Scrapling Web Extractor

Verified

Fetch one or more public webpages with Scrapling, extract the main content, and convert HTML into Markdown using html2text. Supports static HTTP, concurrent...

69 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

# Web Markdown Scraper

Use this skill when the user wants to:
Scrape one or more public webpages (static or JavaScript-rendered)
Convert HTML pages into clean Markdown
Extract article/body text for summarization, analysis, or indexing
Bypass anti-bot protections (Cloudflare, Datadome, etc.) via stealth mode
Scrape many URLs concurrently (async mode)
Track page elements reliably across website redesigns (automatch)
Save the extracted results as `.md` files

Fetcher Mode Selection Guide

| Mode | Fetcher Class | Best For | |------|--------------|----------| | `http` (default) | `Fetcher` | Fast static pages, RSS, APIs | | `async` | `AsyncFetcher` | Batch of 5+ static URLs in parallel | | `stealth` | `StealthyFetcher` | Anti-bot sites, Cloudflare, fingerprint checks | | `dynamic` | `PlayWrightFetcher` | Heavy SPAs, React/Vue/Angular apps |

Decision rule: Start with `http`. If you get a 403 / CAPTCHA / empty body, switch to `stealth`. If the content is rendered client-side (empty on first load), use `dynamic`. Use `async` when scraping many static URLs at once to save time.

Inputs

URL sources - `--url URL` — one target URL (repeat flag for multiple: `--url A --url B`) - `--url-file FILE` — plain text file with one URL per line

Fetcher - `--mode http|async|stealth|dynamic` — fetcher backend (default: `http`)

Content extraction - `--selector CSS` — CSS selector for the main content area (omit = full page) - `--preserve-links` — keep hyperlinks in the Markdown output - `--output-dir DIR` — save per-page `.md` files and a master `index.json` here

AutoMatch — production resilience - `--auto-save` — fingerprint & persist selected elements to the local DB on first run - `--auto-match` — on subsequent runs, find elements by fingerprint even if the site layout has changed (do NOT need to update the CSS selector)

Browser options (stealth / dynamic only) - `--headless true|false|virtual` — headless mode; `virtual` uses Xvfb (default: `true`) - `--network-idle` — wait until no network activity for ≥500 ms before capturing - `--block-images` — block image loading (saves bandwidth and proxy quota) - `--disable-resources` — drop fonts/images/media/stylesheets for ~25% faster loads - `--wait-selector CSS` — pause until this element appears in the DOM - `--wait-selector-state attached|visible|detached|hidden` — element state (default: `attached`) - `--timeout MS` — global timeout in ms (default: 30 000) - `--wait MS` — extra idle wait after page load in ms

StealthyFetcher extras (stealth mode only) - `--humanize SECONDS` — simulate human-like cursor movement (max duration in seconds) - `--geoip` — spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation - `--block-webrtc` — prevent real-IP leaks via WebRTC - `--disable-ads` — install uBlock Origin in the browser session - `--proxy URL` — HTTP/SOCKS proxy as a URL string, or JSON: `'{"server":"host:port","username":"u","password":"p"}'`

Reliability - `--retry N` — retry failed requests up to N times with exponential backoff (max 30 s)

Rules

Only process public `http://` or `https://` pages.
Never bypass login walls, CAPTCHAs, paywalls, or access controls.
Prefer the main article or body content; avoid polluting the output with navigation,
headers, footers, or cookie banners — use `--selector` to target the content area.
When `--auto-save` is used, always also pass `--selector` so Scrapling knows which
element fingerprint to record.
On subsequent runs for layout-changed pages, use `--auto-match` instead of `--auto-save`.
Do not use both flags at once.
Use `--mode async` for batch jobs with 5+ static URLs for parallel execution.
Combine `--disable-resources` with `--block-images` in stealth/dynamic mode when
you only need text content — this can cut load times by up to 40%.
Always inspect the top-level `ok` field and per-result `ok` fields before using content.
If `ok` is `false`, report the exact `error` string — do not invent or guess content.
When `--network-idle` is insufficient, use `--wait-selector` for a specific DOM element
to guarantee the content has loaded before capture.

Command Patterns

Basic static page ```bash python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" ```

Static page — target specific content area ```bash python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content" ```

Stealth mode — bypass anti-bot protection ```bash python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle ```

Stealth + proxy + human fingerprint (maximum stealth) ```bash python3 "{baseDir}/scrape_to_markdown.py" \ --url "<URL>" \ --mode stealth \ --proxy "http://user:pass@host:port" \ --humanize 2.0 \ --geoip \ --block-webrtc \ --network-idle ```

Dynamic SPA page (Playwright Chromium) ```bash python3 "{baseDir}/scrape_to_markdown.py" \ --url "<URL>" \ --mode dynamic \ --wait-selector ".product-list" \ --network-idle \ --disable-resources ```

Async concurrent batch (multiple URLs) ```bash python3 "{baseDir}/scrape_to_markdown.py" \ --mode async \ --url "<URL1>" --url "<URL2>" --url "<URL3>" ```

Batch from file + stealth + save to disk ```bash python3 "{baseDir}/scrape_to_markdown.py" \ --url-file urls.txt \ --mode stealth \ --disable-resources \ --output-dir outputs ```

First-run automatch setup (save fingerprint) ```bash python3 "{baseDir}/scrape_to_markdown.py" \ --url "<URL>" \ --selector ".article-body" \ --auto-save \ --output-dir outputs ```

Subsequent run after site layout change (adaptive match) ```bash python3 "{baseDir}/scrape_to_markdown.py" \ --url "<URL>" \ --selector ".article-body" \ --auto-match \ --output-dir outputs ```

Full production scrape ```bash python3 "{baseDir}/scrape_to_markdown.py" \ --url "<URL>" \ --mode stealth \ --selector "main article" \ --auto-match \ --preserve-links \ --network-idle \ --disable-resources \ --timeout 60000 \ --retry 3 \ --output-dir outputs ```

Output Handling

JSON is printed to stdout. Always check `ok` before using content.

Top-level fields:
`ok` — `true` only if every URL succeeded
`total` / `succeeded` / `failed` — count summary
`results` — array of per-URL result objects
`output_index_file` — path to saved `index.json` (if `--output-dir` used)

Per-URL result fields (when `ok: true`):
`url` — the requested URL
`status` — HTTP status code (e.g. `200`)
`title` — page `<title>` text
`markdown` — extracted content as Markdown ← use this as main content
`markdown_length` — character count (useful for quality checks)
`output_markdown_file` — path to saved `.md` file (if `--output-dir` used)

On failure (`ok: false` in a result):
`error` — exact error message; report this verbatim, do not invent content

Use Cases

Fetch public web pages and extract main content as clean markdown
Scrape multiple URLs in batch and convert HTML to structured markdown
Extract article content while filtering out navigation, ads, and boilerplate
Build content archives from web sources in markdown format
Pipeline web content into documentation or knowledge base systems

Pros & Cons

Pros

+Scrapling library handles anti-bot measures better than simple HTTP requests
+html2text conversion produces clean, readable markdown output
+Batch URL support enables efficient multi-page content extraction

Cons

-Public pages only — no authentication or session-based scraping
-Scrapling library may have limited ecosystem support compared to Playwright

FAQ

What does Scrapling Web Extractor do?

Fetch one or more public webpages with Scrapling, extract the main content, and convert HTML into Markdown using html2text. Supports static HTTP, concurrent...

What platforms support Scrapling Web Extractor?

Scrapling Web Extractor is available on Claude Code, OpenClaw.

What are the use cases for Scrapling Web Extractor?

Fetch public web pages and extract main content as clean markdown. Scrape multiple URLs in batch and convert HTML to structured markdown. Extract article content while filtering out navigation, ads, and boilerplate.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector