Web Scraper Pro

About This Skill

# Web Scraper Pro

Intelligent web page fetcher that converts any URL into clean, readable Markdown content. Uses a multi-layer fallback strategy to ensure maximum compatibility.

💰 Pricing: 0.001 USDT per web fetch

⚠️ CRITICAL: Payment Integration (MUST READ FIRST)

This skill is monetized via SkillPay. Before fetching ANY URL, you MUST execute the payment verification step.

Payment Configuration

```python # ═══════════════════════════════════════════════════ # SkillPay Billing Integration # ═══════════════════════════════════════════════════ import requests

BILLING_API_URL = 'https://skillpay.me' BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b' SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b' HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'} ```

Billing API Functions (MANDATORY)

⚠️ THIS CODE MUST BE EXECUTED BEFORE ANY WEB FETCH:

```python import requests import os

BILLING_API_URL = 'https://skillpay.me' BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b' SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b' HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}

# ① Check balance / 查余额 def check_balance(user_id: str) -> float: resp = requests.get( f'{BILLING_API_URL}/api/v1/billing/balance', params={'user_id': user_id}, headers=HEADERS ) return resp.json()['balance'] # USDT amount

# ② Charge per call / 每次调用扣费 def charge_user(user_id: str) -> dict: resp = requests.post( f'{BILLING_API_URL}/api/v1/billing/charge', headers=HEADERS, json={ 'user_id': user_id, 'skill_id': SKILL_ID, 'amount': 0.001, # USDT per call } ) data = resp.json() if data['success']: return {'ok': True, 'balance': data['balance']} # Insufficient balance → auto returns BNB Chain USDT payment link return {'ok': False, 'balance': data['balance'], 'payment_url': data.get('payment_url')}

# ③ Generate payment link / 生成充值链接 def get_payment_link(user_id: str, amount: float) -> str: resp = requests.post( f'{BILLING_API_URL}/api/v1/billing/payment-link', headers=HEADERS, json={'user_id': user_id, 'amount': amount} ) return resp.json()['payment_url'] # BNB Chain USDT

USER_ID = os.environ.get("SKILLPAY_USER_ID", "anonymous_user") ```

Payment Workflow Integration

```python # ======================================== # STEP 0: CHARGE USER (MANDATORY) # ======================================== # This must be the FIRST step before ANY web fetch

charge_result = charge_user(USER_ID)

if not charge_result['ok']: print(f""" ╔══════════════════════════════════════════════════════════════╗ ║ 💳 INSUFFICIENT BALANCE ║ ║ ║ ║ This web fetch costs 0.001 USDT. ║ ║ Current balance: {charge_result['balance']:<41}║ ║ ║ ║ Please top up at: ║ ║ {charge_result['payment_url']:<56}║ ║ ║ ║ After payment, please retry your request. ║ ╚══════════════════════════════════════════════════════════════╝ """) raise SystemExit("Insufficient balance for web fetch")

print(f"✅ Charged 0.001 USDT. Remaining balance: {charge_result['balance']} USDT") ```

---

多层抓取策略 (Multi-Layer Fetch Strategy)

本 skill 采用智能多层回退策略，确保最大兼容性：

| 层级 | 服务 | URL 前缀 | 特点 | 适用场景 | |------|------|----------|------|----------| | Layer 1 | markdown.new | `https://markdown.new/` | Cloudflare 原生，三层回退，最快 | 大部分网站（首选） | | Layer 2 | defuddle.md | `https://defuddle.md/` | 开源轻量，支持 YAML frontmatter | 非 Cloudflare 站点 | | Layer 3 | Jina Reader | `https://r.jina.ai/` | AI 驱动，内容提取精准 | 复杂页面 | | Layer 4 | Scrapling | Python 库 | 自适应爬虫，反反爬能力强 | 最后兜底 |

Layer 1: markdown.new（首选，最快）

Cloudflare 驱动的 URL→Markdown 转换服务，内置三层回退：
原生 Markdown: `Accept: text/markdown` 内容协商
Workers AI: HTML→Markdown AI 转换
浏览器渲染: 无头浏览器处理 JS 重度页面

```python import requests

def fetch_via_markdown_new(url: str, method: str = "auto", retain_images: bool = True) -> str: """ Layer 1: 使用 markdown.new 抓取网页 Args: url: 目标网页 URL method: 转换方法 - "auto" | "ai" | "browser" retain_images: 是否保留图片链接 Returns: str: Markdown 格式的网页内容 """ api_url = "https://markdown.new/" try: response = requests.post( api_url, headers={"Content-Type": "application/json"}, json={ "url": url, "method": method, "retain_images": retain_images }, timeout=60 ) if response.status_code == 200: token_count = response.headers.get("x-markdown-tokens", "unknown") print(f"✅ [markdown.new] 抓取成功 (tokens: {token_count})") return response.text elif response.status_code == 429: print("⚠️ [markdown.new] 速率限制，切换到下一层...") return None else: print(f"⚠️ [markdown.new] 返回状态码 {response.status_code}，切换到下一层...") return None except requests.exceptions.RequestException as e: print(f"⚠️ [markdown.new] 请求失败: {e}，切换到下一层...") return None ```

支持的查询参数:
`method=auto|ai|browser` - 指定转换方法
`retain_images=true|false` - 是否保留图片
速率限制: 每 IP 每天 500 次请求

Layer 2: defuddle.md（备选方案）

开源的网页→Markdown 提取服务，由 Obsidian Web Clipper 创建者开发。

```python def fetch_via_defuddle(url: str) -> str: """ Layer 2: 使用 defuddle.md 抓取网页 Args: url: 目标网页 URL（不含 https:// 前缀亦可） Returns: str: 带有 YAML frontmatter 的 Markdown 内容 """ # defuddle 接受 URL 路径直接拼接 clean_url = url.replace("https://", "").replace("http://", "") api_url = f"https://defuddle.md/{clean_url}" try: response = requests.get(api_url, timeout=60) if response.status_code == 200 and len(response.text.strip()) > 50: print(f"✅ [defuddle.md] 抓取成功") return response.text else: print(f"⚠️ [defuddle.md] 内容为空或失败 (status: {response.status_code})，切换到下一层...") return None except requests.exceptions.RequestException as e: print(f"⚠️ [defuddle.md] 请求失败: {e}，切换到下一层...") return None ```

Layer 3: Jina Reader（AI 内容提取）

Jina AI 的阅读器服务，擅长处理复杂页面。

```python def fetch_via_jina(url: str) -> str: """ Layer 3: 使用 Jina Reader 抓取网页 Args: url: 目标网页完整 URL Returns: str: 提取的主要文本内容 """ api_url = f"https://r.jina.ai/{url}" try: response = requests.get( api_url, headers={"Accept": "text/markdown"}, timeout=60 ) if response.status_code == 200 and len(response.text.strip()) > 50: print(f"✅ [Jina Reader] 抓取成功") return response.text else: print(f"⚠️ [Jina Reader] 内容为空或失败 (status: {response.status_code})，切换到下一层...") return None except requests.exceptions.RequestException as e: print(f"⚠️ [Jina Reader] 请求失败: {e}，切换到下一层...") return None ```

额外功能: Jina 还支持搜索模式 `https://s.jina.ai/YOUR_SEARCH_QUERY`

Layer 4: Scrapling（终极兜底，反反爬）

强大的自适应爬虫框架，可绕过 Cloudflare Turnstile 等反爬机制。

```bash # 安装 Scrapling pip install scrapling # 如需浏览器功能（反反爬） pip install "scrapling[fetchers]" scrapling install ```

```python def fetch_via_scrapling(url: str, use_stealth: bool = False) -> str: """ Layer 4: 使用 Scrapling 抓取网页（终极兜底方案） Args: url: 目标网页 URL use_stealth: 是否使用隐身模式（绕过 Cloudflare 等） Returns: str: 提取的网页文本内容 """ try: if use_stealth: from scrapling.fetchers import StealthyFetcher page = StealthyFetcher.fetch(url) else: from scrapling.fetchers import Fetcher page = Fetcher.get(url) if page and page.status == 200: # 提取主要文本内容 # 移除脚本和样式标签 for tag in page.css('script, style, nav, footer, header'): tag.remove() # 获取主要内容区域 main_content = ( page.css('article') or page.css('main') or page.css('.content, .post, .article, #content') or page.css('body') ) if main_content: text = main_content[0].text_content print(f"✅ [Scrapling{'(Stealth)' if use_stealth else ''}] 抓取成功") return text print(f"⚠️ [Scrapling] 抓取失败或内容为空") return None except ImportError: print("⚠️ [Scrapling] 未安装，请运行: pip install scrapling") return None except Exception as e: print(f"⚠️ [Scrapling] 抓取失败: {e}") return None ```

---

Workflow

Phase 0: Payment Verification (MANDATORY - DO NOT SKIP)

⚠️ CRITICAL: This phase MUST be completed BEFORE any other phase. DO NOT proceed without payment verification.

Execute payment verification code (see Payment Verification Code section above)
If payment fails: Return the payment URL to user and STOP
If payment succeeds: Proceed to Phase 1

Phase 1: URL Analysis & Strategy Selection (MANDATORY)

当用户提供需要抓取的 URL 时：

接收 URL: 用户提供目标网址
分析 URL 特征: 判断网站类型和最佳抓取策略

| URL 特征 | 推荐策略 | |----------|----------| | 普通网页/博客/文档 | Layer 1 (markdown.new) → 自动回退 | | GitHub/技术文档 | Layer 1 (markdown.new) → Layer 3 (Jina) | | 需要登录/付费墙 | 提示用户需要认证，无法自动抓取 | | JavaScript 重度渲染 (SPA) | Layer 1 (markdown.new, method=browser) → Layer 4 (Scrapling) | | 反爬严格的网站 | Layer 4 (Scrapling, stealth mode) |

告知用户费用: "本次抓取将收取 0.001 USDT"

Phase 2: Execute Fetch (WITH PAYMENT)

⚠️ IMPORTANT: Execute payment verification BEFORE fetching!

使用以下智能抓取函数，自动按层级回退：

```python import requests import os

# ═══════════════════════════════════════════════════ # STEP 0: SKILLPAY BILLING (MANDATORY) # ═══════════════════════════════════════════════════

BILLING_API_URL = 'https://skillpay.me' BILLING_API_KEY = 'sk_d6d26f291dafc43acc8c2b6215b87cbc9b19c7d093aebdb2deeba42a3a0fea4b' SKILL_ID = '4fb2d57e-e583-4ca3-8170-52df37a6572b' HEADERS = {'X-API-Key': BILLING_API_KEY, 'Content-Type': 'application/json'}

def charge_user(user_id: str) -> dict: resp = requests.post( f'{BILLING_API_URL}/api/v1/billing/charge', headers=HEADERS, json={'user_id': user_id, 'skill_id': SKILL_ID, 'amount': 0.001} ) data = resp.json() if data['success']: return {'ok': True, 'balance': data['balance']} return {'ok': False, 'balance': data['balance'], 'payment_url': data.get('payment_url')}

USER_ID = os.environ.get("SKILLPAY_USER_ID", "anonymous_user") charge_result = charge_user(USER_ID) if not charge_result['ok']: print(f""" ╔══════════════════════════════════════════════════════════════╗ ║ 💳 INSUFFICIENT BALANCE ║ ║ ║ ║ This web fetch costs 0.001 USDT. ║ ║ Current balance: {charge_result['balance']:<41}║ ║ ║ ║ Please top up at (BNB Chain USDT): ║ ║ {charge_result['payment_url']:<56}║ ║ ║ ║ After payment, please retry your request. ║ ╚══════════════════════════════════════════════════════════════╝ """) raise SystemExit("Insufficient balance for web fetch")

print(f"✅ Charged 0.001 USDT. Remaining balance: {charge_result['balance']} USDT")

# ======================================== # STEP 1: INTELLIGENT MULTI-LAYER FETCH # ========================================

def smart_fetch(url: str, prefer_method: str = "auto", retain_images: bool = True) -> dict: """ 智能多层抓取：自动按优先级尝试各层服务，直到成功。 Args: url: 目标网页 URL prefer_method: markdown.new 的转换方法 ("auto", "ai", "browser") retain_images: 是否保留图片链接 Returns: dict: { "success": bool, "content": str, # Markdown 内容 "source": str, # 使用的抓取层级 "url": str, # 原始 URL "char_count": int # 内容字符数 } """ # 确保 URL 有协议前缀 if not url.startswith(("http://", "https://")): url = "https://" + url print(f"🔍 开始抓取: {url}") print("=" * 60) # --- Layer 1: markdown.new --- print("📡 Layer 1: 尝试 markdown.new ...") content = fetch_via_markdown_new(url, method=prefer_method, retain_images=retain_images) if content and len(content.strip()) > 100: return {"success": True, "content": content, "source": "markdown.new", "url": url, "char_count": len(content)} # --- Layer 2: defuddle.md --- print("📡 Layer 2: 尝试 defuddle.md ...") content = fetch_via_defuddle(url) if content and len(content.strip()) > 100: return {"success": True, "content": content, "source": "defuddle.md", "url": url, "char_count": len(content)} # --- Layer 3: Jina Reader --- print("📡 Layer 3: 尝试 Jina Reader ...") content = fetch_via_jina(url) if content and len(content.strip()) > 100: return {"success": True, "content": content, "source": "jina-reader", "url": url, "char_count": len(content)} # --- Layer 4: Scrapling (常规模式) --- print("📡 Layer 4a: 尝试 Scrapling (常规模式) ...") content = fetch_via_scrapling(url, use_stealth=False) if content and len(content.strip()) > 100: return {"success": True, "content": content, "source": "scrapling", "url": url, "char_count": len(content)} # --- Layer 4b: Scrapling (隐身模式) --- print("📡 Layer 4b: 尝试 Scrapling (隐身模式) ...") content = fetch_via_scrapling(url, use_stealth=True) if content and len(content.strip()) > 100: return {"success": True, "content": content, "source": "scrapling-stealth", "url": url, "char_count": len(content)} # 所有方法失败 print("❌ 所有抓取方法均失败") return {"success": False, "content": None, "source": None, "url": url, "char_count": 0}

# ======================================== # 执行抓取 # ========================================

TARGET_URL = "{用户提供的 URL}"

result = smart_fetch(TARGET_URL)

if result["success"]: print(f""" ╔══════════════════════════════════════════════════════════════╗ ║ ✅ 抓取成功 ║ ║ ║ ║ 来源: {result['source']:<52}║ ║ 字符数: {result['char_count']:<50}║ ║ URL: {result['url'][:50]:<52}║ ╚══════════════════════════════════════════════════════════════╝ """) # 输出 Markdown 内容 print("\n--- 网页内容 (Markdown) ---\n") print(result["content"]) else: print(f""" ╔══════════════════════════════════════════════════════════════╗ ║ ❌ 抓取失败 ║ ║ ║ ║ 所有 4 层抓取方法均无法获取内容。 ║ ║ 可能的原因: ║ ║ - 目标网站需要登录/认证 ║ ║ - 目标 URL 无效或不可达 ║ ║ - 目标网站有极强的反爬机制 ║ ║ ║ ║ 建议: ║ ║ - 检查 URL 是否正确 ║ ║ - 尝试提供需要登录后的页面源码 ║ ╚══════════════════════════════════════════════════════════════╝ """) ```

Phase 3: Content Processing & Output

抓取成功后：

直接返回 Markdown 内容给用户
如果内容过长（超过 50000 字符），进行智能截取并提示用户
记录交易 ID 用于支付追踪

```python # 内容后处理 def process_content(content: str, max_chars: int = 50000) -> str: """处理和截取过长内容""" if len(content) <= max_chars: return content # 智能截取：在段落边界截断 truncated = content[:max_chars] last_newline = truncated.rfind('\n\n') if last_newline > max_chars * 0.8: truncated = truncated[:last_newline] truncated += f"\n\n---\n⚠️ 内容过长，已截取前 {len(truncated)} 字符（共 {len(content)} 字符）。" return truncated ```

---

使用场景示例

场景 1: 抓取技术文档

``` 用户: 帮我抓取 https://docs.python.org/3/tutorial/index.html 的内容 ```

执行流程:
支付验证 → 通过
Layer 1 (markdown.new) → 尝试抓取
返回 Markdown 格式的 Python 教程内容

场景 2: 抓取 GitHub README

``` 用户: 我想看看这个库的介绍 https://github.com/D4Vinci/Scrapling ```

执行流程:
支付验证 → 通过
Layer 1 (markdown.new) → GitHub 页面通常成功
返回 Scrapling 项目的 README 内容

场景 3: 抓取反爬网站

``` 用户: 帮我抓取这个网页 https://某反爬网站.com/article/123 ```

执行流程:
支付验证 → 通过
Layer 1 → 失败
Layer 2 → 失败
Layer 3 → 失败
Layer 4 (Scrapling Stealth) → 使用隐身模式绕过反爬
返回提取的内容

场景 4: 搜索信息（使用 Jina Search）

``` 用户: 帮我搜一下 "Python asyncio best practices 2025" ```

```python def search_via_jina(query: str) -> str: """使用 Jina Search 搜索信息""" api_url = f"https://s.jina.ai/{query}" try: response = requests.get(api_url, timeout=60) if response.status_code == 200: return response.text return None except: return None

# 执行搜索 search_result = search_via_jina("Python asyncio best practices 2025") print(search_result) ```

---

Prerequisites (按需安装)

基础依赖（Layer 1-3 只需 requests） ```bash pip install requests ```

Scrapling 依赖（Layer 4 - 仅在需要时安装） ```bash # 基础安装 pip install scrapling

# 完整安装（含浏览器和反反爬功能） pip install "scrapling[fetchers]" scrapling install ```

---

💰 Revenue & Analytics

Track your earnings in real-time at SkillPay Dashboard.

Price per fetch: 0.001 USDT
Your revenue share: 95%
Settlement: Instant (BNB Chain)

---

*Powered by SkillPay - AI Skill Monetization Infrastructure*

Use Cases

Convert any URL into clean, readable markdown content
Fetch web pages and extract the main content intelligently
Build content pipelines that transform web pages into structured text
Create readable article extracts from cluttered web pages
Process web content for AI agent consumption and analysis

Pros & Cons

Pros

+Intelligent content extraction focuses on main content, not boilerplate
+Markdown output is universally useful for documentation and AI context
+Simple, focused utility — URL in, clean markdown out

Cons

-No details on how it handles JavaScript-rendered or dynamic content
-Single-page focused — no crawling or multi-page extraction

FAQ

What does Web Scraper Pro do?

Intelligent web page fetcher that converts any URL into clean, readable Markdown content.

What platforms support Web Scraper Pro?

Web Scraper Pro is available on Claude Code, OpenClaw.

What are the use cases for Web Scraper Pro?

Convert any URL into clean, readable markdown content. Fetch web pages and extract the main content intelligently. Build content pipelines that transform web pages into structured text.

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector