Web Scraper

Verified

Extract structured data from websites using browser automation. Use when scraping product listings, articles, contact info, prices, or any web content. Suppo...

78 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

# Web Scraper

Overview

Professional web scraping skill using agent-browser. Extract structured data from any website with support for JavaScript-rendered content, pagination, and complex selectors.

Use Cases

E-commerce: Product listings, prices, reviews, inventory
Real Estate: Property listings, prices, agent contacts
Job Boards: Job postings, salaries, requirements
News/Media: Articles, headlines, publication dates
Directories: Business listings, contact information
Competitor Monitoring: Prices, products, content changes

Quick Start

Scrape Single Page

```bash python scripts/scrape_page.py \ --url "https://example.com/products" \ --fields "title= h2.title,price=.price,link=a.href" \ --output products.csv ```

Scrape with Pagination

```bash python scripts/scrape_paginated.py \ --url "https://example.com/products?page={page}" \ --pages 10 \ --fields "title,price,description" \ --output all_products.csv ```

Scripts

scrape_page.py

Scrape a single page or static list.

Arguments:
`--url` - Target URL
`--fields` - Field definitions (name=selector format, comma-separated)
`--output` - Output file (CSV, JSON, or XLSX)
`--format` - Output format (csv, json, xlsx)
`--wait` - Wait time for dynamic content (seconds)

Field Definition Format: ``` fieldname=css_selector ```

Examples: ``` title=h1.product-title price=.price-tag description=.product-description image=img.product-image.src link=a.product-link.href ```

scrape_paginated.py

Scrape multiple pages with pagination.

Arguments:
`--url` - URL pattern (use {page} for page number)
`--pages` - Number of pages to scrape
`--fields` - Field definitions
`--output` - Output file
`--delay` - Delay between pages (seconds)
`--next-selector` - CSS selector for "next page" button (alternative to URL pattern)

scrape_infinite_scroll.py

Scrape pages with infinite scroll loading.

Arguments:
`--url` - Target URL
`--scrolls` - Number of scroll actions
`--fields` - Field definitions
`--output` - Output file
`--scroll-delay` - Delay between scrolls (ms)

scrape_dynamic.py

Scrape JavaScript-heavy sites with custom interactions.

Arguments:
`--url` - Target URL
`--actions` - JSON file with interaction sequence
`--fields` - Field definitions
`--output` - Output file

Configuration

Actions JSON Format (for dynamic scraping)

```json { "actions": [ {"type": "click", "selector": "#load-more"}, {"type": "wait", "ms": 2000}, {"type": "scroll", "direction": "down", "pixels": 500}, {"type": "fill", "selector": "#search", "value": "keyword"}, {"type": "press", "key": "Enter"} ] } ```

Output Formats

CSV: ```csv title,price,link,url "Product A",29.99,https://...,https://... "Product B",39.99,https://...,https://... ```

JSON: ```json [ { "title": "Product A", "price": "29.99", "link": "https://...", "scraped_at": "2026-03-07T16:00:00" } ] ```

Excel (XLSX):
Same as CSV but with formatting options
Multiple sheets support
Auto-fit columns

Examples

Example 1: Scrape E-commerce Products

```bash python scripts/scrape_paginated.py \ --url "https://example.com/shop?page={page}" \ --pages 5 \ --fields "name=.product-name,price=.price,rating=.stars,reviews=.review-count,url=a.href" \ --output products.csv \ --delay 3 ```

Example 2: Scrape News Articles

```bash python scripts/scrape_page.py \ --url "https://news-site.com/latest" \ --fields "headline=h2.article-title,summary=.article-summary,author=.byline,date=.publish-date,url=a.read-more.href" \ --output articles.json \ --format json ```

Example 3: Scrape Job Postings

```bash python scripts/scrape_infinite_scroll.py \ --url "https://jobs-site.com/search" \ --scrolls 10 \ --fields "title=.job-title,company=.company-name,location=.location,salary=.salary,posted=.date-posted,url=a.job-link.href" \ --output jobs.csv \ --scroll-delay 1500 ```

Example 4: Scrape Real Estate Listings

```bash python scripts/scrape_paginated.py \ --url "https://realestate.com/listings?page={page}" \ --pages 20 \ --fields "address=.property-address,price=.listing-price,beds=.bedrooms,baths=.bathrooms,sqft=.square-feet,url=a.property-link.href" \ --output listings.xlsx \ --format xlsx \ --delay 5 ```

Best Practices

Respect robots.txt - Check and follow site rules
Rate limiting - Add delays between requests (2-5s recommended)
Error handling - Handle missing elements gracefully
User-Agent - Use realistic browser headers
Retry logic - Implement retries for failed requests
Data validation - Validate extracted data before saving
Storage - Save intermediate results for long scrapes

Anti-Scraping Measures

Some sites employ anti-scraping techniques:

| Measure | Countermeasure | |---------|----------------| | IP blocking | Use proxies, rotate IPs | | CAPTCHA | Manual solving or CAPTCHA services | | Rate limiting | Increase delays, randomize timing | | JavaScript challenges | Use browser automation (agent-browser) | | Honeypot traps | Avoid hidden fields, validate selectors |

Legal Considerations

Public data: Generally legal to scrape
Terms of Service: Review site ToS before scraping
Copyright: Don't republish copyrighted content
Personal data: GDPR/privacy laws may apply
Commercial use: May require permission

Disclaimer: This skill is for educational purposes. Users are responsible for compliance with applicable laws and website terms.

Troubleshooting

Elements not found: Verify CSS selectors with browser dev tools
Empty results: Check if content is JavaScript-rendered (use dynamic scraping)
Timeout errors: Increase wait time or check network
Blocked requests: Add delays, rotate user agents, or use proxies
Incomplete data: Verify pagination or scroll handling

References

CSS Selector Guide

See `references/css-selectors.md` for comprehensive selector examples.

Common Website Patterns

See `references/website-patterns.md` for common HTML structures and selectors.

Use Cases

Extract structured data from product listing pages for e-commerce analysis
Scrape article content and metadata from news and blog websites
Collect contact information from business directory websites
Build datasets from web sources for research and market analysis
Automate data collection from websites with browser-rendered content

Pros & Cons

Pros

+Browser automation handles JavaScript-rendered dynamic content
+Structured data extraction produces clean, usable datasets
+Covers common scraping targets — products, articles, and contacts

Cons

-Browser automation is slower than HTTP-based scraping approaches
-Web scraping may violate target website Terms of Service

FAQ

What does Web Scraper do?

Extract structured data from websites using browser automation. Use when scraping product listings, articles, contact info, prices, or any web content. Suppo...

What platforms support Web Scraper?

Web Scraper is available on Claude Code, OpenClaw.

What are the use cases for Web Scraper?

Extract structured data from product listing pages for e-commerce analysis. Scrape article content and metadata from news and blog websites. Collect contact information from business directory websites.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector