Web Scraper
VerifiedExtract structured data from websites using browser automation. Use when scraping product listings, articles, contact info, prices, or any web content. Suppo...
$ Add to .claude/skills/ About This Skill
# Web Scraper
Overview
Professional web scraping skill using agent-browser. Extract structured data from any website with support for JavaScript-rendered content, pagination, and complex selectors.
Use Cases
- E-commerce: Product listings, prices, reviews, inventory
- Real Estate: Property listings, prices, agent contacts
- Job Boards: Job postings, salaries, requirements
- News/Media: Articles, headlines, publication dates
- Directories: Business listings, contact information
- Competitor Monitoring: Prices, products, content changes
Quick Start
Scrape Single Page
```bash python scripts/scrape_page.py \ --url "https://example.com/products" \ --fields "title= h2.title,price=.price,link=a.href" \ --output products.csv ```
Scrape with Pagination
```bash python scripts/scrape_paginated.py \ --url "https://example.com/products?page={page}" \ --pages 10 \ --fields "title,price,description" \ --output all_products.csv ```
Scripts
scrape_page.py
Scrape a single page or static list.
- Arguments:
- `--url` - Target URL
- `--fields` - Field definitions (name=selector format, comma-separated)
- `--output` - Output file (CSV, JSON, or XLSX)
- `--format` - Output format (csv, json, xlsx)
- `--wait` - Wait time for dynamic content (seconds)
Field Definition Format: ``` fieldname=css_selector ```
Examples: ``` title=h1.product-title price=.price-tag description=.product-description image=img.product-image.src link=a.product-link.href ```
scrape_paginated.py
Scrape multiple pages with pagination.
- Arguments:
- `--url` - URL pattern (use {page} for page number)
- `--pages` - Number of pages to scrape
- `--fields` - Field definitions
- `--output` - Output file
- `--delay` - Delay between pages (seconds)
- `--next-selector` - CSS selector for "next page" button (alternative to URL pattern)
scrape_infinite_scroll.py
Scrape pages with infinite scroll loading.
- Arguments:
- `--url` - Target URL
- `--scrolls` - Number of scroll actions
- `--fields` - Field definitions
- `--output` - Output file
- `--scroll-delay` - Delay between scrolls (ms)
scrape_dynamic.py
Scrape JavaScript-heavy sites with custom interactions.
- Arguments:
- `--url` - Target URL
- `--actions` - JSON file with interaction sequence
- `--fields` - Field definitions
- `--output` - Output file
Configuration
Actions JSON Format (for dynamic scraping)
```json { "actions": [ {"type": "click", "selector": "#load-more"}, {"type": "wait", "ms": 2000}, {"type": "scroll", "direction": "down", "pixels": 500}, {"type": "fill", "selector": "#search", "value": "keyword"}, {"type": "press", "key": "Enter"} ] } ```
Output Formats
CSV: ```csv title,price,link,url "Product A",29.99,https://...,https://... "Product B",39.99,https://...,https://... ```
JSON: ```json [ { "title": "Product A", "price": "29.99", "link": "https://...", "scraped_at": "2026-03-07T16:00:00" } ] ```
- Excel (XLSX):
- Same as CSV but with formatting options
- Multiple sheets support
- Auto-fit columns
Examples
Example 1: Scrape E-commerce Products
```bash python scripts/scrape_paginated.py \ --url "https://example.com/shop?page={page}" \ --pages 5 \ --fields "name=.product-name,price=.price,rating=.stars,reviews=.review-count,url=a.href" \ --output products.csv \ --delay 3 ```
Example 2: Scrape News Articles
```bash python scripts/scrape_page.py \ --url "https://news-site.com/latest" \ --fields "headline=h2.article-title,summary=.article-summary,author=.byline,date=.publish-date,url=a.read-more.href" \ --output articles.json \ --format json ```
Example 3: Scrape Job Postings
```bash python scripts/scrape_infinite_scroll.py \ --url "https://jobs-site.com/search" \ --scrolls 10 \ --fields "title=.job-title,company=.company-name,location=.location,salary=.salary,posted=.date-posted,url=a.job-link.href" \ --output jobs.csv \ --scroll-delay 1500 ```
Example 4: Scrape Real Estate Listings
```bash python scripts/scrape_paginated.py \ --url "https://realestate.com/listings?page={page}" \ --pages 20 \ --fields "address=.property-address,price=.listing-price,beds=.bedrooms,baths=.bathrooms,sqft=.square-feet,url=a.property-link.href" \ --output listings.xlsx \ --format xlsx \ --delay 5 ```
Best Practices
- Respect robots.txt - Check and follow site rules
- Rate limiting - Add delays between requests (2-5s recommended)
- Error handling - Handle missing elements gracefully
- User-Agent - Use realistic browser headers
- Retry logic - Implement retries for failed requests
- Data validation - Validate extracted data before saving
- Storage - Save intermediate results for long scrapes
Anti-Scraping Measures
Some sites employ anti-scraping techniques:
| Measure | Countermeasure | |---------|----------------| | IP blocking | Use proxies, rotate IPs | | CAPTCHA | Manual solving or CAPTCHA services | | Rate limiting | Increase delays, randomize timing | | JavaScript challenges | Use browser automation (agent-browser) | | Honeypot traps | Avoid hidden fields, validate selectors |
Legal Considerations
- Public data: Generally legal to scrape
- Terms of Service: Review site ToS before scraping
- Copyright: Don't republish copyrighted content
- Personal data: GDPR/privacy laws may apply
- Commercial use: May require permission
Disclaimer: This skill is for educational purposes. Users are responsible for compliance with applicable laws and website terms.
Troubleshooting
- Elements not found: Verify CSS selectors with browser dev tools
- Empty results: Check if content is JavaScript-rendered (use dynamic scraping)
- Timeout errors: Increase wait time or check network
- Blocked requests: Add delays, rotate user agents, or use proxies
- Incomplete data: Verify pagination or scroll handling
References
CSS Selector Guide
See `references/css-selectors.md` for comprehensive selector examples.
Common Website Patterns
See `references/website-patterns.md` for common HTML structures and selectors.
Use Cases
- Extract structured data from product listing pages for e-commerce analysis
- Scrape article content and metadata from news and blog websites
- Collect contact information from business directory websites
- Build datasets from web sources for research and market analysis
- Automate data collection from websites with browser-rendered content
Pros & Cons
Pros
- +Browser automation handles JavaScript-rendered dynamic content
- +Structured data extraction produces clean, usable datasets
- +Covers common scraping targets — products, articles, and contacts
Cons
- -Browser automation is slower than HTTP-based scraping approaches
- -Web scraping may violate target website Terms of Service
FAQ
What does Web Scraper do?
What platforms support Web Scraper?
What are the use cases for Web Scraper?
100+ free AI tools
Writing, PDF, image, and developer tools — all in your browser.