Skip to content
Intermediate 2-6 hours 6 Steps

Build Web Scrapers with AI

Web scraping is one of those tasks that's conceptually simple but technically finicky: every site has a different structure, sites change without warning, and modern sites load content dynamically thr...

What You'll Build

6
Steps
2-6h
Time
3
Tools
5
Prompts
Difficulty Intermediate
Best for
web scrapingdata collectionpythonautomation

Step-by-Step Guide

Follow this 6-step workflow to complete in about 2-6 hours.

Analyze theGenerate theHandle PaginationClean andPolish YourSchedule and
1

Analyze the Target Site Structure

Before writing any code, you need to understand how the site delivers its data. Is it server-rendered HTML you can parse directly, or does content load via JavaScript after the initial page? Is there a hidden API the site's own frontend calls that you can use instead of scraping HTML? AI can guide your analysis strategy.

Prompt Template
I want to scrape data from a website. Help me understand the site's structure and the best approach before I write any code. **Target site:** [URL — e.g., 'https://example-jobs-site.com/listings'] **Data I want to collect:** - [Field 1 — e.g., 'Job title'] - [Field 2 — e.g., 'Company name'] - [Field 3 — e.g., 'Location'] - [Field 4 — e.g., 'Salary range (if listed)'] - [Field 5 — e.g., 'Date posted'] - [Field 6 — e.g., 'Link to full job listing'] **What I've observed about the site:** [Describe what you see when you inspect the page — e.g., 'The job listings appear immediately when I load the page, no loading spinner,' or 'There's a brief spinner when I first load the page before listings appear,' or 'I'm not sure — I haven't looked yet'] Please guide me through the site analysis process: 1. **How to determine if it's static HTML or JavaScript-rendered**: Walk me through exactly how to use Chrome DevTools (or another browser tool) to determine whether the content I want is in the initial HTML response or loaded by JavaScript afterward. 2. **How to look for a hidden API**: Explain how to use the browser's Network tab to find XHR/Fetch requests that might be loading the data I want as JSON. Explain what I should look for and what a useful API endpoint looks like vs. noise. 3. **What to look for in the HTML structure**: If it's static HTML, what should I look for in the page source to identify the CSS selectors or XPath that will reliably target the data I want? What makes a selector fragile (will break when the site updates) vs. robust? 4. **Authentication / anti-bot considerations**: Based on the type of site I'm describing, what authentication or anti-bot measures am I likely to encounter? What are the signs that a site is actively blocking scrapers? 5. **Recommended approach**: Given all of the above, what scraping approach would you recommend for this type of site? (Pure HTML parsing / headless browser / direct API calls) and why? I'll report back what I find and we'll proceed from there.
Tip: Always check for a hidden API before writing a scraper. Open Chrome DevTools, go to the Network tab, filter by 'Fetch/XHR,' and reload the page. If you see JSON responses containing exactly the data you want, you've found an API you can call directly — this produces cleaner data, is faster, and is far less brittle than HTML parsing. Many sites that look like they require scraping are actually serving clean JSON to their own frontend.
2

Generate the Scraper Code

With a clear picture of the site's structure, AI can generate the core scraper. The key is to give it the specific selectors, response structure, or API details you discovered in step 1 — not just the URL. The more precise your input, the more useful the generated code.

Prompt Template
Generate a web scraper in [Python / JavaScript / Node.js] based on my analysis of the target site. **What I found in Step 1:** - Site type: [Static HTML / JavaScript-rendered / Hidden JSON API] - If static HTML or JS-rendered: The data is in [CSS selectors or XPath — e.g., 'Each job listing is in a div with class job-card, the title is in an h2.job-title inside it, the company name is in span.company-name'] - If hidden API: The endpoint is [URL — e.g., 'https://example.com/api/v2/listings?page=1&limit=20'], it returns JSON like: `[paste a sample of the response structure]` - Authentication required: [Yes / No — if yes, describe how: cookie, header, login form] **Scraper requirements:** - Language: [Python 3 / Node.js] - Library preference: [Python: requests+BeautifulSoup / requests+lxml / Playwright / Scrapy; Node.js: axios+cheerio / Puppeteer / Playwright] - Data I want to extract from each item: - [Field name]: [Where to find it — selector, key name, or XPath] - [Field name]: [Where to find it] - [Field name]: [Where to find it] - Output format: [CSV file / JSON file / SQLite database / print to console for now] - Output file path: [e.g., './output/jobs.csv'] **Constraints:** - Be polite to the server: add a [1-3] second random delay between requests - Set a realistic User-Agent header - Handle HTTP errors gracefully (retry on 429/503, skip on 404) - Log progress: print which page/item is being scraped Generate: 1. The complete scraper script 2. A requirements.txt / package.json with exact dependency versions 3. Instructions to run it 4. Comments explaining the key parts so I can modify it Make it clean and readable — I may need to modify selectors when the site updates.
Tip: Don't hardcode selectors as magic strings without comments. Add a comment above each selector explaining what it targets in plain English — e.g., `# The outer container for each job card listing`. When the site redesigns (and it will), you'll know exactly where to look to update the selector, and you can ask AI to help you update it if you paste the new HTML structure.
3

Handle Pagination and Authentication

A scraper that only gets the first page of results is usually useless. Pagination handling is where most scrapers get complicated — sites use different patterns (URL parameters, infinite scroll, cursor-based APIs, next-page tokens) and each requires different logic.

Prompt Template
Extend my scraper to handle pagination and, if required, authentication. **My current scraper:** ```python [Paste your current scraper code from Step 2] ``` **Pagination details:** What type of pagination does the site use? [Choose the closest match and fill in details] - **URL parameter**: Page number in URL — e.g., `?page=1`, `?page=2` ... total pages: [number, or 'unknown'] - **Next-page button/link**: Each page has a 'Next' button — selector for next button: [CSS selector] - **Offset/limit**: URL uses offset — e.g., `?offset=0&limit=20`, `?offset=20&limit=20`... - **Cursor/token-based**: API returns a next_cursor or next_page_token in the response that you pass to get the next page - **Infinite scroll**: Content loads as you scroll — I need to simulate scrolling - **I don't know**: Here's what the URL looks like on page 1, 2, 3: [paste URLs] **Stopping condition:** [e.g., 'Stop when next button is absent,' 'Stop after 10 pages during testing, remove limit for production,' 'Stop when API returns empty results array,' 'Stop when date of items goes older than 30 days'] **Authentication** (skip if not needed): - Auth type: [Login form / Cookie from browser / API key in header / OAuth] - If login form: Login URL [URL], username field name [field], password field name [field] - If cookie: [Describe how to get the cookie — e.g., 'Log in manually in browser and copy the session cookie from DevTools'] - If API key: [Header name — e.g., 'Authorization: Bearer YOUR_KEY'] Update the scraper to: 1. Loop through all pages using the correct pagination pattern 2. Implement the stopping condition 3. Handle authentication if required 4. Track progress (print 'Page X of Y' or 'Scraped N items so far') 5. Resume from where it left off if interrupted (save progress to a checkpoint file) 6. Avoid scraping items already collected in a previous run (deduplication by [field — e.g., URL or ID]) Return the complete updated script.
Tip: Always add a maximum page limit parameter to your scraper, even if you intend to scrape everything. Set it to something like 1000 pages. This prevents infinite loops if the pagination logic has a bug, and it gives you a clear way to do a 'test run' on the first 5 pages before committing to a full crawl. When testing, start with `max_pages=3`.
4

Clean and Structure the Scraped Data

Raw scraped data is almost never in the format you actually need. Strings have leading/trailing whitespace, dates are in inconsistent formats, numbers have currency symbols or commas, some fields are missing, and duplicates slip through. AI can generate data cleaning and transformation code tailored to your specific fields.

Prompt Template
Help me clean and structure the data my scraper collected. **Raw data sample** (paste 3-5 rows of what my scraper currently outputs): ``` [Paste sample rows — e.g., JSON array or CSV rows] ``` **Problems I can see in the raw data:** - [Problem 1 — e.g., 'Salary is a string like " $80,000 - $120,000 / year " — I want two numeric fields: salary_min and salary_max in USD'] - [Problem 2 — e.g., 'Date posted is like "3 days ago" or "Posted Dec 15" — I want ISO 8601 date: 2026-03-14'] - [Problem 3 — e.g., 'Some location values are "Remote" and others are "New York, NY, USA" — I want to split into city, state, country fields and a boolean is_remote'] - [Problem 4 — e.g., 'Some rows have null/missing salary — keep them but mark salary as null, not empty string'] - [Problem 5 — e.g., 'HTML tags occasionally appear in description field — strip all HTML tags'] **Target schema** (what I want the cleaned data to look like): ``` { "id": "string (unique hash of URL)", "title": "string (trimmed)", "company": "string (trimmed)", "location_city": "string or null", "location_state": "string or null", "is_remote": "boolean", "salary_min": "integer or null (USD/year)", "salary_max": "integer or null (USD/year)", "date_posted": "ISO 8601 date string or null", "url": "string (full canonical URL)", "scraped_at": "ISO 8601 datetime" } ``` Generate: 1. A data cleaning module that transforms raw scraped output to the target schema 2. A validation step that checks each row conforms to the schema and logs any rows that don't — don't silently drop them 3. Deduplication logic based on [field — e.g., 'URL'] 4. A summary report at the end: total rows scraped, rows after deduplication, rows with missing salary, rows with missing date 5. Output to [CSV / JSON / SQLite — specify your choice] Use only standard library modules plus [pandas if Python] — no other dependencies.
Tip: Never overwrite your raw scraped data when cleaning. Save the raw output first, then apply your cleaning pipeline to produce a separate clean output. This means when your cleaning logic has a bug (and it will), you can fix the code and re-run the cleaning step without having to re-scrape the source site. Add the raw data directory to .gitignore since it can get large.
5

Polish Your Output with Coda One

Give your AI-generated content a final polish — fix grammar, improve readability, and make it sound more natural.

Tip: Free tools, no signup required. Just paste your text and go.
6

Schedule and Monitor Scraper Runs

A scraper you run once manually is a script. A scraper that runs on a schedule, handles errors, and notifies you when something goes wrong is a data pipeline. AI can generate the scheduling configuration and monitoring code to make your scraper production-grade.

Prompt Template
Help me operationalize my scraper — schedule it to run automatically and monitor it for failures. **My scraper:** - What it does: [Brief description — e.g., 'Scrapes job listings from X, outputs to SQLite database'] - Script location: [e.g., '/home/user/scrapers/jobs_scraper.py'] - Average runtime: [e.g., '5-10 minutes for full run'] - Required environment: [Python 3.12, dependencies in requirements.txt / Docker container] **Scheduling:** - How often to run: [e.g., 'Every 6 hours / Daily at 2 AM UTC / Every Monday at 9 AM'] - Where to run: [Local machine via cron / Linux server via cron / GitHub Actions / Cloud scheduler (GCP/AWS) / Render.com cron job] **Failure scenarios to handle:** 1. Site is down or returns 5xx errors → [e.g., 'Retry up to 3 times with 10-minute gaps, then mark run as failed and notify me'] 2. Scraper crashes mid-run → [e.g., 'Log the error with full traceback, notify me, and preserve whatever data was collected before the crash'] 3. Zero items scraped (site may have changed structure) → [e.g., 'This is almost certainly a bug — always notify me if a run completes with 0 items'] 4. Run takes longer than expected → [e.g., 'If runtime exceeds 30 minutes, something is wrong — kill the process and notify me'] **Notifications:** - Notify me via: [Email / Slack webhook / Telegram bot / Discord webhook / simple log file] - Notify on: [Failure only / Success + failure / Failure + zero items] - [If Slack/Telegram/Discord]: Webhook URL or bot token: [you'll fill this in] Generate: 1. The scheduling configuration (crontab entry, GitHub Actions workflow, or cloud scheduler config) 2. A wrapper script that handles timeout enforcement, error catching, and notification 3. A run log format: each run should record start time, end time, items scraped, status (success/fail), error message if any 4. A simple health dashboard: a script I can run manually to see the last 10 run results 5. Instructions for setting up notifications Also: what should I do if the site changes structure and my scraper suddenly starts returning wrong data silently (without crashing)? How do I detect data quality degradation?
Tip: Add a data quality check as the last step of every scraper run — not just error checking. For example: 'If this run scraped fewer than 80% of the items that the previous run scraped, treat it as a warning.' Sites often partially redesign, breaking some but not all selectors, resulting in a run that 'succeeds' but only captures half the data. Comparing this-run vs. last-run item count catches this silently degrading scenario.

Recommended Tools for This Scenario

MCP Servers for This Scenario

Browse all MCP servers →

Frequently Asked Questions

Is web scraping legal?
The legal landscape is genuinely complicated and varies by jurisdiction, site, and what you do with the data. The clearest rules: never scrape in a way that violates a site's Terms of Service if you have an account there; never scrape personal data and store it without legal basis (GDPR, CCPA); never use scraped data in ways the ToS explicitly prohibits (commercial redistribution, training ML models, etc.). The US courts have generally held that scraping publicly available data is legal (hiQ v. LinkedIn), but this is not universal law. Practical guidance: read the site's robots.txt and ToS, don't scrape at a rate that degrades the site's performance, don't circumvent authentication or paywalls, and if you're building a commercial product on scraped data, have a lawyer review your situation. When in doubt, see if the site offers a public API instead.
My scraper keeps getting blocked. What do I do?
Getting blocked is a signal, not an obstacle to engineer around by default — first ask whether you should be scraping this site at all. If you've determined it's legitimate: the most common blocking triggers are too-fast request rate (add random delays of 2-5 seconds between requests), a missing or bot-like User-Agent header (use a real browser User-Agent string), and missing headers that a real browser would send (Accept-Language, Accept-Encoding, etc.). For more sophisticated blocking, Playwright or Puppeteer with stealth plugins can help because they drive a real browser. If a site is using Cloudflare or similar, respect that — they're blocking you intentionally. Never pay for tools designed specifically to defeat anti-bot measures for sites that clearly don't want to be scraped.
What's the best Python library for web scraping?
It depends on what you're scraping. For static HTML pages: requests + BeautifulSoup is the fastest to get working and has the most AI training data behind it. For sites requiring JavaScript rendering: Playwright is currently the best option (more reliable and faster than Selenium, better maintained than Puppeteer in Python). For large-scale scraping with built-in rate limiting, retry logic, and pipeline management: Scrapy is the right tool, though it has a steeper learning curve. The hierarchy: start with requests + BeautifulSoup; move to Playwright if the site needs JavaScript; only reach for Scrapy if you're scraping at scale (tens of thousands of pages) and need a production-grade framework.
The site updated its HTML and now my scraper is broken. How do I fix it quickly?
This is the most common scraper maintenance task. The fastest fix with AI: open the page in your browser, right-click the element you want to target, click 'Inspect,' copy the surrounding HTML (a few levels of nesting around your target element), paste it into Claude or ChatGPT, and ask 'Write a CSS selector that reliably targets [the data element] in this HTML.' Then update the selector in your code. For a more durable fix, ask AI to write a selector that uses structural attributes (IDs, data-* attributes, aria-labels) rather than generic class names — class names change constantly in sites using CSS-in-JS frameworks, while semantic attributes tend to be more stable.

Coda One Tools for This Scenario

Try AI Summarizer

Condense long articles, papers, and reports into clear, concise summaries in seconds.

Try Free

Try AI Rewriter

Rewrite and improve any text while preserving meaning and adding a human touch.

Try Free

Try AI Grammar Checker

Find and fix grammar, spelling, and punctuation errors with detailed explanations.

Try Free
web scrapingdata collectionpythonautomationbeautifulsoupplaywrightdata pipelineprogramming
Was this helpful?

Get More Scenarios Like This

New AI guides, top tools, and prompt templates — curated weekly.