Skip to content

Data Collector

Caution

Systematically collects structured data from websites, APIs, and documents into clean, normalized datasets ready for analysis.

By Anthropic 5,300 v1.1.2 Updated 2026-03-10

Install

Claude Code

Copy the SKILL.md file to your project's .claude/skills/ directory

About This Skill

Data Collector is a systematic data gathering skill that transforms scattered information across websites, APIs, and documents into clean, analysis-ready datasets. It combines web fetching with intelligent extraction to handle both structured and unstructured sources.

Collection Strategies

API-First Approach For sites with public APIs, generates authenticated request code that handles pagination, rate limiting, and incremental updates. Supports REST and GraphQL APIs.

HTML Extraction For web pages without APIs, uses semantic HTML parsing to extract data: - Tables → CSV rows - Lists and cards → JSON arrays - Repeated patterns → structured records

Document Parsing Extracts from PDFs, Word documents, and spreadsheets: - Table detection and extraction - Form field recognition - Metadata capture (author, date, title)

Data Quality

  • Schema validation — enforces consistent field types across records
  • Deduplication — fingerprints records to remove exact and near-duplicates
  • Missing value handling — flags vs fills based on configured strategy
  • Source attribution — every record tagged with URL, timestamp, and collection method

Output Formats

  • CSV/TSV for spreadsheet tools
  • JSONL for streaming processing
  • SQLite for queryable local datasets
  • Parquet for large-scale analytics

Rate Limiting and Politeness

Built-in configurable delays, retry logic with exponential backoff, and robots.txt checking before collection begins.

Use Cases

  • Collecting competitor pricing data from e-commerce sites
  • Aggregating news and social media mentions for brand monitoring
  • Building datasets from public government and research APIs
  • Extracting structured information from PDF reports and documents

Pros & Cons

Pros

  • + API-first approach avoids fragile HTML scraping when possible
  • + Source attribution on every record for data lineage
  • + Multiple output formats for different downstream tools
  • + Built-in politeness controls to avoid overloading target sites

Cons

  • - HTML extraction breaks when target sites update their layouts
  • - Must verify legal permission to collect from each target source

Related AI Tools

Related Skills

Stay Updated on Agent Skills

Get weekly curated skills + safety alerts

每周精选 Skills + 安全预警