Data Collector
CautionSystematically collects structured data from websites, APIs, and documents into clean, normalized datasets ready for analysis.
Install
Claude Code
Copy the SKILL.md file to your project's .claude/skills/ directory About This Skill
Data Collector is a systematic data gathering skill that transforms scattered information across websites, APIs, and documents into clean, analysis-ready datasets. It combines web fetching with intelligent extraction to handle both structured and unstructured sources.
Collection Strategies
API-First Approach For sites with public APIs, generates authenticated request code that handles pagination, rate limiting, and incremental updates. Supports REST and GraphQL APIs.
HTML Extraction For web pages without APIs, uses semantic HTML parsing to extract data: - Tables → CSV rows - Lists and cards → JSON arrays - Repeated patterns → structured records
Document Parsing Extracts from PDFs, Word documents, and spreadsheets: - Table detection and extraction - Form field recognition - Metadata capture (author, date, title)
Data Quality
- Schema validation — enforces consistent field types across records
- Deduplication — fingerprints records to remove exact and near-duplicates
- Missing value handling — flags vs fills based on configured strategy
- Source attribution — every record tagged with URL, timestamp, and collection method
Output Formats
- CSV/TSV for spreadsheet tools
- JSONL for streaming processing
- SQLite for queryable local datasets
- Parquet for large-scale analytics
Rate Limiting and Politeness
Built-in configurable delays, retry logic with exponential backoff, and robots.txt checking before collection begins.
Use Cases
- Collecting competitor pricing data from e-commerce sites
- Aggregating news and social media mentions for brand monitoring
- Building datasets from public government and research APIs
- Extracting structured information from PDF reports and documents
Pros & Cons
Pros
- + API-first approach avoids fragile HTML scraping when possible
- + Source attribution on every record for data lineage
- + Multiple output formats for different downstream tools
- + Built-in politeness controls to avoid overloading target sites
Cons
- - HTML extraction breaks when target sites update their layouts
- - Must verify legal permission to collect from each target source
Related AI Tools
Perplexity
Freemium
AI-powered search engine that answers questions with cited sources
- Real-time web search with inline source citations
- Pro Search multi-step deep research automation
- Multiple model options (Sonar, GPT-4o, Claude)
Claude Code
Paid
Anthropic's agentic CLI for autonomous terminal-native coding workflows
- Terminal-native autonomous coding agent
- Full file system and shell access for multi-step tasks
- Deep codebase understanding via repository indexing
Cursor
Freemium
AI-native code editor with deep multi-model integration and agentic coding
- AI-native Cmd+K inline editing and generation
- Composer Agent for autonomous multi-file changes
- Full codebase indexing and context awareness
Related Skills
Data Pipeline
CautionDesigns and implements ETL/ELT data pipelines using Python, SQL, and orchestration tools like Airflow, dbt, and Prefect for batch and streaming workflows.
CSV Transformer
CautionTransforms, cleans, and converts data between CSV, JSON, Excel, and other tabular formats with column mapping, type casting, and validation.
Stay Updated on Agent Skills
Get weekly curated skills + safety alerts
每周精选 Skills + 安全预警