Data Collector
FlaggedSystematically collects structured data from websites, APIs, and documents into clean, normalized datasets ready for analysis.
$ Copy the SKILL.md file to your project's .claude/skills/ directory About This Skill
Data Collector is a systematic data gathering skill that transforms scattered information across websites, APIs, and documents into clean, analysis-ready datasets. It combines web fetching with intelligent extraction to handle both structured and unstructured sources.
Collection Strategies
API-First Approach For sites with public APIs, generates authenticated request code that handles pagination, rate limiting, and incremental updates. Supports REST and GraphQL APIs.
HTML Extraction For web pages without APIs, uses semantic HTML parsing to extract data: - Tables → CSV rows - Lists and cards → JSON arrays - Repeated patterns → structured records
Document Parsing Extracts from PDFs, Word documents, and spreadsheets: - Table detection and extraction - Form field recognition - Metadata capture (author, date, title)
Data Quality
- Schema validation — enforces consistent field types across records
- Deduplication — fingerprints records to remove exact and near-duplicates
- Missing value handling — flags vs fills based on configured strategy
- Source attribution — every record tagged with URL, timestamp, and collection method
Output Formats
- CSV/TSV for spreadsheet tools
- JSONL for streaming processing
- SQLite for queryable local datasets
- Parquet for large-scale analytics
Rate Limiting and Politeness
Built-in configurable delays, retry logic with exponential backoff, and robots.txt checking before collection begins.
Use Cases
- Collecting competitor pricing data from e-commerce sites
- Aggregating news and social media mentions for brand monitoring
- Building datasets from public government and research APIs
- Extracting structured information from PDF reports and documents
Pros & Cons
Pros
- +API-first approach avoids fragile HTML scraping when possible
- +Source attribution on every record for data lineage
- +Multiple output formats for different downstream tools
- +Built-in politeness controls to avoid overloading target sites
Cons
- -HTML extraction breaks when target sites update their layouts
- -Must verify legal permission to collect from each target source
Related AI Tools
Perplexity
AI-powered search engine that answers questions with cited sources
- Real-time web search with inline source citations
- Pro Search multi-step deep research automation
- Multiple model options (Sonar, GPT-4o, Claude)
Claude Code
Anthropic's agentic CLI for autonomous terminal-native coding workflows
- Terminal-native autonomous coding agent
- Full file system and shell access for multi-step tasks
- Deep codebase understanding via repository indexing
Cursor
AI-native code editor with deep multi-model integration and agentic coding
- AI-native Cmd+K inline editing and generation
- Composer Agent for autonomous multi-file changes
- Full codebase indexing and context awareness
Related Skills
Data Pipeline
Designs and implements ETL/ELT data pipelines using Python, SQL, and orchestration tools like Airflow, dbt, and Prefect for batch and streaming workflows.
CSV Transformer
Transforms, cleans, and converts data between CSV, JSON, Excel, and other tabular formats with column mapping, type casting, and validation.
FAQ
What does Data Collector do?
What platforms support Data Collector?
What are the use cases for Data Collector?
What tools work with Data Collector?
100+ free AI tools
Writing, PDF, image, and developer tools — all in your browser.
Next Step
Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.