Data Collector

Name: Data Collector
Author: Community

Flagged

Systematically collects structured data from websites, APIs, and documents into clean, normalized datasets ready for analysis.

By Community 5,300 stars v1.1.2 Updated 2026-03-10

$ Copy the SKILL.md file to your project's .claude/skills/ directory

$ Copy the skill prompt to .cursor/rules/ as a .mdc file

$ Add to AGENTS.md

About This Skill

Data Collector is a systematic data gathering skill that transforms scattered information across websites, APIs, and documents into clean, analysis-ready datasets. It combines web fetching with intelligent extraction to handle both structured and unstructured sources.

Collection Strategies

API-First Approach For sites with public APIs, generates authenticated request code that handles pagination, rate limiting, and incremental updates. Supports REST and GraphQL APIs.

HTML Extraction For web pages without APIs, uses semantic HTML parsing to extract data: - Tables → CSV rows - Lists and cards → JSON arrays - Repeated patterns → structured records

Document Parsing Extracts from PDFs, Word documents, and spreadsheets: - Table detection and extraction - Form field recognition - Metadata capture (author, date, title)

Data Quality

Schema validation — enforces consistent field types across records
Deduplication — fingerprints records to remove exact and near-duplicates
Missing value handling — flags vs fills based on configured strategy
Source attribution — every record tagged with URL, timestamp, and collection method

Output Formats

CSV/TSV for spreadsheet tools
JSONL for streaming processing
SQLite for queryable local datasets
Parquet for large-scale analytics

Rate Limiting and Politeness

Built-in configurable delays, retry logic with exponential backoff, and robots.txt checking before collection begins.

Use Cases

Collecting competitor pricing data from e-commerce sites
Aggregating news and social media mentions for brand monitoring
Building datasets from public government and research APIs
Extracting structured information from PDF reports and documents

Pros & Cons

Pros

+API-first approach avoids fragile HTML scraping when possible
+Source attribution on every record for data lineage
+Multiple output formats for different downstream tools
+Built-in politeness controls to avoid overloading target sites

Cons

-HTML extraction breaks when target sites update their layouts
-Must verify legal permission to collect from each target source

Related AI Tools

Perplexity

Freemium

AI-powered search engine that answers questions with cited sources

Real-time web search with inline source citations
Pro Search multi-step deep research automation
Multiple model options (Sonar, GPT-4o, Claude)

Get Started →

Claude Code

Paid

Anthropic's agentic CLI for autonomous terminal-native coding workflows

Terminal-native autonomous coding agent
Full file system and shell access for multi-step tasks
Deep codebase understanding via repository indexing

View Pricing →

Cursor

Freemium

AI-native code editor with deep multi-model integration and agentic coding

AI-native Cmd+K inline editing and generation
Composer Agent for autonomous multi-file changes
Full codebase indexing and context awareness

Get Started →

Related Skills

Data Pipeline

Designs and implements ETL/ELT data pipelines using Python, SQL, and orchestration tools like Airflow, dbt, and Prefect for batch and streaming workflows.

CSV Transformer

Transforms, cleans, and converts data between CSV, JSON, Excel, and other tabular formats with column mapping, type casting, and validation.

FAQ

What does Data Collector do?

Systematically collects structured data from websites, APIs, and documents into clean, normalized datasets ready for analysis.

What platforms support Data Collector?

Data Collector is available on Claude Code, Cursor, OpenAI Codex CLI.

What are the use cases for Data Collector?

Collecting competitor pricing data from e-commerce sites. Aggregating news and social media mentions for brand monitoring. Building datasets from public government and research APIs.

What tools work with Data Collector?

Data Collector works well with Perplexity, Claude Code, Cursor.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try Perplexity