Data Validator
CautionBuild data quality validation pipelines with schema enforcement, anomaly detection, referential integrity checks, and data quality reports.
Install
Claude Code
Copy the SKILL.md file to .claude/skills/data-validator.md About This Skill
Data Validator generates data quality validation code using industry-standard frameworks to catch data issues before they propagate through pipelines.
Schema Validation
- JSON Schema — Draft 2020-12 validation with ajv (Node.js) or jsonschema (Python). Generates schema from sample data automatically.
- Pydantic — Python data models with field validators, pre/post validators, and discriminated unions for polymorphic data.
- Great Expectations — Expectation suites for batch data validation with data docs HTML reports.
- Pandera — pandas DataFrame schema validation with statistical checks (column distributions, outlier thresholds).
Rule Types
- Structural — required fields, data types, format (email, date, UUID), enum values
- Statistical — value ranges (min/max), mean/std deviation bounds, null rate thresholds, cardinality limits
- Referential — foreign key existence checks, orphan record detection, circular reference detection
- Temporal — timestamp ordering, date range validity, event sequence integrity
Anomaly Detection
Z-score and IQR methods for outlier detection. Seasonality-aware anomaly detection using STL decomposition for time-series data. CUSUM algorithm for drift detection.
Pipeline Integration
Drops into dbt tests, Airflow task validation steps, or as GitHub Actions data quality gates on PR-merged data files.
Reporting
Data quality scorecard: total records, pass/fail counts per rule, sample failing rows, and trend over time. Slack/email alerts on quality degradation.
Use Cases
- Validating incoming API webhook payloads against JSON Schema before processing
- Running data quality checks on ETL pipeline outputs before loading to warehouse
- Detecting anomalies in time-series metrics (sudden spikes, missing data points)
- Generating data quality scorecards for stakeholder reporting
Pros & Cons
Pros
- + Great Expectations data docs provide human-readable quality reports for stakeholders
- + Schema auto-generation from sample data accelerates initial setup
- + Anomaly detection catches statistical outliers that rule-based checks miss
- + dbt/Airflow integration makes validation a first-class pipeline citizen
Cons
- - Great Expectations has significant setup overhead and a steep learning curve
- - Statistical anomaly detection requires sufficient historical data to establish baselines
Related AI Tools
Claude Code
Paid
Anthropic's agentic CLI for autonomous terminal-native coding workflows
- Terminal-native autonomous coding agent
- Full file system and shell access for multi-step tasks
- Deep codebase understanding via repository indexing
Cursor
Freemium
AI-native code editor with deep multi-model integration and agentic coding
- AI-native Cmd+K inline editing and generation
- Composer Agent for autonomous multi-file changes
- Full codebase indexing and context awareness
GitHub Copilot
Freemium
AI pair programmer that suggests code in real time across your IDE
- Real-time code completions across 30+ languages
- Copilot Chat for natural language code Q&A
- Pull request description and summary generation
Related Skills
Data Pipeline
CautionDesigns and implements ETL/ELT data pipelines using Python, SQL, and orchestration tools like Airflow, dbt, and Prefect for batch and streaming workflows.
Schema Designer
VerifiedDesigns relational and NoSQL database schemas with proper normalization, indexing strategies, migration scripts, and entity-relationship diagrams.
Stay Updated on Agent Skills
Get weekly curated skills + safety alerts
每周精选 Skills + 安全预警