Skip to content

Data Validator

Caution

Build data quality validation pipelines with schema enforcement, anomaly detection, referential integrity checks, and data quality reports.

By community 3,600 v1.2.0 Updated 2026-03-08

Install

Claude Code

Copy the SKILL.md file to .claude/skills/data-validator.md

About This Skill

Data Validator generates data quality validation code using industry-standard frameworks to catch data issues before they propagate through pipelines.

Schema Validation

  • JSON Schema — Draft 2020-12 validation with ajv (Node.js) or jsonschema (Python). Generates schema from sample data automatically.
  • Pydantic — Python data models with field validators, pre/post validators, and discriminated unions for polymorphic data.
  • Great Expectations — Expectation suites for batch data validation with data docs HTML reports.
  • Pandera — pandas DataFrame schema validation with statistical checks (column distributions, outlier thresholds).

Rule Types

  • Structural — required fields, data types, format (email, date, UUID), enum values
  • Statistical — value ranges (min/max), mean/std deviation bounds, null rate thresholds, cardinality limits
  • Referential — foreign key existence checks, orphan record detection, circular reference detection
  • Temporal — timestamp ordering, date range validity, event sequence integrity

Anomaly Detection

Z-score and IQR methods for outlier detection. Seasonality-aware anomaly detection using STL decomposition for time-series data. CUSUM algorithm for drift detection.

Pipeline Integration

Drops into dbt tests, Airflow task validation steps, or as GitHub Actions data quality gates on PR-merged data files.

Reporting

Data quality scorecard: total records, pass/fail counts per rule, sample failing rows, and trend over time. Slack/email alerts on quality degradation.

Use Cases

  • Validating incoming API webhook payloads against JSON Schema before processing
  • Running data quality checks on ETL pipeline outputs before loading to warehouse
  • Detecting anomalies in time-series metrics (sudden spikes, missing data points)
  • Generating data quality scorecards for stakeholder reporting

Pros & Cons

Pros

  • + Great Expectations data docs provide human-readable quality reports for stakeholders
  • + Schema auto-generation from sample data accelerates initial setup
  • + Anomaly detection catches statistical outliers that rule-based checks miss
  • + dbt/Airflow integration makes validation a first-class pipeline citizen

Cons

  • - Great Expectations has significant setup overhead and a steep learning curve
  • - Statistical anomaly detection requires sufficient historical data to establish baselines

Related AI Tools

Related Skills

Stay Updated on Agent Skills

Get weekly curated skills + safety alerts

每周精选 Skills + 安全预警