Skip to content

Data Cleaner

Verified

Profiles, cleans, and standardizes messy datasets by detecting and fixing inconsistencies, outliers, duplicates, and formatting issues.

By Anthropic 6,400 v1.1.0 Updated 2026-03-10

Install

Claude Code

Copy the SKILL.md file to your project's .claude/skills/ directory

About This Skill

Data Cleaner automates the most tedious part of any data project — getting raw data into a usable state. It goes beyond find-and-replace to understand the semantics of your data and apply the right cleaning strategy for each issue.

Data Profiling

  • Before cleaning, produces a profile report:
  • Column-level statistics (type, cardinality, null rate, min/max)
  • Distribution shapes for numeric columns
  • Pattern frequency for text columns (email, phone, date formats present)
  • Correlation matrix highlighting redundant features

Cleaning Operations

Type Standardization - Date parsing across 30+ formats → ISO 8601 - Currency strings ("$1,234.56", "€1.234,56") → numeric - Boolean variants ("Yes/No", "1/0", "TRUE/FALSE") → consistent - Phone numbers → E.164 format (+1XXXXXXXXXX)

String Normalization - Case standardization (Title Case for names, uppercase for codes) - Whitespace trimming and internal whitespace collapse - Unicode normalization (NFC) and encoding repair (mojibake detection) - Consistent abbreviation expansion ("St." → "Street", "Dr" → "Doctor")

Deduplication - Exact duplicate removal - Fuzzy deduplication using Jaro-Winkler similarity for names and addresses - Blocking strategies for large datasets to make fuzzy matching tractable

Missing Value Handling - Mean/median/mode imputation for numeric columns - Forward-fill or backward-fill for time series - Indicator variable creation for informative missingness - Row removal when missing rate exceeds configurable threshold

Audit Trail

Every transformation logged to `cleaning_log.json` with: column affected, operation, rows changed, and before/after samples.

Use Cases

  • Standardizing address and phone number formats across CRM exports
  • Deduplicating customer records with fuzzy name matching
  • Fixing encoding issues in international datasets
  • Imputing missing values using appropriate statistical strategies

Pros & Cons

Pros

  • + Never overwrites original data — always writes to new output file
  • + Comprehensive data profiling before any changes are made
  • + Fuzzy deduplication for name and address matching
  • + Full audit trail in cleaning_log.json for data governance

Cons

  • - Fuzzy matching on very large datasets (1M+ rows) requires chunking and may be slow
  • - Domain-specific cleaning rules (e.g., medical codes) may need custom extensions

Related AI Tools

Related Skills

Stay Updated on Agent Skills

Get weekly curated skills + safety alerts

每周精选 Skills + 安全预警