Skip to content

Data Cleaner

Flagged

Profiles, cleans, and standardizes messy datasets by detecting and fixing inconsistencies, outliers, duplicates, and formatting issues.

By Community 6,400 stars v1.1.0 Updated 2026-03-10
$ Copy the SKILL.md file to your project's .claude/skills/ directory

About This Skill

Data Cleaner automates the most tedious part of any data project — getting raw data into a usable state. It goes beyond find-and-replace to understand the semantics of your data and apply the right cleaning strategy for each issue.

Data Profiling

  • Before cleaning, produces a profile report:
  • Column-level statistics (type, cardinality, null rate, min/max)
  • Distribution shapes for numeric columns
  • Pattern frequency for text columns (email, phone, date formats present)
  • Correlation matrix highlighting redundant features

Cleaning Operations

Type Standardization - Date parsing across 30+ formats → ISO 8601 - Currency strings ("$1,234.56", "€1.234,56") → numeric - Boolean variants ("Yes/No", "1/0", "TRUE/FALSE") → consistent - Phone numbers → E.164 format (+1XXXXXXXXXX)

String Normalization - Case standardization (Title Case for names, uppercase for codes) - Whitespace trimming and internal whitespace collapse - Unicode normalization (NFC) and encoding repair (mojibake detection) - Consistent abbreviation expansion ("St." → "Street", "Dr" → "Doctor")

Deduplication - Exact duplicate removal - Fuzzy deduplication using Jaro-Winkler similarity for names and addresses - Blocking strategies for large datasets to make fuzzy matching tractable

Missing Value Handling - Mean/median/mode imputation for numeric columns - Forward-fill or backward-fill for time series - Indicator variable creation for informative missingness - Row removal when missing rate exceeds configurable threshold

Audit Trail

Every transformation logged to `cleaning_log.json` with: column affected, operation, rows changed, and before/after samples.

Use Cases

  • Standardizing address and phone number formats across CRM exports
  • Deduplicating customer records with fuzzy name matching
  • Fixing encoding issues in international datasets
  • Imputing missing values using appropriate statistical strategies

Pros & Cons

Pros

  • +Never overwrites original data — always writes to new output file
  • +Comprehensive data profiling before any changes are made
  • +Fuzzy deduplication for name and address matching
  • +Full audit trail in cleaning_log.json for data governance

Cons

  • -Fuzzy matching on very large datasets (1M+ rows) requires chunking and may be slow
  • -Domain-specific cleaning rules (e.g., medical codes) may need custom extensions

Related AI Tools

Related Skills

FAQ

What does Data Cleaner do?
Profiles, cleans, and standardizes messy datasets by detecting and fixing inconsistencies, outliers, duplicates, and formatting issues.
What platforms support Data Cleaner?
Data Cleaner is available on Claude Code, Cursor, OpenAI Codex CLI.
What are the use cases for Data Cleaner?
Standardizing address and phone number formats across CRM exports. Deduplicating customer records with fuzzy name matching. Fixing encoding issues in international datasets.
What tools work with Data Cleaner?
Data Cleaner works well with Claude Code, Cursor, GitHub Copilot.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.