Data Cleaner
VerifiedProfiles, cleans, and standardizes messy datasets by detecting and fixing inconsistencies, outliers, duplicates, and formatting issues.
Install
Claude Code
Copy the SKILL.md file to your project's .claude/skills/ directory About This Skill
Data Cleaner automates the most tedious part of any data project — getting raw data into a usable state. It goes beyond find-and-replace to understand the semantics of your data and apply the right cleaning strategy for each issue.
Data Profiling
- Before cleaning, produces a profile report:
- Column-level statistics (type, cardinality, null rate, min/max)
- Distribution shapes for numeric columns
- Pattern frequency for text columns (email, phone, date formats present)
- Correlation matrix highlighting redundant features
Cleaning Operations
Type Standardization - Date parsing across 30+ formats → ISO 8601 - Currency strings ("$1,234.56", "€1.234,56") → numeric - Boolean variants ("Yes/No", "1/0", "TRUE/FALSE") → consistent - Phone numbers → E.164 format (+1XXXXXXXXXX)
String Normalization - Case standardization (Title Case for names, uppercase for codes) - Whitespace trimming and internal whitespace collapse - Unicode normalization (NFC) and encoding repair (mojibake detection) - Consistent abbreviation expansion ("St." → "Street", "Dr" → "Doctor")
Deduplication - Exact duplicate removal - Fuzzy deduplication using Jaro-Winkler similarity for names and addresses - Blocking strategies for large datasets to make fuzzy matching tractable
Missing Value Handling - Mean/median/mode imputation for numeric columns - Forward-fill or backward-fill for time series - Indicator variable creation for informative missingness - Row removal when missing rate exceeds configurable threshold
Audit Trail
Every transformation logged to `cleaning_log.json` with: column affected, operation, rows changed, and before/after samples.
Use Cases
- Standardizing address and phone number formats across CRM exports
- Deduplicating customer records with fuzzy name matching
- Fixing encoding issues in international datasets
- Imputing missing values using appropriate statistical strategies
Pros & Cons
Pros
- + Never overwrites original data — always writes to new output file
- + Comprehensive data profiling before any changes are made
- + Fuzzy deduplication for name and address matching
- + Full audit trail in cleaning_log.json for data governance
Cons
- - Fuzzy matching on very large datasets (1M+ rows) requires chunking and may be slow
- - Domain-specific cleaning rules (e.g., medical codes) may need custom extensions
Related AI Tools
Claude Code
Paid
Anthropic's agentic CLI for autonomous terminal-native coding workflows
- Terminal-native autonomous coding agent
- Full file system and shell access for multi-step tasks
- Deep codebase understanding via repository indexing
Cursor
Freemium
AI-native code editor with deep multi-model integration and agentic coding
- AI-native Cmd+K inline editing and generation
- Composer Agent for autonomous multi-file changes
- Full codebase indexing and context awareness
GitHub Copilot
Freemium
AI pair programmer that suggests code in real time across your IDE
- Real-time code completions across 30+ languages
- Copilot Chat for natural language code Q&A
- Pull request description and summary generation
Related Skills
Pandas Assistant
CautionOptimizes Python pandas workflows by writing efficient DataFrame operations, fixing common performance pitfalls, and converting between pandas, polars, and SQL.
Excel Analyzer
VerifiedAnalyzes Excel and CSV files to produce statistical summaries, pivot tables, charts, and actionable insights without leaving your AI workflow.
Stay Updated on Agent Skills
Get weekly curated skills + safety alerts
每周精选 Skills + 安全预警