cleanifix
Version:
Intelligent data cleaning CLI with natural language support - Docker-powered Python engine
160 lines (127 loc) โข 3.91 kB
Markdown
Cleanifix
A CLI tool that automatically cleans your data files through natural language commands. Like having a data analyst in your terminal.
๐ Quick Start
bash# Install
npm install -g cleanifix
# Basic usage
cleanifix @sales.csv "remove duplicates"
cleanifix @users.csv "fill missing emails with 'unknown@example.com'"
cleanifix @data.json "standardize all dates to ISO format"
# Interactive mode
cleanifix interactive @messy_data.csv
๐ฏ Features
Core Capabilities (MVP)
Missing Value Detection & Handling - Find and fix missing data automatically
Data Standardization - Normalize dates, phone numbers, addresses, and more
Deduplication - Remove duplicate rows with smart matching
Natural Language Interface
bash# Just describe what you want
cleanifix @customers.csv "find missing phone numbers and fill with 'N/A'"
cleanifix @inventory.csv "standardize product names to title case"
cleanifix @transactions.csv "remove duplicate entries keeping the most recent"
Smart Suggestions
bash$ cleanifix @data.csv "analyze"
๐ Data Quality Report:
โ 156 missing values in 'email' column
โ 89 inconsistent date formats
โ 34 potential duplicates
Suggested fixes:
1. Fill missing emails with domain-based patterns
2. Standardize dates to YYYY-MM-DD
3. Remove exact duplicates keeping first occurrence
Apply all fixes? [Y/n]
๐ฆ Installation
Prerequisites
Node.js 18+
Python 3.8+
4GB RAM recommended for large files
Install from npm
bashnpm install -g cleanifix
Install from source
bashgit clone https://github.com/rickyjs1955/cleanifix.git
cd cleanifix
./scripts/setup-dev.sh
๐ ๏ธ Usage Examples
Basic Cleaning
bash# Find issues
cleanifix @data.csv "show me data quality issues"
# Fix missing values
cleanifix @sales.csv "fill missing prices with median"
# Standardize formats
cleanifix @contacts.csv "standardize all phone numbers to international format"
# Remove duplicates
cleanifix @emails.csv "remove duplicate emails keeping the latest entry"
Batch Processing
bash# Create a config file
cat > cleaning-rules.yaml << EOF
rules:
- type: missing_values
columns: [price, quantity]
strategy: median
- type: standardize
column: phone
format: E164
- type: deduplicate
keys: [email]
keep: last
EOF
# Run batch cleaning
cleanifix batch @data.csv --rules cleaning-rules.yaml
Interactive Mode
bashcleanifix interactive @messy_data.csv
๐งน Cleanifix Interactive Mode
> analyze my data
> fill missing ages with average by city
> standardize all names to proper case
> save as cleaned_data.csv
> exit
๐๏ธ Architecture
Cleanifix uses a hybrid architecture:
CLI Interface (Node.js) - Fast, responsive user interaction
Processing Engine (Python) - Powerful data manipulation with pandas
Communication - JSON-based message passing between components
๐ค Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Development Setup
bash# Clone the repo
git clone https://github.com/rickyjs1955/cleanifix.git
cd cleanifix
# Setup development environment
./scripts/setup-dev.sh
# Run tests
npm test # CLI tests
python -m pytest # Engine tests
# Run in development mode
npm run dev
๐ Roadmap
Phase 1 (Current) - MVP
Basic CLI interface
Missing value handling
Simple standardization
Exact deduplication
CSV support
JSON support
Phase 2 - Enhanced Rules
Fuzzy deduplication
Custom regex patterns
Outlier detection
Data type inference
Excel support
Phase 3 - ML Integration
Smart imputation
Anomaly detection
Pattern learning
Confidence scoring
Auto-cleaning mode
๐ License
MIT License - see LICENSE file for details
๐ Acknowledgments
Built with:
Commander.js - CLI framework
Pandas - Data manipulation
Chalk - Terminal styling
๐ฌ Support
Documentation: docs.cleanifix.dev
Issues: GitHub Issues
Discussions: GitHub Discussions
Made with โค๏ธ by data people, for data people