UNPKG

cleanifix

Version:

Intelligent data cleaning CLI with natural language support - Docker-powered Python engine

160 lines (127 loc) โ€ข 3.91 kB
Cleanifix A CLI tool that automatically cleans your data files through natural language commands. Like having a data analyst in your terminal. ๐Ÿš€ Quick Start bash# Install npm install -g cleanifix # Basic usage cleanifix @sales.csv "remove duplicates" cleanifix @users.csv "fill missing emails with 'unknown@example.com'" cleanifix @data.json "standardize all dates to ISO format" # Interactive mode cleanifix interactive @messy_data.csv ๐ŸŽฏ Features Core Capabilities (MVP) Missing Value Detection & Handling - Find and fix missing data automatically Data Standardization - Normalize dates, phone numbers, addresses, and more Deduplication - Remove duplicate rows with smart matching Natural Language Interface bash# Just describe what you want cleanifix @customers.csv "find missing phone numbers and fill with 'N/A'" cleanifix @inventory.csv "standardize product names to title case" cleanifix @transactions.csv "remove duplicate entries keeping the most recent" Smart Suggestions bash$ cleanifix @data.csv "analyze" ๐Ÿ“Š Data Quality Report: โœ— 156 missing values in 'email' column โœ— 89 inconsistent date formats โœ— 34 potential duplicates Suggested fixes: 1. Fill missing emails with domain-based patterns 2. Standardize dates to YYYY-MM-DD 3. Remove exact duplicates keeping first occurrence Apply all fixes? [Y/n] ๐Ÿ“ฆ Installation Prerequisites Node.js 18+ Python 3.8+ 4GB RAM recommended for large files Install from npm bashnpm install -g cleanifix Install from source bashgit clone https://github.com/rickyjs1955/cleanifix.git cd cleanifix ./scripts/setup-dev.sh ๐Ÿ› ๏ธ Usage Examples Basic Cleaning bash# Find issues cleanifix @data.csv "show me data quality issues" # Fix missing values cleanifix @sales.csv "fill missing prices with median" # Standardize formats cleanifix @contacts.csv "standardize all phone numbers to international format" # Remove duplicates cleanifix @emails.csv "remove duplicate emails keeping the latest entry" Batch Processing bash# Create a config file cat > cleaning-rules.yaml << EOF rules: - type: missing_values columns: [price, quantity] strategy: median - type: standardize column: phone format: E164 - type: deduplicate keys: [email] keep: last EOF # Run batch cleaning cleanifix batch @data.csv --rules cleaning-rules.yaml Interactive Mode bashcleanifix interactive @messy_data.csv ๐Ÿงน Cleanifix Interactive Mode > analyze my data > fill missing ages with average by city > standardize all names to proper case > save as cleaned_data.csv > exit ๐Ÿ—๏ธ Architecture Cleanifix uses a hybrid architecture: CLI Interface (Node.js) - Fast, responsive user interaction Processing Engine (Python) - Powerful data manipulation with pandas Communication - JSON-based message passing between components ๐Ÿค Contributing We welcome contributions! See CONTRIBUTING.md for guidelines. Development Setup bash# Clone the repo git clone https://github.com/rickyjs1955/cleanifix.git cd cleanifix # Setup development environment ./scripts/setup-dev.sh # Run tests npm test # CLI tests python -m pytest # Engine tests # Run in development mode npm run dev ๐Ÿ“‹ Roadmap Phase 1 (Current) - MVP Basic CLI interface Missing value handling Simple standardization Exact deduplication CSV support JSON support Phase 2 - Enhanced Rules Fuzzy deduplication Custom regex patterns Outlier detection Data type inference Excel support Phase 3 - ML Integration Smart imputation Anomaly detection Pattern learning Confidence scoring Auto-cleaning mode ๐Ÿ“„ License MIT License - see LICENSE file for details ๐Ÿ™ Acknowledgments Built with: Commander.js - CLI framework Pandas - Data manipulation Chalk - Terminal styling ๐Ÿ’ฌ Support Documentation: docs.cleanifix.dev Issues: GitHub Issues Discussions: GitHub Discussions Made with โค๏ธ by data people, for data people