UNPKG

@semantic-ds/toolkit

Version:

Performance-first semantic layer for modern data stacks - Stable Column Anchors & intelligent inference

226 lines (158 loc) โ€ข 7.72 kB
# ๐Ÿš€ Semantic Data Science Toolkit [![npm version](https://badge.fury.io/js/%40semantic-ds%2Ftoolkit.svg)](https://badge.fury.io/js/%40semantic-ds%2Ftoolkit) [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Build Status](https://github.com/kneelinghorse/semantic-ds-toolkit/workflows/CI/badge.svg)](https://github.com/kneelinghorse/semantic-ds-toolkit/actions) Stop breaking pipelines when schemas change. The Semantic Data Science Toolkit introduces Stable Column Anchors (SCAs) that survive renames, reordering, and schema evolution. ## **๐ŸŽฏ What You're Looking At** This is our **v1 foundation** that demonstrates core innovations while being transparent about current vs. future capabilities. We've built something genuinely impressive but want your honest feedback on what works, what's missing, and where to focus next. ## **โšก Quick Start (5 minutes)** ```bash # Install and try the core demo npm install -g @semantic-ds/toolkit # See the main innovation in action semantic-ds quickstart --demo # Test basic file analysis semantic-ds infer examples/customers.csv # Try the interactive project setup semantic-ds init --interactive ``` ## **๐Ÿ† What Actually Works (Please Test These)** ### **1. Stable Column Anchors (SCA) - Our Core Innovation** **Status: โœ… PRODUCTION READY** This is our main technical breakthrough - column fingerprinting that survives schema changes: ```bash # The anchor system has 64 passing tests and real functionality npm test test/anchors.test.ts npm test test/shadow-semantics.test.ts ``` **What to test:** - Load the same CSV with renamed columns - anchors should match - Try different file formats (CSV, JSON) - universal adaptation works - Check `./semantics/anchors/` directory for real YAML persistence **Why this matters:** This solves the #1 cause of pipeline breaks in data engineering. ### **2. Professional CLI Experience** **Status: โœ… FULLY FUNCTIONAL** We've built enterprise-grade developer experience: ```bash # Real tab completion (try this) semantic-ds completion install bash # Professional help system semantic-ds --help semantic-ds infer --help # Interactive wizards work semantic-ds init --interactive ``` **What to test:** - Tab completion for commands and options - Error handling and help messages - Directory creation and project scaffolding ### **3. DataFrame Integration** **Status: โœ… WORKING** Universal support for different data formats without external dependencies: ```bash # Test with your own CSV files semantic-ds infer your-data.csv # JSON support semantic-ds infer your-data.json ``` **What works:** File loading, parsing, basic pattern recognition, anchor creation ## **โš ๏ธ What's Demo Level (Don't Rely On These Yet)** ### **Semantic Inference Results** The CLI shows beautiful progress bars and realistic-looking results like: ``` Semantic mappings found: 18 Average confidence: 0.87 Estimated time saved: 4.2 hours ``` **Reality:** File loading and basic pattern matching work, but the semantic mapping results are largely templated. We have the infrastructure but not the deep inference engine yet. ### **Health Monitoring & Validation** ```bash semantic-ds health # Pretty dashboard, simulated metrics semantic-ds validate # Framework exists, limited real validation ``` **Reality:** Professional UX with realistic-looking metrics, but the actual health analysis is mostly simulated. ### **Performance Claims** We show impressive performance metrics in the CLI output. **Reality:** The underlying anchor system is fast (sub-second for typical datasets), but the performance numbers shown are targets, not current measurements. ## **โŒ What's Not Implemented Yet (Clear Roadmap Items)** 1. **SQL Generation** - Referenced but not built 2. **dbt/Snowflake Integration** - Planned for next version 3. **GitHub Bot** - Directory structure exists, implementation doesn't 4. **Advanced Drift Detection** - Basic framework only 5. **Federated CID Registry** - Local YAML files only currently 6. **Real-time Monitoring** - Command exists but placeholder ## **๐Ÿงช Specific Testing Scenarios** ### **Scenario 1: Schema Evolution (Core Strength)** ```bash # 1. Create a CSV with customer data echo "customer_id,email,amount\n1,john@test.com,100\n2,jane@test.com,200" > test1.csv # 2. Analyze it semantic-ds infer test1.csv # 3. Rename columns and analyze again echo "cust_pk,mail,price\n1,john@test.com,100\n2,jane@test.com,200" > test2.csv semantic-ds infer test2.csv # 4. Check ./semantics/anchors/ - should show matching anchors ``` **Expected:** Anchor system should recognize the renamed columns as the same concepts. ### **Scenario 2: Real File Analysis** ```bash # Try with your own data files semantic-ds infer path/to/your/data.csv --verbose # Check what anchors were created ls -la ./semantics/anchors/ cat ./semantics/anchors/*.yml ``` **Expected:** Real file loading, basic column analysis, YAML anchor creation. ### **Scenario 3: Developer Experience** ```bash # Test the full DX workflow semantic-ds init my-project --interactive cd my-project semantic-ds health semantic-ds infer data/*.csv ``` **Expected:** Professional project scaffolding and workflow. ## **๐Ÿ’ญ Questions for Your Feedback** ### **Technical Questions:** 1. **SCA System:** Does the column fingerprinting approach make sense? Any edge cases we're missing? 2. **API Design:** Is the CLI interface intuitive? Would you want a Python/JavaScript API first? 3. **Performance:** The demos show sub-second performance. What scale should we target for v2? ### **Product Questions:** 1. **Problem-Solution Fit:** Does this actually solve a pain you've experienced? 2. **Adoption Path:** If this were production-ready, how would you roll it out in your organization? 3. **Integration Priority:** Which integrations matter most? (dbt, Snowflake, Airflow, etc.) ### **Market Questions:** 1. **Competitive Position:** How does this compare to tools you currently use? 2. **Pricing Sensitivity:** Open source core + premium features - does that model work? 3. **Enterprise Readiness:** What's missing for enterprise adoption? ## **๐Ÿš€ Why We're Excited (Despite Current Limitations)** ### **Technical Innovation** The Stable Column Anchor approach is a new approach we think is an improvement to some of the existing schema evolution solutions elsewhere. ### **Foundation Quality** While the features are limited, what we've built has: - Zero external dependencies for core functionality - Professional developer experience - Strong test coverage for implemented features - Clean, extensible architecture ### **Clear Path Forward** We have some ideas on what to build next from early users and research: - Real semantic inference engine - SQL generation for major warehouses - Performance optimization with SIMD/vectorization - Enterprise governance features ## **โฐ Timeline Expectations** Based on current velocity: - **Next 4 weeks:** Real inference engine, basic SQL generation - **Next 8 weeks:** dbt integration, performance optimization - **Next 12 weeks:** Enterprise features, GitHub bot ## **๐Ÿค How to Give Feedback** ### **What's Most Helpful:** 1. **Try the working features** and tell us about edge cases 2. **Share your specific use cases** - do our examples match reality? 3. **Test at your data scale** - does it break at 10K rows? 100K rows? 4. **Integration priorities** - what would make this immediately useful for you? ### **What We Don't Need Yet:** - Bug reports on features marked as "demo level" - Performance feedback on simulated metrics ## License Apache License 2.0. See LICENSE for details.