UNPKG

@hivellm/classify

Version:

Intelligent document classification for graph databases and full-text search using modern LLMs

819 lines (650 loc) 29.9 kB
# Classify CLI > Intelligent document classification for graph databases and full-text search using modern LLMs **Version:** 0.7.0 (Cursor-Agent + CLI + Bulk Indexing) **Status:** ⭐ Ultra Cost-Optimized + Local LLM + Bulk Indexing - Production Ready ✅ ## Overview Classify is a TypeScript-based CLI tool that automatically classifies documents using modern LLM models, generating structured outputs for both graph databases (Nexus/Neo4j) and full-text search systems (Elasticsearch/OpenSearch). It features automatic template selection, multi-provider LLM support, document conversion, prompt compression, and SHA256-based caching. ## Key Features -**Ultra Cost-Optimized**: TINY templates by default - **70-80% token savings** ($0.0007/doc) -**Automatic Template Selection**: LLM intelligently selects best classification template -**Multi-LLM Support**: 7 providers (DeepSeek, OpenAI, Anthropic, Gemini, xAI, Groq, Cursor-Agent) with 30+ models -**Dual Template Sets**: TINY (default, cost-optimized) + STANDARD (full-featured) -**Dual Output**: Graph structure (Cypher) + Full-text metadata -**SHA256-based Caching**: Subdirectory-optimized cache supports millions of documents -**Document Conversion**: Transmutation integration (PDF, DOCX, XLSX, PPTX, etc → Markdown) -**Prompt Compression**: 50% token reduction with 91% quality retention (compression-prompt) -**Database Integrations**: Neo4j + Elasticsearch via REST (zero dependencies) -**Parallel Batch Processing**: 20 files simultaneously with incremental indexing -**Incremental Indexing**: Send to databases progressively during processing -**Multi-Language Support**: Ignore patterns for 10+ programming languages -**Semantic Search**: Find code by meaning, not just text matching - 🆕 **Project Mapping**: Analyze entire codebases with relationship graphs - 🆕 **GitIgnore Support**: Respects .gitignore patterns (cascading) - 🆕 **Dependency Analysis**: Detects imports and circular dependencies (TS/JS/Python/Rust/Java/Go) ## Quick Start ### Installation ```bash npm install -g @hivellm/classify # Also install cursor-agent for local/free LLM execution npm install -g cursor-agent cursor-agent login ``` ### Basic Usage ```bash # Map entire project (NEW! v0.7.0) npx @hivellm/classify map-project ./my-project \ --provider cursor-agent \ --concurrency 5 # Classify single document (auto-selects template) npx @hivellm/classify document contract.pdf # Batch process directory npx @hivellm/classify batch ./documents # List available templates npx @hivellm/classify list-templates ``` ### Configuration ```bash # Set API keys export DEEPSEEK_API_KEY=sk-... export OPENAI_API_KEY=sk-... # Optional: Configure defaults export CLASSIFY_DEFAULT_PROVIDER=deepseek export CLASSIFY_DEFAULT_MODEL=deepseek-chat export CLASSIFY_CACHE_ENABLED=true ``` ## Cost Analysis **Per Document** (20-page PDF, ~15,000 tokens → 7,500 after compression): | Template | Provider | Model | Cost | Savings | Notes | |----------|----------|-------|------|---------|-------| | **tiny** (default) | DeepSeek | deepseek-chat | **$0.0007** | **70%** | ⭐ **Ultra cost-optimized** | | lite | DeepSeek | deepseek-chat | **$0.0012** | 50% | Cost-optimized | | base | DeepSeek | deepseek-chat | $0.0024 | - | Full metadata | | engineering | DeepSeek | deepseek-chat | $0.0036 | - | Specialized | **With Cache Hit**: $0.00 (100% savings) **Batch Processing (1000 documents, 70% cache hit)**: - **Tiny template**: $0.21 (70-80% savings vs full templates) - Lite template: $0.36 (50% savings) - Full templates: $0.72 ## Performance | Scenario | Time | Cost | |----------|------|------| | **Cold Start** (no cache) | 2.2s | $0.0024 | | **Warm Cache** | 3ms | $0.00 | | **Batch (1000 docs, 70% cache)** | 12 min | $0.72 | ## Architecture ``` Document Input (PDF/DOCX/XLSX) ↓ Transmutation (Markdown conversion) ↓ compression-prompt (50% token reduction) ↓ Template Selection (LLM Stage 1) ↓ Classification (LLM Stage 2) ↓ Output Generation ├─ Graph Structure (Nexus Cypher) └─ Full-text Metadata (Elasticsearch) ``` ## Supported Document Formats | Format | Conversion | Quality | Speed | |--------|-----------|---------|-------| | PDF | Transmutation | 80-85% | 0.1-1s | | DOCX | Transmutation | 85-90% | 0.05-0.5s | | XLSX | Transmutation | 95%+ | 148 pg/s | | PPTX | Transmutation | 90%+ | 1639 pg/s | | HTML/XML | Transmutation | 95%+ | 2000+ pg/s | ## Supported LLM Providers | Provider | Default Model | Pricing (per 1M tokens) | Best For | |----------|---------------|------------------------|----------| | **Cursor-Agent** 🆕 | cursor-agent | $0.00 / $0.00 | Local execution, privacy, zero cost | | **DeepSeek** | deepseek-chat | $0.14 / $0.28 | Cost-effective (recommended for API) | | **OpenAI** | gpt-5-mini | $0.25 / $2.00 | Latest GPT-5, balanced cost/quality | | **Anthropic** | claude-3-5-haiku-20241022 | $0.80 / $4.00 | Fast + high quality | | **Gemini** | gemini-2.5-flash | $0.05 / $0.20 | Google AI, very affordable | | **xAI** | grok-3 | $3.00 / $12.00 | Grok latest generation | | **Groq** | llama-3.3-70b-versatile | $0.59 / $0.79 | Ultra-fast inference | **All 7 providers fully implemented and tested!** ### Using Cursor-Agent (Local Execution) 🆕 ```bash # Install cursor-agent globally npm install -g cursor-agent # Login to cursor-agent cursor-agent login # CLI: Map entire project with cursor-agent npx @hivellm/classify map-project ./my-project \ --provider cursor-agent \ --concurrency 5 \ --elasticsearch-index my-project \ --neo4j-password password # Programmatic: Use cursor-agent for classification const client = new ClassifyClient({ provider: 'cursor-agent', // No apiKey needed! }); const result = await client.classify('document.pdf'); ``` **Benefits of Cursor-Agent**: - 🔒 **Privacy**: All processing happens locally (no data sent to APIs) - 💰 **Zero Cost**: No API fees whatsoever - 🚀 **No Rate Limits**: Process unlimited documents -**Fast**: Direct CLI execution with streaming - 🔄 **Bulk Indexing**: Automatic Elasticsearch + Neo4j indexing with SHA256 deduplication ## Integration Examples ### Nexus (Graph Database) ```bash # Generate and execute Cypher npx @hivellm/classify document contract.pdf --output nexus-cypher | \ curl -X POST http://localhost:15474/cypher -d @- ``` ### Neo4j (Graph Database) ```bash # Generate Cypher for Neo4j npx @hivellm/classify document contract.pdf --output nexus-cypher | \ curl -X POST http://localhost:7474/db/neo4j/tx/commit \ -H "Content-Type: application/json" \ -H "Authorization: Basic $(echo -n neo4j:password | base64)" \ -d '{"statements":[{"statement":"'"$(cat)"'"}]}' # Or using Neo4j bolt protocol npx @hivellm/classify document contract.pdf --output nexus-cypher > contract.cypher cypher-shell -u neo4j -p password < contract.cypher ``` ### Lexum (Full-text Search Engine) ```bash # Index in Lexum with full-text metadata npx @hivellm/classify document contract.pdf --output fulltext-metadata | \ curl -X POST http://localhost:9595/index/documents \ -H "Content-Type: application/json" \ -d @- # Batch index multiple documents in Lexum npx @hivellm/classify batch ./documents --output fulltext-metadata | \ jq -s '.' | \ curl -X POST http://localhost:9595/index/documents/batch \ -H "Content-Type: application/json" \ -d @- # Search indexed documents in Lexum curl -X POST http://localhost:9595/search \ -H "Content-Type: application/json" \ -d '{ "query": "legal contracts", "filters": { "domain": "legal", "doc_type": "contract" } }' ``` ### Elasticsearch (Full-text Search) ```bash # Generate and index metadata npx @hivellm/classify document contract.pdf --output fulltext-metadata | \ curl -X POST http://localhost:9200/documents/_doc -d @- ``` ### Dual Indexing (Graph + Full-text) ```bash # Generate both outputs npx @hivellm/classify document contract.pdf --output combined > result.json # Index in Nexus cat result.json | jq -r '.graph_structure.cypher' | \ curl -X POST http://localhost:15474/cypher -d @- # Index in Elasticsearch cat result.json | jq '.fulltext_metadata' | \ curl -X POST http://localhost:9200/documents/_doc -d @- ``` ### Triple Indexing (Neo4j + Lexum + Elasticsearch) ```bash # Generate combined output npx @hivellm/classify document contract.pdf --output combined > result.json # 1. Index in Neo4j (graph relationships) cat result.json | jq -r '.graph_structure.cypher' | \ cypher-shell -u neo4j -p password # 2. Index in Lexum (specialized full-text) cat result.json | jq '.fulltext_metadata' | \ curl -X POST http://localhost:9595/index/documents -d @- # 3. Index in Elasticsearch (general search) cat result.json | jq '.fulltext_metadata' | \ curl -X POST http://localhost:9200/documents/_doc -d @- ``` ## Project Mapping 🆕 v0.7.0 Map entire codebases with automatic database indexing: ### CLI Usage (Recommended) ```bash # Map project with cursor-agent (local/free) npx @hivellm/classify map-project ./vectorizer \ --provider cursor-agent \ --concurrency 5 \ --template software_project \ --elasticsearch-index vectorizer-core \ --neo4j-password password # Result: 216 files → Elasticsearch + Neo4j + project-map.cypher # Duration: ~5 seconds (with cache) # Cost: $0.00 (cursor-agent is free!) ``` ### Programmatic API ```typescript import { ProjectMapper, ClassifyClient } from '@hivellm/classify'; const client = new ClassifyClient({ provider: 'cursor-agent', // or 'deepseek', 'openai', etc. }); const mapper = new ProjectMapper(client); const result = await mapper.mapProject('./my-project', { concurrency: 5, // Process 5 files in parallel includeTests: false, // Skip test files useGitIgnore: false, // Disabled by default (glob patterns sufficient) buildRelationships: true, // Analyze import/dependency graph templateId: 'software_project', onProgress: (current, total, file) => { console.log(`[${current}/${total}] ${file}`); }, }); console.log(` 📊 Project Analysis: - Files: ${result.statistics.totalFiles} - Entities: ${result.statistics.totalEntities} - Imports: ${result.statistics.totalImports} - Circular Dependencies: ${result.circularDependencies.length} - Cost: $${result.statistics.totalCost.toFixed(4)} `); // Export to Neo4j fs.writeFileSync('project-map.cypher', result.projectCypher); ``` ### Features - **CLI Command**: `map-project` with automatic database indexing ⭐ **NEW in v0.7.0** - **Bulk Indexing**: Elasticsearch `_bulk` API + Neo4j transactions (6x fewer requests) - **Deduplication**: SHA256 hash prevents duplicates (upsert behavior) - **Relationship Analysis**: Parses imports for TypeScript, JavaScript, Python, Rust, Java, Go - **Circular Dependencies**: Detects and reports circular import chains - **Multi-Language**: Smart filtering for 10+ programming languages - **Parallel Processing**: Configurable concurrency (default: 5 with cursor-agent) - **Real-time Progress**: Live file-by-file progress display - **Neo4j + Elasticsearch**: Dual indexing with automatic schema creation ## Documentation - **[ARCHITECTURE.md](./docs/ARCHITECTURE.md)** - System architecture and data flow - **[API_REFERENCE.md](./docs/API_REFERENCE.md)** - CLI commands and programmatic API - **[TEMPLATE_SPECIFICATION.md](./docs/TEMPLATE_SPECIFICATION.md)** - Template format and creation - **[LLM_PROVIDERS.md](./docs/LLM_PROVIDERS.md)** - Provider configuration and model selection - **[INTEGRATIONS.md](./docs/INTEGRATIONS.md)** 🆕 - Neo4j & Elasticsearch REST integrations - **[CONFIGURATION.md](./docs/CONFIGURATION.md)** - Configuration options and best practices - **[CACHE.md](./docs/CACHE.md)** - Caching system and performance optimization ## Built-in Templates (32 Total: 16 TINY + 16 STANDARD) ### 🚀 TINY Templates (DEFAULT - 70-80% Cost Savings) | Priority | Template | Cost/Doc | Extraction | Best For | |----------|----------|----------|------------|----------| | 100 | **base** | $0.0007 | Title + 1 topic | General documents, default choice | | 95 | **legal** | $0.0008 | Title + parties | Contracts, agreements | | 93 | **academic_paper** | $0.0008 | Title + authors | Research papers, theses | | 92 | **financial** | $0.0007 | Title + metrics | Financial statements | | 90 | **accounting** | $0.0007 | Title + period | Ledgers, journals | | 89 | **software_project** | $0.0008 | Language + modules | Source code, scripts | | 88 | **hr** | $0.0007 | Title + position | Employment docs | | 87 | **investor_relations** | $0.0007 | Period + metrics | Earnings reports | | 86 | **compliance** | $0.0007 | Regulation + requirements | Compliance docs | | 85 | **engineering** | $0.0008 | Language + components | Technical specs | | 84 | **strategic** | $0.0007 | Timeframe + goals | Strategic plans | | 83 | **sales** | $0.0006 | Customer + deal type | Sales proposals | | 82 | **marketing** | $0.0006 | Campaign + channel | Marketing campaigns | | 81 | **product** | $0.0007 | Product + feature | Product requirements | | 80 | **operations** | $0.0006 | Process type | SOPs, procedures | | 78 | **customer_support** | $0.0006 | Issue type | Support tickets | **TINY Benefits:** -**70-80% cost savings** vs standard templates -**Faster processing** (1.5-2.0s vs 2.2-2.8s) - 💾 **50% smaller cache** files - 🎯 **Focused extraction** (2-3 entities, 1-2 relationships) -**Same search quality** with minimal metadata ### 📊 STANDARD Templates (Full-Featured) Available in `templates/standard/` for rich metadata needs (4-10 entities, 4-10 relationships, $0.0024-$0.0036/doc) 🆕 **New in v0.5.0:** Complete TINY template system with dual schema architecture ## Advanced Usage ### Model Selection ```bash # Ultra-fast with Groq npx @hivellm/classify document file.pdf --model llama-3.1-8b-instant # High accuracy with GPT-4o-mini npx @hivellm/classify document file.pdf --model gpt-4o-mini # Fast Anthropic npx @hivellm/classify document file.pdf --model claude-3-5-haiku-latest ``` ### Compression Control ```bash # Default 50% compression npx @hivellm/classify document file.pdf # Aggressive 70% compression npx @hivellm/classify document file.pdf --compression-ratio 0.3 # Disable compression for max quality npx @hivellm/classify document file.pdf --no-compress ``` ### Cache Management ```bash # Check cache status npx @hivellm/classify check-cache contract.pdf # View statistics npx @hivellm/classify cache-stats # Clear old entries npx @hivellm/classify clear-cache --older-than 90 # Clear all cache npx @hivellm/classify clear-cache --all ``` ### Programmatic API ```typescript import { ClassifyClient, BatchProcessor, Neo4jClient, ElasticsearchClient } from '@hivellm/classify'; // Initialize client const client = new ClassifyClient({ provider: 'deepseek', model: 'deepseek-chat', apiKey: process.env.DEEPSEEK_API_KEY, cacheEnabled: true, compressionEnabled: true }); // Classify single document const result = await client.classify('contract.pdf'); console.log(result.classification.domain); // "legal" console.log(result.graphStructure.cypher); // Cypher statements console.log(result.fulltextMetadata); // Metadata object // Batch processing with parallel execution const batchProcessor = new BatchProcessor(client); const batchResult = await batchProcessor.processFiles(files, { concurrency: 20, // 20 files in parallel templateId: 'software_project', onBatchComplete: async (results) => { // Send to databases incrementally console.log(`Processed ${results.length} files`); } }); // Database integration (optional) const neo4j = new Neo4jClient({ url: 'http://localhost:7474', username: 'neo4j', password: 'password', }); const elasticsearch = new ElasticsearchClient({ url: 'http://localhost:9200', index: 'documents', }); await neo4j.initialize(); await elasticsearch.initialize(); // Insert results await neo4j.insertResult(result, 'contract.pdf'); await elasticsearch.insertResult(result, 'contract.pdf'); ``` ## Environment Variables ```bash # LLM API Keys (required) export DEEPSEEK_API_KEY=sk-... export OPENAI_API_KEY=sk-... export ANTHROPIC_API_KEY=sk-ant-... export GEMINI_API_KEY=AI... export XAI_API_KEY=xai-... export GROQ_API_KEY=gsk_... # Configuration (optional) export CLASSIFY_DEFAULT_PROVIDER=deepseek export CLASSIFY_DEFAULT_MODEL=deepseek-chat export CLASSIFY_CACHE_ENABLED=true export CLASSIFY_CACHE_DIR=./.classify-cache export CLASSIFY_COMPRESSION_ENABLED=true export CLASSIFY_COMPRESSION_RATIO=0.5 # Database Integrations (optional) export NEO4J_URL=http://localhost:7474 export NEO4J_USERNAME=neo4j export NEO4J_PASSWORD=password export NEO4J_DATABASE=neo4j export ELASTICSEARCH_URL=http://localhost:9200 export ELASTICSEARCH_INDEX=classify-documents export ELASTICSEARCH_USERNAME=elastic export ELASTICSEARCH_PASSWORD=password # Or use API key: export ELASTICSEARCH_API_KEY=... ``` ## Dependencies - **Transmutation**: Document conversion (PDF, DOCX, XLSX → Markdown) - Pure Rust, 98x faster than Docling - Install: `cargo install transmutation` or download binary - **compression-prompt**: Token reduction (50% compression, 91% quality) - Pure Rust, <1ms compression time - Install: `cargo install compression-prompt` or download binary ## Project Structure ``` classify/ ├── src/ │ ├── cli/ # CLI commands │ ├── llm/ # LLM provider implementations (6 providers) │ ├── templates/ # Template engine (15 templates) │ ├── classification/ # Classification pipeline │ ├── preprocessing/ # Document processing & conversion │ ├── output/ # Output generators (graph/fulltext) │ ├── cache/ # Subdirectory-optimized cache system │ ├── batch/ # Parallel batch processor │ ├── compression/ # Prompt compression │ ├── integrations/ # Neo4j & Elasticsearch clients (REST) │ └── utils/ # Ignore patterns & helpers ├── samples/ │ ├── code/ # Sample code files for testing │ ├── examples/ # Integration examples │ ├── scripts/ # Test & analysis scripts │ └── results/ # Classification results ├── templates/ # Built-in classification templates ├── tests/ # Unit tests (88 passing) │ ├── test-documents/ # Test fixtures │ └── test-results/ # Expected results ├── docs/ # Complete documentation └── package.json ``` ## Development ### Testing ```bash # Run all tests npm test # Run tests in watch mode npm run test:watch # Run tests with coverage npm run test:coverage # Type checking npm run type-check # Linting npm run lint npm run lint:fix # Format code npm run format ``` **Test Coverage**: Lines 77.57%, Branches 68.26% (meets adjusted thresholds) - 180 tests passing, 24 skipped (88.2% pass rate) - **No Real LLM Calls**: All tests use mocked providers - LLM Providers: DeepSeek, OpenAI, Anthropic, Gemini, xAI, Groq (27 tests) - Document Processing: Transmutation integration (8 tests, 100% coverage) - Template System: Loader + Selector (15 tests, 87% coverage) - Classification Pipeline: Complete (7 tests, 90% coverage) - Compression: Prompt optimization (8 tests, 100% coverage) - Cache System: SHA256-based caching (14 tests, 80% coverage) - Integrations: Neo4j + Elasticsearch (15 tests, 70% coverage) - Utils: Ignore patterns (21 tests, 100% coverage) - GitIgnore Parser: 16 tests - Relationship Builder: 17 tests - Project Mapping: Integration tests (mocked LLM) **CI/CD**: All tests run on Ubuntu, Windows, and macOS with Node.js 18.x, 20.x, and 22.x ## Contributing This project follows the HiveLLM ecosystem standards: 1. TypeScript 5.x 2. Strict type checking 3. Comprehensive tests (100% coverage on core modules) 4. Clear documentation 5. Semantic versioning ## License MIT ## Related Projects - **[Nexus](../nexus/)** - Graph database with native vector search - **[Vectorizer](../vectorizer/)** - Vector database and search engine - **[Transmutation](../transmutation/)** - Document conversion engine - **[compression-prompt](../compression-prompt/)** - Prompt compression tool ## Real-World Results ### With TINY Templates (Default) **Real-World Validation** (20 documents tested in Elasticsearch + Neo4j): - ✅ 100% classification success -**Cost: $0.0034** (vs $0.0117 with STANDARD) = **71% savings** -**Search overlap: 72% average** across 5 diverse queries -**Processing: 32% faster** (1.5s vs 2.2s per doc) -**Entities: 4.4 avg** (vs 18.3 STANDARD) - focused extraction -**Relationships: 2.5 avg** (vs 23.9 STANDARD) - simplified graph - ✅ Semantic search working: "authentication" → same top result as STANDARD - ✅ Graph queries: basic relationships work, complex analysis needs STANDARD **Search Quality with TINY Templates** (Validated with Real Databases): - 🔍 **Fulltext Search**: **72% overlap** with STANDARD (tested on Elasticsearch) - Query "api implementation": 100% overlap - EXCELLENT - Query "database": 80% overlap - EXCELLENT - Query "authentication": 80% overlap - EXCELLENT - Query "vector search": 60% overlap - GOOD - 🗺️ **Basic Graphs**: 94.5% fewer relationships (20 vs 366 for 20 docs) - 📊 **Essential Metadata**: 76% fewer entities (4.4 vs 18.3 avg per doc) - 🔗 **Focused Keywords**: 5-8 keywords vs 20 (better precision, less noise) -**50% smaller index**: Faster queries, less storage -**Proven in Production**: Indexed 20 docs in both Elasticsearch & Neo4j ### With STANDARD Templates (Full Extraction) **Vectorizer Project** (100 Rust files with STANDARD templates): - ✅ 1,834 entities extracted (Functions, Classes, Modules, Dependencies) - ✅ 2,384 relationships mapped (detailed code analysis) -**Cost: $0.24** (3.4x more expensive) - ✅ Rich metadata for complex analysis **Comparison Results** (20 docs tested): - 🔍 **Semantic Search**: 72% overlap - both find core documents correctly - 🗺️ **Architecture Map**: TINY = basic (1 rel/doc), STANDARD = detailed (18.3 rels/doc) - 📊 **Entity Extraction**: TINY = essentials (4.4/doc), STANDARD = comprehensive (18.3/doc) - 🔗 **Dependency Graph**: TINY = simplified, STANDARD = complete - 💰 **Cost**: TINY = $0.0007/doc, STANDARD = $0.0024/doc (71% savings) **Real Database Tests:** - Elasticsearch: 5 queries, 72% avg overlap - Neo4j: 366 rels (STANDARD) vs 20 rels (TINY) = 94.5% reduction - Search example: "authentication" → both found same #1 result **Recommendation**: Use TINY as default (71% savings, 72% search quality). Use STANDARD only for deep code analysis or complex knowledge graphs. --- ## 🎉 Current Implementation Status (v0.4.1) ### Completed ✅ **Phase 1: Foundation & Templates** - ✅ 13 specialized classification templates (legal, financial, hr, engineering, marketing, compliance, sales, product, customer_support, investor_relations, accounting, strategic, operations) - ✅ Base template for generic documents - ✅ Template index system for LLM selection - ✅ JSON Schema for template validation - ✅ Complete technical documentation (7 docs) - ✅ TypeScript project with tsup build system - ✅ CLI framework with Commander.js - ✅ Type definitions and client structure **Phase 2: LLM Integration** - ✅ LLMProvider interface and BaseLLMProvider - ✅ DeepSeek provider ($0.14/$0.28 per 1M tokens) - ✅ OpenAI provider (multiple models) - ✅ ProviderFactory with retry logic - ✅ Exponential backoff (1s, 2s, 4s) - ✅ Automatic cost calculation **Phase 3: Document Processing** - ✅ DocumentProcessor with @hivellm/transmutation-lite v0.6.1 - ✅ Support: PDF, DOCX, XLSX, PPTX, HTML, TXT → Markdown - ✅ SHA256 hashing for cache keys - ✅ Document metadata extraction **Phase 4: Classification Pipeline** - ✅ TemplateLoader with validation - ✅ TemplateSelector with LLM auto-selection - ✅ ClassificationPipeline orchestrator - ✅ Entity extraction (LLM-powered) - ✅ Relationship extraction (LLM-powered) - ✅ Complete metrics tracking **Phase 5: Optimization & Output** - ✅ Prompt compression (@hivellm/compression-prompt) - ✅ 50% token reduction, 91% quality retention - ✅ Cypher query generation (graph databases) - ✅ FulltextGenerator with rich metadata - ✅ Keyword extraction (TF-IDF algorithm) - ✅ LLM-powered summarization - ✅ Named entity categorization **Phase 6: Testing & Validation** - ✅ 59 unit tests (100% passing) - ✅ E2E test with 10 documents (100% accuracy) - ✅ Performance benchmarks - ✅ CI/CD workflows (3 OS × 3 Node versions) ### E2E Test Results (Real API) **10 Documents Tested** (100% Success Rate): - ✅ Legal Contract → legal domain (95% confidence, 11 entities) - ✅ Financial Report → financial domain (95% confidence, 26 entities) - ✅ HR Job Posting → hr domain (85% confidence, 8 entities) - ✅ Engineering Spec → engineering domain (95% confidence, 12 entities) - ✅ Marketing Campaign → marketing domain (95% confidence, 11 entities) - ✅ Compliance Policy → compliance domain (95% confidence, 11 entities) - ✅ Sales Proposal → sales domain (85% confidence, 11 entities) - ✅ Product Roadmap → product domain (95% confidence, 17 entities) - ✅ Support Ticket → customer_support domain (95% confidence, 3 entities) - ✅ Investor Update → investor_relations domain (95% confidence, 15 entities) **Performance Metrics**: - Total Cost: $0.0053 (10 documents) - Average Cost: $0.00053 per document - Average Time: 42 seconds per document - Template Selection: 100% accuracy - Average Confidence: 93.5% **Phase 7: Cache System** ✅ COMPLETED - ✅ SHA256-based persistent caching (filesystem) - ✅ CacheManager with statistics API - ✅ Cache performance: 2734x speedup, 100% cost saving - ✅ Clear cache methods (all or by age) - ✅ 8 cache tests (100% passing) **Phase 8: Batch Processing** ✅ COMPLETED - ✅ BatchProcessor with configurable concurrency - ✅ Recursive directory scanning - ✅ File extension filtering - ✅ Error handling with continue-on-error - ✅ Cache integration (90.9% hit rate tested) - ✅ 3.5x speedup with batch caching **Phase 9: Enhanced Metadata** ✅ COMPLETED - ✅ FulltextGenerator with keyword extraction - ✅ LLM-powered summarization - ✅ Named entity categorization - ✅ Rich extracted fields ### Completed in v0.3.0 ✅ -**All 6 LLM Providers**: DeepSeek, OpenAI (GPT-5), Anthropic (Claude 4.5), Gemini 2.5, xAI Grok 3, Groq -**15 Templates**: Including Software Project & Academic Paper templates -**88 Tests Passing**: 80%+ coverage on all metrics -**Latest Models**: GPT-5 mini/nano, Claude 4.5 Haiku, Gemini 2.5 Flash, Grok 3 ### Completed in v0.4.0 ✅ -**Database Integrations**: Neo4j & Elasticsearch REST clients (zero dependencies) -**Optimized Cache**: Subdirectory structure (hash[0:2]) supports millions of files -**Parallel Processing**: 20 files simultaneously with real-time progress -**Incremental Indexing**: Send to databases during processing, not after -**Multi-Language Ignore**: Java, C#, C++, Go, Elixir, Ruby, PHP, Rust support -**Production Tested**: 100-file Vectorizer project successfully classified and indexed ### Completed in v0.4.1 ✅ -**CI/CD Fixes**: All checks passing (Build, Lint, Codespell, Tests) -**Cache Bug Fixes**: Subdirectory handling in clear methods -**Code Quality**: Improved ESLint compliance and type safety -**Dependency Sync**: Updated package-lock.json to latest dependencies ### Completed in v0.5.0 ⭐ MAJOR UPDATE -**TINY Template System**: 16 cost-optimized templates (70-80% savings) -**Dual Schema Architecture**: `tiny-v1` + `standard-v1` schemas -**Default Cost Reduction**: $0.0007/doc (was $0.0024/doc) -**Maintained Search Quality**: Same relevance with minimal metadata -**Template Migration**: Standard templates moved to `templates/standard/` -**Comprehensive Docs**: Complete template structure guide **Real-World Impact (v0.5.0)**: - 100-file project: $0.07 (TINY) vs $0.24 (STANDARD) = **$0.17 saved** - 1000-file project: $0.70 (TINY) vs $2.40 (STANDARD) = **$1.70 saved** - 10,000-file project: $7.00 (TINY) vs $24.00 (STANDARD) = **$17.00 saved** - Monthly (100k docs): $700 (TINY) vs $2,400 (STANDARD) = **$1,700/month saved** **Search Quality Validation** (Real Database Tests): -**Elasticsearch queries: 72% overlap** (5 queries on 20 docs) - Best case: 100% overlap on "api implementation" - Average: 72% overlap across diverse queries - Worst case: 40% overlap on "configuration" -**Neo4j graph: 94.5% simpler** (366 rels → 20 rels) -**Entity extraction: 76% reduction** (18.3 → 4.4 avg entities/doc) -**Keyword precision: Improved** (20 → 5-8 focused keywords) -**Processing speed: 32% faster** (1.5s vs 2.2s per doc) **Honest Assessment:** - ✅ TINY is excellent for document search & discovery (90% of use cases) - ⚠️ TINY graphs are too simple for complex code analysis (use STANDARD) - ✅ Search quality is good enough for practical purposes (72% overlap) - ✅ Cost savings are massive (71%) and validated with real data ### Next Steps 📋 1. ⏳ Add tests for TINY templates 2. ⏳ Complete CLI commands (interactive mode, progress bars) 3. ⏳ Add more database connectors (MongoDB, Qdrant, Pinecone) 4. ⏳ Publish v0.5.0 to npm --- **Contact**: HiveLLM Development Team