@andrejs1979/document

# NoSQL - Document Module A comprehensive MongoDB-compatible document database built for Cloudflare's edge infrastructure, featuring advanced vector integration, intelligent indexing, and real-time capabilities. ## Features ### 🚀 **MongoDB Compatibility** - Full MongoDB query language support - CRUD operations (Create, Read, Update, Delete) - Advanced aggregation pipelines - Complex query operators ($and, $or, $in, $regex, etc.) - GridFS-like large document support with R2 integration ### 🔍 **Hybrid Search** - Text search with full-text indexing - Vector similarity search - Semantic search using embeddings - Multi-modal search (text, image, audio) - Personalized recommendations - Similar document discovery ### 🔗 **Relationships** - Define relationships between collections - Automatic population of related documents - One-to-one, one-to-many, many-to-many relationships - Cascade operations with referential integrity - Deep population with configurable depth ### ⚡ **Performance** - Intelligent auto-indexing based on query patterns - Dynamic field indexing for optimal performance - Query plan optimization and explanation - Caching with TTL support - Connection pooling and batch operations ### 📊 **Bulk Operations** - High-performance bulk writes - Streaming inserts for large datasets - Real-time document streams - Parallel processing with configurable concurrency - Error handling and recovery ### 🏷️ **Smart Tagging** - Automatic content-based tagging - Hierarchical tag systems - Tag recommendations and suggestions - Bulk tagging operations - Tag analytics and cleanup ### 📈 **Analytics & Monitoring** - Real-time performance metrics - Query pattern analysis - Index usage statistics - Document lifecycle tracking - Health monitoring and diagnostics ## Quick Start ### Installation ```typescript import { EdgeDocumentDB } from './src/document'; // Create database instance const db = await EdgeDocumentDB.create({ name: 'my_app_db', d1Database: env.DB, // Cloudflare D1 binding kvStore: env.KV, // Cloudflare KV binding r2Bucket: env.BUCKET, // Cloudflare R2 binding options: { enableAutoIndexing: true, enableRelationships: true, vectorConfig: { enabled: true, autoEmbedding: true } } }); ``` ### Basic Operations ```typescript // Insert documents const user = await db.insertOne('users', { name: 'John Doe', email: 'john@example.com', tags: ['developer', 'javascript'] }); // Query documents const users = await db.find('users', { tags: { $in: ['developer'] }, createdAt: { $gte: new Date('2024-01-01') } }, { sort: { name: 1 }, limit: 10 }); // Update documents await db.updateOne('users', { email: 'john@example.com' }, { $set: { lastLogin: new Date() } } ); // Aggregation pipeline const stats = await db.aggregate('users', [ { $match: { active: true } }, { $group: { _id: '$department', count: { $sum: 1 } } }, { $sort: { count: -1 } } ]); ``` ### Hybrid Search ```typescript // Text search const articles = await db.textSearch('content', 'machine learning', { filters: { category: 'tech' }, limit: 10 }); // Vector search const similar = await db.vectorSearch('content', embeddings, { threshold: 0.7, limit: 5 }); // Semantic search const semantic = await db.semanticSearch('content', 'artificial intelligence neural networks', { textWeight: 0.3, vectorWeight: 0.7 } ); // Hybrid search combining multiple signals const hybrid = await db.hybridSearch('content', { text: 'react development', vector: queryEmbedding, filter: { publishedAt: { $gte: recentDate } }, weights: { text: 0.4, vector: 0.6, metadata: 0.0 } }); ``` ### Relationships ```typescript // Define relationships await db.defineRelationship('posts', 'users', { type: 'manyToOne', localField: 'authorId', foreignField: '_id', foreignCollection: 'users' }); // Query with population const posts = await db.findWithPopulate('posts', {}, [ { path: 'authorId', select: 'name email avatar' }, { path: 'comments', match: { approved: true }, options: { sort: { createdAt: -1 }, limit: 5 } } ]); ``` ### Bulk Operations ```typescript // Bulk write operations const bulkOps = [ { insertOne: { document: { name: 'User 1' } } }, { updateOne: { filter: { name: 'User 2' }, update: { $set: { active: true } } } }, { deleteOne: { filter: { name: 'User 3' } } } ]; const result = await db.bulkWrite('users', bulkOps); // Streaming inserts async function* generateData() { for (let i = 0; i < 100000; i++) { yield { id: i, value: Math.random() }; } } const streamResult = await db.streamInsert('analytics', generateData(), { batchSize: 1000, onProgress: (inserted, total) => console.log(`${inserted}/${total}`) }); ``` ### Indexing ```typescript // Create indexes await db.createIndex('products', { key: { category: 1, price: -1 }, options: { name: 'category_price_idx' } }); // Text index await db.createIndex('articles', { key: { title: 'text', content: 'text' }, options: { weights: { title: 10, content: 5 } } }); // Vector index await db.createIndex('embeddings', { key: { vector: 'vector' }, options: { vectorOptions: { dimensions: 1536, similarity: 'cosine', type: 'hnsw' } } }); // Auto-indexing await db.autoCreateIndexes('products'); // Get recommendations const recommendations = await db.getIndexRecommendations('products'); ``` ### Tagging System ```typescript // Auto-tag documents const tags = await db.autoTag('articles', document, { tagSources: ['content', 'metadata'], customTagger: (doc) => { const tags = []; if (doc.content?.includes('React')) tags.push('react'); return tags; } }); // Apply tags await db.tagDocument('articles', documentId, tags); // Find by tags const taggedDocs = await db.findByTags('articles', ['react', 'typescript'], { operator: 'and', includeHierarchy: true }); // Tag statistics const tagStats = await db.getTagStats('articles', { sortBy: 'count', limit: 20 }); ``` ## Advanced Features ### Vector Integration The document module seamlessly integrates with NoSQL's vector capabilities: ```typescript // Documents with embedded vectors const document = { title: 'AI Research Paper', content: 'Latest advances in machine learning...', _vector: { id: 'doc1', data: new Float32Array([0.1, 0.2, 0.3, ...]), // 1536 dimensions metadata: { model: 'text-embedding-ada-002' } } }; // Automatic embedding generation const db = await EdgeDocumentDB.create({ name: 'ai_db', d1Database: env.DB, options: { vectorConfig: { enabled: true, autoEmbedding: true, embeddingFields: ['content', 'title'], defaultModel: 'text-embedding-ada-002' } } }); ``` ### Real-time Streams ```typescript // Create real-time document stream const stream = db.createDocumentStream('events', { batchSize: 100, flushInterval: 5000, compression: true, transform: (doc) => ({ ...doc, processedAt: new Date() }), errorHandler: (error, batch) => { console.error('Stream error:', error); // Implement retry logic or dead letter queue } }); // Write to stream await stream.write({ event: 'user_action', userId: 'user123', action: 'click', timestamp: new Date() }); // Stop stream await stream.stop(); ``` ### Performance Monitoring ```typescript // Database statistics const stats = await db.stats(); console.log('Total documents:', stats.totalDocuments); console.log('Index count:', stats.indexCount); // Query performance const queryMetrics = db.queryEngine.getQueryMetrics(); const slowQueries = queryMetrics.filter(m => m.latency > 100); // Index usage const indexStats = await db.indexManager.getIndexStats('mydb', 'collection'); console.log('Unused indexes:', indexStats.recommendations); ``` ## Configuration Options ```typescript interface DocumentDatabaseConfig { name: string; d1Database: any; // Cloudflare D1 instance kvStore?: any; // Cloudflare KV store r2Bucket?: any; // Cloudflare R2 bucket // Performance settings maxDocumentSize?: number; // Default: 16MB queryTimeout?: number; // Default: 30s batchSize?: number; // Default: 100 // Caching enableQueryCache?: boolean; // Default: true queryCacheTTL?: number; // Default: 300s cacheSize?: number; // Default: 100MB // Indexing enableAutoIndexing?: boolean; // Default: true autoIndexThreshold?: number; // Default: 1000 maxIndexedFields?: number; // Default: 20 // Vector integration vectorConfig?: { enabled?: boolean; // Default: true defaultDimensions?: number; // Default: 1536 defaultModel?: string; // Default: 'text-embedding-ada-002' autoEmbedding?: boolean; // Default: false embeddingFields?: string[]; // Default: ['content', 'text'] }; // Features enableValidation?: boolean; // Default: true enableSchemaEvolution?: boolean; // Default: true enableChangeStreams?: boolean; // Default: true enableRelationships?: boolean; // Default: true enableQueryLogging?: boolean; // Default: false enablePerformanceMetrics?: boolean; // Default: true // Bulk operations bulkWriteBatchSize?: number; // Default: 1000 bulkWriteParallelism?: number; // Default: 4 } ``` ## Architecture The document module is built with a modular architecture: ``` src/document/ ├── edge-document-db.ts # Main database class ├── types.ts # Type definitions ├── storage/ │ └── document-storage.ts # Core storage engine ├── operations/ │ ├── query-engine.ts # MongoDB query processing │ ├── hybrid-search.ts # Hybrid search engine │ └── bulk-operations.ts # Bulk and streaming operations ├── indexes/ │ └── index-manager.ts # Intelligent indexing ├── relationships/ │ └── relationship-manager.ts # Document relationships ├── metadata/ │ └── tagging-system.ts # Smart tagging system └── examples/ └── basic-usage.ts # Comprehensive examples ``` ## Performance Characteristics ### Latency Targets - Simple queries: < 10ms p99 - Complex aggregations: < 100ms p99 - Bulk operations: 10,000+ docs/second - Vector similarity: < 50ms p99 ### Scalability - Documents: Unlimited (distributed across D1 + R2) - Collections: Unlimited - Indexes: 20 dynamic indexes per collection - Concurrent operations: 1000+ per database ### Storage Efficiency - Automatic compression for large documents - Intelligent caching with LRU eviction - Vector quantization for storage optimization - Tiered storage (D1 for metadata, R2 for large docs) ## Best Practices ### Query Optimization ```typescript // Use indexes effectively await db.find('products', { category: 'electronics', // Indexed field price: { $gte: 100 } // Indexed field }); // Limit results await db.find('products', filter, { limit: 20 }); // Use projection to reduce data transfer await db.find('products', filter, { projection: { name: 1, price: 1 } }); // Explain queries for optimization const explanation = await db.explain('products', filter); ``` ### Memory Management ```typescript // Use streaming for large datasets const stream = db.createDocumentStream('logs', { batchSize: 1000, flushInterval: 5000 }); // Clear caches periodically db.hybridSearchEngine.clearSearchCache(); db.relationshipManager.clearPopulateCache(); ``` ### Error Handling ```typescript try { await db.insertOne('users', document); } catch (error) { if (error instanceof DuplicateKeyError) { // Handle duplicate key } else if (error instanceof ValidationError) { // Handle validation error } else { // Handle other errors } } ``` ## Integration with NoSQL The document module is designed to work seamlessly with other NoSQL modules: ```typescript // Unified database instance import { NoSQLDB } from '../index'; const vectorDB = new NoSQLDB({ name: 'unified_db', d1Database: env.DB, kvStore: env.KV, r2Bucket: env.BUCKET }); // Access document operations const documentDB = vectorDB.documents(); await documentDB.insertOne('content', document); // Access vector operations const vectorStore = vectorDB.vectors(); await vectorStore.addVectors(vectors); // Hybrid operations const results = await documentDB.hybridSearch('content', { text: 'search query', vector: queryVector }); ``` ## Migration from MongoDB The document module provides a migration-friendly API: ```typescript // MongoDB equivalent operations const collection = db.collection('users'); // NoSQL const users = await db.find('users', filter, options); const user = await db.findOne('users', filter); const result = await db.insertOne('users', document); const updateResult = await db.updateMany('users', filter, update); const deleteResult = await db.deleteOne('users', filter); // Aggregation pipelines work identically const pipeline = [ { $match: { active: true } }, { $group: { _id: '$department', count: { $sum: 1 } } } ]; const results = await db.aggregate('users', pipeline); ``` ## Monitoring and Observability ```typescript // Performance metrics const metrics = db.queryEngine.getQueryMetrics(); const slowQueries = metrics.filter(m => m.latency > 100); // Index recommendations const recommendations = await db.getIndexRecommendations('collection'); // Database health const isHealthy = await db.ping(); // Resource usage const cacheStats = db.hybridSearchEngine.getSearchCacheStats(); const populateStats = db.relationshipManager.getPopulateCacheStats(); ``` ## License Part of NoSQL - distributed under the same license as the main project.