@andrejs1979/document
Version:
MongoDB-compatible document database for NoSQL
553 lines (456 loc) ⢠13.9 kB
Markdown
# NoSQL - Document Module
A comprehensive MongoDB-compatible document database built for Cloudflare's edge infrastructure, featuring advanced vector integration, intelligent indexing, and real-time capabilities.
## Features
### š **MongoDB Compatibility**
- Full MongoDB query language support
- CRUD operations (Create, Read, Update, Delete)
- Advanced aggregation pipelines
- Complex query operators ($and, $or, $in, $regex, etc.)
- GridFS-like large document support with R2 integration
### š **Hybrid Search**
- Text search with full-text indexing
- Vector similarity search
- Semantic search using embeddings
- Multi-modal search (text, image, audio)
- Personalized recommendations
- Similar document discovery
### š **Relationships**
- Define relationships between collections
- Automatic population of related documents
- One-to-one, one-to-many, many-to-many relationships
- Cascade operations with referential integrity
- Deep population with configurable depth
### ā” **Performance**
- Intelligent auto-indexing based on query patterns
- Dynamic field indexing for optimal performance
- Query plan optimization and explanation
- Caching with TTL support
- Connection pooling and batch operations
### š **Bulk Operations**
- High-performance bulk writes
- Streaming inserts for large datasets
- Real-time document streams
- Parallel processing with configurable concurrency
- Error handling and recovery
### š·ļø **Smart Tagging**
- Automatic content-based tagging
- Hierarchical tag systems
- Tag recommendations and suggestions
- Bulk tagging operations
- Tag analytics and cleanup
### š **Analytics & Monitoring**
- Real-time performance metrics
- Query pattern analysis
- Index usage statistics
- Document lifecycle tracking
- Health monitoring and diagnostics
## Quick Start
### Installation
```typescript
import { EdgeDocumentDB } from './src/document';
// Create database instance
const db = await EdgeDocumentDB.create({
name: 'my_app_db',
d1Database: env.DB, // Cloudflare D1 binding
kvStore: env.KV, // Cloudflare KV binding
r2Bucket: env.BUCKET, // Cloudflare R2 binding
options: {
enableAutoIndexing: true,
enableRelationships: true,
vectorConfig: {
enabled: true,
autoEmbedding: true
}
}
});
```
### Basic Operations
```typescript
// Insert documents
const user = await db.insertOne('users', {
name: 'John Doe',
email: 'john@example.com',
tags: ['developer', 'javascript']
});
// Query documents
const users = await db.find('users', {
tags: { $in: ['developer'] },
createdAt: { $gte: new Date('2024-01-01') }
}, {
sort: { name: 1 },
limit: 10
});
// Update documents
await db.updateOne('users',
{ email: 'john@example.com' },
{ $set: { lastLogin: new Date() } }
);
// Aggregation pipeline
const stats = await db.aggregate('users', [
{ $match: { active: true } },
{ $group: { _id: '$department', count: { $sum: 1 } } },
{ $sort: { count: -1 } }
]);
```
### Hybrid Search
```typescript
// Text search
const articles = await db.textSearch('content', 'machine learning', {
filters: { category: 'tech' },
limit: 10
});
// Vector search
const similar = await db.vectorSearch('content', embeddings, {
threshold: 0.7,
limit: 5
});
// Semantic search
const semantic = await db.semanticSearch('content',
'artificial intelligence neural networks', {
textWeight: 0.3,
vectorWeight: 0.7
}
);
// Hybrid search combining multiple signals
const hybrid = await db.hybridSearch('content', {
text: 'react development',
vector: queryEmbedding,
filter: { publishedAt: { $gte: recentDate } },
weights: { text: 0.4, vector: 0.6, metadata: 0.0 }
});
```
### Relationships
```typescript
// Define relationships
await db.defineRelationship('posts', 'users', {
type: 'manyToOne',
localField: 'authorId',
foreignField: '_id',
foreignCollection: 'users'
});
// Query with population
const posts = await db.findWithPopulate('posts', {}, [
{
path: 'authorId',
select: 'name email avatar'
},
{
path: 'comments',
match: { approved: true },
options: { sort: { createdAt: -1 }, limit: 5 }
}
]);
```
### Bulk Operations
```typescript
// Bulk write operations
const bulkOps = [
{ insertOne: { document: { name: 'User 1' } } },
{ updateOne: { filter: { name: 'User 2' }, update: { $set: { active: true } } } },
{ deleteOne: { filter: { name: 'User 3' } } }
];
const result = await db.bulkWrite('users', bulkOps);
// Streaming inserts
async function* generateData() {
for (let i = 0; i < 100000; i++) {
yield { id: i, value: Math.random() };
}
}
const streamResult = await db.streamInsert('analytics', generateData(), {
batchSize: 1000,
onProgress: (inserted, total) => console.log(`${inserted}/${total}`)
});
```
### Indexing
```typescript
// Create indexes
await db.createIndex('products', {
key: { category: 1, price: -1 },
options: { name: 'category_price_idx' }
});
// Text index
await db.createIndex('articles', {
key: { title: 'text', content: 'text' },
options: { weights: { title: 10, content: 5 } }
});
// Vector index
await db.createIndex('embeddings', {
key: { vector: 'vector' },
options: {
vectorOptions: {
dimensions: 1536,
similarity: 'cosine',
type: 'hnsw'
}
}
});
// Auto-indexing
await db.autoCreateIndexes('products');
// Get recommendations
const recommendations = await db.getIndexRecommendations('products');
```
### Tagging System
```typescript
// Auto-tag documents
const tags = await db.autoTag('articles', document, {
tagSources: ['content', 'metadata'],
customTagger: (doc) => {
const tags = [];
if (doc.content?.includes('React')) tags.push('react');
return tags;
}
});
// Apply tags
await db.tagDocument('articles', documentId, tags);
// Find by tags
const taggedDocs = await db.findByTags('articles', ['react', 'typescript'], {
operator: 'and',
includeHierarchy: true
});
// Tag statistics
const tagStats = await db.getTagStats('articles', {
sortBy: 'count',
limit: 20
});
```
## Advanced Features
### Vector Integration
The document module seamlessly integrates with NoSQL's vector capabilities:
```typescript
// Documents with embedded vectors
const document = {
title: 'AI Research Paper',
content: 'Latest advances in machine learning...',
_vector: {
id: 'doc1',
data: new Float32Array([0.1, 0.2, 0.3, ...]), // 1536 dimensions
metadata: { model: 'text-embedding-ada-002' }
}
};
// Automatic embedding generation
const db = await EdgeDocumentDB.create({
name: 'ai_db',
d1Database: env.DB,
options: {
vectorConfig: {
enabled: true,
autoEmbedding: true,
embeddingFields: ['content', 'title'],
defaultModel: 'text-embedding-ada-002'
}
}
});
```
### Real-time Streams
```typescript
// Create real-time document stream
const stream = db.createDocumentStream('events', {
batchSize: 100,
flushInterval: 5000,
compression: true,
transform: (doc) => ({
...doc,
processedAt: new Date()
}),
errorHandler: (error, batch) => {
console.error('Stream error:', error);
// Implement retry logic or dead letter queue
}
});
// Write to stream
await stream.write({
event: 'user_action',
userId: 'user123',
action: 'click',
timestamp: new Date()
});
// Stop stream
await stream.stop();
```
### Performance Monitoring
```typescript
// Database statistics
const stats = await db.stats();
console.log('Total documents:', stats.totalDocuments);
console.log('Index count:', stats.indexCount);
// Query performance
const queryMetrics = db.queryEngine.getQueryMetrics();
const slowQueries = queryMetrics.filter(m => m.latency > 100);
// Index usage
const indexStats = await db.indexManager.getIndexStats('mydb', 'collection');
console.log('Unused indexes:', indexStats.recommendations);
```
## Configuration Options
```typescript
interface DocumentDatabaseConfig {
name: string;
d1Database: any; // Cloudflare D1 instance
kvStore?: any; // Cloudflare KV store
r2Bucket?: any; // Cloudflare R2 bucket
// Performance settings
maxDocumentSize?: number; // Default: 16MB
queryTimeout?: number; // Default: 30s
batchSize?: number; // Default: 100
// Caching
enableQueryCache?: boolean; // Default: true
queryCacheTTL?: number; // Default: 300s
cacheSize?: number; // Default: 100MB
// Indexing
enableAutoIndexing?: boolean; // Default: true
autoIndexThreshold?: number; // Default: 1000
maxIndexedFields?: number; // Default: 20
// Vector integration
vectorConfig?: {
enabled?: boolean; // Default: true
defaultDimensions?: number; // Default: 1536
defaultModel?: string; // Default: 'text-embedding-ada-002'
autoEmbedding?: boolean; // Default: false
embeddingFields?: string[]; // Default: ['content', 'text']
};
// Features
enableValidation?: boolean; // Default: true
enableSchemaEvolution?: boolean; // Default: true
enableChangeStreams?: boolean; // Default: true
enableRelationships?: boolean; // Default: true
enableQueryLogging?: boolean; // Default: false
enablePerformanceMetrics?: boolean; // Default: true
// Bulk operations
bulkWriteBatchSize?: number; // Default: 1000
bulkWriteParallelism?: number; // Default: 4
}
```
## Architecture
The document module is built with a modular architecture:
```
src/document/
āāā edge-document-db.ts # Main database class
āāā types.ts # Type definitions
āāā storage/
ā āāā document-storage.ts # Core storage engine
āāā operations/
ā āāā query-engine.ts # MongoDB query processing
ā āāā hybrid-search.ts # Hybrid search engine
ā āāā bulk-operations.ts # Bulk and streaming operations
āāā indexes/
ā āāā index-manager.ts # Intelligent indexing
āāā relationships/
ā āāā relationship-manager.ts # Document relationships
āāā metadata/
ā āāā tagging-system.ts # Smart tagging system
āāā examples/
āāā basic-usage.ts # Comprehensive examples
```
## Performance Characteristics
### Latency Targets
- Simple queries: < 10ms p99
- Complex aggregations: < 100ms p99
- Bulk operations: 10,000+ docs/second
- Vector similarity: < 50ms p99
### Scalability
- Documents: Unlimited (distributed across D1 + R2)
- Collections: Unlimited
- Indexes: 20 dynamic indexes per collection
- Concurrent operations: 1000+ per database
### Storage Efficiency
- Automatic compression for large documents
- Intelligent caching with LRU eviction
- Vector quantization for storage optimization
- Tiered storage (D1 for metadata, R2 for large docs)
## Best Practices
### Query Optimization
```typescript
// Use indexes effectively
await db.find('products', {
category: 'electronics', // Indexed field
price: { $gte: 100 } // Indexed field
});
// Limit results
await db.find('products', filter, { limit: 20 });
// Use projection to reduce data transfer
await db.find('products', filter, {
projection: { name: 1, price: 1 }
});
// Explain queries for optimization
const explanation = await db.explain('products', filter);
```
### Memory Management
```typescript
// Use streaming for large datasets
const stream = db.createDocumentStream('logs', {
batchSize: 1000,
flushInterval: 5000
});
// Clear caches periodically
db.hybridSearchEngine.clearSearchCache();
db.relationshipManager.clearPopulateCache();
```
### Error Handling
```typescript
try {
await db.insertOne('users', document);
} catch (error) {
if (error instanceof DuplicateKeyError) {
// Handle duplicate key
} else if (error instanceof ValidationError) {
// Handle validation error
} else {
// Handle other errors
}
}
```
## Integration with NoSQL
The document module is designed to work seamlessly with other NoSQL modules:
```typescript
// Unified database instance
import { NoSQLDB } from '../index';
const vectorDB = new NoSQLDB({
name: 'unified_db',
d1Database: env.DB,
kvStore: env.KV,
r2Bucket: env.BUCKET
});
// Access document operations
const documentDB = vectorDB.documents();
await documentDB.insertOne('content', document);
// Access vector operations
const vectorStore = vectorDB.vectors();
await vectorStore.addVectors(vectors);
// Hybrid operations
const results = await documentDB.hybridSearch('content', {
text: 'search query',
vector: queryVector
});
```
## Migration from MongoDB
The document module provides a migration-friendly API:
```typescript
// MongoDB equivalent operations
const collection = db.collection('users');
// NoSQL
const users = await db.find('users', filter, options);
const user = await db.findOne('users', filter);
const result = await db.insertOne('users', document);
const updateResult = await db.updateMany('users', filter, update);
const deleteResult = await db.deleteOne('users', filter);
// Aggregation pipelines work identically
const pipeline = [
{ $match: { active: true } },
{ $group: { _id: '$department', count: { $sum: 1 } } }
];
const results = await db.aggregate('users', pipeline);
```
## Monitoring and Observability
```typescript
// Performance metrics
const metrics = db.queryEngine.getQueryMetrics();
const slowQueries = metrics.filter(m => m.latency > 100);
// Index recommendations
const recommendations = await db.getIndexRecommendations('collection');
// Database health
const isHealthy = await db.ping();
// Resource usage
const cacheStats = db.hybridSearchEngine.getSearchCacheStats();
const populateStats = db.relationshipManager.getPopulateCacheStats();
```
## License
Part of NoSQL - distributed under the same license as the main project.