UNPKG

@n2flowjs/nbase

Version:

Neural Vector Database for efficient similarity search

618 lines (505 loc) 15.2 kB
# Advanced Features NBase provides powerful advanced features for complex vector operations, large-scale processing, and specialized search algorithms. This guide covers relationship extraction, community detection, hybrid search, batch processing, and performance optimization techniques. ## Vector Relationships ### Extracting Relationships Find relationships between vectors based on distance thresholds: ```javascript // Extract relationships with distance threshold const relationships = await db.extractRelationships(0.3, { metric: 'cosine', includeMetadata: true, partitionIds: ['p1', 'p2'] // Limit to specific partitions }); console.log('Relationships found:', relationships.length); // Output: [ // { // vector1: { id: 'doc1', partitionId: 'p1', metadata: {...} }, // vector2: { id: 'doc2', partitionId: 'p1', metadata: {...} }, // distance: 0.25 // }, // ... // ] ``` ### Relationship Analysis ```javascript // Find strongly related documents const strongRelationships = relationships.filter(rel => rel.distance < 0.2); // Group relationships by partition const relationshipsByPartition = relationships.reduce((acc, rel) => { const pid = rel.vector1.partitionId; if (!acc[pid]) acc[pid] = []; acc[pid].push(rel); return acc; }, {}); ``` ## Community Detection ### Finding Communities Extract clusters (communities) of closely related vectors: ```javascript // Detect communities with distance threshold const communities = await db.extractCommunities(0.5, { metric: 'euclidean', includeMetadata: true, partitionIds: ['p1', 'p2', 'p3'] }); console.log(`Found ${communities.length} communities`); // Output: [ // [ // { id: 'doc1', partitionId: 'p1', metadata: {...} }, // { id: 'doc2', partitionId: 'p1', metadata: {...} }, // { id: 'doc5', partitionId: 'p2', metadata: {...} } // ], // [ // { id: 'doc3', partitionId: 'p1', metadata: {...} }, // { id: 'doc4', partitionId: 'p1', metadata: {...} } // ], // ... // ] ``` ### Community Analysis ```javascript // Analyze community characteristics communities.forEach((community, index) => { const avgMetadata = analyzeCommunityMetadata(community); console.log(`Community ${index}: ${community.length} vectors, avg score: ${avgMetadata.avgScore}`); }); // Find largest communities const largestCommunities = communities .sort((a, b) => b.length - a.length) .slice(0, 5); ``` ## Hybrid Search ### Multi-Modal Search Combine different search methods for improved results: ```javascript // Hybrid search with multiple algorithms const results = await db.search(queryVector, { k: 10, searchMethod: 'hybrid', methods: ['hnsw', 'lsh'], // Combine HNSW and LSH weights: { hnsw: 0.7, lsh: 0.3 }, useParallelExecution: true }); ``` ### Custom Scoring ```javascript // Implement custom hybrid scoring const hnswResults = await db.search(queryVector, { k: 20, useHNSW: true, includeMetadata: true }); const lshResults = await db.search(queryVector, { k: 20, useHNSW: false, // Use flat/LSH includeMetadata: true }); // Combine and rerank results const combinedResults = combineResults(hnswResults, lshResults, { hnswWeight: 0.6, lshWeight: 0.4, diversityFactor: 0.1 }); ``` ## Batch Processing ### Batch Search Process multiple search queries efficiently: ```javascript // Prepare batch queries const queries = [ { query: vector1, k: 5, options: { includeMetadata: true, distanceMetric: 'cosine', useHNSW: true } }, { query: vector2, k: 10, options: { includeMetadata: true, filter: (id, metadata) => metadata.category === 'science' } }, { query: vector3, k: 3, options: { distanceMetric: 'euclidean', efSearch: 200 } } ]; // Execute batch search const batchResults = await db.batchSearch(queries, { maxBatchSize: 50, // Process in batches of 50 prioritizeOrder: true, // Maintain input order groupSimilarQueries: true, // Optimize similar queries defaultSearchTimeout: 5000 // 5 second timeout per query }); console.log(`Processed ${batchResults.length} query sets`); ``` ### Batch Operations ```javascript // Bulk vector operations const vectors = Array.from({ length: 1000 }, (_, i) => ({ id: `vec-${i}`, vector: generateRandomVector(1536), metadata: { batch: 'test-batch', index: i } })); // Add in bulk const addResult = await db.bulkAdd(vectors); console.log(`Added ${addResult.count} vectors to partitions: ${addResult.partitionIds.join(', ')}`); // Bulk metadata updates const updates = vectors.map(v => ({ id: v.id, metadata: { ...v.metadata, processed: true, timestamp: Date.now() } })); for (const update of updates) { await db.updateMetadata(update.id, update.metadata); } ``` ## Advanced Filtering ### Metadata Filtering ```javascript // Simple metadata filter const results = await db.search(queryVector, { k: 10, includeMetadata: true, filter: (id, metadata) => { return metadata.category === 'technology' && metadata.rating >= 4.5 && metadata.published > '2023-01-01'; } }); ``` ### Complex Filter Objects ```javascript // Advanced filter configuration const complexFilter = { category: { $in: ['article', 'blog', 'research'] }, rating: { $gte: 4.0 }, tags: { $regex: '^tech.*' }, published: { $gt: '2023-06-01' }, author: { $ne: 'Anonymous' }, views: { $between: [1000, 10000] } }; const results = await db.search(queryVector, { k: 20, filters: complexFilter, includeMetadata: true }); ``` ### Custom Filter Functions ```javascript // Custom filter with complex logic const customFilter = (id, metadata) => { // Time-based filtering const age = Date.now() - new Date(metadata.published).getTime(); const isRecent = age < 365 * 24 * 60 * 60 * 1000; // 1 year // Content-based filtering const hasRequiredTags = metadata.tags?.some(tag => ['machine-learning', 'ai', 'neural-networks'].includes(tag) ); // Quality filtering const isHighQuality = metadata.rating > 4.0 && metadata.views > 500; return isRecent && hasRequiredTags && isHighQuality; }; const results = await db.search(queryVector, { k: 15, filter: customFilter, includeMetadata: true }); ``` ## Index Management ### Advanced Index Building ```javascript // Build indexes with progress tracking await db.buildIndexes({ progressCallback: (progress) => { console.log(`Index building: ${progress}% complete`); }, dimensionAware: true, force: true // Rebuild even if exists }); // Build specific index types await db.buildIndexes({ indexType: 'hnsw', hnswOptions: { M: 32, // More connections for better accuracy efConstruction: 400, // Higher build quality dimensionAware: true } }); ``` ### Index Optimization ```javascript // Optimize indexes for specific workloads const optimizationConfig = { // For high-accuracy search highAccuracy: { M: 24, efConstruction: 300, efSearch: 200 }, // For fast search highSpeed: { M: 12, efConstruction: 100, efSearch: 50 }, // For memory efficiency memoryOptimized: { M: 8, efConstruction: 50, efSearch: 32 } }; // Apply optimization await db.buildIndexes({ ...optimizationConfig.highAccuracy, progressCallback: console.log }); ``` ### Index Statistics and Monitoring ```javascript // Get detailed index statistics const stats = await db.getStats(); const indexStats = stats.database.indices; console.log('Index Statistics:'); console.log(`HNSW Loaded: ${indexStats.hnswLoadedCount}/${indexStats.totalPartitions}`); console.log('HNSW Details:'); Object.entries(indexStats.hnswStats).forEach(([partitionId, hnswStat]) => { console.log(` ${partitionId}: ${hnswStat.totalNodes} nodes, ${hnswStat.levels} levels`); }); ``` ## Performance Monitoring ### Comprehensive Monitoring Setup ```typescript const db = new Database({ monitoring: { enable: true, intervalMs: 5000, // Collect every 5 seconds logToConsole: true, enableSystemMetrics: true, // CPU, memory, disk enableSearchMetrics: true, // Query performance enableDatabaseMetrics: true, // Vector/index stats enableCacheMetrics: true // Cache hit rates } }); ``` ### Real-time Performance Tracking ```javascript // Monitor search performance db.on('search:complete', (event) => { console.log(`Search completed in ${event.totalTime}ms`); console.log(`Results: ${event.resultCount}, K: ${event.kRequested}`); }); // Monitor system resources setInterval(async () => { const metrics = await db.getStats(); console.log('System Metrics:'); console.log(`CPU: ${(metrics.system.cpuUsage * 100).toFixed(1)}%`); console.log(`Memory: ${(metrics.system.memoryUsage * 100).toFixed(1)}%`); console.log(`Search Cache: ${metrics.searchCache.hitRate.toFixed(1)}% hit rate`); console.log('Database Metrics:'); console.log(`Total Vectors: ${metrics.database.vectors.totalConfigured}`); console.log(`Active Partitions: ${metrics.database.partitions.loadedCount}`); }, 10000); ``` ### Performance Profiling ```typescript // Profile search operations async function profileSearch(query, options, iterations = 100) { const times = []; for (let i = 0; i < iterations; i++) { const start = Date.now(); await db.search(query, options); times.push(Date.now() - start); } const avg = times.reduce((a, b) => a + b) / times.length; const p95 = times.sort((a, b) => a - b)[Math.floor(times.length * 0.95)]; return { avg, p95, min: Math.min(...times), max: Math.max(...times) }; } // Profile different configurations const configs = [ { useHNSW: true, efSearch: 100 }, { useHNSW: true, efSearch: 200 }, { useHNSW: false } // Flat search ]; for (const config of configs) { const profile = await profileSearch(queryVector, { k: 10, ...config }); console.log(`${JSON.stringify(config)}: ${profile.avg.toFixed(1)}ms avg, ${profile.p95}ms p95`); } ``` ## Custom Distance Metrics ### Implementing Custom Metrics ```typescript // Custom distance function function customDistance(a, b) { // Implement your custom distance calculation // Example: Weighted cosine with custom weighting let dotProduct = 0; let normA = 0; let normB = 0; const weights = [1.0, 0.8, 0.6, ...]; // Custom weights per dimension for (let i = 0; i < a.length; i++) { const weight = weights[i] || 1.0; dotProduct += (a[i] * b[i]) * weight; normA += (a[i] * a[i]) * weight; normB += (b[i] * b[i]) * weight; } return 1 - (dotProduct / (Math.sqrt(normA) * Math.sqrt(normB))); } // Use custom metric (requires custom implementation) const results = await db.search(queryVector, { k: 10, distanceMetric: customDistance }); ``` ## Memory Optimization ### Large Dataset Handling ```typescript // Optimize for large datasets const db = new Database({ partitioning: { partitionCapacity: 50000, // Smaller partitions maxActivePartitions: 2, // Limit memory usage autoCreatePartitions: true }, clustering: { useCompression: true, // Enable compression clusterSize: 2000 // Smaller clusters }, cacheSize: 1000, // Smaller cache indexing: { hnswOptions: { M: 12, // Fewer connections efConstruction: 100 // Faster building } } }); ``` ### Memory Usage Monitoring ```typescript // Monitor memory usage function monitorMemory() { const usage = process.memoryUsage(); console.log(`RSS: ${(usage.rss / 1024 / 1024).toFixed(1)}MB`); console.log(`Heap Used: ${(usage.heapUsed / 1024 / 1024).toFixed(1)}MB`); console.log(`Heap Total: ${(usage.heapTotal / 1024 / 1024).toFixed(1)}MB`); console.log(`External: ${(usage.external / 1024 / 1024).toFixed(1)}MB`); } // Monitor every minute setInterval(monitorMemory, 60000); ``` ## Advanced Persistence ### Custom Backup Strategy ```typescript // Implement custom backup with metadata async function createIntelligentBackup(backupPath, options = {}) { const { includeMetadata = true, compressVectors = true, maxPartitionSize = 100000, parallelPartitions = 2 } = options; // Get database state const stats = await db.getStats(); // Create backup directory const backupDir = path.join(backupPath, `backup-${Date.now()}`); await fs.mkdir(backupDir, { recursive: true }); // Backup partitions in parallel const partitionPromises = stats.database.partitions.configs .filter(p => p.active) .slice(0, parallelPartitions) .map(async (partition) => { const partitionBackup = await backupPartition(partition.id, backupDir, { includeMetadata, compressVectors }); return partitionBackup; }); const partitionBackups = await Promise.all(partitionPromises); // Create backup manifest const manifest = { timestamp: new Date().toISOString(), version: '1.0', database: { vectorCount: stats.database.vectors.totalConfigured, partitionCount: stats.database.partitions.totalConfigured }, partitions: partitionBackups, options: { includeMetadata, compressVectors, maxPartitionSize } }; await fs.writeFile( path.join(backupDir, 'manifest.json'), JSON.stringify(manifest, null, 2) ); return backupDir; } ``` ## Integration Patterns ### Microservices Integration ```javascript // Vector service client class VectorServiceClient { constructor(baseURL) { this.baseURL = baseURL; } async search(query, options = {}) { const response = await fetch(`${this.baseURL}/search`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ query, ...options }) }); return response.json(); } async addVectors(vectors) { const response = await fetch(`${this.baseURL}/vectors`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ vectors }) }); return response.json(); } } // Usage const vectorClient = new VectorServiceClient('http://localhost:1307'); const results = await vectorClient.search(queryVector, { k: 10 }); ``` ### Streaming and Pagination ```javascript // Implement streaming search for large result sets async function* streamingSearch(query, options = {}) { const batchSize = options.batchSize || 100; let offset = 0; while (true) { const batch = await db.search(query, { ...options, k: batchSize, offset: offset }); if (batch.length === 0) break; yield batch; offset += batch.length; // Prevent infinite loops if (offset > options.maxResults) break; } } // Usage for await (const batch of streamingSearch(queryVector, { k: 1000, maxResults: 10000 })) { console.log(`Processing ${batch.length} results...`); // Process batch } ``` These advanced features provide powerful capabilities for complex vector operations, large-scale processing, and specialized search requirements. Experiment with different configurations to optimize performance for your specific use case.