UNPKG

@n2flowjs/nbase

Version:

Neural Vector Database for efficient similarity search

929 lines (729 loc) 24.9 kB
# Contributing to NBase ## Overview Welcome to NBase! We're excited to have you contribute to our high-performance vector database. This guide will help you get started with development, understand our processes, and make meaningful contributions to the project. ## Development Environment Setup ### Prerequisites **Required:** - **Node.js**: v16.0.0 or higher (v18 LTS recommended) - **NPM**: v7.0.0 or higher (or Yarn v1.22.0+, PNPM v6.0.0+) - **Git**: v2.25.0 or higher - **TypeScript**: v4.5.0 or higher (installed automatically) **Recommended:** - **VS Code** with TypeScript and Node.js extensions - **GitHub CLI** for streamlined workflow - **Docker** for testing and deployment - **Understanding of**: Vector databases, ANN algorithms, TypeScript, Node.js ### Quick Start ```bash # Fork and clone the repository git clone https://github.com/your-username/nbase.git cd nbase # Install dependencies npm install # Set up development environment npm run setup:dev # Run initial tests npm test # Start development server npm run dev ``` ### Development Scripts ```bash # Core development npm run build # Build TypeScript to JavaScript npm run dev # Start development server with hot reload npm run clean # Clean build artifacts npm run type-check # Run TypeScript type checking # Testing npm test # Run all tests npm run test:unit # Run unit tests only npm run test:integration # Run integration tests npm run test:watch # Run tests in watch mode npm run test:coverage # Generate test coverage report # Benchmarking npm run benchmark # Run all benchmarks npm run benchmark:suite1 # Run large-scale benchmarks npm run benchmark:suite2 # Run latency benchmarks # Quality npm run lint # Run ESLint npm run lint:fix # Auto-fix linting issues npm run format # Format code with Prettier npm run type-check # TypeScript type checking # Documentation npm run docs # Generate API documentation npm run docs:serve # Serve documentation locally npm run docs:build # Build documentation for deployment ``` ## Project Structure ``` nbase/ ├── src/ # Source code ├── index.ts # Main entry point ├── types.ts # TypeScript type definitions ├── ann/ # Approximate Nearest Neighbor algorithms ├── hnsw.ts # HNSW implementation └── lsh.ts # LSH implementation ├── compression/ # Vector compression algorithms ├── index.ts ├── kmeans.ts └── product_quantization.ts ├── config/ # Configuration management ├── default.ts ├── factory.ts └── index.ts ├── database/ # Core database implementations ├── database.ts # Main database class └── scripts/ └── service.ts ├── search/ # Search engine implementations ├── batch_search.ts ├── hybrid_search.ts ├── knn_search.ts ├── reranking.ts └── unified_search.ts ├── server/ # REST API server ├── index.ts ├── middleware/ └── common.ts └── routes/ ├── index.ts ├── search.ts └── vectors.ts ├── utils/ # Utility functions ├── distance_metrics.ts ├── log.ts ├── profiling.ts └── vector_monitoring.ts └── vector/ # Vector database implementations ├── clustered_vector_db.ts ├── index.ts ├── partitioned_vector_db.ts └── vector_db.ts ├── test/ # Test files ├── api/ # API tests ├── benchmarks/ # Performance benchmarks ├── test-helpers/ # Test utilities └── *.test.ts # Unit and integration tests ├── docs/ # Documentation ├── api-reference.md ├── getting-started.md ├── search-techniques.md └── ... ├── database/ # Runtime data storage └── partitions/ # Partitioned data ├── node_modules/ # Dependencies (generated) ├── package.json # Package configuration ├── tsconfig.json # TypeScript configuration ├── eslint.config.mjs # ESLint configuration ├── nodemon.json # Development server config └── typedoc.json # API documentation config ``` ## Development Workflow ### 1. Choose an Issue - Check [GitHub Issues](https://github.com/n2flowjs/nbase/issues) for open tasks - Look for issues labeled `good first issue` or `help wanted` - Comment on the issue to indicate you're working on it ### 2. Create a Branch ```bash # Create and switch to a new branch git checkout -b feature/your-feature-name # Or for bug fixes git checkout -b fix/issue-description # Or for documentation git checkout -b docs/update-readme ``` ### 3. Make Changes ```bash # Ensure you're working with latest code git pull origin main # Make your changes # ... edit files ... # Run tests to ensure nothing breaks npm test # Run linting npm run lint # Format code npm run format ``` ### 4. Test Your Changes ```bash # Run unit tests npm run test:unit # Run integration tests npm run test:integration # Run benchmarks to ensure performance npm run benchmark # Test with your specific use case node -e " const { Database } = require('./dist'); const db = new Database({ vectorSize: 128 }); // ... your test code ... " ``` ### 5. Commit Your Changes ```bash # Stage your changes git add . # Commit with descriptive message git commit -m "feat: add new HNSW optimization - Implement dynamic M parameter adjustment - Add memory usage monitoring - Improve search accuracy by 15% Closes #123" # Follow conventional commit format # Type can be: feat, fix, docs, style, refactor, test, chore ``` ### 6. Create Pull Request ```bash # Push your branch git push origin feature/your-feature-name # Create PR on GitHub or use GitHub CLI gh pr create --title "Add new HNSW optimization" --body "Detailed description..." ``` ## Coding Standards ### TypeScript Guidelines ```typescript // Good: Use interfaces for public APIs interface SearchOptions { readonly k: number; readonly includeMetadata?: boolean; readonly distanceMetric?: DistanceMetric; readonly useHNSW?: boolean; } // Good: Use type unions for constrained values type DistanceMetric = 'cosine' | 'euclidean' | 'manhattan' | 'dot-product'; // Good: Document complex functions with TSDoc /** * Performs approximate nearest neighbor search using HNSW algorithm. * * @param query - The query vector to search for * @param options - Search configuration options * @returns Promise resolving to search results sorted by distance * * @example * ```typescript * const results = await db.search(queryVector, { * k: 10, * useHNSW: true, * distanceMetric: 'cosine' * }); * ``` * * @throws {ValidationError} When query vector dimension doesn't match database * @throws {IndexNotBuiltError} When HNSW index is not available */ async function search(query: Float32Array, options: SearchOptions): Promise<SearchResult[]> { // Implementation } // Bad: Avoid any types function badFunction(param: any): any { return param; } // Bad: Avoid large interfaces interface BadInterface { prop1: string; prop2: number; prop3: boolean; prop4: string[]; // ... many more properties } ``` ### Code Style ```typescript // Good: Use descriptive variable names const queryVector = new Float32Array(128); const searchResults = await database.search(queryVector, { k: 10 }); // Good: Use early returns function validateInput(input: any): boolean { if (!input) return false; if (!Array.isArray(input.vector)) return false; if (input.vector.length !== this.vectorSize) return false; return true; } // Good: Handle errors appropriately try { await database.addVector(id, vector); } catch (error) { if (error instanceof ValidationError) { logger.warn('Invalid vector data:', error.message); throw error; } logger.error('Unexpected error:', error); throw new DatabaseError('Failed to add vector'); } // Bad: Avoid magic numbers const results = await db.search(query, { k: 10 }); // What does 10 mean? // Good: Use named constants const DEFAULT_SEARCH_RESULTS = 10; const results = await db.search(query, { k: DEFAULT_SEARCH_RESULTS }); ``` ### File Organization ```typescript // Good: Group related functionality // src/search/hnsw_search.ts export class HNSWSearch { // HNSW-specific search implementation } // src/search/lsh_search.ts export class LSHSearch { // LSH-specific search implementation } // src/search/index.ts export { HNSWSearch } from './hnsw_search'; export { LSHSearch } from './lsh_search'; export { UnifiedSearch } from './unified_search'; // Bad: Don't put everything in one file // src/everything.ts - Contains HNSW, LSH, KNN, and more ``` ## Testing ### Unit Tests ```typescript // test/vector_db.test.ts import { describe, it, expect, beforeEach } from '@jest/globals'; import { VectorDB } from '../src/vector/vector_db'; describe('VectorDB', () => { let db: VectorDB; beforeEach(() => { db = new VectorDB({ vectorSize: 128, distanceMetric: 'cosine' }); }); describe('addVector', () => { it('should add a vector successfully', async () => { const vector = new Float32Array(128); const id = 'test-vector'; await expect(db.addVector(id, vector)).resolves.toBeUndefined(); expect(await db.getVector(id)).toEqual(vector); }); it('should reject invalid vector dimensions', async () => { const invalidVector = new Float32Array(64); // Wrong size const id = 'invalid-vector'; await expect(db.addVector(id, invalidVector)) .rejects.toThrow('Vector dimension mismatch'); }); it('should handle duplicate IDs', async () => { const vector1 = new Float32Array(128); const vector2 = new Float32Array(128); const id = 'duplicate-id'; await db.addVector(id, vector1); await expect(db.addVector(id, vector2)) .rejects.toThrow('Vector ID already exists'); }); }); describe('search', () => { beforeEach(async () => { // Setup test data const vectors = generateTestVectors(100); for (let i = 0; i < vectors.length; i++) { await db.addVector(`vec-${i}`, vectors[i]); } }); it('should find nearest neighbors', async () => { const query = new Float32Array(128); const results = await db.search(query, { k: 5 }); expect(results).toHaveLength(5); expect(results[0]).toHaveProperty('id'); expect(results[0]).toHaveProperty('score'); expect(results[0]).toHaveProperty('distance'); }); it('should respect k parameter', async () => { const query = new Float32Array(128); const results = await db.search(query, { k: 3 }); expect(results).toHaveLength(3); }); }); }); ``` ### Integration Tests ```typescript // test/api/vectors.test.ts import { describe, it, expect, beforeAll, afterAll } from '@jest/globals'; import { Server } from '../src/server'; import { Database } from '../src/database/database'; describe('Vectors API', () => { let server: Server; let baseUrl: string; beforeAll(async () => { server = new Server({ port: 0, // Random port database: { vectorSize: 128 } }); await server.start(); const address = server.getAddress(); baseUrl = `http://localhost:${address.port}`; }); afterAll(async () => { await server.stop(); }); describe('POST /api/vectors', () => { it('should add a vector successfully', async () => { const response = await fetch(`${baseUrl}/api/vectors`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ id: 'test-vector', vector: Array.from({ length: 128 }, () => Math.random()), metadata: { type: 'test' } }) }); expect(response.status).toBe(201); const result = await response.json(); expect(result.success).toBe(true); expect(result.id).toBe('test-vector'); }); it('should handle validation errors', async () => { const response = await fetch(`${baseUrl}/api/vectors`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ id: 'invalid-vector', vector: [1, 2, 3], // Wrong dimension metadata: { type: 'test' } }) }); expect(response.status).toBe(400); const result = await response.json(); expect(result.success).toBe(false); expect(result.error.code).toBe('VECTOR_DIMENSION_MISMATCH'); }); }); }); ``` ### Benchmark Tests ```typescript // test/benchmarks/custom.benchmark.ts import { Database } from '../../src/database/database'; import { createTimer } from '../test-helpers/benchmark-utils'; async function customBenchmark() { const db = new Database({ vectorSize: 128, partitionCapacity: 10000 }); // Setup test data const testVectors = generateTestVectors(5000); await db.bulkAdd(testVectors.map((vec, i) => ({ id: `vec-${i}`, vector: vec, metadata: { category: `cat-${i % 10}` } }))); // Benchmark different search configurations const query = generateRandomVector(128); const configurations = [ { name: 'Flat Search', options: { k: 10, useHNSW: false } }, { name: 'HNSW Fast', options: { k: 10, useHNSW: true, efSearch: 32 } }, { name: 'HNSW Balanced', options: { k: 10, useHNSW: true, efSearch: 100 } }, { name: 'HNSW Accurate', options: { k: 10, useHNSW: true, efSearch: 200 } } ]; console.log('Running custom benchmark...'); console.table( await Promise.all(configurations.map(async (config) => { const timer = createTimer(); timer.start(); for (let i = 0; i < 100; i++) { await db.search(query, config.options); } timer.stop(); return { Configuration: config.name, 'Avg Time (ms)': (timer.total / 100).toFixed(2), 'Total Time (ms)': timer.total.toFixed(2) }; })) ); } ``` ## Pull Request Process ### Branch Naming Convention ```bash # Feature branches feature/add-hnsw-optimization feature/implement-lsh-index feature/add-rest-api-endpoints # Bug fix branches fix/memory-leak-in-cache fix/search-accuracy-issue fix/api-validation-error # Documentation branches docs/update-api-reference docs/add-contribution-guide docs/improve-search-techniques # Maintenance branches chore/update-dependencies chore/cleanup-unused-code chore/improve-error-messages ``` ### Commit Message Format We follow [Conventional Commits](https://conventionalcommits.org/) specification: ```bash # Format <type>[optional scope]: <description> [optional body] [optional footer(s)] # Examples feat(search): add LSH locality-sensitive hashing implementation - Implement LSH index with configurable hash functions - Add support for cosine and euclidean distance metrics - Include performance benchmarks and documentation Closes #123 fix(memory): resolve vector cache memory leak in PartitionedVectorDB The cache was not properly cleaning up expired entries, leading to unbounded memory growth over time. fix(api): correct search result ordering in REST API Results were being returned in arbitrary order instead of distance-sorted order as documented. BREAKING CHANGE: Search results now include distance field docs(api): update search endpoint documentation - Add examples for different search algorithms - Document all available parameters and options - Include error response examples refactor(database): extract common functionality to base class - Create AbstractDatabase base class - Move shared methods to reduce code duplication - Improve type safety with generics test(search): add comprehensive HNSW search test suite - Test various configurations and edge cases - Include performance regression tests - Add memory usage validation ``` ### Pull Request Template When creating a PR, please use this template: ```markdown ## Description Brief description of the changes and the problem they solve. ## Changes Made ### Code Changes - [ ] Added new feature/functionality - [ ] Fixed bug/issue - [ ] Refactored existing code - [ ] Updated documentation - [ ] Added tests ### Files Changed - `src/search/hnsw.ts` - Added optimization - `test/hnsw.test.ts` - Added test cases - `docs/search-techniques.md` - Updated documentation ## Testing ### Test Coverage - [ ] Unit tests pass (`npm test`) - [ ] Integration tests pass (`npm run test:integration`) - [ ] Benchmarks pass (`npm run benchmark`) - [ ] Code coverage maintained (>90%) ### Manual Testing - [ ] Tested with sample data - [ ] Verified performance impact - [ ] Checked edge cases ## Performance Impact ### Benchmarks ``` Operation | Before | After | Change -------------------|----------|----------|-------- HNSW Search (k=10) | 15.2ms | 12.8ms | +18% faster Memory Usage | 120MB | 118MB | -2% reduction Index Build Time | 45.2s | 42.1s | +7% faster ``` ### Breaking Changes - [ ] None - [ ] Minor API changes (documented) - [ ] Major API changes (migration guide needed) ## Documentation - [ ] Updated API documentation - [ ] Added code examples - [ ] Updated README if needed - [ ] Added migration guide for breaking changes ## Checklist - [ ] Code follows project style guidelines - [ ] All tests pass - [ ] Documentation updated - [ ] Performance benchmarks included - [ ] No linting errors - [ ] Commit messages follow conventional format - [ ] PR description is clear and comprehensive ## Related Issues Closes #123, #124 Related to #125 ## Additional Notes Any additional context, considerations, or follow-up work needed. ``` ## Code Review Process ### Review Checklist for Reviewers **Code Quality:** - [ ] Code follows TypeScript best practices - [ ] Proper error handling and validation - [ ] No console.log statements in production code - [ ] Functions have appropriate documentation - [ ] Code is well-structured and readable **Testing:** - [ ] Unit tests cover new functionality - [ ] Integration tests verify end-to-end behavior - [ ] Edge cases are handled - [ ] Performance tests included for performance-critical code **Documentation:** - [ ] API documentation updated - [ ] Code comments added for complex logic - [ ] Examples provided for new features - [ ] Breaking changes documented **Performance:** - [ ] No performance regressions - [ ] Memory usage considered - [ ] Scalability implications reviewed ### Review Comments Guidelines ```typescript // Good: Specific, actionable feedback "The error handling here could be more specific. Consider throwing ValidationError instead of generic Error for better client handling." // Good: Suggest alternatives "Instead of manual array iteration, consider using Array.find() for better readability and performance." // Bad: Unclear feedback "This looks wrong" // Bad: Demotivating "This is a terrible implementation" ``` ## Release Process ### Version Numbering We follow [Semantic Versioning](https://semver.org/): ``` MAJOR.MINOR.PATCH - MAJOR: Breaking changes - MINOR: New features (backward compatible) - PATCH: Bug fixes (backward compatible) ``` ### Release Steps 1. **Preparation** ```bash # Ensure all tests pass npm test # Run benchmarks npm run benchmark # Update version npm version minor # or patch, major ``` 2. **Changelog Update** ```markdown ## [1.1.0] - 2024-01-15 ### Added - LSH (Locality-Sensitive Hashing) implementation (#123) - REST API server with full CRUD operations (#124) - Vector compression with Product Quantization (#125) ### Fixed - Memory leak in HNSW index building (#126) - Search accuracy regression in clustered DB (#127) ### Performance - 25% faster search with new HNSW optimizations - 30% reduction in memory usage for large datasets ``` 3. **Release** ```bash # Build production version npm run build # Create git tag git tag v1.1.0 # Push to repository git push origin main --tags # Publish to NPM npm publish ``` 4. **Post-Release** - Update documentation website - Announce release on GitHub - Update issue labels and milestones ## Documentation ### API Documentation Use TSDoc comments for all public APIs: ```typescript /** * Represents a vector database that supports various search algorithms * and indexing strategies for high-performance similarity search. * * @example * ```typescript * const db = new Database({ * vectorSize: 128, * indexing: { buildOnStart: true } * }); * * await db.addVector('user-1', embedding); * const results = await db.search(queryEmbedding, { k: 5 }); * ``` */ export class Database { /** * Creates a new Database instance with the specified configuration. * * @param config - Database configuration options * @throws {ValidationError} When configuration is invalid */ constructor(config: DatabaseConfig) { // Implementation } /** * Adds a vector to the database with optional metadata. * * @param id - Unique identifier for the vector * @param vector - The vector data as Float32Array * @param metadata - Optional metadata associated with the vector * @returns Promise that resolves when vector is added * @throws {ValidationError} When vector dimensions don't match * @throws {DuplicateError} When ID already exists */ async addVector( id: string, vector: Float32Array, metadata?: Record<string, any> ): Promise<void> { // Implementation } } ``` ### Performance Documentation Include performance characteristics in documentation: ```typescript /** * Searches for the k nearest neighbors using HNSW algorithm. * * Performance characteristics: * - Time Complexity: O(log n) average case * - Space Complexity: O(n × d + n × M) where M is max connections * - Accuracy: 95-99% depending on efSearch parameter * - Memory Usage: ~50MB per 100K vectors with M=16 * * @param query - Query vector * @param options - Search options * @returns Search results sorted by distance */ async search(query: Float32Array, options: SearchOptions): Promise<SearchResult[]> { // Implementation } ``` ## Getting Help ### Communication Channels - **GitHub Issues**: For bugs, feature requests, and general questions - **GitHub Discussions**: For longer-form discussions and Q&A - **Discord**: For real-time chat and community support - **Documentation**: Check our comprehensive docs first ### Issue Reporting When reporting bugs, please include: 1. **Clear Title**: Summarize the issue concisely 2. **Description**: Detailed description of the problem 3. **Steps to Reproduce**: Minimal code example that reproduces the issue 4. **Expected vs Actual**: What you expected vs what happened 5. **Environment**: Node.js version, OS, NBase version 6. **Logs**: Relevant error messages or logs ### Feature Requests For feature requests, please include: 1. **Use Case**: Describe your specific use case 2. **Current Workaround**: How you currently solve this problem 3. **Proposed Solution**: Your suggested implementation 4. **Alternatives**: Other approaches you've considered ## Recognition Contributors are recognized in several ways: - **GitHub Contributors**: Listed in repository contributors - **Changelog**: Mentioned in release notes - **Documentation**: Featured in contributor acknowledgments - **Community**: Highlighted in community discussions ## License By contributing to NBase, you agree that your contributions will be licensed under the MIT License. All contributors must sign our Contributor License Agreement (CLA) before their contributions can be accepted. --- Thank you for contributing to NBase! Your contributions help make vector databases more accessible and powerful for everyone. 🚀