UNPKG

mongodocs-mcp

Version:

Transform any GitHub repository into searchable vector embeddings. MCP server with smart indexing, voyage-context-3 embeddings, and semantic search for Claude/Cursor IDEs.

549 lines (417 loc) 11.2 kB
# mongodocs-mcp A Model Context Protocol (MCP) server that transforms any GitHub repository into searchable vector embeddings, enabling semantic search across codebases and documentation through IDE integration. ## Architecture The system implements a three-phase indexing pipeline with smart change detection: ``` Repository → Git Clone → Smart Chunking → Vector Embeddings → MongoDB Atlas ↓ ↓ ↓ ↓ Hash Tracking Semantic Split voyage-context-3 Vector Search ``` ### Core Components - **Indexer** (`src/core/indexer.ts`): Git-based change detection using commit hashes - **Semantic Chunker** (`src/core/semantic-chunker.ts`): Multi-strategy content splitting - **Embedding Service** (`src/core/embeddings.ts`): Voyage AI integration with batching - **Storage Service** (`src/core/storage.ts`): MongoDB Atlas vector operations - **Search Service** (`src/core/search.ts`): Vector, hybrid RRF, and MMR algorithms - **MCP Server** (`src/index.ts`): Protocol implementation for IDE integration ## Installation ### Global Package ```bash npm install -g mongodocs-mcp ``` ### From Source ```bash git clone https://github.com/yourusername/mongodocs-mcp.git cd mongodocs-mcp npm install npm run build npm link ``` ## Setup ### 1. MongoDB Atlas Create free M0 cluster at [cloud.mongodb.com](https://cloud.mongodb.com): ```bash # Database structure Database: mongodb_semantic_docs Collection: documents # Connection string format mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority ``` **Network Access Configuration:** - Navigate to Network Access → Add IP Address - Add `0.0.0.0/0` for development (restrict in production) **Vector Search Index Creation:** 1. Go to Atlas Search → Create Index 2. Select "JSON Editor" 3. Paste configuration: ```json { "mappings": { "dynamic": true, "fields": { "embedding": { "type": "knnVector", "dimensions": 1024, "similarity": "cosine" } } } } ``` Name: `vector_index` ### 2. Voyage AI Get API key from [voyageai.com](https://voyageai.com): - Model: `voyage-context-3` - Dimensions: 1024 - Context window: 32,000 tokens - Rate limit: 2000 RPM ### 3. Environment Configuration Create `.env` file: ```bash # Required MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority VOYAGE_API_KEY=pa-your-api-key # Optional GITHUB_TOKEN=ghp_your_token # For private repos ``` ## Usage ### Web Interface ```bash # Start web UI npm run web # Opens http://localhost:3000 # 4-step wizard: # 1. Configure APIs # 2. Select repositories # 3. Review MCP setup # 4. Start processing ``` ### Command Line ```bash # Index repositories (smart mode - only changed files) npm run index # Force complete rebuild npm run rebuild # Monitor indexing progress npm run progress # Database statistics npm run stats # Clean database npm run clean ``` ### Programmatic API ```javascript import { Indexer } from 'mongodocs-mcp'; const config = { repositories: [{ name: 'My Documentation', repo: 'owner/repository', branch: 'main', product: 'custom-my-docs' }], embedding: { model: 'voyage-context-3', dimensions: 1024, chunkSize: 1000, chunkOverlap: 200 } }; const indexer = new Indexer(config); indexer.onProgress((progress) => { console.log(`${progress.phase}: ${progress.current}/${progress.total}`); }); await indexer.index(); ``` ## MCP Integration ### Claude Desktop File: `~/Library/Application Support/Claude/claude_desktop_config.json` ```json { "mcpServers": { "mongodocs": { "command": "npx", "args": ["mongodocs-mcp"], "env": { "MONGODB_URI": "your-connection-string", "VOYAGE_API_KEY": "your-api-key" } } } } ``` ### Cursor IDE File: `.cursor/mcp_settings.json` ```json { "mcpServers": { "mongodocs": { "command": "npx", "args": ["mongodocs-mcp"], "env": { "MONGODB_URI": "your-connection-string", "VOYAGE_API_KEY": "your-api-key" } } } } ``` Restart IDE after configuration. ## Search Methods ### 1. Hybrid RRF Search (Primary) Reciprocal Rank Fusion combining vector and keyword search: ```javascript // Weight configuration vectorWeight: 0.7 keywordWeight: 0.3 // Ranking formula score = 1 / (k + rank) where k = 60 ``` ### 2. MMR Search (Diversity) Maximum Marginal Relevance for result diversity: ```javascript // Parameters fetchK: 20 // Initial candidates lambdaMult: 0.7 // Relevance vs diversity limit: 5 // Final results // Algorithm MMR = λ * Sim(Di, Q) - (1-λ) * max Sim(Di, Dj) ``` ### 3. Pure Vector Search Cosine similarity search: ```javascript // Configuration numCandidates: 40 // 7.5x faster than default 300 limit: 10 ``` ## Technical Implementation ### Semantic Chunking Three-strategy approach with statistical analysis: **1. Interquartile Method** ```javascript // Calculate sentence distances distances = sentences.map(embed).map(cosineDistance) // Find breakpoints at quartile boundaries Q1, Q3 = quartiles(distances) threshold = Q3 + 1.5 * (Q3 - Q1) ``` **2. Gradient Method** ```javascript // Identify semantic transitions gradients = distances.map(derivative) breakpoints = gradients.filter(g => g > threshold) ``` **3. Hybrid Scoring** ```javascript score = 0.6 * interquartile + 0.4 * gradient // Adaptive to content type ``` ### Chunk Optimization ```javascript const CHUNK_CONFIG = { base: 1000, // Target size min: 100, // Prevent empty max: 2500, // Respect limits overlap: 200, // Context preservation // Token validation maxTokens: 6000, // voyage-context-3 safety tokenizer: 'cl100k_base' }; ``` ### Smart Indexing Repository state tracking: ```javascript // Check existing hash const existingHash = await storage.getRepositoryHash(repo.name); const currentHash = await git.getLatestCommit(); if (existingHash === currentHash) { console.log('✅ Repository up to date, skipping...'); return; } // Process only changed files const changedFiles = await git.diff(existingHash, currentHash); await processFiles(changedFiles); await storage.updateRepositoryHash(repo.name, currentHash); ``` ### Error Handling ```javascript // Exponential backoff with jitter const delay = Math.min(1000 * Math.pow(2, attempt), 30000); const jitter = Math.random() * 1000; await sleep(delay + jitter); // Token limit handling if (error.message.includes('32000 tokens')) { // Split chunk and retry const subChunks = emergencySplit(chunk); return processSubChunks(subChunks); } ``` ## Performance Characteristics ### Indexing Metrics - **Processing rate**: 100-150 docs/hour (Voyage API limited) - **Batch size**: 32 documents optimal - **Memory usage**: <500MB peak - **Network bandwidth**: ~10MB/hour ### Search Performance - **Latency**: <100ms p99 - **Throughput**: 1000+ QPS - **Index size**: ~1.5KB per chunk - **Cache TTL**: 5 minutes ### Storage Efficiency ```javascript // Document structure (avg 1.5KB) { _id: ObjectId, title: string, // 50 bytes content: string, // 1000 bytes embedding: float[1024], // 4KB compressed metadata: { // 200 bytes file: string, repo: string, product: string, indexedAt: Date } } ``` ## Repository Configuration ### Default Repositories ```javascript const repositories = [ { name: 'MongoDB Documentation', repo: 'mongodb/docs', branch: 'master', product: 'mongodb-docs', priority: 10 }, // Add custom repositories... ]; ``` ### Custom Repository ```javascript { name: 'Your Documentation', repo: 'owner/repository', branch: 'main', product: 'custom-your-docs', // Optional filters include: ['docs/**/*.md'], exclude: ['**/node_modules/**'], // Processing options chunkSize: 1500, chunkOverlap: 300 } ``` ## Development ### Build Pipeline ```bash # Development with watch npm run dev # Production build npm run build # Type checking npm run typecheck # Linting npm run lint # Testing npm test ``` ### Project Structure ``` src/ ├── core/ │ ├── indexer.ts # Orchestration │ ├── semantic-chunker.ts # Content splitting │ ├── embeddings.ts # Vector generation │ ├── storage.ts # Database operations │ └── search.ts # Query algorithms ├── config/ │ └── index.ts # Repository definitions ├── web/ │ ├── server.ts # Express server │ ├── coordinator.ts # Web orchestration │ └── templates/ # HTML interfaces └── index.ts # MCP server dist/ # Compiled output .repos/ # Cloned repositories ``` ### Key Dependencies ```json { "mongodb": "^6.10.0", // Native driver "voyageai": "^0.0.1-5", // Embeddings "@modelcontextprotocol/sdk": "^1.0.0", // MCP "js-tiktoken": "^1.0.15", // Tokenization "simple-git": "^3.27.0" // Repository ops } ``` ## Troubleshooting ### Connection Issues ```bash # Test MongoDB connection node -e " const { MongoClient } = require('mongodb'); MongoClient.connect(process.env.MONGODB_URI) .then(() => console.log('✅ Connected')) .catch(err => console.error('❌', err.message)); " # Test Voyage AI curl -X POST https://api.voyageai.com/v1/embeddings \ -H "Authorization: Bearer $VOYAGE_API_KEY" \ -H "Content-Type: application/json" \ -d '{"input": ["test"], "model": "voyage-context-3"}' ``` ### Index Issues ```bash # Verify vector index mongosh $MONGODB_URI --eval " db.documents.getSearchIndexes() " # Check document structure mongosh $MONGODB_URI --eval " db.documents.findOne() " ``` ### Performance Tuning ```javascript // Adjust for your use case const tuning = { // Smaller batches for memory constraints batchSize: 16, // More candidates for precision numCandidates: 100, // Larger chunks for context chunkSize: 2000, // Disable for speed smartIndexing: false }; ``` ## Best Practices ### Security - Store credentials in environment variables - Use least-privilege MongoDB user - Rotate API keys regularly - Enable MongoDB audit logging ### Optimization - Index during off-peak hours - Use incremental updates - Monitor token usage - Cache frequent queries ### Scaling - Horizontal sharding for large corpuses - Read replicas for search traffic - CDN for static assets - Queue system for processing ## Contributing Pull requests welcome. Please ensure: - TypeScript strict mode compliance - Test coverage >80% - Conventional commits - Documentation updates ## License MIT ## Support - Issues: [GitHub](https://github.com/yourusername/mongodocs-mcp/issues) - Discussions: [GitHub Discussions](https://github.com/yourusername/mongodocs-mcp/discussions) - MCP Spec: [modelcontextprotocol.io](https://modelcontextprotocol.io) --- Built with MongoDB Atlas vector search and Voyage AI embeddings.