mongodocs-mcp
Version:
Transform any GitHub repository into searchable vector embeddings. MCP server with smart indexing, voyage-context-3 embeddings, and semantic search for Claude/Cursor IDEs.
549 lines (417 loc) • 11.2 kB
Markdown
# mongodocs-mcp
A Model Context Protocol (MCP) server that transforms any GitHub repository into searchable vector embeddings, enabling semantic search across codebases and documentation through IDE integration.
## Architecture
The system implements a three-phase indexing pipeline with smart change detection:
```
Repository → Git Clone → Smart Chunking → Vector Embeddings → MongoDB Atlas
↓ ↓ ↓ ↓
Hash Tracking Semantic Split voyage-context-3 Vector Search
```
### Core Components
- **Indexer** (`src/core/indexer.ts`): Git-based change detection using commit hashes
- **Semantic Chunker** (`src/core/semantic-chunker.ts`): Multi-strategy content splitting
- **Embedding Service** (`src/core/embeddings.ts`): Voyage AI integration with batching
- **Storage Service** (`src/core/storage.ts`): MongoDB Atlas vector operations
- **Search Service** (`src/core/search.ts`): Vector, hybrid RRF, and MMR algorithms
- **MCP Server** (`src/index.ts`): Protocol implementation for IDE integration
## Installation
### Global Package
```bash
npm install -g mongodocs-mcp
```
### From Source
```bash
git clone https://github.com/yourusername/mongodocs-mcp.git
cd mongodocs-mcp
npm install
npm run build
npm link
```
## Setup
### 1. MongoDB Atlas
Create free M0 cluster at [cloud.mongodb.com](https://cloud.mongodb.com):
```bash
# Database structure
Database: mongodb_semantic_docs
Collection: documents
# Connection string format
mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority
```
**Network Access Configuration:**
- Navigate to Network Access → Add IP Address
- Add `0.0.0.0/0` for development (restrict in production)
**Vector Search Index Creation:**
1. Go to Atlas Search → Create Index
2. Select "JSON Editor"
3. Paste configuration:
```json
{
"mappings": {
"dynamic": true,
"fields": {
"embedding": {
"type": "knnVector",
"dimensions": 1024,
"similarity": "cosine"
}
}
}
}
```
Name: `vector_index`
### 2. Voyage AI
Get API key from [voyageai.com](https://voyageai.com):
- Model: `voyage-context-3`
- Dimensions: 1024
- Context window: 32,000 tokens
- Rate limit: 2000 RPM
### 3. Environment Configuration
Create `.env` file:
```bash
# Required
MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/?retryWrites=true&w=majority
VOYAGE_API_KEY=pa-your-api-key
# Optional
GITHUB_TOKEN=ghp_your_token # For private repos
```
## Usage
### Web Interface
```bash
# Start web UI
npm run web
# Opens http://localhost:3000
# 4-step wizard:
# 1. Configure APIs
# 2. Select repositories
# 3. Review MCP setup
# 4. Start processing
```
### Command Line
```bash
# Index repositories (smart mode - only changed files)
npm run index
# Force complete rebuild
npm run rebuild
# Monitor indexing progress
npm run progress
# Database statistics
npm run stats
# Clean database
npm run clean
```
### Programmatic API
```javascript
import { Indexer } from 'mongodocs-mcp';
const config = {
repositories: [{
name: 'My Documentation',
repo: 'owner/repository',
branch: 'main',
product: 'custom-my-docs'
}],
embedding: {
model: 'voyage-context-3',
dimensions: 1024,
chunkSize: 1000,
chunkOverlap: 200
}
};
const indexer = new Indexer(config);
indexer.onProgress((progress) => {
console.log(`${progress.phase}: ${progress.current}/${progress.total}`);
});
await indexer.index();
```
## MCP Integration
### Claude Desktop
File: `~/Library/Application Support/Claude/claude_desktop_config.json`
```json
{
"mcpServers": {
"mongodocs": {
"command": "npx",
"args": ["mongodocs-mcp"],
"env": {
"MONGODB_URI": "your-connection-string",
"VOYAGE_API_KEY": "your-api-key"
}
}
}
}
```
### Cursor IDE
File: `.cursor/mcp_settings.json`
```json
{
"mcpServers": {
"mongodocs": {
"command": "npx",
"args": ["mongodocs-mcp"],
"env": {
"MONGODB_URI": "your-connection-string",
"VOYAGE_API_KEY": "your-api-key"
}
}
}
}
```
Restart IDE after configuration.
## Search Methods
### 1. Hybrid RRF Search (Primary)
Reciprocal Rank Fusion combining vector and keyword search:
```javascript
// Weight configuration
vectorWeight: 0.7
keywordWeight: 0.3
// Ranking formula
score = 1 / (k + rank) where k = 60
```
### 2. MMR Search (Diversity)
Maximum Marginal Relevance for result diversity:
```javascript
// Parameters
fetchK: 20 // Initial candidates
lambdaMult: 0.7 // Relevance vs diversity
limit: 5 // Final results
// Algorithm
MMR = λ * Sim(Di, Q) - (1-λ) * max Sim(Di, Dj)
```
### 3. Pure Vector Search
Cosine similarity search:
```javascript
// Configuration
numCandidates: 40 // 7.5x faster than default 300
limit: 10
```
## Technical Implementation
### Semantic Chunking
Three-strategy approach with statistical analysis:
**1. Interquartile Method**
```javascript
// Calculate sentence distances
distances = sentences.map(embed).map(cosineDistance)
// Find breakpoints at quartile boundaries
Q1, Q3 = quartiles(distances)
threshold = Q3 + 1.5 * (Q3 - Q1)
```
**2. Gradient Method**
```javascript
// Identify semantic transitions
gradients = distances.map(derivative)
breakpoints = gradients.filter(g => g > threshold)
```
**3. Hybrid Scoring**
```javascript
score = 0.6 * interquartile + 0.4 * gradient
// Adaptive to content type
```
### Chunk Optimization
```javascript
const CHUNK_CONFIG = {
base: 1000, // Target size
min: 100, // Prevent empty
max: 2500, // Respect limits
overlap: 200, // Context preservation
// Token validation
maxTokens: 6000, // voyage-context-3 safety
tokenizer: 'cl100k_base'
};
```
### Smart Indexing
Repository state tracking:
```javascript
// Check existing hash
const existingHash = await storage.getRepositoryHash(repo.name);
const currentHash = await git.getLatestCommit();
if (existingHash === currentHash) {
console.log('✅ Repository up to date, skipping...');
return;
}
// Process only changed files
const changedFiles = await git.diff(existingHash, currentHash);
await processFiles(changedFiles);
await storage.updateRepositoryHash(repo.name, currentHash);
```
### Error Handling
```javascript
// Exponential backoff with jitter
const delay = Math.min(1000 * Math.pow(2, attempt), 30000);
const jitter = Math.random() * 1000;
await sleep(delay + jitter);
// Token limit handling
if (error.message.includes('32000 tokens')) {
// Split chunk and retry
const subChunks = emergencySplit(chunk);
return processSubChunks(subChunks);
}
```
## Performance Characteristics
### Indexing Metrics
- **Processing rate**: 100-150 docs/hour (Voyage API limited)
- **Batch size**: 32 documents optimal
- **Memory usage**: <500MB peak
- **Network bandwidth**: ~10MB/hour
### Search Performance
- **Latency**: <100ms p99
- **Throughput**: 1000+ QPS
- **Index size**: ~1.5KB per chunk
- **Cache TTL**: 5 minutes
### Storage Efficiency
```javascript
// Document structure (avg 1.5KB)
{
_id: ObjectId,
title: string, // 50 bytes
content: string, // 1000 bytes
embedding: float[1024], // 4KB compressed
metadata: { // 200 bytes
file: string,
repo: string,
product: string,
indexedAt: Date
}
}
```
## Repository Configuration
### Default Repositories
```javascript
const repositories = [
{
name: 'MongoDB Documentation',
repo: 'mongodb/docs',
branch: 'master',
product: 'mongodb-docs',
priority: 10
},
// Add custom repositories...
];
```
### Custom Repository
```javascript
{
name: 'Your Documentation',
repo: 'owner/repository',
branch: 'main',
product: 'custom-your-docs',
// Optional filters
include: ['docs/**/*.md'],
exclude: ['**/node_modules/**'],
// Processing options
chunkSize: 1500,
chunkOverlap: 300
}
```
## Development
### Build Pipeline
```bash
# Development with watch
npm run dev
# Production build
npm run build
# Type checking
npm run typecheck
# Linting
npm run lint
# Testing
npm test
```
### Project Structure
```
src/
├── core/
│ ├── indexer.ts # Orchestration
│ ├── semantic-chunker.ts # Content splitting
│ ├── embeddings.ts # Vector generation
│ ├── storage.ts # Database operations
│ └── search.ts # Query algorithms
├── config/
│ └── index.ts # Repository definitions
├── web/
│ ├── server.ts # Express server
│ ├── coordinator.ts # Web orchestration
│ └── templates/ # HTML interfaces
└── index.ts # MCP server
dist/ # Compiled output
.repos/ # Cloned repositories
```
### Key Dependencies
```json
{
"mongodb": "^6.10.0", // Native driver
"voyageai": "^0.0.1-5", // Embeddings
"@modelcontextprotocol/sdk": "^1.0.0", // MCP
"js-tiktoken": "^1.0.15", // Tokenization
"simple-git": "^3.27.0" // Repository ops
}
```
## Troubleshooting
### Connection Issues
```bash
# Test MongoDB connection
node -e "
const { MongoClient } = require('mongodb');
MongoClient.connect(process.env.MONGODB_URI)
.then(() => console.log('✅ Connected'))
.catch(err => console.error('❌', err.message));
"
# Test Voyage AI
curl -X POST https://api.voyageai.com/v1/embeddings \
-H "Authorization: Bearer $VOYAGE_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": ["test"], "model": "voyage-context-3"}'
```
### Index Issues
```bash
# Verify vector index
mongosh $MONGODB_URI --eval "
db.documents.getSearchIndexes()
"
# Check document structure
mongosh $MONGODB_URI --eval "
db.documents.findOne()
"
```
### Performance Tuning
```javascript
// Adjust for your use case
const tuning = {
// Smaller batches for memory constraints
batchSize: 16,
// More candidates for precision
numCandidates: 100,
// Larger chunks for context
chunkSize: 2000,
// Disable for speed
smartIndexing: false
};
```
## Best Practices
### Security
- Store credentials in environment variables
- Use least-privilege MongoDB user
- Rotate API keys regularly
- Enable MongoDB audit logging
### Optimization
- Index during off-peak hours
- Use incremental updates
- Monitor token usage
- Cache frequent queries
### Scaling
- Horizontal sharding for large corpuses
- Read replicas for search traffic
- CDN for static assets
- Queue system for processing
## Contributing
Pull requests welcome. Please ensure:
- TypeScript strict mode compliance
- Test coverage >80%
- Conventional commits
- Documentation updates
## License
MIT
## Support
- Issues: [GitHub](https://github.com/yourusername/mongodocs-mcp/issues)
- Discussions: [GitHub Discussions](https://github.com/yourusername/mongodocs-mcp/discussions)
- MCP Spec: [modelcontextprotocol.io](https://modelcontextprotocol.io)
---
Built with MongoDB Atlas vector search and Voyage AI embeddings.