@knath2000/codebase-indexing-mcp
Version:
MCP server for codebase indexing with Voyage AI embeddings and Qdrant vector storage
877 lines (675 loc) ⢠24.5 kB
Markdown
# MCP Codebase Indexing Server
[](https://www.npmjs.com/package/mcp-codebase-indexing-server)
[](https://opensource.org/licenses/MIT)
[](https://www.typescriptlang.org/)
[](https://modelcontextprotocol.io/)
A Model Context Protocol (MCP) server that provides intelligent codebase indexing and semantic search capabilities for AI assistants like Cursor. This server uses Voyage AI for embeddings and Qdrant for vector storage to enable powerful semantic code search across your entire codebase.
## š Table of Contents
- [⨠Features](#-features)
- [šļø Architecture](#ļø-architecture)
- [š Quick Start](#-quick-start)
- [šÆ Cursor Integration Guide](#-cursor-integration-guide)
- [āļø Customization Guide](#ļø-customization-guide)
- [š Configuration](#-configuration)
- [š ļø MCP Tools](#ļø-mcp-tools)
- [š Supported Languages](#-supported-languages)
- [š§ Troubleshooting](#-troubleshooting)
- [š¤ Contributing](#-contributing)
- [š License](#-license)
## ⨠Features
- š§ **Intelligent Code Parsing**: Uses tree-sitter to parse code into meaningful chunks (functions, classes, modules, etc.)
- š **Semantic Search**: Leverages Voyage AI embeddings for semantic code search beyond keyword matching
- š **Vector Storage**: Uses Qdrant for efficient vector storage and lightning-fast similarity search
- š **Multiple Language Support**: Supports JavaScript, TypeScript, Python, and more
- ā” **Incremental Indexing**: Tracks file changes and only re-indexes when necessary
- šÆ **Flexible Search**: Search by language, chunk type, file path, or semantic similarity
- š **Context-Aware**: Provides code context and related chunks for better understanding
- š **MCP Compatible**: Works seamlessly with Cursor and other MCP-compatible AI assistants
- š ļø **12 Powerful Tools**: Complete set of indexing and search tools for comprehensive codebase management
## šļø Architecture
```mermaid
graph TB
subgraph "AI Assistant"
A[Cursor/Claude]
end
subgraph "MCP Server"
B[HTTP Server<br/>Custom SSE + JSON-RPC]
C[IndexingService]
D[SearchService]
E[Code Parser<br/>Tree-sitter]
end
subgraph "External Services"
F[Voyage AI<br/>Embeddings]
G[Qdrant<br/>Vector DB]
end
A ā B
B ā C
B ā D
C ā E
C ā F
C ā G
D ā F
D ā G
style A fill:#e1f5fe
style B fill:#f3e5f5
style F fill:#fff3e0
style G fill:#e8f5e8
```
The server consists of several key components:
1. **Code Parser**: Tree-sitter based parser that extracts semantic chunks from code
2. **Voyage Client**: Handles embedding generation via Voyage AI API
3. **Qdrant Client**: Manages vector storage and similarity search
4. **Indexing Service**: Orchestrates the indexing process
5. **Search Service**: Provides semantic search capabilities
6. **MCP Server**: Exposes tools via the Model Context Protocol
## š¦ Installation
### NPM Package (Recommended)
```bash
# Install globally
npm install -g mcp-codebase-indexing-server
# Or run directly with npx
npx mcp-codebase-indexing-server
```
### Docker
```bash
# Pull and run
docker run -p 3001:3001 ghcr.io/your-org/mcp-codebase-indexing-server:latest
```
### From Source
```bash
git clone <repository-url>
cd mcp-codebase-indexing-server
npm install
npm run build
npm start
```
## š Quick Start
### Prerequisites
- Node.js 18+
- Voyage AI API key ([Get one here](https://www.voyageai.com/))
- Qdrant instance (local or cloud)
- AI assistant that supports MCP (like Cursor)
### 5-Minute Setup
1. **Get your services ready**:
```bash
# Start local Qdrant
docker run -d -p 6333:6333 --name qdrant qdrant/qdrant
# Get Voyage AI API key from https://www.voyageai.com/
```
2. **Deploy the server**:
```bash
git clone <repository-url>
cd mcp-codebase-indexing-server
npm install && npm run build
VOYAGE_API_KEY=your_key_here npm start
```
3. **Connect to Cursor**:
- Add MCP server in Cursor settings
- Use server URL: `http://localhost:3001`
- You should see a green circle with 12 tools available
4. **Test it out**:
- Index your codebase: "Index the current directory"
- Search your code: "Find authentication functions in TypeScript"
## Prerequisites (Detailed)
- Node.js 18+
- Voyage AI API key
- Qdrant instance (local or cloud)
## Installation
1. Clone the repository:
```bash
git clone <repository-url>
cd mcp-codebase-indexing-server
```
2. Install dependencies:
```bash
npm install
```
3. Build the server:
```bash
npm run build
```
## šÆ Cursor Integration Guide
### Setting Up MCP Server in Cursor
1. **Open Cursor Settings**:
- Go to Settings ā Features ā Model Context Protocol
2. **Add MCP Server**:
```json
{
"name": "codebase-indexing",
"command": "node",
"args": ["path/to/your/mcp-codebase-indexing-server/dist/index.js"],
"env": {
"VOYAGE_API_KEY": "your_voyage_api_key_here",
"QDRANT_URL": "http://localhost:6333"
}
}
```
3. **Verify Connection**:
- Look for green circle indicator in Cursor
- Should show "12 tools" when connected
- If red circle: check logs and troubleshooting section
### Using with Cursor
#### Indexing Your Codebase
```
"Index the current directory for semantic search"
"Index the src/ folder in my project"
"Re-index the modified files in my codebase"
```
#### Searching Your Code
```
"Find authentication functions in TypeScript"
"Search for error handling patterns"
"Look for database query functions"
"Find classes that handle user data"
"Show me similar functions to the one I'm looking at"
```
#### Getting Code Context
```
"Get context around the login function"
"Show me similar code to this authentication logic"
"Find related functions in this file"
```
### Troubleshooting Cursor Connection
| Issue | Solution |
|-------|----------|
| Red circle (0 tools) | Check VOYAGE_API_KEY is set correctly |
| "No server info found" | Restart Cursor completely |
| Connection timeout | Ensure Qdrant is running on correct port |
| Tools not responding | Check server logs for errors |
## āļø Customization Guide
### For Different Project Types
#### Large Enterprise Codebases
```env
# Handle large codebases efficiently
BATCH_SIZE=50
MAX_FILE_SIZE=2097152
CHUNK_SIZE=1500
EXCLUDE_PATTERNS=node_modules,dist,build,.git,coverage,logs
```
#### AI/ML Projects
```env
# Optimize for Python-heavy codebases
SUPPORTED_EXTENSIONS=.py,.ipynb,.md,.yaml,.yml
EMBEDDING_MODEL=voyage-code-2
CHUNK_SIZE=2000
```
#### Frontend Projects
```env
# Focus on web technologies
SUPPORTED_EXTENSIONS=.js,.jsx,.ts,.tsx,.vue,.svelte,.css,.scss
EXCLUDE_PATTERNS=node_modules,dist,build,.next,coverage
CHUNK_SIZE=800
```
#### Microservices Architecture
```env
# Index multiple service repositories
COLLECTION_NAME=microservices-org
BATCH_SIZE=100
# Consider separate instances per service
```
### Advanced Configuration Options
#### Performance Tuning
```env
# Memory optimization
BATCH_SIZE=25 # Smaller batches for memory-constrained environments
CHUNK_OVERLAP=100 # Reduce overlap to save storage
MAX_FILE_SIZE=1048576 # Limit file size (1MB default)
# Speed optimization
BATCH_SIZE=200 # Larger batches for faster processing
EMBEDDING_MODEL=voyage-code-2 # Optimized model for code
```
#### Custom File Filtering
```env
# Include only specific file types
SUPPORTED_EXTENSIONS=.py,.js,.ts,.go,.rust
# Exclude testing and generated files
EXCLUDE_PATTERNS=*test*,*spec*,generated,vendor,node_modules
# Include documentation
SUPPORTED_EXTENSIONS=.md,.rst,.txt,.py,.js,.ts
```
#### Multi-Environment Setup
```env
# Development
COLLECTION_NAME=dev-codebase
QDRANT_URL=http://localhost:6333
# Staging
COLLECTION_NAME=staging-codebase
QDRANT_URL=https://staging-qdrant.company.com
# Production
COLLECTION_NAME=prod-codebase
QDRANT_URL=https://qdrant.company.com
QDRANT_API_KEY=prod_api_key
```
## š Privacy & Security
### Your Code Stays Private
The MCP server is designed with privacy as a core principle:
#### **Small Code Chunks Only**
- **Chunk Size**: Only small code segments (100-1000 characters) are sent for embedding
- **Default**: 800 characters maximum per chunk (configurable)
- **Enforcement**: Automatic truncation of larger chunks with logging
- **No Full Files**: Complete files are never sent to external services
#### **One-Way Mathematical Representations**
- **Embeddings**: Code chunks are converted to mathematical vectors (embeddings)
- **Irreversible**: Embeddings cannot be converted back to original code
- **Semantic Only**: Vectors capture meaning, not exact text
- **No Code Storage**: Original code never leaves your environment
#### **Local Processing**
- **Parsing**: All code parsing happens locally using Tree-sitter
- **Chunking**: Code segmentation occurs on your machine
- **Storage**: Only vector embeddings stored in your Qdrant instance
- **Search**: Semantic search runs on your infrastructure
#### **Network Security**
- **HTTPS**: All external API calls use TLS encryption
- **API Keys**: Securely stored in environment variables
- **No Logging**: Code content is never logged to external services
- **Minimal Data**: Only mathematical vectors transmitted
### Privacy Configuration
```env
# Privacy-optimized settings
CHUNK_SIZE=800 # Max 800 chars per chunk (100-1000 range)
CHUNK_OVERLAP=100 # Reduced overlap for privacy
MAX_FILE_SIZE=1048576 # 1MB file size limit
EXCLUDE_PATTERNS=*.git*,node_modules/**,dist/** # Skip sensitive directories
```
## š Configuration
The server is configured via environment variables:
### Required Environment Variables
- `VOYAGE_API_KEY`: Your Voyage AI API key
### Optional Environment Variables
- `QDRANT_URL`: Qdrant server URL (default: `http://localhost:6333`)
- `QDRANT_API_KEY`: Qdrant API key (if using cloud instance)
- `COLLECTION_NAME`: Name of the Qdrant collection (default: `codebase`)
- `EMBEDDING_MODEL`: Voyage AI model to use (default: `voyage-code-3`)
- `BATCH_SIZE`: Batch size for embedding generation (default: `100`)
- `CHUNK_SIZE`: Maximum chunk size in characters (default: `800`, range: 100-1000)
- `CHUNK_OVERLAP`: Overlap between chunks (default: `100`)
- `MAX_FILE_SIZE`: Maximum file size to index in bytes (default: `1048576`)
- `EXCLUDE_PATTERNS`: Comma-separated patterns to exclude (default: see config)
- `SUPPORTED_EXTENSIONS`: Comma-separated file extensions to support (default: see config)
### Example Configuration
Create a `.env` file in the project root:
```env
VOYAGE_API_KEY=your_voyage_api_key_here
QDRANT_URL=http://localhost:6333
COLLECTION_NAME=my_codebase
EMBEDDING_MODEL=voyage-code-2
BATCH_SIZE=50
MAX_FILE_SIZE=2097152
```
## Usage
### Running the Server
```bash
npm start
```
Or in development mode:
```bash
npm run dev
```
### Setting up Qdrant
#### Local Qdrant (Docker)
```bash
docker run -p 6333:6333 qdrant/qdrant
```
#### Qdrant Cloud
Sign up at [Qdrant Cloud](https://cloud.qdrant.io/) and get your API key and URL.
### š ļø MCP Tools
The server provides **16 powerful tools** organized by functionality:
<details>
<summary><strong>š Indexing Tools (5 tools)</strong></summary>
- **`index_directory`**: Index all files in a directory recursively
- **`index_file`**: Index a single file
- **`reindex_file`**: Re-index a file (force update)
- **`remove_file`**: Remove a file from the index
- **`clear_index`**: Clear the entire search index
</details>
<details>
<summary><strong>š Search Tools (6 tools)</strong></summary>
- **`codebase_search`**: š **Natural language search** for codebase understanding (e.g., "How is user authentication handled?", "Database connection setup", "Error handling patterns")
- **`search_code`**: Search for code chunks using semantic similarity
- **`search_functions`**: Search for functions by name or description
- **`search_classes`**: Search for classes by name or description
- **`find_similar`**: Find code chunks similar to a given chunk
- **`get_code_context`**: Get code context around a specific chunk
</details>
<details>
<summary><strong>š Statistics & Health Tools (5 tools)</strong></summary>
- **`get_indexing_stats`**: Get statistics about the indexed codebase
- **`get_search_stats`**: Get statistics about the search index
- **`get_enhanced_stats`**: Get enhanced statistics including cache and hybrid search metrics
- **`get_health_status`**: Get comprehensive health status of all services
- **`clear_search_cache`**: Clear search cache for fresh results
- **`invalidate_file_cache`**: Invalidate cache for a specific file
</details>
### Example Usage
1. **Index a directory**:
```json
{
"tool": "index_directory",
"arguments": {
"directory_path": "/path/to/your/codebase"
}
}
```
2. **š Natural language codebase search**:
```json
{
"tool": "codebase_search",
"arguments": {
"query": "How is user authentication handled?",
"limit": 5,
"enable_hybrid": true,
"enable_reranking": true
}
}
```
3. **Search for authentication functions**:
```json
{
"tool": "search_functions",
"arguments": {
"query": "authentication login user",
"language": "typescript",
"limit": 5
}
}
```
3. **Search for error handling patterns**:
```json
{
"tool": "search_code",
"arguments": {
"query": "error handling exception try catch",
"chunk_type": "function",
"threshold": 0.7
}
}
```
### š Natural Language Search Examples
The `codebase_search` tool understands natural language queries and provides:
- **Relevant code snippets** with syntax highlighting
- **File paths with line numbers** for direct navigation
- **Similarity scores** as percentages
- **Clickable navigation links** to jump to specific locations
**Example queries that work great:**
- `"How is user authentication handled?"`
- `"Database connection setup"`
- `"Error handling patterns"`
- `"API endpoint definitions"`
- `"Component state management"`
- `"Configuration loading"`
- `"Logging implementation"`
**Sample output format:**
```markdown
# š Natural Language Codebase Search
**Query:** "How is user authentication handled?"
## š Search Results
- **Found:** 8 relevant code references
- **Search Time:** 45ms
- **Hybrid Search:** ā
(Dense + Sparse)
- **LLM Re-ranked:** ā
(Relevance optimized)
## š Code References with Navigation Links
### 1. [š src/auth/auth-service.ts:15](file://src/auth/auth-service.ts#L15)
**Lines 15-28** | **function** | **typescript** | **Similarity: 94.2%**
```typescript
async authenticateUser(token: string): Promise<User | null> {
try {
const decoded = jwt.verify(token, this.secretKey);
return await this.userRepository.findById(decoded.userId);
} catch (error) {
logger.error('Authentication failed:', error);
return null;
}
}
```
```
## š Supported Languages
| Language | File Extensions | Status |
|----------|----------------|--------|
| **JavaScript** | `.js`, `.jsx` | ā
Full Support |
| **TypeScript** | `.ts`, `.tsx` | ā
Full Support |
| **Python** | `.py` | ā
Full Support |
| **Go** | `.go` | š Coming Soon |
| **Rust** | `.rs` | š Coming Soon |
| **Java** | `.java` | š Coming Soon |
> š” **Extensible**: Additional languages can be added by installing the corresponding tree-sitter grammars and updating the configuration.
## API Reference
### Indexing Service
The `IndexingService` class provides:
```typescript
// Initialize the service
await indexingService.initialize();
// Index a directory
const stats = await indexingService.indexDirectory('/path/to/code');
// Index a single file
const chunks = await indexingService.indexFile('/path/to/file.ts');
// Remove a file from index
await indexingService.removeFile('/path/to/file.ts');
// Clear entire index
await indexingService.clearIndex();
```
### Search Service
The `SearchService` class provides:
```typescript
// Initialize the service
await searchService.initialize();
// Basic search
const results = await searchService.search({
query: 'authentication',
language: 'typescript',
limit: 10
});
// Search functions
const functions = await searchService.searchFunctions('login', 'typescript');
// Find similar chunks
const similar = await searchService.findSimilar('chunk_id', 5);
// Get code context
const context = await searchService.getCodeContext('chunk_id', 5);
```
## Performance Considerations
- **Batch Processing**: The server processes files in batches to avoid memory issues
- **Incremental Updates**: Only re-indexes files that have changed
- **Embedding Caching**: Consider caching embeddings to reduce API calls
- **Vector Storage**: Qdrant provides efficient vector storage and retrieval
## š§ Troubleshooting
### Common Issues
#### MCP Connection Issues
| Problem | Symptoms | Solution |
|---------|----------|----------|
| **Server won't start** | `Error: EADDRINUSE` | Port 3001 already in use. Change PORT env var or kill existing process |
| **Connection timeout** | Cursor shows "connecting..." forever | Check VOYAGE_API_KEY is valid and Qdrant is running |
| **Red circle in Cursor** | 0 tools shown | Restart Cursor completely, verify server is running |
| **"Not connected" error** | Tools fail with connection error | Server restarted automatically, wait 30 seconds |
#### Service Connection Issues
1. **Connection to Qdrant fails**:
```bash
# Check if Qdrant is running
curl http://localhost:6333/collections
# Start Qdrant if not running
docker run -d -p 6333:6333 --name qdrant qdrant/qdrant
# Check firewall settings
netstat -tulpn | grep 6333
```
2. **Voyage AI API errors**:
```bash
# Test API key
curl -H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"input": ["test"], "model": "voyage-code-2"}' \
https://api.voyageai.com/v1/embeddings
# Check quota at https://www.voyageai.com/dashboard
```
#### Performance Issues
3. **Out of memory during indexing**:
```env
# Reduce memory usage
BATCH_SIZE=25
MAX_FILE_SIZE=524288
CHUNK_SIZE=500
# Exclude large directories
EXCLUDE_PATTERNS=node_modules,dist,build,.git,logs,coverage,vendor
```
4. **Slow indexing performance**:
```env
# Optimize for speed
BATCH_SIZE=100
CHUNK_OVERLAP=100
# Use faster embedding model if available
EMBEDDING_MODEL=voyage-code-2
```
#### Code Parsing Issues
5. **Tree-sitter parsing errors**:
- **Error**: `Language not supported`
- **Solution**: Add tree-sitter grammar for your language
- **Error**: `Failed to parse file`
- **Solution**: Check file encoding (must be UTF-8)
- **Error**: `File too large`
- **Solution**: Increase MAX_FILE_SIZE or exclude the file
### Diagnostic Commands
#### Check Server Health
```bash
# Test server is running
curl http://localhost:3001/health
# Test MCP endpoint
curl http://localhost:3001/sse
# Check server logs
npm start 2>&1 | tee server.log
```
#### Check Services
```bash
# Test Qdrant
curl http://localhost:6333/collections
# Test Voyage AI
curl -H "Authorization: Bearer $VOYAGE_API_KEY" \
https://api.voyageai.com/v1/embeddings \
-d '{"input":["test"],"model":"voyage-code-2"}'
```
#### Debug Indexing
```bash
# Enable debug mode
DEBUG=1 npm start
# Test specific directory
curl -X POST http://localhost:3001/tools/call \
-H "Content-Type: application/json" \
-d '{"tool":"index_directory","arguments":{"directory_path":"./test"}}'
```
### Debug Mode
Enable comprehensive logging:
```bash
# Full debug output
DEBUG=1 npm start
# Service-specific debugging
DEBUG=indexing npm start
DEBUG=search npm start
DEBUG=mcp npm start
```
### Log Analysis
Look for these patterns in logs:
| Log Pattern | Meaning | Action |
|-------------|---------|--------|
| `Error: VOYAGE_API_KEY is required` | Missing API key | Set VOYAGE_API_KEY environment variable |
| `Failed to connect to Qdrant` | Vector DB unavailable | Check Qdrant is running and accessible |
| `Rate limit exceeded` | API quota reached | Wait or upgrade Voyage AI plan |
| `Memory usage warning` | High memory usage | Reduce BATCH_SIZE or exclude more files |
| `Lazy initialization completed` | Services ready | Normal startup, server ready for requests |
### Getting Help
1. **Check server logs** for specific error messages
2. **Test each service individually** using diagnostic commands
3. **Verify environment variables** are set correctly
4. **Restart services** in order: Qdrant ā MCP Server ā Cursor
5. **Create minimal reproduction** with a small test directory
If issues persist, create a GitHub issue with:
- Complete error logs
- Environment configuration (without API keys)
- Steps to reproduce
- System information (OS, Node.js version, etc.)
## Development
### Project Structure
```
src/
āāā types.ts # Type definitions
āāā index.ts # Main MCP server
āāā clients/
ā āāā voyage-client.ts # Voyage AI client
ā āāā qdrant-client.ts # Qdrant client
āāā parsers/
ā āāā code-parser.ts # Tree-sitter based parser
āāā services/
āāā indexing-service.ts # Indexing orchestration
āāā search-service.ts # Search functionality
```
### Adding New Languages
1. Install the tree-sitter grammar:
```bash
npm install tree-sitter-rust
```
2. Update the `loadLanguage` function in `code-parser.ts`
3. Add language configuration in `initializeLanguageConfigs`
4. Update the file extension mapping
### Testing
Run tests with:
```bash
npm test
```
### Linting
Check code style with:
```bash
npm run lint
```
## Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request
## License
MIT License
## Acknowledgments
- [Voyage AI](https://www.voyageai.com/) for embedding API
- [Qdrant](https://qdrant.tech/) for vector database
- [Tree-sitter](https://tree-sitter.github.io/) for code parsing
- [Model Context Protocol](https://modelcontextprotocol.io/) for the protocol specification
## š¤ Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
### Development Setup
```bash
# Clone the repository
git clone <repository-url>
cd mcp-codebase-indexing-server
# Install dependencies
npm install
# Build the project
npm run build
# Run in development mode
npm run dev
# Run tests
npm test
```
### Adding New Languages
1. Install the tree-sitter grammar:
```bash
npm install tree-sitter-rust
```
2. Update the `loadLanguage` function in `src/parsers/code-parser.ts`
3. Add language configuration in `initializeLanguageConfigs`
4. Update the file extension mapping
## š License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## š Acknowledgments
- [Voyage AI](https://www.voyageai.com/) for providing excellent code embeddings
- [Qdrant](https://qdrant.tech/) for the powerful vector database
- [Tree-sitter](https://tree-sitter.github.io/) for robust code parsing
- [Model Context Protocol](https://modelcontextprotocol.io/) for the standardized AI integration protocol
## š Changelog
### v1.0.0 - Production Release
- ā
Complete MCP protocol implementation with 12 tools
- ā
Lazy initialization to prevent connection timeouts
- ā
Custom SSE implementation for Cursor compatibility
- ā
Support for JavaScript, TypeScript, Python
- ā
Voyage AI integration for semantic embeddings
- ā
Qdrant integration for vector storage
- ā
Incremental indexing with file change tracking
- ā
Automated Fly.io deployment with GitHub Actions
## š Related Projects
- [Model Context Protocol](https://modelcontextprotocol.io/) - Official MCP documentation
- [MCP Servers](https://github.com/modelcontextprotocol/servers) - Official MCP server implementations
- [Cursor](https://cursor.sh/) - AI-powered code editor with MCP support