@gianged/cindex
Version:
Semantic code search and context retrieval MCP server for large codebases
884 lines (633 loc) • 29.5 kB
Markdown
# cindex
**Semantic code search and context retrieval for large codebases**
A Model Context Protocol (MCP) server that provides intelligent code search and context retrieval
for Claude Code. Handles 1M+ lines of code with accuracy-first design.
## Features
- **Semantic Search** - Vector embeddings for intelligent code discovery
- **Hybrid Search** - Combines vector similarity with PostgreSQL full-text search for better natural
language query handling
- **9-Stage Retrieval Pipeline** - Scope filtering → query → files → chunks → symbols → imports →
APIs → dedup → assembly
- **Multi-Project Support** - Monorepo, microservices, and reference repository indexing
- **Scope Filtering** - Global, repository, service, and boundary-aware search modes
- **API Contract Search** - Semantic search for REST/GraphQL/gRPC endpoints
- **Query Caching** - LRU cache with 80%+ hit rate (cached queries ~50ms)
- **Progress Notifications** - Real-time 9-stage pipeline tracking
- **Incremental Indexing** - Only re-index changed files
- **Import Chain Analysis** - Automatic dependency resolution
- **Deduplication** - Remove duplicate utility functions
- **Large Codebase Support** - Efficiently handles 1M+ LoC
- **Claude Code Integration** - Native MCP server with 17 tools
- **Accuracy-First** - Default settings optimized for relevance
- **Configurable Models** - Swap embedding/LLM models via env vars
## Performance
- **Indexing Speed**: 300-600 files/min (with LLM summaries)
- **Query Speed**: First query ~800ms, cached queries ~50ms
- **Cache Hit Rate**: 80%+ for repeated queries
- **Codebase Scale**: Efficiently handles 1M+ lines of code
- **Memory Efficient**: LRU caching with configurable limits
- **Real-Time Progress**: 9-stage pipeline notifications
## Supported Languages
**12 languages** with full tree-sitter parsing: TypeScript, JavaScript, Python, Java, Go, Rust, C,
C++, C#, PHP, Ruby, Kotlin. Swift and other languages use regex fallback parsing.
## Prerequisites
Before installing cindex, you need:
### 1. PostgreSQL with pgvector
PostgreSQL 16+ with pgvector extension for vector similarity search:
```bash
# Ubuntu/Debian
sudo apt install postgresql-16 postgresql-16-pgvector
# macOS
brew install postgresql@16 pgvector
# Start PostgreSQL
sudo systemctl start postgresql # Linux
brew services start postgresql@16 # macOS
```
### 2. Ollama with Models
Ollama for local LLM inference with two models:
**Embedding Model** (for vector generation):
```bash
# Install Ollama
curl https://ollama.ai/install.sh | sh
# Pull embedding model (bge-m3:567m recommended)
ollama pull bge-m3:567m
```
**Coding Model** (for file summaries and analysis):
```bash
# Pull coding model (qwen2.5-coder:7b recommended)
ollama pull qwen2.5-coder:7b
# Alternative for faster indexing (lower quality):
# ollama pull qwen2.5-coder:1.5b
```
**Model Options:**
- **Embedding**: bge-m3:567m (1024 dims, 8K context) - Best accuracy
- **Summary**: qwen2.5-coder:7b (32K context) - High quality, RTX 4060+ recommended
- **Summary**: qwen2.5-coder:3b (32K context) - Balanced
- **Summary**: qwen2.5-coder:1.5b (32K context) - Fast indexing, lower quality
## Installation
### Database Setup
Create and initialize the cindex database:
```bash
# Create database
createdb cindex_rag_codebase
# Initialize schema (after installing cindex - see next section)
```
### Install MCP Server
Add cindex to Claude Code using the CLI. You can install for personal use (user scope) or share with
your team (project scope).
#### Quick Install (Personal Use)
Install for all your projects:
```bash
claude mcp add cindex --scope user --transport stdio \
--env POSTGRES_PASSWORD="your_password" \
-- npx -y @gianged/cindex
```
#### Team Install (Shared via Git)
Install for the current project (creates `.mcp.json` in project root):
```bash
claude mcp add cindex --scope project --transport stdio \
--env POSTGRES_PASSWORD="your_password" \
-- npx -y @gianged/cindex
```
**Note:** For project scope, set `POSTGRES_PASSWORD` as an environment variable on your system and
reference it in the command. Never commit actual secrets to version control.
#### Custom Configuration
Add additional environment variables using multiple `--env` flags:
```bash
claude mcp add cindex --scope user --transport stdio \
--env POSTGRES_PASSWORD="your_password" \
--env POSTGRES_HOST="localhost" \
--env POSTGRES_DB="cindex_rag_codebase" \
--env EMBEDDING_MODEL="bge-m3:567m" \
--env SUMMARY_MODEL="qwen2.5-coder:7b" \
-- npx -y @gianged/cindex
```
See [Environment Variables](#environment-variables) section below for all available configuration
options.
#### Manual Configuration (Alternative)
If you prefer to manually edit configuration files, you can add cindex to:
**User Scope** (`~/.claude.json`):
```json
{
"mcpServers": {
"cindex": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@gianged/cindex"],
"env": {
"POSTGRES_PASSWORD": "your_password"
}
}
}
}
```
**Project Scope** (`.mcp.json` in project root):
```json
{
"mcpServers": {
"cindex": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@gianged/cindex"],
"env": {
"POSTGRES_HOST": "${POSTGRES_HOST:-localhost}",
"POSTGRES_PORT": "${POSTGRES_PORT:-5432}",
"POSTGRES_DB": "${POSTGRES_DB:-cindex_rag_codebase}",
"POSTGRES_USER": "${POSTGRES_USER:-postgres}",
"POSTGRES_PASSWORD": "${POSTGRES_PASSWORD}"
}
}
}
}
```
### Initialize Database Schema
After configuring MCP, initialize the database schema:
```bash
# Download schema file
curl -o database.sql https://raw.githubusercontent.com/gianged/cindex/main/database.sql
# Apply schema
psql cindex_rag_codebase < database.sql
```
### Start Using
1. Open Claude Code
2. Use the `index_repository` tool to index your codebase
3. Use `search_codebase` to find relevant code
## Environment Variables
All configuration is done through environment variables in your MCP config file.
### Model Configuration
| Variable | Default | Range | Description |
| -------------------------- | ------------------------ | ----------- | -------------------------------------------- |
| `EMBEDDING_MODEL` | `bge-m3:567m` | - | Ollama embedding model for vector generation |
| `EMBEDDING_DIMENSIONS` | `1024` | 1-4096 | Vector dimensions (must match model output) |
| `EMBEDDING_CONTEXT_WINDOW` | `4096` | 512-131072 | Token limit for embedding model |
| `SUMMARY_MODEL` | `qwen2.5-coder:7b` | - | Ollama model for file summaries |
| `SUMMARY_CONTEXT_WINDOW` | `4096` | 512-131072 | Token limit for summary model |
| `OLLAMA_HOST` | `http://localhost:11434` | - | Ollama API endpoint |
| `OLLAMA_TIMEOUT` | `30000` | 1000-300000 | Request timeout in milliseconds |
**Context Window Notes:**
- Default 4096 matches Ollama's default and is sufficient (cindex uses first 100 lines per file)
- Higher values = more VRAM usage + slower inference
- qwen2.5-coder:7b supports up to 32K tokens
- bge-m3:567m supports up to 8K tokens
- Increase only if you encounter issues with large files
### Database Configuration
| Variable | Default | Range | Description |
| -------------------------- | --------------------- | ------- | ------------------------------- |
| `POSTGRES_HOST` | `localhost` | - | PostgreSQL server hostname |
| `POSTGRES_PORT` | `5432` | 1-65535 | PostgreSQL server port |
| `POSTGRES_DB` | `cindex_rag_codebase` | - | Database name |
| `POSTGRES_USER` | `postgres` | - | Database user |
| `POSTGRES_PASSWORD` | _required_ | - | Database password (must be set) |
| `POSTGRES_MAX_CONNECTIONS` | `10` | 1-100 | Maximum connection pool size |
### Performance Tuning
| Variable | Default | Range | Description |
| ---------------------------- | ------- | ------- | ---------------------------------------------------- |
| `HNSW_EF_SEARCH` | `300` | 10-1000 | HNSW search quality (higher = more accurate, slower) |
| `HNSW_EF_CONSTRUCTION` | `200` | 10-1000 | HNSW index quality (higher = better index) |
| `SIMILARITY_THRESHOLD` | `0.3` | 0.0-1.0 | Minimum similarity for file-level retrieval |
| `CHUNK_SIMILARITY_THRESHOLD` | `0.2` | 0.0-1.0 | Minimum similarity for chunk-level retrieval |
| `DEDUP_THRESHOLD` | `0.92` | 0.0-1.0 | Similarity threshold for deduplication |
| `HYBRID_VECTOR_WEIGHT` | `0.7` | 0.0-1.0 | Weight for vector similarity in hybrid search |
| `HYBRID_KEYWORD_WEIGHT` | `0.3` | 0.0-1.0 | Weight for keyword (BM25) score in hybrid search |
| `IMPORT_DEPTH` | `3` | 1-10 | Maximum import chain traversal depth |
| `WORKSPACE_DEPTH` | `2` | 1-10 | Maximum workspace dependency depth |
| `SERVICE_DEPTH` | `1` | 1-10 | Maximum service dependency depth |
### Indexing Configuration
| Variable | Default | Range | Description |
| ------------------ | ------- | ---------- | ---------------------------------- |
| `MAX_FILE_SIZE` | `5000` | 100-100000 | Maximum file size in lines |
| `INCLUDE_MARKDOWN` | `false` | true/false | Include markdown files in indexing |
### Feature Flags
| Variable | Default | Range | Description |
| ------------------------------- | ------- | ---------- | --------------------------------------- |
| `ENABLE_WORKSPACE_DETECTION` | `true` | true/false | Detect monorepo workspaces |
| `ENABLE_SERVICE_DETECTION` | `true` | true/false | Detect microservices |
| `ENABLE_MULTI_REPO` | `false` | true/false | Enable multi-repository support |
| `ENABLE_API_ENDPOINT_DETECTION` | `true` | true/false | Parse API contracts (REST/GraphQL/gRPC) |
| `ENABLE_HYBRID_SEARCH` | `true` | true/false | Combine vector + full-text search |
## Example Configurations
### Minimal Configuration
Only the required password:
```json
{
"mcpServers": {
"cindex": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@gianged/cindex"],
"env": {
"POSTGRES_PASSWORD": "your_password"
}
}
}
}
```
### Full Configuration
All available settings with defaults shown:
```json
{
"mcpServers": {
"cindex": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@gianged/cindex"],
"env": {
"EMBEDDING_MODEL": "bge-m3:567m",
"EMBEDDING_DIMENSIONS": "1024",
"EMBEDDING_CONTEXT_WINDOW": "4096",
"SUMMARY_MODEL": "qwen2.5-coder:7b",
"SUMMARY_CONTEXT_WINDOW": "4096",
"OLLAMA_HOST": "http://localhost:11434",
"POSTGRES_HOST": "localhost",
"POSTGRES_PORT": "5432",
"POSTGRES_DB": "cindex_rag_codebase",
"POSTGRES_USER": "postgres",
"POSTGRES_PASSWORD": "your_password",
"HNSW_EF_SEARCH": "300",
"HNSW_EF_CONSTRUCTION": "200",
"SIMILARITY_THRESHOLD": "0.3",
"CHUNK_SIMILARITY_THRESHOLD": "0.2",
"DEDUP_THRESHOLD": "0.92"
}
}
}
}
```
### Speed-First Configuration
For faster indexing with lower quality:
```json
{
"mcpServers": {
"cindex": {
"type": "stdio",
"command": "npx",
"args": ["-y", "@gianged/cindex"],
"env": {
"POSTGRES_PASSWORD": "your_password",
"SUMMARY_MODEL": "qwen2.5-coder:1.5b",
"SUMMARY_CONTEXT_WINDOW": "4096",
"HNSW_EF_SEARCH": "100",
"HNSW_EF_CONSTRUCTION": "64",
"SIMILARITY_THRESHOLD": "0.4",
"CHUNK_SIMILARITY_THRESHOLD": "0.25",
"DEDUP_THRESHOLD": "0.95"
}
}
}
}
```
**Performance:**
- **Indexing**: 500-1000 files/min (vs 300-600 files/min default)
- **Query Time**: <500ms (vs <800ms default)
- **Relevance**: >85% in top 10 results (vs >92% default)
## Recommended Settings
### RTX 4060 / 8GB VRAM (Tested Configuration)
| Setting | Value | Notes |
| ---------------------------- | ------------------ | ---------------------------------- |
| `EMBEDDING_MODEL` | `bge-m3:567m` | Best accuracy/speed balance |
| `SUMMARY_MODEL` | `qwen2.5-coder:7b` | Good summaries, fits in VRAM |
| `EMBEDDING_CONTEXT_WINDOW` | `4096` | Default, sufficient for most files |
| `HNSW_EF_SEARCH` | `300` | High accuracy retrieval |
| `SIMILARITY_THRESHOLD` | `0.3` | File-level retrieval threshold |
| `CHUNK_SIMILARITY_THRESHOLD` | `0.2` | Chunk-level retrieval threshold |
| `DEDUP_THRESHOLD` | `0.92` | Prevent duplicate results |
### Performance Expectations
- **Indexing:** ~30 files/min (~70 chunks/min)
- **Search:** <1 second per query
- **Codebase:** Tested with 40k LoC (112 files)
## Managing Configuration
### Verify Installation
List all installed MCP servers:
```bash
claude mcp list
```
View cindex configuration:
```bash
claude mcp get cindex
```
### Update Configuration
To update environment variables, remove and re-add with new settings:
```bash
claude mcp remove cindex
claude mcp add cindex --scope user --transport stdio \
--env POSTGRES_PASSWORD="your_password" \
--env SUMMARY_MODEL="qwen2.5-coder:3b" \
-- npx -y @gianged/cindex
```
### Switch to Speed-First Mode
For faster indexing with lower quality, use these settings:
```bash
claude mcp remove cindex
claude mcp add cindex --scope user --transport stdio \
--env POSTGRES_PASSWORD="your_password" \
--env SUMMARY_MODEL="qwen2.5-coder:1.5b" \
--env HNSW_EF_SEARCH="100" \
--env HNSW_EF_CONSTRUCTION="64" \
--env SIMILARITY_THRESHOLD="0.4" \
--env CHUNK_SIMILARITY_THRESHOLD="0.25" \
--env DEDUP_THRESHOLD="0.95" \
-- npx -y @gianged/cindex
```
**Performance:**
- **Indexing**: 500-1000 files/min (vs 300-600 files/min default)
- **Query Time**: <500ms (vs <800ms default)
- **Relevance**: >85% in top 10 results (vs >92% default)
### Remove Server
```bash
claude mcp remove cindex
```
## MCP Tools
**Status: 17 of 17 tools implemented**
All tools provide structured output with syntax highlighting and comprehensive metadata.
### Core Search Tools
#### `search_codebase`
Semantic code search with multi-stage retrieval and dependency analysis.
**Parameters:**
- `query` (required) - Natural language search query
- `scope` - Search scope: `'global'`, `'repository'`, `'service'`, or `'workspace'`
- `repo_id` - Filter by repository ID
- `service_id` - Filter by service ID
- `workspace_id` - Filter by workspace ID
- `max_results` - Maximum results (1-100, default: 20)
- `similarity_threshold` - Minimum similarity (0.0-1.0, default: 0.75)
- `include_dependencies` - Include imported dependencies (default: false)
**Returns:** Markdown-formatted results with file paths, line numbers, code snippets, and relevance
scores.
#### `get_file_context`
Get complete context for a specific file including callers, callees, and import chain.
**Parameters:**
- `file_path` (required) - Absolute or relative file path
- `repo_id` - Repository ID (optional if file path is unique)
- `include_callers` - Include functions that call this file (default: true)
- `include_callees` - Include functions called by this file (default: true)
- `include_imports` - Include import chain (default: true)
- `max_depth` - Import chain depth (1-5, default: 2)
**Returns:** File summary, symbols, dependencies, and related code context.
#### `find_symbol_definition`
Locate symbol definitions and optionally show usages across the codebase.
**Parameters:**
- `symbol_name` (required) - Function, class, or variable name
- `repo_id` - Filter by repository ID
- `file_path` - Filter by file path
- `symbol_type` - Filter by type: `'function'`, `'class'`, `'variable'`, `'interface'`, etc.
- `include_usages` - Show where symbol is used (default: false)
- `max_usages` - Maximum usage results (1-100, default: 50)
**Returns:** Symbol definitions with file paths, line numbers, signatures, and optional usage
locations.
### Repository Management Tools
#### `index_repository`
Index or re-index a repository with progress notifications and multi-project support.
**Parameters:**
- `repo_path` (required) - Absolute path to repository root
- `repo_id` - Repository identifier (default: directory name)
- `repo_type` - Repository type: `'monolithic'`, `'microservice'`, `'monorepo'`, `'library'`,
`'reference'`, or `'documentation'`
- `force_reindex` - Force full re-index (default: false, uses incremental indexing)
- `detect_workspaces` - Detect monorepo workspaces (default: true)
- `detect_services` - Detect microservices (default: true)
- `detect_api_endpoints` - Parse API contracts (default: true)
- `service_config` - Manual service configuration (optional)
- `version` - Repository version for reference repos (e.g., `'v10.3.0'`)
- `metadata` - Additional metadata (e.g., `{ upstream_url: '...' }`)
**Returns:** Indexing statistics including files indexed, chunks created, symbols extracted,
workspaces/services detected, and timing information.
#### `delete_repository`
Delete one or more indexed repositories and all associated data.
**Parameters:**
- `repo_ids` (required) - Array of repository IDs to delete
**Returns:** Deletion confirmation with statistics (files, chunks, symbols, workspaces, services
removed).
#### `list_indexed_repos`
List all indexed repositories with optional metadata, workspace counts, and service counts.
**Parameters:**
- `include_metadata` - Include repository metadata (default: true)
- `include_workspace_count` - Include workspace count for monorepos (default: true)
- `include_service_count` - Include service count for microservices (default: true)
- `repo_type_filter` - Filter by repository type
**Returns:** List of repositories with IDs, types, file counts, last indexed time, and optional
metadata.
### Monorepo Tools
#### `list_workspaces`
List all workspaces in indexed repositories for monorepo support.
**Parameters:**
- `repo_id` - Filter by repository ID (optional)
- `include_dependencies` - Include dependency information (default: false)
- `include_metadata` - Include package.json metadata (default: false)
**Returns:** List of workspaces with package names, paths, file counts, and optional dependencies.
#### `get_workspace_context`
Get full context for a workspace including dependencies and dependents.
**Parameters:**
- `workspace_id` - Workspace ID (use `list_workspaces` to find)
- `package_name` - Package name (alternative to workspace_id)
- `repo_id` - Repository ID (required if using package_name)
- `include_dependencies` - Include workspace dependencies (default: true)
- `include_dependents` - Include workspaces that depend on this one (default: true)
- `dependency_depth` - Dependency tree depth (1-5, default: 2)
**Returns:** Workspace metadata, dependency tree, dependent workspaces, and file list.
#### `find_cross_workspace_usages`
Find workspace package usages across the monorepo.
**Parameters:**
- `workspace_id` - Source workspace ID
- `package_name` - Source package name (alternative to workspace_id)
- `symbol_name` - Specific symbol to track (optional)
- `include_indirect` - Include indirect usages (default: false)
- `max_depth` - Dependency chain depth (1-5, default: 2)
**Returns:** List of workspaces using the target package/symbol with file locations.
### Microservice Tools
#### `list_services`
List all services across indexed repositories for microservice support.
**Parameters:**
- `repo_id` - Filter by repository ID (optional)
- `service_type` - Filter by type: `'docker'`, `'serverless'`, `'mobile'` (optional)
- `include_dependencies` - Include service dependencies (default: false)
- `include_api_endpoints` - Include API endpoint counts (default: false)
**Returns:** List of services with IDs, names, types, file counts, and optional API information.
#### `get_service_context`
Get full context for a service including API contracts and dependencies.
**Parameters:**
- `service_id` - Service ID (use `list_services` to find)
- `service_name` - Service name (alternative to service_id)
- `repo_id` - Repository ID (required if using service_name)
- `include_dependencies` - Include service dependencies (default: true)
- `include_dependents` - Include services that depend on this one (default: true)
- `include_api_contracts` - Include API endpoint definitions (default: true)
- `dependency_depth` - Dependency tree depth (1-5, default: 1)
**Returns:** Service metadata, API contracts (REST/GraphQL/gRPC), dependency graph, and file list.
#### `find_cross_service_calls`
Find inter-service API calls across microservices.
**Parameters:**
- `source_service_id` - Source service ID (optional)
- `target_service_id` - Target service ID (optional)
- `endpoint_pattern` - Endpoint regex pattern (e.g., `/api/users/.*`, optional)
- `include_reverse` - Also show calls in reverse direction (default: false)
**Returns:** List of inter-service API calls with endpoints, HTTP methods, and call counts.
### API Contract Tools
#### `search_api_contracts`
Search API endpoints across services with semantic understanding.
**Parameters:**
- `query` (required) - API search query (e.g., "user authentication endpoint")
- `api_types` - Filter by type: `['rest', 'graphql', 'grpc']` (default: all)
- `service_filter` - Filter by service IDs (optional)
- `repo_filter` - Filter by repository IDs (optional)
- `include_deprecated` - Include deprecated endpoints (default: false)
- `max_results` - Maximum results (1-100, default: 20)
- `similarity_threshold` - Minimum similarity (0.0-1.0, default: 0.70)
**Returns:** API endpoints with paths, HTTP methods, service names, implementation files, and
similarity scores.
### Reference & Documentation Tools
Tools for searching reference materials including markdown documentation (syntax references,
Context7-fetched docs) AND reference repository code (indexed frameworks/libraries).
#### `index_documentation`
Index markdown files for documentation search. Works with explicit paths only.
**Parameters:**
- `paths` (required) - Array of file or directory paths to index (e.g.,
`['syntax.md', '/docs/libraries/']`)
- `doc_id` - Document identifier (default: derived from path)
- `tags` - Tags for filtering (e.g., `['typescript', 'react']`)
- `force_reindex` - Force re-index even if unchanged (default: false)
**Returns:** Indexing statistics including files indexed, sections created, code blocks extracted,
and timing.
**Workflow:**
1. Fetch documentation (e.g., from Context7)
2. Save to markdown file
3. Index with `index_documentation`
4. Search with `search_references`
#### `search_references`
Search reference materials including markdown documentation AND reference repository code. Combines
both sources for comprehensive reference search.
**Parameters:**
- `query` (required) - Natural language search query
- `doc_ids` - Filter by document IDs (optional)
- `tags` - Filter by documentation tags (optional)
- `include_docs` - Include markdown documentation results (default: true)
- `include_code` - Include reference repository code results (default: true)
- `max_results` - Maximum results per source (1-50, default: 10)
- `include_code_blocks` - Include code blocks from documentation (default: true)
- `similarity_threshold` - Minimum similarity (0.0-1.0, default: 0.65)
**Returns:** Combined results from both documentation chunks and reference repository code, with
heading breadcrumbs, content snippets, code blocks, file paths, and relevance scores.
**Note:** Reference repositories are indexed using `index_repository` with `repo_type: 'reference'`.
They are excluded from `search_codebase` by default and only searchable via `search_references`.
#### `list_documentation`
List all indexed documentation with metadata.
**Parameters:**
- `doc_ids` - Filter by document IDs (optional)
- `tags` - Filter by tags (optional)
**Returns:** List of indexed documents with file counts, section counts, code block counts, and
indexed timestamps.
#### `delete_documentation`
Delete indexed documentation by document ID.
**Parameters:**
- `doc_ids` (required) - Array of document IDs to delete
**Returns:** Deletion confirmation with chunks and files removed.
---
See [docs/overview.md](./docs/overview.md) for complete tool documentation including
multi-project/monorepo/microservice architecture details.
## Architecture
### Hybrid Search
Combines vector similarity search with PostgreSQL full-text search (tsvector/ts_rank_cd) for
improved natural language query handling:
```
hybrid_score = (0.7 * vector_similarity) + (0.3 * keyword_score)
```
- **Vector search** - Semantic understanding via embeddings
- **Keyword search** - Exact term matching via PostgreSQL full-text search
- Configurable weights via `HYBRID_VECTOR_WEIGHT` and `HYBRID_KEYWORD_WEIGHT`
- Disable with `ENABLE_HYBRID_SEARCH=false` to use vector-only search
### Multi-Stage Retrieval
1. **File-Level** - Find relevant files via summary embeddings + full-text search
2. **Chunk-Level** - Locate specific code chunks (functions/classes)
3. **Symbol Resolution** - Resolve imported symbols and dependencies
4. **Import Expansion** - Build dependency graph (max 3 levels)
5. **Deduplication** - Remove redundant code from results
### Indexing Pipeline
1. File discovery (respects .gitignore)
2. Tree-sitter parsing (with regex fallback)
3. Semantic chunking (functions, classes, blocks)
4. LLM-based file summaries (configurable model)
5. Embedding generation (configurable model)
6. Full-text search vector generation (tsvector)
7. PostgreSQL + pgvector storage
## Performance Characteristics
### Accuracy-First Mode (Default)
- **Indexing**: 300-600 files/min
- **Query Time**: <800ms
- **Relevance**: >92% in top 10 results
- **Context Noise**: <2%
### Speed-First Mode
- **Indexing**: 500-1000 files/min
- **Query Time**: <500ms
- **Relevance**: >85% in top 10 results
## System Requirements
- **Node.js** 22+ (for MCP server)
- **PostgreSQL** 16+ with pgvector extension
- **Ollama** with models installed
- **Disk Space**: ~1GB per 100k LoC indexed
- **RAM**: 8GB minimum (16GB+ recommended for large codebases)
- **GPU**: Optional but recommended (RTX 3060+ for qwen2.5-coder:7b)
## Troubleshooting
### "Vector dimension mismatch"
Update `EMBEDDING_DIMENSIONS` in MCP config to match your model, then update vector dimensions in
`database.sql`.
### "Connection refused" to PostgreSQL
Check `POSTGRES_HOST` and `POSTGRES_PORT` in MCP config. Verify PostgreSQL is running:
```bash
sudo systemctl status postgresql # Linux
brew services list # macOS
```
### "Model not found" in Ollama
Pull the required models:
```bash
ollama pull bge-m3:567m
ollama pull qwen2.5-coder:7b
```
Verify models are available:
```bash
ollama list
```
### Slow indexing
- Use smaller summary model: `qwen2.5-coder:1.5b` instead of `7b`
- Reduce `HNSW_EF_CONSTRUCTION` to `64`
- Enable incremental indexing (default)
### Low accuracy results
- Increase `HNSW_EF_SEARCH` to `300-400`
- Raise `SIMILARITY_THRESHOLD` to `0.4-0.5` for stricter file matching
- Raise `CHUNK_SIMILARITY_THRESHOLD` to `0.3-0.4` for stricter chunk matching
- Use better summary model: `qwen2.5-coder:3b` or `7b`
- Lower `DEDUP_THRESHOLD` to `0.90-0.92`
## Documentation
See [docs/overview.md](./docs/overview.md) for detailed documentation including:
- Complete architecture details
- Database schema
- Configuration reference
- Implementation guide
- Performance tuning
## Development
```bash
git clone https://github.com/gianged/cindex.git
cd cindex
npm install
npm run build
npm test
```
## Implementation Status
- Phase 1 (100%) - Database schema & type system
- Phase 2 (100%) - File discovery, parsing, chunking, workspace/service detection
- Phase 3 (100%) - Embeddings, summaries, API parsing, 12-language support, Docker/serverless/mobile
detection
- Phase 4 (100%) - Multi-stage retrieval pipeline (9-stage)
- Phase 5 (100%) - MCP tools (17 of 17 implemented)
- Phase 6 (100%) - Incremental indexing, optimization, testing
**Overall: 100% complete**
## License
MIT
## Author
**gianged** - Yup, it's me
## Contributing
Contributions welcome! Please open an issue or PR on GitHub.
## Acknowledgments
Built with:
- [Model Context Protocol](https://modelcontextprotocol.io/) by Anthropic
- [pgvector](https://github.com/pgvector/pgvector) for vector search
- [Ollama](https://ollama.ai/) for local LLM inference
- [tree-sitter](https://tree-sitter.github.io/) for code parsing