@gianged/cindex

# cindex **Semantic code search and context retrieval for large codebases** A Model Context Protocol (MCP) server that provides intelligent code search and context retrieval for Claude Code. Handles 1M+ lines of code with accuracy-first design. ## Features - **Semantic Search** - Vector embeddings for intelligent code discovery - **Hybrid Search** - Combines vector similarity with PostgreSQL full-text search for better natural language query handling - **9-Stage Retrieval Pipeline** - Scope filtering → query → files → chunks → symbols → imports → APIs → dedup → assembly - **Multi-Project Support** - Monorepo, microservices, and reference repository indexing - **Scope Filtering** - Global, repository, service, and boundary-aware search modes - **API Contract Search** - Semantic search for REST/GraphQL/gRPC endpoints - **Query Caching** - LRU cache with 80%+ hit rate (cached queries ~50ms) - **Progress Notifications** - Real-time 9-stage pipeline tracking - **Incremental Indexing** - Only re-index changed files - **Import Chain Analysis** - Automatic dependency resolution - **Deduplication** - Remove duplicate utility functions - **Large Codebase Support** - Efficiently handles 1M+ LoC - **Claude Code Integration** - Native MCP server with 17 tools - **Accuracy-First** - Default settings optimized for relevance - **Configurable Models** - Swap embedding/LLM models via env vars ## Performance - **Indexing Speed**: 300-600 files/min (with LLM summaries) - **Query Speed**: First query ~800ms, cached queries ~50ms - **Cache Hit Rate**: 80%+ for repeated queries - **Codebase Scale**: Efficiently handles 1M+ lines of code - **Memory Efficient**: LRU caching with configurable limits - **Real-Time Progress**: 9-stage pipeline notifications ## Supported Languages **12 languages** with full tree-sitter parsing: TypeScript, JavaScript, Python, Java, Go, Rust, C, C++, C#, PHP, Ruby, Kotlin. Swift and other languages use regex fallback parsing. ## Prerequisites Before installing cindex, you need: ### 1. PostgreSQL with pgvector PostgreSQL 16+ with pgvector extension for vector similarity search: ```bash # Ubuntu/Debian sudo apt install postgresql-16 postgresql-16-pgvector # macOS brew install postgresql@16 pgvector # Start PostgreSQL sudo systemctl start postgresql # Linux brew services start postgresql@16 # macOS ``` ### 2. Ollama with Models Ollama for local LLM inference with two models: **Embedding Model** (for vector generation): ```bash # Install Ollama curl https://ollama.ai/install.sh | sh # Pull embedding model (bge-m3:567m recommended) ollama pull bge-m3:567m ``` **Coding Model** (for file summaries and analysis): ```bash # Pull coding model (qwen2.5-coder:7b recommended) ollama pull qwen2.5-coder:7b # Alternative for faster indexing (lower quality): # ollama pull qwen2.5-coder:1.5b ``` **Model Options:** - **Embedding**: bge-m3:567m (1024 dims, 8K context) - Best accuracy - **Summary**: qwen2.5-coder:7b (32K context) - High quality, RTX 4060+ recommended - **Summary**: qwen2.5-coder:3b (32K context) - Balanced - **Summary**: qwen2.5-coder:1.5b (32K context) - Fast indexing, lower quality ## Installation ### Database Setup Create and initialize the cindex database: ```bash # Create database createdb cindex_rag_codebase # Initialize schema (after installing cindex - see next section) ``` ### Install MCP Server Add cindex to Claude Code using the CLI. You can install for personal use (user scope) or share with your team (project scope). #### Quick Install (Personal Use) Install for all your projects: ```bash claude mcp add cindex --scope user --transport stdio \ --env POSTGRES_PASSWORD="your_password" \ -- npx -y @gianged/cindex ``` #### Team Install (Shared via Git) Install for the current project (creates `.mcp.json` in project root): ```bash claude mcp add cindex --scope project --transport stdio \ --env POSTGRES_PASSWORD="your_password" \ -- npx -y @gianged/cindex ``` **Note:** For project scope, set `POSTGRES_PASSWORD` as an environment variable on your system and reference it in the command. Never commit actual secrets to version control. #### Custom Configuration Add additional environment variables using multiple `--env` flags: ```bash claude mcp add cindex --scope user --transport stdio \ --env POSTGRES_PASSWORD="your_password" \ --env POSTGRES_HOST="localhost" \ --env POSTGRES_DB="cindex_rag_codebase" \ --env EMBEDDING_MODEL="bge-m3:567m" \ --env SUMMARY_MODEL="qwen2.5-coder:7b" \ -- npx -y @gianged/cindex ``` See [Environment Variables](#environment-variables) section below for all available configuration options. #### Manual Configuration (Alternative) If you prefer to manually edit configuration files, you can add cindex to: **User Scope** (`~/.claude.json`): ```json { "mcpServers": { "cindex": { "type": "stdio", "command": "npx", "args": ["-y", "@gianged/cindex"], "env": { "POSTGRES_PASSWORD": "your_password" } } } } ``` **Project Scope** (`.mcp.json` in project root): ```json { "mcpServers": { "cindex": { "type": "stdio", "command": "npx", "args": ["-y", "@gianged/cindex"], "env": { "POSTGRES_HOST": "${POSTGRES_HOST:-localhost}", "POSTGRES_PORT": "${POSTGRES_PORT:-5432}", "POSTGRES_DB": "${POSTGRES_DB:-cindex_rag_codebase}", "POSTGRES_USER": "${POSTGRES_USER:-postgres}", "POSTGRES_PASSWORD": "${POSTGRES_PASSWORD}" } } } } ``` ### Initialize Database Schema After configuring MCP, initialize the database schema: ```bash # Download schema file curl -o database.sql https://raw.githubusercontent.com/gianged/cindex/main/database.sql # Apply schema psql cindex_rag_codebase < database.sql ``` ### Start Using 1. Open Claude Code 2. Use the `index_repository` tool to index your codebase 3. Use `search_codebase` to find relevant code ## Environment Variables All configuration is done through environment variables in your MCP config file. ### Model Configuration | Variable | Default | Range | Description | | -------------------------- | ------------------------ | ----------- | -------------------------------------------- | | `EMBEDDING_MODEL` | `bge-m3:567m` | - | Ollama embedding model for vector generation | | `EMBEDDING_DIMENSIONS` | `1024` | 1-4096 | Vector dimensions (must match model output) | | `EMBEDDING_CONTEXT_WINDOW` | `4096` | 512-131072 | Token limit for embedding model | | `SUMMARY_MODEL` | `qwen2.5-coder:7b` | - | Ollama model for file summaries | | `SUMMARY_CONTEXT_WINDOW` | `4096` | 512-131072 | Token limit for summary model | | `OLLAMA_HOST` | `http://localhost:11434` | - | Ollama API endpoint | | `OLLAMA_TIMEOUT` | `30000` | 1000-300000 | Request timeout in milliseconds | **Context Window Notes:** - Default 4096 matches Ollama's default and is sufficient (cindex uses first 100 lines per file) - Higher values = more VRAM usage + slower inference - qwen2.5-coder:7b supports up to 32K tokens - bge-m3:567m supports up to 8K tokens - Increase only if you encounter issues with large files ### Database Configuration | Variable | Default | Range | Description | | -------------------------- | --------------------- | ------- | ------------------------------- | | `POSTGRES_HOST` | `localhost` | - | PostgreSQL server hostname | | `POSTGRES_PORT` | `5432` | 1-65535 | PostgreSQL server port | | `POSTGRES_DB` | `cindex_rag_codebase` | - | Database name | | `POSTGRES_USER` | `postgres` | - | Database user | | `POSTGRES_PASSWORD` | _required_ | - | Database password (must be set) | | `POSTGRES_MAX_CONNECTIONS` | `10` | 1-100 | Maximum connection pool size | ### Performance Tuning | Variable | Default | Range | Description | | ---------------------------- | ------- | ------- | ---------------------------------------------------- | | `HNSW_EF_SEARCH` | `300` | 10-1000 | HNSW search quality (higher = more accurate, slower) | | `HNSW_EF_CONSTRUCTION` | `200` | 10-1000 | HNSW index quality (higher = better index) | | `SIMILARITY_THRESHOLD` | `0.3` | 0.0-1.0 | Minimum similarity for file-level retrieval | | `CHUNK_SIMILARITY_THRESHOLD` | `0.2` | 0.0-1.0 | Minimum similarity for chunk-level retrieval | | `DEDUP_THRESHOLD` | `0.92` | 0.0-1.0 | Similarity threshold for deduplication | | `HYBRID_VECTOR_WEIGHT` | `0.7` | 0.0-1.0 | Weight for vector similarity in hybrid search | | `HYBRID_KEYWORD_WEIGHT` | `0.3` | 0.0-1.0 | Weight for keyword (BM25) score in hybrid search | | `IMPORT_DEPTH` | `3` | 1-10 | Maximum import chain traversal depth | | `WORKSPACE_DEPTH` | `2` | 1-10 | Maximum workspace dependency depth | | `SERVICE_DEPTH` | `1` | 1-10 | Maximum service dependency depth | ### Indexing Configuration | Variable | Default | Range | Description | | ------------------ | ------- | ---------- | ---------------------------------- | | `MAX_FILE_SIZE` | `5000` | 100-100000 | Maximum file size in lines | | `INCLUDE_MARKDOWN` | `false` | true/false | Include markdown files in indexing | ### Feature Flags | Variable | Default | Range | Description | | ------------------------------- | ------- | ---------- | --------------------------------------- | | `ENABLE_WORKSPACE_DETECTION` | `true` | true/false | Detect monorepo workspaces | | `ENABLE_SERVICE_DETECTION` | `true` | true/false | Detect microservices | | `ENABLE_MULTI_REPO` | `false` | true/false | Enable multi-repository support | | `ENABLE_API_ENDPOINT_DETECTION` | `true` | true/false | Parse API contracts (REST/GraphQL/gRPC) | | `ENABLE_HYBRID_SEARCH` | `true` | true/false | Combine vector + full-text search | ## Example Configurations ### Minimal Configuration Only the required password: ```json { "mcpServers": { "cindex": { "type": "stdio", "command": "npx", "args": ["-y", "@gianged/cindex"], "env": { "POSTGRES_PASSWORD": "your_password" } } } } ``` ### Full Configuration All available settings with defaults shown: ```json { "mcpServers": { "cindex": { "type": "stdio", "command": "npx", "args": ["-y", "@gianged/cindex"], "env": { "EMBEDDING_MODEL": "bge-m3:567m", "EMBEDDING_DIMENSIONS": "1024", "EMBEDDING_CONTEXT_WINDOW": "4096", "SUMMARY_MODEL": "qwen2.5-coder:7b", "SUMMARY_CONTEXT_WINDOW": "4096", "OLLAMA_HOST": "http://localhost:11434", "POSTGRES_HOST": "localhost", "POSTGRES_PORT": "5432", "POSTGRES_DB": "cindex_rag_codebase", "POSTGRES_USER": "postgres", "POSTGRES_PASSWORD": "your_password", "HNSW_EF_SEARCH": "300", "HNSW_EF_CONSTRUCTION": "200", "SIMILARITY_THRESHOLD": "0.3", "CHUNK_SIMILARITY_THRESHOLD": "0.2", "DEDUP_THRESHOLD": "0.92" } } } } ``` ### Speed-First Configuration For faster indexing with lower quality: ```json { "mcpServers": { "cindex": { "type": "stdio", "command": "npx", "args": ["-y", "@gianged/cindex"], "env": { "POSTGRES_PASSWORD": "your_password", "SUMMARY_MODEL": "qwen2.5-coder:1.5b", "SUMMARY_CONTEXT_WINDOW": "4096", "HNSW_EF_SEARCH": "100", "HNSW_EF_CONSTRUCTION": "64", "SIMILARITY_THRESHOLD": "0.4", "CHUNK_SIMILARITY_THRESHOLD": "0.25", "DEDUP_THRESHOLD": "0.95" } } } } ``` **Performance:** - **Indexing**: 500-1000 files/min (vs 300-600 files/min default) - **Query Time**: <500ms (vs <800ms default) - **Relevance**: >85% in top 10 results (vs >92% default) ## Recommended Settings ### RTX 4060 / 8GB VRAM (Tested Configuration) | Setting | Value | Notes | | ---------------------------- | ------------------ | ---------------------------------- | | `EMBEDDING_MODEL` | `bge-m3:567m` | Best accuracy/speed balance | | `SUMMARY_MODEL` | `qwen2.5-coder:7b` | Good summaries, fits in VRAM | | `EMBEDDING_CONTEXT_WINDOW` | `4096` | Default, sufficient for most files | | `HNSW_EF_SEARCH` | `300` | High accuracy retrieval | | `SIMILARITY_THRESHOLD` | `0.3` | File-level retrieval threshold | | `CHUNK_SIMILARITY_THRESHOLD` | `0.2` | Chunk-level retrieval threshold | | `DEDUP_THRESHOLD` | `0.92` | Prevent duplicate results | ### Performance Expectations - **Indexing:** ~30 files/min (~70 chunks/min) - **Search:** <1 second per query - **Codebase:** Tested with 40k LoC (112 files) ## Managing Configuration ### Verify Installation List all installed MCP servers: ```bash claude mcp list ``` View cindex configuration: ```bash claude mcp get cindex ``` ### Update Configuration To update environment variables, remove and re-add with new settings: ```bash claude mcp remove cindex claude mcp add cindex --scope user --transport stdio \ --env POSTGRES_PASSWORD="your_password" \ --env SUMMARY_MODEL="qwen2.5-coder:3b" \ -- npx -y @gianged/cindex ``` ### Switch to Speed-First Mode For faster indexing with lower quality, use these settings: ```bash claude mcp remove cindex claude mcp add cindex --scope user --transport stdio \ --env POSTGRES_PASSWORD="your_password" \ --env SUMMARY_MODEL="qwen2.5-coder:1.5b" \ --env HNSW_EF_SEARCH="100" \ --env HNSW_EF_CONSTRUCTION="64" \ --env SIMILARITY_THRESHOLD="0.4" \ --env CHUNK_SIMILARITY_THRESHOLD="0.25" \ --env DEDUP_THRESHOLD="0.95" \ -- npx -y @gianged/cindex ``` **Performance:** - **Indexing**: 500-1000 files/min (vs 300-600 files/min default) - **Query Time**: <500ms (vs <800ms default) - **Relevance**: >85% in top 10 results (vs >92% default) ### Remove Server ```bash claude mcp remove cindex ``` ## MCP Tools **Status: 17 of 17 tools implemented** All tools provide structured output with syntax highlighting and comprehensive metadata. ### Core Search Tools #### `search_codebase` Semantic code search with multi-stage retrieval and dependency analysis. **Parameters:** - `query` (required) - Natural language search query - `scope` - Search scope: `'global'`, `'repository'`, `'service'`, or `'workspace'` - `repo_id` - Filter by repository ID - `service_id` - Filter by service ID - `workspace_id` - Filter by workspace ID - `max_results` - Maximum results (1-100, default: 20) - `similarity_threshold` - Minimum similarity (0.0-1.0, default: 0.75) - `include_dependencies` - Include imported dependencies (default: false) **Returns:** Markdown-formatted results with file paths, line numbers, code snippets, and relevance scores. #### `get_file_context` Get complete context for a specific file including callers, callees, and import chain. **Parameters:** - `file_path` (required) - Absolute or relative file path - `repo_id` - Repository ID (optional if file path is unique) - `include_callers` - Include functions that call this file (default: true) - `include_callees` - Include functions called by this file (default: true) - `include_imports` - Include import chain (default: true) - `max_depth` - Import chain depth (1-5, default: 2) **Returns:** File summary, symbols, dependencies, and related code context. #### `find_symbol_definition` Locate symbol definitions and optionally show usages across the codebase. **Parameters:** - `symbol_name` (required) - Function, class, or variable name - `repo_id` - Filter by repository ID - `file_path` - Filter by file path - `symbol_type` - Filter by type: `'function'`, `'class'`, `'variable'`, `'interface'`, etc. - `include_usages` - Show where symbol is used (default: false) - `max_usages` - Maximum usage results (1-100, default: 50) **Returns:** Symbol definitions with file paths, line numbers, signatures, and optional usage locations. ### Repository Management Tools #### `index_repository` Index or re-index a repository with progress notifications and multi-project support. **Parameters:** - `repo_path` (required) - Absolute path to repository root - `repo_id` - Repository identifier (default: directory name) - `repo_type` - Repository type: `'monolithic'`, `'microservice'`, `'monorepo'`, `'library'`, `'reference'`, or `'documentation'` - `force_reindex` - Force full re-index (default: false, uses incremental indexing) - `detect_workspaces` - Detect monorepo workspaces (default: true) - `detect_services` - Detect microservices (default: true) - `detect_api_endpoints` - Parse API contracts (default: true) - `service_config` - Manual service configuration (optional) - `version` - Repository version for reference repos (e.g., `'v10.3.0'`) - `metadata` - Additional metadata (e.g., `{ upstream_url: '...' }`) **Returns:** Indexing statistics including files indexed, chunks created, symbols extracted, workspaces/services detected, and timing information. #### `delete_repository` Delete one or more indexed repositories and all associated data. **Parameters:** - `repo_ids` (required) - Array of repository IDs to delete **Returns:** Deletion confirmation with statistics (files, chunks, symbols, workspaces, services removed). #### `list_indexed_repos` List all indexed repositories with optional metadata, workspace counts, and service counts. **Parameters:** - `include_metadata` - Include repository metadata (default: true) - `include_workspace_count` - Include workspace count for monorepos (default: true) - `include_service_count` - Include service count for microservices (default: true) - `repo_type_filter` - Filter by repository type **Returns:** List of repositories with IDs, types, file counts, last indexed time, and optional metadata. ### Monorepo Tools #### `list_workspaces` List all workspaces in indexed repositories for monorepo support. **Parameters:** - `repo_id` - Filter by repository ID (optional) - `include_dependencies` - Include dependency information (default: false) - `include_metadata` - Include package.json metadata (default: false) **Returns:** List of workspaces with package names, paths, file counts, and optional dependencies. #### `get_workspace_context` Get full context for a workspace including dependencies and dependents. **Parameters:** - `workspace_id` - Workspace ID (use `list_workspaces` to find) - `package_name` - Package name (alternative to workspace_id) - `repo_id` - Repository ID (required if using package_name) - `include_dependencies` - Include workspace dependencies (default: true) - `include_dependents` - Include workspaces that depend on this one (default: true) - `dependency_depth` - Dependency tree depth (1-5, default: 2) **Returns:** Workspace metadata, dependency tree, dependent workspaces, and file list. #### `find_cross_workspace_usages` Find workspace package usages across the monorepo. **Parameters:** - `workspace_id` - Source workspace ID - `package_name` - Source package name (alternative to workspace_id) - `symbol_name` - Specific symbol to track (optional) - `include_indirect` - Include indirect usages (default: false) - `max_depth` - Dependency chain depth (1-5, default: 2) **Returns:** List of workspaces using the target package/symbol with file locations. ### Microservice Tools #### `list_services` List all services across indexed repositories for microservice support. **Parameters:** - `repo_id` - Filter by repository ID (optional) - `service_type` - Filter by type: `'docker'`, `'serverless'`, `'mobile'` (optional) - `include_dependencies` - Include service dependencies (default: false) - `include_api_endpoints` - Include API endpoint counts (default: false) **Returns:** List of services with IDs, names, types, file counts, and optional API information. #### `get_service_context` Get full context for a service including API contracts and dependencies. **Parameters:** - `service_id` - Service ID (use `list_services` to find) - `service_name` - Service name (alternative to service_id) - `repo_id` - Repository ID (required if using service_name) - `include_dependencies` - Include service dependencies (default: true) - `include_dependents` - Include services that depend on this one (default: true) - `include_api_contracts` - Include API endpoint definitions (default: true) - `dependency_depth` - Dependency tree depth (1-5, default: 1) **Returns:** Service metadata, API contracts (REST/GraphQL/gRPC), dependency graph, and file list. #### `find_cross_service_calls` Find inter-service API calls across microservices. **Parameters:** - `source_service_id` - Source service ID (optional) - `target_service_id` - Target service ID (optional) - `endpoint_pattern` - Endpoint regex pattern (e.g., `/api/users/.*`, optional) - `include_reverse` - Also show calls in reverse direction (default: false) **Returns:** List of inter-service API calls with endpoints, HTTP methods, and call counts. ### API Contract Tools #### `search_api_contracts` Search API endpoints across services with semantic understanding. **Parameters:** - `query` (required) - API search query (e.g., "user authentication endpoint") - `api_types` - Filter by type: `['rest', 'graphql', 'grpc']` (default: all) - `service_filter` - Filter by service IDs (optional) - `repo_filter` - Filter by repository IDs (optional) - `include_deprecated` - Include deprecated endpoints (default: false) - `max_results` - Maximum results (1-100, default: 20) - `similarity_threshold` - Minimum similarity (0.0-1.0, default: 0.70) **Returns:** API endpoints with paths, HTTP methods, service names, implementation files, and similarity scores. ### Reference & Documentation Tools Tools for searching reference materials including markdown documentation (syntax references, Context7-fetched docs) AND reference repository code (indexed frameworks/libraries). #### `index_documentation` Index markdown files for documentation search. Works with explicit paths only. **Parameters:** - `paths` (required) - Array of file or directory paths to index (e.g., `['syntax.md', '/docs/libraries/']`) - `doc_id` - Document identifier (default: derived from path) - `tags` - Tags for filtering (e.g., `['typescript', 'react']`) - `force_reindex` - Force re-index even if unchanged (default: false) **Returns:** Indexing statistics including files indexed, sections created, code blocks extracted, and timing. **Workflow:** 1. Fetch documentation (e.g., from Context7) 2. Save to markdown file 3. Index with `index_documentation` 4. Search with `search_references` #### `search_references` Search reference materials including markdown documentation AND reference repository code. Combines both sources for comprehensive reference search. **Parameters:** - `query` (required) - Natural language search query - `doc_ids` - Filter by document IDs (optional) - `tags` - Filter by documentation tags (optional) - `include_docs` - Include markdown documentation results (default: true) - `include_code` - Include reference repository code results (default: true) - `max_results` - Maximum results per source (1-50, default: 10) - `include_code_blocks` - Include code blocks from documentation (default: true) - `similarity_threshold` - Minimum similarity (0.0-1.0, default: 0.65) **Returns:** Combined results from both documentation chunks and reference repository code, with heading breadcrumbs, content snippets, code blocks, file paths, and relevance scores. **Note:** Reference repositories are indexed using `index_repository` with `repo_type: 'reference'`. They are excluded from `search_codebase` by default and only searchable via `search_references`. #### `list_documentation` List all indexed documentation with metadata. **Parameters:** - `doc_ids` - Filter by document IDs (optional) - `tags` - Filter by tags (optional) **Returns:** List of indexed documents with file counts, section counts, code block counts, and indexed timestamps. #### `delete_documentation` Delete indexed documentation by document ID. **Parameters:** - `doc_ids` (required) - Array of document IDs to delete **Returns:** Deletion confirmation with chunks and files removed. --- See [docs/overview.md](./docs/overview.md) for complete tool documentation including multi-project/monorepo/microservice architecture details. ## Architecture ### Hybrid Search Combines vector similarity search with PostgreSQL full-text search (tsvector/ts_rank_cd) for improved natural language query handling: ``` hybrid_score = (0.7 * vector_similarity) + (0.3 * keyword_score) ``` - **Vector search** - Semantic understanding via embeddings - **Keyword search** - Exact term matching via PostgreSQL full-text search - Configurable weights via `HYBRID_VECTOR_WEIGHT` and `HYBRID_KEYWORD_WEIGHT` - Disable with `ENABLE_HYBRID_SEARCH=false` to use vector-only search ### Multi-Stage Retrieval 1. **File-Level** - Find relevant files via summary embeddings + full-text search 2. **Chunk-Level** - Locate specific code chunks (functions/classes) 3. **Symbol Resolution** - Resolve imported symbols and dependencies 4. **Import Expansion** - Build dependency graph (max 3 levels) 5. **Deduplication** - Remove redundant code from results ### Indexing Pipeline 1. File discovery (respects .gitignore) 2. Tree-sitter parsing (with regex fallback) 3. Semantic chunking (functions, classes, blocks) 4. LLM-based file summaries (configurable model) 5. Embedding generation (configurable model) 6. Full-text search vector generation (tsvector) 7. PostgreSQL + pgvector storage ## Performance Characteristics ### Accuracy-First Mode (Default) - **Indexing**: 300-600 files/min - **Query Time**: <800ms - **Relevance**: >92% in top 10 results - **Context Noise**: <2% ### Speed-First Mode - **Indexing**: 500-1000 files/min - **Query Time**: <500ms - **Relevance**: >85% in top 10 results ## System Requirements - **Node.js** 22+ (for MCP server) - **PostgreSQL** 16+ with pgvector extension - **Ollama** with models installed - **Disk Space**: ~1GB per 100k LoC indexed - **RAM**: 8GB minimum (16GB+ recommended for large codebases) - **GPU**: Optional but recommended (RTX 3060+ for qwen2.5-coder:7b) ## Troubleshooting ### "Vector dimension mismatch" Update `EMBEDDING_DIMENSIONS` in MCP config to match your model, then update vector dimensions in `database.sql`. ### "Connection refused" to PostgreSQL Check `POSTGRES_HOST` and `POSTGRES_PORT` in MCP config. Verify PostgreSQL is running: ```bash sudo systemctl status postgresql # Linux brew services list # macOS ``` ### "Model not found" in Ollama Pull the required models: ```bash ollama pull bge-m3:567m ollama pull qwen2.5-coder:7b ``` Verify models are available: ```bash ollama list ``` ### Slow indexing - Use smaller summary model: `qwen2.5-coder:1.5b` instead of `7b` - Reduce `HNSW_EF_CONSTRUCTION` to `64` - Enable incremental indexing (default) ### Low accuracy results - Increase `HNSW_EF_SEARCH` to `300-400` - Raise `SIMILARITY_THRESHOLD` to `0.4-0.5` for stricter file matching - Raise `CHUNK_SIMILARITY_THRESHOLD` to `0.3-0.4` for stricter chunk matching - Use better summary model: `qwen2.5-coder:3b` or `7b` - Lower `DEDUP_THRESHOLD` to `0.90-0.92` ## Documentation See [docs/overview.md](./docs/overview.md) for detailed documentation including: - Complete architecture details - Database schema - Configuration reference - Implementation guide - Performance tuning ## Development ```bash git clone https://github.com/gianged/cindex.git cd cindex npm install npm run build npm test ``` ## Implementation Status - Phase 1 (100%) - Database schema & type system - Phase 2 (100%) - File discovery, parsing, chunking, workspace/service detection - Phase 3 (100%) - Embeddings, summaries, API parsing, 12-language support, Docker/serverless/mobile detection - Phase 4 (100%) - Multi-stage retrieval pipeline (9-stage) - Phase 5 (100%) - MCP tools (17 of 17 implemented) - Phase 6 (100%) - Incremental indexing, optimization, testing **Overall: 100% complete** ## License MIT ## Author **gianged** - Yup, it's me ## Contributing Contributions welcome! Please open an issue or PR on GitHub. ## Acknowledgments Built with: - [Model Context Protocol](https://modelcontextprotocol.io/) by Anthropic - [pgvector](https://github.com/pgvector/pgvector) for vector search - [Ollama](https://ollama.ai/) for local LLM inference - [tree-sitter](https://tree-sitter.github.io/) for code parsing