embedocs-mcp

Version:

Transform any GitHub repository into searchable vector embeddings. MCP server with smart indexing, voyage-context-3 embeddings, and semantic search for Claude/Cursor IDEs.

github.com/romiluz13/EmbeDocs-MCP

romiluz13/EmbeDocs-MCP

535 lines (410 loc) • 18.8 kB

Markdown

``` ███████╗███╗ ███╗██████╗ ███████╗██████╗ ██████╗ ██████╗███████╗ ██╔════╝████╗ ████║██╔══██╗██╔════╝██╔══██╗██╔═══██╗██╔════╝██╔════╝ █████╗ ██╔████╔██║██████╔╝█████╗ ██║ ██║██║ ██║██║ ███████╗ ██╔══╝ ██║╚██╔╝██║██╔══██╗██╔══╝ ██║ ██║██║ ██║██║ ╚════██║ ███████╗██║ ╚═╝ ██║██████╔╝███████╗██████╔╝╚██████╔╝╚██████╗███████║ ╚══════╝╚═╝ ╚═╝╚═════╝ ╚══════╝╚═════╝ ╚═════╝ ╚═════╝╚══════╝ ``` <div align="center"> # 🧠 **AI That Actually Knows Your Docs** [![npm version](https://img.shields.io/npm/v/embedocs-mcp.svg)](https://www.npmjs.com/package/embedocs-mcp) [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) [![Node.js Version](https://img.shields.io/node/v/embedocs-mcp.svg)](https://nodejs.org) [![Website](https://img.shields.io/badge/Website-embedocs.site-blue)](https://embedocs.site/) **Stop googling outdated Stack Overflow. Give your AI access to the LATEST documentation.** *AI knowledge cutoffs are killing developer productivity* [🌐 Website](https://embedocs.site/) • [🚀 Quick Start](#-quick-start) • [⚡ Power of Semantic Search](#-the-semantic-search-advantage) • [🎯 Examples](#-real-world-examples) • [📖 Setup](#-setup-guide) </div> --- ## 🤕 **The Documentation Hell Every Developer Lives In** Your AI assistant has **knowledge cutoffs** - it doesn't know about: ``` ❌ New MongoDB 8.0 features (AI knows up to 7.0) ❌ Latest React 19 APIs (AI stuck on 18) ❌ Fresh TypeScript 5.6 syntax (AI knows 5.2) ❌ Your company's internal APIs (AI has no clue) ❌ Updated AWS services (AI knowledge is 6 months old) ``` **So you waste HOURS:** - 🔍 Googling for current docs - 📖 Reading through endless documentation pages - 🤔 Figuring out what's changed since AI's training - 😫 Getting outdated or wrong answers from AI --- ## 🧠 **EmbeDocs: AI With Current Knowledge** ``` ┌──────────────────┐ ┌─────────────────┐ ┌──────────────────┐ │ Latest Docs │───▶│ EmbeDocs │───▶│ Smart AI │ │ 📚 MongoDB 8.0 │ │ 🧠 Semantic │ │ 💡 Current │ │ ⚛️ React 19 │ │ 🔍 Search │ │ Answers │ │ 🔷 TypeScript │ │ ⚡️ Instant │ │ │ │ ☁️ AWS Latest │ │ Context │ │ │ └──────────────────┘ └─────────────────┘ └──────────────────┘ ``` **Give your AI CURRENT, ACCURATE documentation knowledge in minutes** ✅ **After EmbeDocs**: ``` ✅ You: "How do I use MongoDB 8.0's new queryable encryption?" 🤖 AI: [Finds latest docs, explains step-by-step with current syntax] ✅ You: "What's new in React 19 server components?" 🤖 AI: [Returns exact React 19 documentation with examples] ✅ You: "How does TypeScript 5.6 handle the new import assertions?" 🤖 AI: [Shows current TypeScript docs with working code samples] ``` --- ## ⚡ **The Semantic Search Advantage** ### 🔍 **Beyond Keyword Matching** Traditional search finds words. **EmbeDocs understands MEANING.** ```bash # You search: "slow database" # Regular search finds: documents containing "slow" AND "database" # EmbeDocs semantic search finds: performance optimization, indexing strategies, # query bottlenecks, N+1 problems, connection pooling - ALL related concepts! ``` ### 🧠 **Powered by voyage-context-3** - **1024-dimensional embeddings** - Captures deep semantic relationships - **32K token context** - Understands entire documentation pages - **Code-optimized** - Specifically trained on programming content - **Multi-language** - Works across JavaScript, Python, Go, Rust, Java, C++ ### 🎯 **Smart Search Modes** 1. **Hybrid Search** (Default): Combines semantic understanding + keyword precision 2. **MMR Search** (Advanced): Maximum diversity - finds ALL related concepts, not just similar ones 3. **Vector Search** (Pure): 100% meaning-based, perfect for conceptual questions --- ## 🎯 **Real-World Examples** ### **👨‍💻 Keep Up With Fast-Moving Projects** ```bash # Add repos via web interface embedocs setup # Select and add: # - facebook/react (Latest React documentation) # - microsoft/TypeScript (Current TypeScript docs) # - Your company's documentation repos # Then index them all: embedocs index # Now your AI knows CURRENT features: "What's new in React 19?" "How do TypeScript 5.6 decorators work?" "Show me the latest Suspense patterns" ``` ### **🏢 Company Internal Documentation** ```bash # Add your company repos through the web interface embedocs setup # Add your private repositories: # - yourcompany/api-docs # - yourcompany/architecture-guide # - yourcompany/internal-wiki # Your AI now understands your business: "How does our payment processing work?" "What are our microservice communication patterns?" "Where do we handle user authentication?" ``` ### **📚 Master New Technologies** ```bash # Use the web interface to add cutting-edge projects embedocs setup # Add repositories like: # - vercel/next.js # - openai/openai-python # - langchain-ai/langchain # Learn from the source: "How does Next.js App Router actually work?" "What's the best way to use OpenAI's new API?" "Show me advanced LangChain patterns" ``` --- ## 🚀 **Quick Start** *(3 Simple Steps)* ### **Step 1: Install** ```bash npm install -g embedocs-mcp ``` ### **Step 2: First Run** *(Auto-launches setup wizard!)* ```bash embedocs # ✨ Automatically opens setup wizard on first run! ``` Or manually run setup anytime: ```bash embedocs setup ``` ### **🎨 Beautiful Web Interface** <div align="center"> <img src="https://embedocs.site/pics/Screenshot%202025-08-19%20at%2015.29.52.png" alt="EmbeDocs Setup Wizard" width="800"> <br> <em>Modern, intuitive setup wizard with a stunning 2025 UI design</em> </div> **🌐 Opens a stunning web interface in your browser!** - Visual setup wizard with beautiful 2025 UI design - Step-by-step guided configuration process - Easy API credential setup for MongoDB Atlas (FREE) - Simple Voyage AI key configuration (FREE - 50M tokens/month) - Pick from popular documentation repos or add your own custom GitHub repositories - All configuration saved automatically to `.env` - Real-time connection testing and validation ### **Step 3: Add & Index Your Documentation** **Option A: Using Web Interface** (Recommended ✨) ```bash embedocs setup # or just 'embedocs' on first run ``` - Select from popular repos, add your own GitHub repositories, or switch to the "Official Website" tab and paste a docs root URL (e.g., https://www.mongodb.com/docs/). - Click "Validate & Add Website" to ingest the entire site (sitemap + discover). - Click "Start Indexing" to begin - All selected repos are saved for future CLI use **Option B: Command Line** (After adding repos via web) ```bash # After adding repos through web interface: embedocs index # Indexes all your selected repositories embedocs update # Updates only changed files embedocs rebuild # Force re-index everything ``` **Important**: You must first add repositories using the web interface (`embedocs setup`). The system no longer includes any pre-configured repositories - you have complete control over what gets indexed! ### **Step 4: Connect to Your AI** **Cursor IDE** (Recommended): ```json // .cursor/settings.json { "mcpServers": { "embedocs": { "command": "npx", "args": ["embedocs-mcp"], "env": { "MONGODB_URI": "your-mongodb-connection-string", "VOYAGE_API_KEY": "your-voyage-api-key" } } } } ``` **Claude Code** (Same configuration): ```json { "mcpServers": { "embedocs": { "command": "npx", "args": ["embedocs-mcp"], "env": { "MONGODB_URI": "your-mongodb-connection-string", "VOYAGE_API_KEY": "your-voyage-api-key" } } } } ``` ### **Step 5: Ask Current Questions!** Your AI now has access to the LATEST documentation! 🎉 --- ## 🔧 **What EmbeDocs Actually Does** ### 🎯 **Core Function** **Indexes documentation repositories** and makes them **semantically searchable** by your AI through the Model Context Protocol (MCP). ### 🧠 **Smart Processing** - **Semantic Chunking**: Intelligently splits docs into meaningful pieces (100-2500 chars) - **voyage-context-3 Embeddings**: Creates 1024-dimensional vectors that understand code context - **Automatic Indexing**: MongoDB Atlas vector + text search indexes created automatically - **Git-Aware Updates**: Only processes changed files on updates ### 🔍 **Semantic Search Power** - **Understands Intent**: "slow queries" finds performance docs, indexing guides, optimization tips - **Code Context**: Knows that "authentication" relates to JWT, OAuth, sessions, middleware - **Cross-Language**: Finds similar patterns across JavaScript, Python, Go implementations - **Lightning Fast**: <100ms search responses with 7.5x performance optimization ### 🔌 **Universal AI Integration** - **MCP Protocol**: Works with Claude Desktop, Cursor IDE, any MCP-compatible AI - **Four Powerful Tools**: Primary hybrid search, advanced MMR search, full context fetcher, system status - **Production Ready**: Handles 14,880+ documents with 0 failures --- ## 📖 **Setup Requirements** *(All FREE!)* ### **1. MongoDB Atlas** (Free 512MB tier) - [Sign up here](https://cloud.mongodb.com) - Create cluster → Copy connection string - Add `0.0.0.0/0` to Network Access (allows EmbeDocs to connect) ### **2. Voyage AI** (Free 50M tokens/month) - [Get API key here](https://voyageai.com) - Industry-leading code embeddings - 50M tokens = process 1000+ documentation repositories ### **3. Node.js 18+** - [Download here](https://nodejs.org) --- ## 📊 **Why Semantic Search Matters** ### **Traditional Keyword Search vs EmbeDocs Semantic Search** | Query | Keyword Search | EmbeDocs Semantic Search | |-------|----------------|-------------------------| | "slow database" | Finds docs with "slow" + "database" | Finds: performance tuning, indexing strategies, query optimization, connection pooling, N+1 problems | | "user login" | Finds "user" + "login" exact matches | Finds: authentication, JWT tokens, OAuth flows, session management, middleware, security | | "API errors" | Finds "API" + "errors" | Finds: error handling, HTTP status codes, exception patterns, debugging, logging, monitoring | ### **Real Performance Gains** - **Search Speed**: <100ms average response time - **Accuracy**: 92% relevance score with MMR diversity - **Coverage**: Finds 3-5x more relevant results than keyword search - **Context**: Understands relationships between concepts --- ## 🛠️ **Advanced Usage** ### **Index Multiple Documentation Sources** ```bash # Frontend ecosystem embedocs index https://github.com/facebook/react embedocs index https://github.com/vuejs/core embedocs index https://github.com/angular/angular # Backend frameworks embedocs index https://github.com/expressjs/express embedocs index https://github.com/nestjs/nest embedocs index https://github.com/django/django # Cloud & DevOps embedocs index https://github.com/aws/aws-cli embedocs index https://github.com/kubernetes/kubernetes embedocs index https://github.com/docker/cli ``` ### **Monitor Indexing Progress** ```bash # 🌐 Opens beautiful web dashboard at http://localhost:3333 embedocs progress ``` **Features:** - Real-time progress bars and statistics - "Keep Mac Awake" button (prevents sleep during long indexing) - Shows all repositories being indexed - Auto-refreshes every 5 seconds - Estimated time remaining ```bash # Quick CLI status check (no browser) embedocs status ``` ### **Smart Search Workflow with Full Context** **CRITICAL: Search returns CHUNKS, not complete files!** Always use the two-step workflow for complete understanding: ```bash # Step 1: Search for relevant files "How does the chatbot generate responses?" → mongodb-search finds: generate-response.js (partial chunk showing ~500 chars) # Step 2: Get COMPLETE file content → mongodb-fetch-full-context("generate-response.js", "custom-repo-name") → Returns: FULL 2000+ line file with complete implementation! ``` **The Four Tools:** 1. **mongodb-search**: RRF hybrid search - best for general queries 2. **mongodb-mmr-search**: Maximum Marginal Relevance - best for diverse results 3. **mongodb-fetch-full-context**: Gets COMPLETE file content after search 4. **mongodb-status**: System health and statistics **Smart Search Strategies:** ```bash # For broad understanding - use hybrid search + fetch full context "How does React handle state management?" → Search finds relevant files → Fetch complete implementations # For comprehensive research - use MMR search + fetch full context "Find ALL approaches to database optimization" → MMR finds diverse approaches → Fetch full files for each # For specific implementations - always fetch full context "Show me the authentication middleware" → Search finds auth.js → Fetch complete middleware code ``` --- ## 🏗️ **Architecture: How It Works** ``` GitHub Documentation ↓ Git Clone & Parse ↓ Semantic Chunking (100-2500 chars) ↓ voyage-context-3 Embeddings (1024 dimensions) ↓ MongoDB Atlas (Vector + Text Indexes) ↓ MCP Protocol Tools ↓ Your AI Assistant ``` **Built on Production Infrastructure**: - 🚀 **MongoDB Atlas**: Auto-creates vector search indexes, handles 50K+ documents on free tier - 🧭 **Voyage AI**: State-of-the-art code embeddings, specifically trained for programming content - 🤖 **MCP Protocol**: Standard integration works with any MCP-compatible AI assistant --- ## 💰 **Pricing: 100% FREE for Most Developers** - **MongoDB Atlas**: 512MB free tier (handles 50,000+ documents) - **Voyage AI**: 50M tokens/month free (index 1000+ repositories) - **EmbeDocs**: Open source MIT license - **Total Cost**: $0/month for typical usage **Enterprise Scale**: Both services offer paid tiers for massive documentation sets. --- ## 🌟 **Why EmbeDocs vs Alternatives** ### **vs Googling Documentation** - ❌ Google: Outdated results, SEO spam, wrong versions - ✅ EmbeDocs: Always current, semantic understanding, AI integration ### **vs AI with Knowledge Cutoffs** - ❌ Standard AI: 6-month old knowledge, makes up answers - ✅ EmbeDocs: Real-time current docs, factual responses ### **vs Manual Documentation Reading** - ❌ Manual: Hours of reading, finding specific answers - ✅ EmbeDocs: Instant semantic search, AI explains in context ### **vs Other Documentation Tools** - ❌ Others: Keyword search only, complex setup, expensive - ✅ EmbeDocs: Semantic understanding, 60-second setup, free tier --- ## 🎯 **Perfect For** ### **📚 Documentation-Heavy Projects** - MongoDB, PostgreSQL, Redis documentation - AWS, GCP, Azure cloud service docs - React, Vue, Angular framework documentation - Company internal API documentation ### **⚡ Fast-Moving Technologies** - AI/ML libraries (OpenAI, LangChain, Transformers) - New language features (TypeScript, JavaScript, Python) - Framework updates (Next.js, Django, Spring) - Database new features (MongoDB, PostgreSQL) ### **🏢 Enterprise Internal Docs** - Architecture decision records - API specifications and guides - Deployment and operational procedures - Company coding standards and best practices --- ## 🔧 **Troubleshooting** ### **Setup Issues** - **"embedocs: command not found"**: Run `npm install -g embedocs-mcp` with sudo if needed - **Web interface doesn't open**: Navigate manually to http://localhost:3333 - **MongoDB connection fails**: Make sure to add `0.0.0.0/0` to Network Access in Atlas ### **Environment Configuration** If the web setup doesn't work, create `.env` file manually: ```bash # Create .env in your project directory MONGODB_URI=mongodb+srv://username:password@cluster.mongodb.net/ VOYAGE_API_KEY=pa-your-api-key-here ``` ### **Indexing Issues** - **"No repositories configured"**: Run `embedocs setup` to add repositories first - **Rate limit errors**: Voyage AI free tier is limited to 2000 RPM - indexing automatically handles this - **"0 chunks" for some files**: Normal for very small files - **Process seems stuck**: Check `embedocs progress` for real-time status ### **Repository Management** - All repositories are stored in `.repos/metadata.json` - No hardcoded/default repositories - you control what gets indexed - Add repos via web interface: `embedocs setup` - Remove repos by editing `.repos/metadata.json` or using web interface ## 🤝 **Contributing** Help make AI smarter about documentation! ```bash git clone https://github.com/romiluz13/EmbeDocs-MCP.git cd EmbeDocs-MCP npm install npm run build npm test ``` **Areas for Contribution**: - Support for more documentation formats (GitBook, Notion, etc.) - Better chunking strategies for different content types - Additional embedding models and search algorithms - UI improvements for the setup wizard --- ## 📝 **License** MIT © [Rom Iluz](https://github.com/romiluz13) --- <div align="center"> ### **🎯 Stop Fighting Outdated AI Knowledge** ```bash npm install -g embedocs-mcp && embedocs # Just run 'embedocs' - it auto-launches setup on first run! ``` **Give your AI access to current, accurate documentation in 60 seconds** **[🌐 Website](https://embedocs.site/)** • **[⭐ Star on GitHub](https://github.com/romiluz13/EmbeDocs-MCP)** • **[📦 npm Package](https://www.npmjs.com/package/embedocs-mcp)** • **[🐛 Report Issues](https://github.com/romiluz13/EmbeDocs-MCP/issues)** *"AI knowledge cutoffs are killing developer productivity. EmbeDocs fixes that."* </div>