UNPKG

@vjlanguage/mcp-vj-docs

Version:

MCP server for documentation crawling, indexing, and retrieval

1,165 lines (932 loc) 48.2 kB
# MCP Documentation Server (@vjlanguage/mcp-vj-docs) 一个用于文档爬取、索引和检索的模型上下文协议(MCP)服务器。该包提供了爬取网站、存储和索引内容以及使用基于 TF-IDF 的搜索来搜索内容的工具。搜索结果经过优化,适合大型语言模型使用。 A Model Context Protocol (MCP) server for documentation crawling, indexing, and retrieval. This package provides tools for crawling websites, storing and indexing content, and searching through that content using TF-IDF based search. The search results are optimized for large language models. ## dify stream 模式快速开始指南 ## Quick Start Guide for Dify Stream Mode ### 安装 ### Installation ```bash # 使用 npm 安装 npm install @vjlanguage/mcp-vj-docs -g # 或使用 yarn 安装 yarn global add @vjlanguage/mcp-vj-docs ``` ### 命令行使用 ### Command Line Usage #### 1. 启动服务器(前台模式) #### 1. Start the Server (Foreground Mode) ```bash # 使用默认配置启动服务器 vjdoc-cli stream-start # 指定端口启动服务器 vjdoc-cli stream-start -p 3000 # 指定数据库路径和 TF-IDF 目录 vjdoc-cli stream-start -d ~/mydata/docs.json --tfidf-dir ~/mydata/tfidf ``` #### 2. 后台服务模式 #### 2. Background Service Mode ```bash # 在后台启动服务器 # Start the server in background mode vjdoc-cli stream-serve # 停止后台运行的服务器 # Stop the background server vjdoc-cli stream-stop ``` #### 3. 命令行选项 #### 3. Command Line Options 所有命令都支持以下选项: All commands support the following options: - `-p, --port <number>` - 设置服务器端口(默认:3000) - Set server port (default: 3000) - `-d, --db-path <path>` - 设置数据库文件路径(默认:~/mcpdata/docs.json) - Set database file path (default: ~/mcpdata/docs.json) - `--tfidf-dir <path>` - 设置 TF-IDF 文件目录(默认:~/mcpdata/tfidf) - Set TF-IDF files directory (default: ~/mcpdata/tfidf) ### 环境变量配置 ### Environment Variable Configuration 您也可以通过环境变量配置服务器: You can also configure the server using environment variables: ```bash # 基本配置 # Basic Configuration VJDOC_DB_PATH=~/mcpdata/docs.json VJDOC_TFIDF_FILES_DIR=~/mcpdata/tfidf VJDOC_LOG_LEVEL=info # debug, info, warn, error VJDOC_LOG_TO_FILE=true VJDOC_LOG_DIR=~/mcpdata/logs # 爬虫配置 # Crawler Configuration VJDOC_MAX_DEPTH=4 VJDOC_MAX_PAGES=100 FIRECRAWL_API_KEY=your_api_key_here FIRECRAWL_API_URL=http://localhost:5002 # 传输配置 # Transport Configuration ENABLE_STDIO_TRANSPORT=true ENABLE_STREAMABLE_HTTP=true STREAMABLE_HTTP_PORT=3000 ``` ### dify stream 模式配置 ### Dify Stream Mode Configuration ```json { "mcpServers": { "mcp-vjdoc": { "transport": "streamable_http", "url": "http://192.168.2.9:3000/mcp" } } } ``` > - 192.168.2.9 替换为你本地的实际的网卡地址 Replace 192.168.2.9 with your actual local network interface address - 3000 替换为你默认执行 serve 命令的端口 Replace 3000 with the port you use when executing the serve command ### 作为 MCP 服务器使用 ### Using as an MCP Server 要将此服务器与 MCP 客户端(如 Dify、Claude 或其他支持 MCP 的应用程序)一起使用,请在 MCP 配置中添加以下内容: To use this server with MCP clients (such as Dify, Claude, or other applications that support MCP), add the following to your MCP configuration: ```json { "vj-docs": { "command": "node", "args": [ "/path/to/mcp-vj-docs/dist/index.js" ], "env": { "FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE", "VJDOC_MAX_DEPTH": "4", "VJDOC_MAX_PAGES": "100", "VJDOC_DB_PATH": "~/mcpdata/docs.json", "VJDOC_LOG_DIR": "~/mcpdata/logs", "VJDOC_LOG_TO_FILE": "true", "VJDOC_LOG_LEVEL": "debug", "FIRECRAWL_API_URL": "http://localhost:5002", "VJDOC_TFIDF_FILES_DIR": "~/mcpdata/tfidf" }, "disabled": false, "timeout": 3600 } } ``` ### 常见用例 ### Common Use Cases #### 爬取网站并索引内容 #### Crawl Websites and Index Content 使用 MCP 工具 `vjdoc_crawl` 爬取网站: Use the MCP tool `vjdoc_crawl` to crawl websites: ```json { "name": "vjdoc_crawl", "arguments": { "url": "https://example.com/docs", "maxDepth": 3, "maxPages": 50, "includePatterns": ["*/docs/*"], "excludePatterns": ["*/api/*"], "defaultCategory": "Documentation" } } ``` #### 搜索文档 #### Search Documents 使用 MCP 工具 `vjdoc_search` 搜索文档: Use the MCP tool `vjdoc_search` to search documents: ```json { "name": "vjdoc_search", "arguments": { "query": "如何配置服务器", "limit": 5, "filters": { "categories": ["Documentation", "Tutorial"] } } } ``` ### 故障排除 ### Troubleshooting #### 常见问题 #### Common Issues 1. **找不到命令** **Command Not Found** - 确保全局安装了包,或使用 npx 运行命令:`npx vjdoc-cli stream-start` - Ensure the package is installed globally, or use npx to run the command: `npx vjdoc-cli stream-start` 2. **权限错误** **Permission Errors** - 确保数据目录(~/mcpdata)存在且有写入权限 - Ensure the data directory (~/mcpdata) exists and has write permissions - 使用 `sudo mkdir -p ~/mcpdata` 创建目录 - Use `sudo mkdir -p ~/mcpdata` to create the directory 3. **无法连接到服务器** **Cannot Connect to Server** - 检查端口是否被占用:`lsof -i :3000` - Check if the port is already in use: `lsof -i :3000` - 确保防火墙未阻止连接 - Ensure firewall is not blocking the connection 4. **爬虫 API 错误** **Crawler API Errors** - 验证 FIRECRAWL_API_KEY 是否正确 - Verify that FIRECRAWL_API_KEY is correct - 检查 FIRECRAWL_API_URL 是否可访问 - Check if FIRECRAWL_API_URL is accessible --- # MCP Documentation Server (@vjlanguage/mcp-vj-docs) A Model Context Protocol (MCP) server for documentation crawling, indexing, and retrieval. This package provides tools for crawling websites, storing and indexing the content, and searching through that content using TF-IDF based search. The search results are optimized for large language models. ## Features | 功能 - **Documentation Crawling**: Crawl documentation from websites using Firecrawl - **Content Processing**: Convert HTML to Markdown and extract relevant content - **Storage & Indexing**: Store documents using lowdb with TF-IDF based indexing - **LLM-Optimized Search**: Search for documentation with aggregated results optimized for large language models - **Full Content Return**: No character length limits on search results - **Content-First Results**: Prioritizes content over URLs in search results - **Smart Deduplication**: Removes duplicate content and returns only the top 3 most relevant results - **AI-Optimized Format**: Results structured specifically for AI consumption and code generation - **Complete Document Context**: Returns full document content via `fullDocument` field for comprehensive context - **Enhanced Metadata Search**: Analyzes title, URL, and all metadata fields with field-specific weighting - **Multi-dimensional Scoring**: Evaluates document relevance across content, metadata, URL, and title - **Custom Corpus Management**: Add your own text corpus files for inclusion in search results - **Multiple Format Support**: Supports TXT, Markdown, and PDF files - **Recursive Directory Scanning**: Automatically discovers files in nested subdirectories - **Automatic Indexing**: Files in corpus directory are automatically indexed and searchable - **MCP Integration**: Expose tools for crawling and searching via Model Context Protocol - **Path Handling**: Support for tilde (~) expansion in file paths - **Server Modes**: Support for both SSE (Server-Sent Events) and stdio transports - **Multilingual Support**: Enhanced handling for Chinese queries with specialized tokenization ## 功能 - **文档爬取**:使用 Firecrawl 从网站爬取文档 - **内容处理**:将 HTML 转换为 Markdown 并提取相关内容 - **存储和索引**:使用 lowdb 存储文档,并使用基于 TF-IDF 的索引 - **LLM 优化搜索**:搜索文档并返回经过聚合的结果,专为大型语言模型优化 - **完整内容返回**:搜索结果没有字符长度限制 - **内容优先结果**:在搜索结果中优先考虑内容而非 URL - **智能去重**:移除重复内容并仅返回前 3 个最相关的结果 - **AI 优化格式**:结果结构专为 AI 消费和代码生成而设计 - **完整文档上下文**:通过 `fullDocument` 字段返回完整文档内容,提供全面的上下文 - **增强元数据搜索**:分析标题、URL 和所有元数据字段,并进行字段特定权重评分 - **多维度评分**:在内容、元数据、URL 和标题等多个维度评估文档相关性 - **自定义语料库管理**:添加您自己的文本语料库文件以包含在搜索结果中 - **多格式支持**:支持 TXT、Markdown 和 PDF 文件 - **递归目录扫描**:自动发现嵌套子目录中的文件 - **自动索引**:语料目录中的文件自动索引并可搜索 - **MCP 集成**:通过模型上下文协议暴露爬取和搜索工具 - **路径处理**:支持波浪号(~)在文件路径中的扩展 - **服务器模式**:支持 SSE(服务器发送事件)和 stdio 传输 - **多语言支持**:通过专门的分词增强中文查询处理 ## Changelog | 更新日志 ### 2026-04-07 (v0.1.73) - **Recursive Directory Scanning**: Corpus files in nested subdirectories are now automatically discovered, indexed, and searchable - `loadTfidfFiles()` and `getCorpusDocuments()` both scan recursively - Document metadata now includes `relativePath` for nested file location ### 2025-04-22 (v0.1.61) - **Enhanced Metadata Search**: The search algorithm now analyzes multiple dimensions: - **Title Analysis**: Improved title matching with weighted scoring for full and partial matches - **URL Analysis**: Extracts and scores URL segments for keyword relevance - **Metadata Field Analysis**: Individually processes important fields like keywords, description, and category - **Field-Specific Weighting**: Assigns different weights to matches in different metadata fields - **Chinese Query Optimization**: Implements specialized tokenization for Chinese characters and phrases ### 2026年04月07日 (v0.1.73) - **递归目录扫描**:嵌套子目录中的语料库文件现在可以自动发现、索引和搜索 - `loadTfidfFiles()``getCorpusDocuments()` 均支持递归扫描 - 文档元数据现在包含 `relativePath` 字段,标识嵌套文件位置 ### 2025年04月22日 (v0.1.61) - **增强元数据搜索**:搜索算法现在分析多个维度: - **标题分析**:改进标题匹配,对完整和部分匹配进行加权评分 - **URL 分析**:提取并评分 URL 段以确定关键词相关性 - **元数据字段分析**:单独处理重要字段,如关键词、描述和类别 - **字段特定权重**:为不同元数据字段中的匹配分配不同权重 - **中文查询优化**:为中文字符和短语实现专门的分词 ### 2025-04-11 - **Search Result Enhancement**: Modified search functionality to include relevant paragraphs for each individual result item, rather than only showing content for the top result. - **Result Format Improvement**: Changed the structure to make it clearer which document content belongs to which search result. - **Document Retrieval Enhancement**: Improved the `vjdoc_get_document` tool to support partial matching for both URL and title parameters. ### 2025年04月11日 - **搜索结果增强**:修改了搜索功能,以便为每个单独的结果项包含相关段落,而不仅仅是显示顶部结果的内容。 - **结果格式改进**:更改了结构,使其更清晰地显示哪些文档内容属于哪个搜索结果。 - **文档检索增强**:改进了 `vjdoc_get_document` 工具,支持 URL 和标题参数的部分匹配。 ## Installation | 安装 ```bash # Install globally | 全局安装 npm install -g @vjlanguage/mcp-vj-docs # Or use with npx | 或使用 npx npx @vjlanguage/mcp-vj-docs ``` ## Firecrawl Registration and API Key | Firecrawl 注册和 API 密钥 ### English This package uses Firecrawl service for web crawling. To use it, you need to: 1. **Register for Firecrawl**: - Visit [Firecrawl website](https://firecrawl.dev) and create an account - Or use the local Firecrawl service by setting `FIRECRAWL_API_URL` to your local endpoint 2. **Get your API Key**: - After registration, navigate to your account dashboard - Find and copy your API key - Add this key to your environment variables or MCP configuration 3. **Configure the API Key**: - Set the `FIRECRAWL_API_KEY` environment variable - Or add it to your MCP configuration (see example below) ### 中文 本包使用 Firecrawl 服务进行网页爬取。要使用它,您需要: 1. **注册 Firecrawl** - 访问 [Firecrawl 网站](https://firecrawl.dev) 并创建账户 - 或通过设置 `FIRECRAWL_API_URL` 为您的本地端点来使用本地 Firecrawl 服务 2. **获取您的 API 密钥** - 注册后,导航到您的账户仪表板 - 找到并复制您的 API 密钥 - 将此密钥添加到您的环境变量或 MCP 配置中 3. **配置 API 密钥** - 设置 `FIRECRAWL_API_KEY` 环境变量 - 或将其添加到您的 MCP 配置中(见下面的示例) ## Usage | 使用方法 ### Environment Variables | 环境变量 - `VJDOC_DB_PATH` - Path to the database file (default: ./data/docs.json) | 数据库文件路径(默认:./data/docs.json) - `VJDOC_MAX_DEPTH` - Maximum depth to crawl (default: 3) | 最大爬取深度(默认:3) - `VJDOC_MAX_PAGES` - Maximum number of pages to crawl (default: 100) | 最大爬取页面数(默认:100) - `VJDOC_LOG_DIR` - Directory for log files | 日志文件目录 - `VJDOC_LOG_TO_FILE` - Whether to log to file (true/false) | 是否记录到文件(true/false) - `VJDOC_LOG_LEVEL` - Log level (error, warn, info, debug) | 日志级别(error, warn, info, debug) - `FIRECRAWL_API_KEY` - API key for Firecrawl service | Firecrawl 服务的 API 密钥 - `FIRECRAWL_API_URL` - Custom URL for Firecrawl API | Firecrawl API 的自定义 URL - `MCP_TRANSPORT` - Transport method (sse or stdio, default: sse) | 传输方法(sse 或 stdio,默认:sse) - `VJDOC_TFIDF_FILES_DIR` - Directory for custom corpus files (default: ~/mcpdata/tfidf_files) | 自定义语料库文件目录(默认:~/mcpdata/tfidf_files) ```json { "mcpServers": { "mcp-vj-docs": { "command": "npx", "args": ["-y", "@vjlanguage/mcp-vj-docs@latest"], "env": { "FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE", "VJDOC_MAX_DEPTH": "4", "VJDOC_MAX_PAGES": "100", "VJDOC_DB_PATH": "~/mcpdata/docs.json", "VJDOC_LOG_DIR": "~/mcpdata/logs", "VJDOC_LOG_TO_FILE": "true", "VJDOC_LOG_LEVEL": "debug", "FIRECRAWL_API_URL": "http://localhost:5002", "VJDOC_TFIDF_FILES_DIR": "~/mcpdata/tfidf_files" }, "disabled": false, "timeout": 3600, "autoApprove": ["vjdoc_search", "vjdoc_crawl", "vjdoc_add_corpus_file"] } } } ``` ## MCP Tools | MCP 工具 The server exposes the following MCP tools: 服务器暴露以下 MCP 工具: ### 1. `vjdoc_crawl` Tool | `vjdoc_crawl` 工具 Crawls a website and indexes its content for search. 爬取网站并为搜索索引其内容。 **Parameters | 参数:** - `url` (string, required): The URL to crawl (e.g., "https://example.com/docs") | 要爬取的 URL(例如,"https://example.com/docs") - `maxDepth` (number, optional): Maximum depth to crawl, default: 3 | 最大爬取深度,默认:3 - `maxPages` (number, optional): Maximum number of pages to crawl, default: 100 | 最大爬取页面数,默认:100 - `includePatterns` (array of strings, optional): Patterns to include in crawl (e.g., ["docs/*"]) | 要包含在爬取中的模式(例如,["docs/*"]) - `excludePatterns` (array of strings, optional): Patterns to exclude from crawl (e.g., ["blog/*"]) | 要从爬取中排除的模式(例如,["blog/*"]) - `defaultCategory` (string, optional): Default category for documents if not detected automatically | 如果未自动检测到,文档的默认类别 **Example | 示例:** ```json { "url": "https://example.com/docs", "maxDepth": 3, "maxPages": 100, "includePatterns": ["docs/*"], "excludePatterns": ["blog/*"] } ``` **Response | 响应:** ```json { "success": true, "message": "Successfully crawled and indexed 42 pages from https://example.com/docs", "count": 42 } ``` ### 2. `vjdoc_search` Tool | `vjdoc_search` 工具 Searches indexed documents with results optimized for large language models. 搜索已索引的文档,结果经过优化,适合大型语言模型。 **Parameters | 参数:** - `query` (string, required): The search query (e.g., "how to use the API") | 搜索查询(例如,"如何使用 API") - `limit` (number, optional): Maximum number of sources to consider, default: 10 | 要考虑的最大源数,默认:10 - `filters` (object, optional): Optional filters to narrow down search results | 可选过滤器,用于缩小搜索结果范围 - `categories` (array of strings, optional): Filter by document categories | 按文档类别过滤 - `dateFrom` (number, optional): Filter documents created after this timestamp | 过滤在此时间戳之后创建的文档 - `dateTo` (number, optional): Filter documents created before this timestamp | 过滤在此时间戳之前创建的文档 - `metadata` (object, optional): Filter by metadata fields | 按元数据字段过滤 - `userId` (string, optional): Optional user ID for personalized results | 可选的用户 ID,用于个性化结果 **Example | 示例:** ```json { "query": "how to use the API", "limit": 5, "filters": { "categories": ["API Documentation"] } } ``` **Response | 响应:** ```json { "success": true, "results": { "paragraph": "The API can be used by making HTTP requests to the endpoints...", "sources": [ { "url": "https://example.com/docs/api", "title": "API Documentation", "relevance": 0.85, "paragraph": "The API can be used by making HTTP requests to the endpoints...", "highlightedParagraph": "The **API** can be used by making **HTTP** requests to the **endpoints**...", "fullDocument": "Complete document content for this specific result..." } ] } } ``` ### 3. `vjdoc_add_corpus_file` Tool | `vjdoc_add_corpus_file` 工具 Adds a custom corpus file to the TF-IDF files directory for inclusion in search results. This is perfect for adding your own code snippets, documentation, error solutions, or technical notes that you want to be searchable. 向 TF-IDF 文件目录添加自定义语料库文件,以包含在搜索结果中。这非常适合添加您自己的代码片段、文档、错误解决方案或技术笔记,使它们可被搜索。 **Parameters | 参数:** - `content` (string, required): The text content to add to the corpus file | 要添加到语料库文件的文本内容 - `filename` (string, optional): Optional filename for the corpus file (without extension) | 语料库文件的可选文件名(不带扩展名) - `category` (string, optional): Optional category for the corpus file | 语料库文件的可选类别 **Recommended Categories | 推荐类别:** - `Code Snippet` - Reusable code patterns and examples | 可重用的代码模式和示例 - `API Documentation` - Function and parameter descriptions | 函数和参数描述 - `Error Solution` - Common errors and their fixes | 常见错误及其修复方法 - `Technical Note` - Personal learning summaries | 个人学习总结 **Example | 示例:** ```json { "content": "// 快速排序实现\nfunction quickSort(arr) {\n if (arr.length <= 1) return arr;\n const pivot = arr[0];\n const left = []; \n const right = [];\n for (let i = 1; i < arr.length; i++) {\n arr[i] < pivot ? left.push(arr[i]) : right.push(arr[i]);\n }\n return [...quickSort(left), pivot, ...quickSort(right)];\n}\n\n// 常见错误:Uncaught TypeError\n// 解决方案:检查变量是否为null/undefined", "filename": "quicksort_algorithm", "category": "Code Snippet" } ``` **Response | 响应:** ```json { "success": true, "message": "Successfully added corpus file: code_snippet_quicksort_algorithm.txt", "filename": "code_snippet_quicksort_algorithm.txt", "category": "Code Snippet" } ``` ### 4. `vjdoc_get_docs_meta` Tool | `vjdoc_get_docs_meta` 工具 Retrieves metadata about all documents and corpus files to help LLMs understand the available content and plan effective searches. 获取所有文档和语料库文件的元数据,帮助大型语言模型了解可用内容并规划有效的搜索。 **Parameters | 参数:** - `query` (string, required): Natural language query or requirement | 自然语言查询或需求 **Response Format | 响应格式:** ```json { "query": "Original natural language query", "documents": [ { "url": "Document URL", "title": "Document title", "category": "Document category", "timestamp": 1712190000000, "keywords": ["keyword1", "keyword2", "..."], "summary": "Brief summary of document content..." } ], "totalDocuments": 42, "categories": ["API Documentation", "Code Snippet", "..."], "suggestion": "Search guidance for LLMs" } ``` ### 5. `vjdoc_get_document` Tool | `vjdoc_get_document` 工具 Gets the full content of a specific document by URL or title. 通过 URL 或标题获取特定文档的完整内容。 **Parameters | 参数:** - `url` (string, optional): URL of the document to retrieve | 要检索的文档的 URL - `title` (string, optional): Title of the document to retrieve | 要检索的文档的标题 **Notes | 注意:** - At least one of `url` or `title` must be provided | 必须提供 `url``title` 中的至少一个 - The tool supports partial matching for both parameters | 该工具支持两个参数的部分匹配 - When using `url` parameter, it will find documents where the URL contains the provided string | 使用 `url` 参数时,它将查找 URL 包含所提供字符串的文档 - When using `title` parameter, it will find documents where the title contains the provided string (case-insensitive) | 使用 `title` 参数时,它将查找标题包含所提供字符串的文档(不区分大小写) **Example | 示例:** ```json { "url": "https://example.com/docs/auth" } ``` or | 或 ```json { "title": "Authentication Guide" } ``` **Response | 响应:** ```json { "url": "https://example.com/docs/auth", "title": "Authentication Guide", "content": "Complete document content...", "metadata": { "category": "API Documentation", "lastModified": "2023-01-15T12:00:00Z" } } ``` ## Using with AI Coding Assistants | 与 AI 编码助手一起使用 You can use these MCP tools with various AI coding assistants to enhance your documentation workflow. 您可以在各种 AI 编码助手中使用这些 MCP 工具来增强您的文档工作流程。 ### Using with Cursor | 在 Cursor 中使用 In Cursor, you can use the MCP tools through the command interface: 在 Cursor 中,您可以通过命令界面使用 MCP 工具: 1. **Setup | 设置**: Configure Cursor to use your MCP server | 配置 Cursor 使用您的 MCP 服务器 2. **Crawling | 爬取**: Use the `/mcp` command to invoke the crawl tool | 使用 `/mcp` 命令调用 crawl 工具 ``` /mcp mcp-vj-docs vjdoc_crawl {"url": "https://example.com/docs", "maxDepth": 3, "maxPages": 100} ``` 3. **Searching | 搜索**: Use the `/mcp` command to invoke the search tool | 使用 `/mcp` 命令调用 search 工具 ``` /mcp mcp-vj-docs vjdoc_search {"query": "authentication", "limit": 5, "filters": {"categories": ["API Documentation"]}} ``` 4. **Adding Corpus Files | 添加语料库文件**: Use the `/mcp` command to add custom corpus files | 使用 `/mcp` 命令添加自定义语料库文件 ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "// Your code here", "category": "Code Snippet"} ``` 5. **Getting Document Content | 获取文档内容**: Use the `/mcp` command to get full document content | 使用 `/mcp` 命令获取完整文档内容 ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth"} ``` or | 或 ``` /mcp mcp-vj-docs vjdoc_get_document {"title": "Authentication Guide"} ``` ### Advanced Workflow with AI Assistants | 与 AI 助手的高级工作流程 When working with AI assistants like Claude or GPT, you can create a more effective workflow: 1. **First, get document metadata** to understand what's available: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement JWT authentication"} ``` 2. **Then, search for relevant documents**: ``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT authentication implementation", "limit": 3} ``` 3. **Finally, get the full content** of the most relevant document for comprehensive context: ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **Ask the AI assistant** to explain or generate code based on the full document: ``` Based on this documentation, please explain how to implement JWT authentication in my Node.js application. ``` This workflow ensures the AI has complete context while minimizing token usage by only retrieving full content for the most relevant documents. 当与 Claude 或 GPT 等 AI 助手一起工作时,您可以创建更有效的工作流程: 1. **首先,获取文档元数据**以了解有哪些可用内容: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "我需要实现 JWT 认证"} ``` 2. **然后,搜索相关文档**``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT 认证实现", "limit": 3} ``` 3. **最后,获取最相关文档的完整内容**以获得全面的上下文: ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **请求 AI 助手**基于完整文档解释或生成代码: ``` 根据这份文档,请解释如何在我的 Node.js 应用程序中实现 JWT 认证。 ``` 这个工作流程确保 AI 拥有完整的上下文,同时通过仅检索最相关文档的完整内容来最小化令牌使用。 ## Optimized Token Usage with vjdoc_get_docs_meta | 优化的 vjdoc_get_docs_meta 令牌使用 ### Changelog (v0.1.61) - 2025-04-22 - **Enhanced Metadata Search**: The search algorithm now analyzes multiple dimensions: - **Title Analysis**: Improved title matching with weighted scoring for full and partial matches - **URL Analysis**: Extracts and scores URL segments for keyword relevance - **Metadata Field Analysis**: Individually processes important fields like keywords, description, and category - **Field-Specific Weighting**: Assigns different weights to matches in different metadata fields - **Chinese Query Optimization**: Implements specialized tokenization for Chinese characters and phrases ### 更新日志 (v0.1.61) - 2025-04-22 - **增强元数据搜索**:搜索算法现在分析多个维度: - **标题分析**:改进标题匹配,对完整和部分匹配进行加权评分 - **URL 分析**:提取并评分 URL 段以确定关键词相关性 - **元数据字段分析**:单独处理重要字段,如关键词、描述和类别 - **字段特定权重**:为不同元数据字段中的匹配分配不同权重 - **中文查询优化**:为中文字符和短语实现专门的分词 ### Changelog (v0.1.60) - 2025-04-22 {{ ... }} ## Search Tool Response Format | 搜索工具响应格式 The `vjdoc_search` tool returns results in the following format: ```json { "results": [ { "url": "https://example.com/docs/api", "title": "API Documentation", "relevance": 0.85, "category": "API Documentation", "paragraph": "Content excerpt most relevant to this document...", "highlightedParagraph": "Content with **highlighted** query terms for this document...", "fullDocument": "Complete content for this specific document..." // Only present for the most relevant result }, { "url": "https://example.com/docs/guide", "title": "User Guide", "relevance": 0.75, "category": "Documentation", "paragraph": "Content excerpt most relevant to this document...", "highlightedParagraph": "Content with **highlighted** query terms for this document..." // No fullDocument field for lower-ranked results }, // More results... ], "content": "Summary of content most relevant to the query...", "fullDocument": "Complete document of the most relevant result", "personalized": true } ``` Key fields: - `results`: 带有相关性分数的来源列表 - 每个结果包括: - `url`: 文档 URL - `title`: 文档标题 - `relevance`: 相关性分数 - `category`: 文档类别 - `paragraph`: 来自此特定文档的相关段落摘录 - `highlightedParagraph`: 带有高亮显示的此文档段落 - `fullDocument`: 完整的文档内容(仅适用于最相关的结果) - `content`: 与查询相关的提取内容摘要 - `fullDocument`: 最相关结果的完整文档内容 - `personalized`: 结果是否基于用户 ID 进行了个性化 ### 搜索工具响应格式 `vjdoc_search` 工具返回以下格式的结果: ```json { "results": [ { "url": "https://example.com/docs/api", "title": "API Documentation", "relevance": 0.85, "category": "API Documentation", "paragraph": "与此文档最相关的内容段落...", "highlightedParagraph": "带有**高亮**查询词的此文档段落...", "fullDocument": "此特定文档的完整内容..." // 只有最相关的结果才包含此字段 }, { "url": "https://example.com/docs/guide", "title": "User Guide", "relevance": 0.75, "category": "Documentation", "paragraph": "与此文档最相关的内容段落...", "highlightedParagraph": "带有**高亮**查询词的此文档段落..." // 较低排名的结果没有 fullDocument 字段 }, // 更多结果... ], "content": "与查询最相关的摘要内容...", "fullDocument": "最相关结果的完整文档内容", "personalized": true } ``` 关键字段: - `results`: 带有相关性分数的来源列表 - 每个结果包括: - `url`: 文档 URL - `title`: 文档标题 - `relevance`: 相关性分数 - `category`: 文档类别 - `paragraph`: 来自此特定文档的相关段落摘录 - `highlightedParagraph`: 带有高亮显示的此文档段落 - `fullDocument`: 完整的文档内容(仅适用于最相关的结果) - `content`: 与查询相关的提取内容摘要 - `fullDocument`: 最相关结果的完整文档内容 - `personalized`: 结果是否基于用户 ID 进行了个性化 ## Examples | 示例 ### Searching Across Database and Corpus | 在数据库和语料库中搜索 ``` /mcp mcp-vj-docs vjdoc_search {"query": "authentication", "limit": 5} ``` This will search for "authentication" in both the crawled documents (database) and your custom corpus files. ### Using Natural Language Queries | 使用自然语言查询 For natural language requirements, you can use the metadata tool first: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement user authentication in my React application"} ``` Then use the search tool with the refined query: ``` /mcp mcp-vj-docs vjdoc_search {"query": "React authentication implementation", "filters": {"categories": ["Code Snippet", "API Documentation"]}} ``` ### Utilizing the fullDocument Field | 利用 fullDocument 字段 When working with LLMs, you can use the `fullDocument` field to provide comprehensive context: ```javascript // 使用 fullDocument 字段与 LLM 的示例 const searchResults = await searchDocs("如何实现 JWT 认证"); const fullContext = searchResults.fullDocument; // 现在您可以要求 LLM 基于完整文档生成代码 const generatedCode = await llm.generateCode( `基于此文档: ${fullContext}\n\n生成一个 JWT 认证实现` ); ``` ## Real-World Use Cases | 实际使用场景 ### Personal Knowledge Base | 个人知识库 - Save code snippets you frequently use for easy reference | 保存您经常使用的代码片段以便于参考 - Document API endpoints with examples | 使用示例记录 API 端点 - Keep track of error messages and their solutions | 跟踪错误消息及其解决方案 - Store configuration examples for different environments | 存储不同环境的配置示例 - Create a personal knowledge base of technical notes | 创建技术笔记的个人知识库 **Pro Tip | 专业提示:** Organize your corpus files with consistent categories to make searching more effective. You can then filter search results by category to find exactly what you need! 使用一致的类别组织您的语料库文件,使搜索更有效。然后,您可以按类别过滤搜索结果,以找到您需要的确切内容! ## PDF Support | PDF 支持 The system now supports adding PDF files to the corpus. PDFs are automatically converted to Markdown format for better searchability. | 系统现在支持将PDF文件添加到语料库。PDF会自动转换为Markdown格式以提高可搜索性。 **Adding a PDF file in Cline | 在Cline中添加PDF文件**: Simply provide the absolute path to your PDF file: ```bash cline mcp mcp-vj-docs vjdoc_add_corpus_file --filePath "/absolute/path/to/your/document.pdf" --category "Documentation" ``` **Adding a PDF file in Cursor | 在Cursor中添加PDF文件**: Simply provide the absolute path to your PDF file: ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"filePath": "/absolute/path/to/your/document.pdf", "category": "Documentation"} ``` The system extracts text from the PDF and converts it to Markdown format, preserving structure like headings, code blocks, and lists where possible. | 系统从PDF中提取文本并将其转换为Markdown格式,尽可能保留标题、代码块和列表等结构。 ## How It Works | 工作原理 1. When you add a corpus file, it's saved to the `VJDOC_TFIDF_FILES_DIR` directory | 当您添加语料库文件时,它会保存到 `VJDOC_TFIDF_FILES_DIR` 目录 2. If you don't specify a filename, one will be generated automatically | 如果您不指定文件名,将自动生成一个 3. The category will be added as a prefix to the filename | 类别将作为前缀添加到文件名中 4. The file is automatically indexed and will appear in search results | 文件会自动索引并出现在搜索结果中 5. You can search for this content later using the `vjdoc_search` tool | 您可以稍后使用 `vjdoc_search` 工具搜索此内容 ## Practical Workflow Examples | 实用工作流程示例 Here are some practical workflows combining these tools: 以下是结合这些工具的一些实用工作流程: 1. **Documentation Indexing | 文档索引** - Crawl your project documentation: | 爬取您的项目文档: ``` /mcp mcp-vj-docs vjdoc_crawl {"url": "https://your-project-docs.com"} ``` - Add custom code snippets: | 添加自定义代码片段: ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "// Your code here", "category": "Code Snippet"} ``` - Search across all indexed content: | 搜索所有已索引内容: ``` /mcp mcp-vj-docs vjdoc_search {"query": "how to implement feature X"} ``` 2. **Personal Knowledge Base | 个人知识库** - Add error solutions as you encounter them: | 添加您遇到的错误解决方案: ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "Error: Module not found\nSolution: Run npm install", "category": "Error Solution"} ``` - Add API documentation for your projects: | 为您的项目添加 API 文档: ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "function getData(id) - Retrieves data by ID from the API", "category": "API Documentation"} ``` - Search your knowledge base when needed: | 在需要时搜索您的知识库: ``` /mcp mcp-vj-docs vjdoc_search {"query": "module not found", "filters": {"categories": ["Error Solution"]}} ``` ## Advanced Workflow with AI Assistants | 与 AI 助手的高级工作流程 When working with AI assistants like Claude or GPT, you can create a more effective workflow: 1. **First, get document metadata** to understand what's available: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement JWT authentication"} ``` 2. **Then, search for relevant documents**: ``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT authentication implementation", "limit": 3} ``` 3. **Finally, get the full content** of the most relevant document for comprehensive context: ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **Ask the AI assistant** to explain or generate code based on the full document: ``` Based on this documentation, please explain how to implement JWT authentication in my Node.js application. ``` This workflow ensures the AI has complete context while minimizing token usage by only retrieving full content for the most relevant documents. 当与 Claude 或 GPT 等 AI 助手一起工作时,您可以创建更有效的工作流程: 1. **首先,获取文档元数据**以了解有哪些可用内容: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "我需要实现 JWT 认证"} ``` 2. **然后,搜索相关文档**``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT 认证实现", "limit": 3} ``` 3. **最后,获取最相关文档的完整内容**以获得全面的上下文: ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **请求 AI 助手**基于完整文档解释或生成代码: ``` 根据这份文档,请解释如何在我的 Node.js 应用程序中实现 JWT 认证。 ``` 这个工作流程确保 AI 拥有完整的上下文,同时通过仅检索最相关文档的完整内容来最小化令牌使用。 这个工作流程确保 AI 拥有完整的上下文,同时通过仅检索最相关文档的完整内容来最小化令牌使用。 ``` ### Recommended Usage Strategy | 推荐使用策略 The optimized `vjdoc_get_docs_meta` tool is designed to be more token-efficient while still providing valuable context to LLMs. Here's how to best utilize it: 1. **Start with a Specific Query**: The more specific your query, the more relevant the returned documents will be. ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "React server components vs client components"} ``` 2. **Review the Top Results**: The tool now returns only the most relevant documents with their snippets and relevance scores. 3. **Use the Suggested Tool Calls**: The response includes ready-to-use examples for: - Searching with `vjdoc_search` for more detailed results - Getting full document content with `vjdoc_get_document` for the most relevant document 4. **Progressive Disclosure Pattern**: This optimized approach follows a "progressive disclosure" pattern: - Start with metadata (minimal tokens) - Progress to search results (moderate tokens) - Finally get full document content (maximum tokens) only when necessary This approach is especially valuable in contexts where token usage directly impacts costs or performance. ### 推荐使用策略 优化后的 `vjdoc_get_docs_meta` 工具旨在提高令牌效率,同时仍为大型语言模型提供有价值的上下文。以下是最佳利用方式: 1. **从特定查询开始**:查询越具体,返回的文档就越相关。 ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "React 服务器组件与客户端组件的区别"} ``` 2. **查看顶部结果**:该工具现在只返回最相关的文档,包含它们的摘要和相关性分数。 3. **使用建议的工具调用**:响应中包含可直接使用的示例: - 使用 `vjdoc_search` 搜索更详细的结果 - 使用 `vjdoc_get_document` 获取最相关文档的完整内容 4. **渐进式披露模式**:这种优化方法遵循"渐进式披露"模式: - 从元数据开始(最少令牌) - 进展到搜索结果(中等令牌) - 最后只在必要时获取完整文档内容(最大令牌) 这种方法在令牌使用直接影响成本或性能的情况下特别有价值。 ### Example Response Format | 响应格式示例 The optimized response format now includes: ```json { "query": "React hooks", "documents": [ { "url": "https://example.com/docs/react/hooks/usestate", "title": "useState Hook", "category": "React Hooks", "score": 0.95, "snippet": "useState is a Hook that lets you add React state to function components..." }, // Additional documents (limited to most relevant) ], "totalDocuments": 10, "categories": ["React Hooks", "React Basics"], "userPrompt": "我找到了10个与\"React hooks\"相关的文档...(简洁的提示文本)" } ``` ### 响应格式示例 优化后的响应格式现在包括: ```json { "query": "React hooks", "documents": [ { "url": "https://example.com/docs/react/hooks/usestate", "title": "useState Hook", "category": "React Hooks", "score": 0.95, "snippet": "useState 是一个 Hook,它允许你在函数组件中添加 React 状态..." }, // 其他文档(限于最相关的) ], "totalDocuments": 10, "categories": ["React Hooks", "React 基础"], "userPrompt": "我找到了10个与\"React hooks\"相关的文档...(简洁的提示文本)" } ``` ## Troubleshooting | 故障排除 ### Common Issues | 常见问题 1. **Database Path Issues | 数据库路径问题** - Ensure the directory for your database exists | 确保您的数据库目录存在 - Check if you have write permissions to the specified path | 检查您是否有写入指定路径的权限 - For tilde paths, ensure your home directory is correctly detected | 对于波浪号路径,确保正确检测到您的主目录 2. **Firecrawl API Issues | Firecrawl API 问题** - Verify your API key is correct | 验证您的 API 密钥是否正确 - Check if you've reached API rate limits | 检查您是否达到了 API 速率限制 - If using a local Firecrawl service, ensure it's running | 如果使用本地 Firecrawl 服务,确保它正在运行 3. **Crawling Issues | 爬取问题** - Some websites may block crawlers | 某些网站可能会阻止爬虫 - Check if the website requires authentication | 检查网站是否需要身份验证 - Try reducing the crawl depth and page limit | 尝试减少爬取深度和页面限制 ### Logs | 日志 Check the logs for more detailed error information: 查看日志以获取更详细的错误信息: - If `VJDOC_LOG_TO_FILE` is enabled, check the log files in your log directory | 如果启用了 `VJDOC_LOG_TO_FILE`,请检查日志目录中的日志文件 - Otherwise, check the console output | 否则,检查控制台输出 ## 传输协议配置 | Transport Configuration MCP 服务器支持多种传输协议,可以通过环境变量进行配置: ### 环境变量 | Environment Variables #### 传输协议控制 | Transport Protocol Control - `ENABLE_STDIO_TRANSPORT`: 控制是否启用标准输入/输出传输(默认为 true,设置为 'false' 禁用) - `ENABLE_STREAMABLE_HTTP`: 控制是否启用流式 HTTP 传输(默认为 false,设置为 'true' 启用) - `ENABLE_LEGACY_SSE`: 控制是否启用旧版 SSE 端点(默认为 false,设置为 'true' 启用) #### 端口配置 | Port Configuration - `STREAMABLE_HTTP_PORT`: 设置流式 HTTP 服务器的端口(默认为 3000) - `LEGACY_SSE_PORT`: 设置旧版 SSE 服务器的端口(默认为 3001) ### 使用示例 | Usage Examples ```bash # 启用所有传输协议 export ENABLE_STDIO_TRANSPORT=true export ENABLE_STREAMABLE_HTTP=true export ENABLE_LEGACY_SSE=true # 设置端口 export STREAMABLE_HTTP_PORT=3000 export LEGACY_SSE_PORT=3001 # 运行服务器 npm start ``` ### 传输协议说明 | Transport Protocol Description #### Streamable HTTP Transport 现代 HTTP 传输协议,支持流式传输和会话管理。提供以下端点: - `POST /mcp`: 处理客户端到服务器的通信 - `GET /mcp`: 处理服务器到客户端的通知(通过 SSE) - `DELETE /mcp`: 处理会话终止 #### Legacy SSE Transport 旧版 SSE 传输协议,提供以下端点: - `GET /sse`: 建立 SSE 连接 - `POST /messages?sessionId=<id>`: 处理客户端消息 #### Stdio Transport 标准输入/输出传输协议,用于命令行环境。 ## CHANGELOG | 更新日志 ### 2026-04-07 (v0.1.73) - **递归目录扫描**:语料库文件现在支持递归目录扫描,嵌套子目录中的文件会被自动发现和索引 - 新增 `scanFilesRecursive()` 方法,递归扫描 `.txt``.md``.pdf` 文件 - `loadTfidfFiles()``getCorpusDocuments()` 均已重构为使用递归扫描 - 文档元数据新增 `relativePath` 字段,标识文件在目录树中的相对位置 ### 2025-05-08 - **改进传输协议配置** - 实现了基于环境变量的传输协议配置,支持灵活启用/禁用不同的传输协议 - 新增环境变量:`ENABLE_STDIO_TRANSPORT``ENABLE_STREAMABLE_HTTP``ENABLE_LEGACY_SSE` - 新增端口配置环境变量:`STREAMABLE_HTTP_PORT``LEGACY_SSE_PORT` - **升级 Streamable HTTP 传输** - 采用最新的 MCP 规范实现会话管理 - 支持 POST、GET 和 DELETE 请求处理 - 改进了会话 ID 生成和验证机制 - **统一传输协议处理** - 统一了所有传输协议的初始化和连接方式 - 改进了日志记录,提供更详细的传输协议状态信息 - 增强了错误处理和资源清理 ### 2025-04-24 - **改进搜索结果结构** - 将搜索结果中的 `content` 字段重命名为 `paragraph`,更准确地反映其包含的内容 - 添加 `highlightedParagraph` 字段,提供带有高亮显示的段落内容 - 优化了搜索结果的格式,使其更适合大型语言模型处理 - **添加 PDF 支持** - 集成 pdf-parse 库,支持 PDF 文件的解析和索引 - 添加 `pdf-base64` 内容类型,允许直接添加 PDF 文件到语料库 - **改进路径处理** - 增强了波浪号路径扩展功能,更好地支持跨平台路径处理 - 修复了与 Node.js 路径处理相关的问题 - **添加高级日志功能** - 集成 winston 日志库,提供更详细的日志记录 - 添加了可配置的日志级别和文件日志选项 - 新增环境变量:`VJDOC_LOG_LEVEL``VJDOC_LOG_TO_FILE``VJDOC_LOG_DIR` - **文档时间戳支持** - 添加文档时间戳,用于实现基于新鲜度的文档评分 - 改进搜索算法,考虑文档的时间因素 - **增强爬取选项** - 支持 `ignoreSitemap` 选项,允许忽略网站的 sitemap.xml - 添加 `allowExternalLinks``allowBackwardLinks` 选项,控制爬取范围 - 支持 `includePatterns``excludePatterns` 数组,用于精确控制要爬取的 URL - 添加 `defaultCategory` 选项,为爬取的文档设置默认类别 - 支持自定义 Firecrawl API 配置(`firecrawlApiKey``firecrawlApiUrl`