@vjlanguage/mcp-vj-docs

# MCP Documentation Server (@vjlanguage/mcp-vj-docs) 一个用于文档爬取、索引和检索的模型上下文协议（MCP）服务器。该包提供了爬取网站、存储和索引内容以及使用基于 TF-IDF 的搜索来搜索内容的工具。搜索结果经过优化，适合大型语言模型使用。 A Model Context Protocol (MCP) server for documentation crawling, indexing, and retrieval. This package provides tools for crawling websites, storing and indexing content, and searching through that content using TF-IDF based search. The search results are optimized for large language models. ## dify stream 模式快速开始指南 ## Quick Start Guide for Dify Stream Mode ### 安装 ### Installation ```bash # 使用 npm 安装 npm install @vjlanguage/mcp-vj-docs -g # 或使用 yarn 安装 yarn global add @vjlanguage/mcp-vj-docs ``` ### 命令行使用 ### Command Line Usage #### 1. 启动服务器（前台模式） #### 1. Start the Server (Foreground Mode) ```bash # 使用默认配置启动服务器 vjdoc-cli stream-start # 指定端口启动服务器 vjdoc-cli stream-start -p 3000 # 指定数据库路径和 TF-IDF 目录 vjdoc-cli stream-start -d ~/mydata/docs.json --tfidf-dir ~/mydata/tfidf ``` #### 2. 后台服务模式 #### 2. Background Service Mode ```bash # 在后台启动服务器 # Start the server in background mode vjdoc-cli stream-serve # 停止后台运行的服务器 # Stop the background server vjdoc-cli stream-stop ``` #### 3. 命令行选项 #### 3. Command Line Options 所有命令都支持以下选项： All commands support the following options: - `-p, --port <number>` - 设置服务器端口（默认：3000） - Set server port (default: 3000) - `-d, --db-path <path>` - 设置数据库文件路径（默认：~/mcpdata/docs.json） - Set database file path (default: ~/mcpdata/docs.json) - `--tfidf-dir <path>` - 设置 TF-IDF 文件目录（默认：~/mcpdata/tfidf） - Set TF-IDF files directory (default: ~/mcpdata/tfidf) ### 环境变量配置 ### Environment Variable Configuration 您也可以通过环境变量配置服务器： You can also configure the server using environment variables: ```bash # 基本配置 # Basic Configuration VJDOC_DB_PATH=~/mcpdata/docs.json VJDOC_TFIDF_FILES_DIR=~/mcpdata/tfidf VJDOC_LOG_LEVEL=info # debug, info, warn, error VJDOC_LOG_TO_FILE=true VJDOC_LOG_DIR=~/mcpdata/logs # 爬虫配置 # Crawler Configuration VJDOC_MAX_DEPTH=4 VJDOC_MAX_PAGES=100 FIRECRAWL_API_KEY=your_api_key_here FIRECRAWL_API_URL=http://localhost:5002 # 传输配置 # Transport Configuration ENABLE_STDIO_TRANSPORT=true ENABLE_STREAMABLE_HTTP=true STREAMABLE_HTTP_PORT=3000 ``` ### dify stream 模式配置 ### Dify Stream Mode Configuration ```json { "mcpServers": { "mcp-vjdoc": { "transport": "streamable_http", "url": "http://192.168.2.9:3000/mcp" } } } ``` > - 192.168.2.9 替换为你本地的实际的网卡地址 Replace 192.168.2.9 with your actual local network interface address - 3000 替换为你默认执行 serve 命令的端口 Replace 3000 with the port you use when executing the serve command ### 作为 MCP 服务器使用 ### Using as an MCP Server 要将此服务器与 MCP 客户端（如 Dify、Claude 或其他支持 MCP 的应用程序）一起使用，请在 MCP 配置中添加以下内容： To use this server with MCP clients (such as Dify, Claude, or other applications that support MCP), add the following to your MCP configuration: ```json { "vj-docs": { "command": "node", "args": [ "/path/to/mcp-vj-docs/dist/index.js" ], "env": { "FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE", "VJDOC_MAX_DEPTH": "4", "VJDOC_MAX_PAGES": "100", "VJDOC_DB_PATH": "~/mcpdata/docs.json", "VJDOC_LOG_DIR": "~/mcpdata/logs", "VJDOC_LOG_TO_FILE": "true", "VJDOC_LOG_LEVEL": "debug", "FIRECRAWL_API_URL": "http://localhost:5002", "VJDOC_TFIDF_FILES_DIR": "~/mcpdata/tfidf" }, "disabled": false, "timeout": 3600 } } ``` ### 常见用例 ### Common Use Cases #### 爬取网站并索引内容 #### Crawl Websites and Index Content 使用 MCP 工具 `vjdoc_crawl` 爬取网站： Use the MCP tool `vjdoc_crawl` to crawl websites: ```json { "name": "vjdoc_crawl", "arguments": { "url": "https://example.com/docs", "maxDepth": 3, "maxPages": 50, "includePatterns": ["*/docs/*"], "excludePatterns": ["*/api/*"], "defaultCategory": "Documentation" } } ``` #### 搜索文档 #### Search Documents 使用 MCP 工具 `vjdoc_search` 搜索文档： Use the MCP tool `vjdoc_search` to search documents: ```json { "name": "vjdoc_search", "arguments": { "query": "如何配置服务器", "limit": 5, "filters": { "categories": ["Documentation", "Tutorial"] } } } ``` ### 故障排除 ### Troubleshooting #### 常见问题 #### Common Issues 1. **找不到命令** **Command Not Found** - 确保全局安装了包，或使用 npx 运行命令：`npx vjdoc-cli stream-start` - Ensure the package is installed globally, or use npx to run the command: `npx vjdoc-cli stream-start` 2. **权限错误** **Permission Errors** - 确保数据目录（~/mcpdata）存在且有写入权限 - Ensure the data directory (~/mcpdata) exists and has write permissions - 使用 `sudo mkdir -p ~/mcpdata` 创建目录 - Use `sudo mkdir -p ~/mcpdata` to create the directory 3. **无法连接到服务器** **Cannot Connect to Server** - 检查端口是否被占用：`lsof -i :3000` - Check if the port is already in use: `lsof -i :3000` - 确保防火墙未阻止连接 - Ensure firewall is not blocking the connection 4. **爬虫 API 错误** **Crawler API Errors** - 验证 FIRECRAWL_API_KEY 是否正确 - Verify that FIRECRAWL_API_KEY is correct - 检查 FIRECRAWL_API_URL 是否可访问 - Check if FIRECRAWL_API_URL is accessible --- # MCP Documentation Server (@vjlanguage/mcp-vj-docs) A Model Context Protocol (MCP) server for documentation crawling, indexing, and retrieval. This package provides tools for crawling websites, storing and indexing the content, and searching through that content using TF-IDF based search. The search results are optimized for large language models. ## Features | 功能 - **Documentation Crawling**: Crawl documentation from websites using Firecrawl - **Content Processing**: Convert HTML to Markdown and extract relevant content - **Storage & Indexing**: Store documents using lowdb with TF-IDF based indexing - **LLM-Optimized Search**: Search for documentation with aggregated results optimized for large language models - **Full Content Return**: No character length limits on search results - **Content-First Results**: Prioritizes content over URLs in search results - **Smart Deduplication**: Removes duplicate content and returns only the top 3 most relevant results - **AI-Optimized Format**: Results structured specifically for AI consumption and code generation - **Complete Document Context**: Returns full document content via `fullDocument` field for comprehensive context - **Enhanced Metadata Search**: Analyzes title, URL, and all metadata fields with field-specific weighting - **Multi-dimensional Scoring**: Evaluates document relevance across content, metadata, URL, and title - **Custom Corpus Management**: Add your own text corpus files for inclusion in search results - **Multiple Format Support**: Supports TXT, Markdown, and PDF files - **Recursive Directory Scanning**: Automatically discovers files in nested subdirectories - **Automatic Indexing**: Files in corpus directory are automatically indexed and searchable - **MCP Integration**: Expose tools for crawling and searching via Model Context Protocol - **Path Handling**: Support for tilde (~) expansion in file paths - **Server Modes**: Support for both SSE (Server-Sent Events) and stdio transports - **Multilingual Support**: Enhanced handling for Chinese queries with specialized tokenization ## 功能 - **文档爬取**：使用 Firecrawl 从网站爬取文档 - **内容处理**：将 HTML 转换为 Markdown 并提取相关内容 - **存储和索引**：使用 lowdb 存储文档，并使用基于 TF-IDF 的索引 - **LLM 优化搜索**：搜索文档并返回经过聚合的结果，专为大型语言模型优化 - **完整内容返回**：搜索结果没有字符长度限制 - **内容优先结果**：在搜索结果中优先考虑内容而非 URL - **智能去重**：移除重复内容并仅返回前 3 个最相关的结果 - **AI 优化格式**：结果结构专为 AI 消费和代码生成而设计 - **完整文档上下文**：通过 `fullDocument` 字段返回完整文档内容，提供全面的上下文 - **增强元数据搜索**：分析标题、URL 和所有元数据字段，并进行字段特定权重评分 - **多维度评分**：在内容、元数据、URL 和标题等多个维度评估文档相关性 - **自定义语料库管理**：添加您自己的文本语料库文件以包含在搜索结果中 - **多格式支持**：支持 TXT、Markdown 和 PDF 文件 - **递归目录扫描**：自动发现嵌套子目录中的文件 - **自动索引**：语料目录中的文件自动索引并可搜索 - **MCP 集成**：通过模型上下文协议暴露爬取和搜索工具 - **路径处理**：支持波浪号（~）在文件路径中的扩展 - **服务器模式**：支持 SSE（服务器发送事件）和 stdio 传输 - **多语言支持**：通过专门的分词增强中文查询处理 ## Changelog | 更新日志 ### 2026-04-07 (v0.1.73) - **Recursive Directory Scanning**: Corpus files in nested subdirectories are now automatically discovered, indexed, and searchable - `loadTfidfFiles()` and `getCorpusDocuments()` both scan recursively - Document metadata now includes `relativePath` for nested file location ### 2025-04-22 (v0.1.61) - **Enhanced Metadata Search**: The search algorithm now analyzes multiple dimensions: - **Title Analysis**: Improved title matching with weighted scoring for full and partial matches - **URL Analysis**: Extracts and scores URL segments for keyword relevance - **Metadata Field Analysis**: Individually processes important fields like keywords, description, and category - **Field-Specific Weighting**: Assigns different weights to matches in different metadata fields - **Chinese Query Optimization**: Implements specialized tokenization for Chinese characters and phrases ### 2026年04月07日 (v0.1.73) - **递归目录扫描**：嵌套子目录中的语料库文件现在可以自动发现、索引和搜索 - `loadTfidfFiles()` 和 `getCorpusDocuments()` 均支持递归扫描 - 文档元数据现在包含 `relativePath` 字段，标识嵌套文件位置 ### 2025年04月22日 (v0.1.61) - **增强元数据搜索**：搜索算法现在分析多个维度： - **标题分析**：改进标题匹配，对完整和部分匹配进行加权评分 - **URL 分析**：提取并评分 URL 段以确定关键词相关性 - **元数据字段分析**：单独处理重要字段，如关键词、描述和类别 - **字段特定权重**：为不同元数据字段中的匹配分配不同权重 - **中文查询优化**：为中文字符和短语实现专门的分词 ### 2025-04-11 - **Search Result Enhancement**: Modified search functionality to include relevant paragraphs for each individual result item, rather than only showing content for the top result. - **Result Format Improvement**: Changed the structure to make it clearer which document content belongs to which search result. - **Document Retrieval Enhancement**: Improved the `vjdoc_get_document` tool to support partial matching for both URL and title parameters. ### 2025年04月11日 - **搜索结果增强**：修改了搜索功能，以便为每个单独的结果项包含相关段落，而不仅仅是显示顶部结果的内容。 - **结果格式改进**：更改了结构，使其更清晰地显示哪些文档内容属于哪个搜索结果。 - **文档检索增强**：改进了 `vjdoc_get_document` 工具，支持 URL 和标题参数的部分匹配。 ## Installation | 安装 ```bash # Install globally | 全局安装 npm install -g @vjlanguage/mcp-vj-docs # Or use with npx | 或使用 npx npx @vjlanguage/mcp-vj-docs ``` ## Firecrawl Registration and API Key | Firecrawl 注册和 API 密钥 ### English This package uses Firecrawl service for web crawling. To use it, you need to: 1. **Register for Firecrawl**: - Visit [Firecrawl website](https://firecrawl.dev) and create an account - Or use the local Firecrawl service by setting `FIRECRAWL_API_URL` to your local endpoint 2. **Get your API Key**: - After registration, navigate to your account dashboard - Find and copy your API key - Add this key to your environment variables or MCP configuration 3. **Configure the API Key**: - Set the `FIRECRAWL_API_KEY` environment variable - Or add it to your MCP configuration (see example below) ### 中文本包使用 Firecrawl 服务进行网页爬取。要使用它，您需要： 1. **注册 Firecrawl**： - 访问 [Firecrawl 网站](https://firecrawl.dev) 并创建账户 - 或通过设置 `FIRECRAWL_API_URL` 为您的本地端点来使用本地 Firecrawl 服务 2. **获取您的 API 密钥**： - 注册后，导航到您的账户仪表板 - 找到并复制您的 API 密钥 - 将此密钥添加到您的环境变量或 MCP 配置中 3. **配置 API 密钥**： - 设置 `FIRECRAWL_API_KEY` 环境变量 - 或将其添加到您的 MCP 配置中（见下面的示例） ## Usage | 使用方法 ### Environment Variables | 环境变量 - `VJDOC_DB_PATH` - Path to the database file (default: ./data/docs.json) | 数据库文件路径（默认：./data/docs.json） - `VJDOC_MAX_DEPTH` - Maximum depth to crawl (default: 3) | 最大爬取深度（默认：3） - `VJDOC_MAX_PAGES` - Maximum number of pages to crawl (default: 100) | 最大爬取页面数（默认：100） - `VJDOC_LOG_DIR` - Directory for log files | 日志文件目录 - `VJDOC_LOG_TO_FILE` - Whether to log to file (true/false) | 是否记录到文件（true/false） - `VJDOC_LOG_LEVEL` - Log level (error, warn, info, debug) | 日志级别（error, warn, info, debug） - `FIRECRAWL_API_KEY` - API key for Firecrawl service | Firecrawl 服务的 API 密钥 - `FIRECRAWL_API_URL` - Custom URL for Firecrawl API | Firecrawl API 的自定义 URL - `MCP_TRANSPORT` - Transport method (sse or stdio, default: sse) | 传输方法（sse 或 stdio，默认：sse） - `VJDOC_TFIDF_FILES_DIR` - Directory for custom corpus files (default: ~/mcpdata/tfidf_files) | 自定义语料库文件目录（默认：~/mcpdata/tfidf_files） ```json { "mcpServers": { "mcp-vj-docs": { "command": "npx", "args": ["-y", "@vjlanguage/mcp-vj-docs@latest"], "env": { "FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE", "VJDOC_MAX_DEPTH": "4", "VJDOC_MAX_PAGES": "100", "VJDOC_DB_PATH": "~/mcpdata/docs.json", "VJDOC_LOG_DIR": "~/mcpdata/logs", "VJDOC_LOG_TO_FILE": "true", "VJDOC_LOG_LEVEL": "debug", "FIRECRAWL_API_URL": "http://localhost:5002", "VJDOC_TFIDF_FILES_DIR": "~/mcpdata/tfidf_files" }, "disabled": false, "timeout": 3600, "autoApprove": ["vjdoc_search", "vjdoc_crawl", "vjdoc_add_corpus_file"] } } } ``` ## MCP Tools | MCP 工具 The server exposes the following MCP tools: 服务器暴露以下 MCP 工具： ### 1. `vjdoc_crawl` Tool | `vjdoc_crawl` 工具 Crawls a website and indexes its content for search. 爬取网站并为搜索索引其内容。 **Parameters | 参数:** - `url` (string, required): The URL to crawl (e.g., "https://example.com/docs") | 要爬取的 URL（例如，"https://example.com/docs"） - `maxDepth` (number, optional): Maximum depth to crawl, default: 3 | 最大爬取深度，默认：3 - `maxPages` (number, optional): Maximum number of pages to crawl, default: 100 | 最大爬取页面数，默认：100 - `includePatterns` (array of strings, optional): Patterns to include in crawl (e.g., ["docs/*"]) | 要包含在爬取中的模式（例如，["docs/*"]） - `excludePatterns` (array of strings, optional): Patterns to exclude from crawl (e.g., ["blog/*"]) | 要从爬取中排除的模式（例如，["blog/*"]） - `defaultCategory` (string, optional): Default category for documents if not detected automatically | 如果未自动检测到，文档的默认类别 **Example | 示例:** ```json { "url": "https://example.com/docs", "maxDepth": 3, "maxPages": 100, "includePatterns": ["docs/*"], "excludePatterns": ["blog/*"] } ``` **Response | 响应:** ```json { "success": true, "message": "Successfully crawled and indexed 42 pages from https://example.com/docs", "count": 42 } ``` ### 2. `vjdoc_search` Tool | `vjdoc_search` 工具 Searches indexed documents with results optimized for large language models. 搜索已索引的文档，结果经过优化，适合大型语言模型。 **Parameters | 参数:** - `query` (string, required): The search query (e.g., "how to use the API") | 搜索查询（例如，"如何使用 API"） - `limit` (number, optional): Maximum number of sources to consider, default: 10 | 要考虑的最大源数，默认：10 - `filters` (object, optional): Optional filters to narrow down search results | 可选过滤器，用于缩小搜索结果范围 - `categories` (array of strings, optional): Filter by document categories | 按文档类别过滤 - `dateFrom` (number, optional): Filter documents created after this timestamp | 过滤在此时间戳之后创建的文档 - `dateTo` (number, optional): Filter documents created before this timestamp | 过滤在此时间戳之前创建的文档 - `metadata` (object, optional): Filter by metadata fields | 按元数据字段过滤 - `userId` (string, optional): Optional user ID for personalized results | 可选的用户 ID，用于个性化结果 **Example | 示例:** ```json { "query": "how to use the API", "limit": 5, "filters": { "categories": ["API Documentation"] } } ``` **Response | 响应:** ```json { "success": true, "results": { "paragraph": "The API can be used by making HTTP requests to the endpoints...", "sources": [ { "url": "https://example.com/docs/api", "title": "API Documentation", "relevance": 0.85, "paragraph": "The API can be used by making HTTP requests to the endpoints...", "highlightedParagraph": "The **API** can be used by making **HTTP** requests to the **endpoints**...", "fullDocument": "Complete document content for this specific result..." } ] } } ``` ### 3. `vjdoc_add_corpus_file` Tool | `vjdoc_add_corpus_file` 工具 Adds a custom corpus file to the TF-IDF files directory for inclusion in search results. This is perfect for adding your own code snippets, documentation, error solutions, or technical notes that you want to be searchable. 向 TF-IDF 文件目录添加自定义语料库文件，以包含在搜索结果中。这非常适合添加您自己的代码片段、文档、错误解决方案或技术笔记，使它们可被搜索。 **Parameters | 参数:** - `content` (string, required): The text content to add to the corpus file | 要添加到语料库文件的文本内容 - `filename` (string, optional): Optional filename for the corpus file (without extension) | 语料库文件的可选文件名（不带扩展名） - `category` (string, optional): Optional category for the corpus file | 语料库文件的可选类别 **Recommended Categories | 推荐类别:** - `Code Snippet` - Reusable code patterns and examples | 可重用的代码模式和示例 - `API Documentation` - Function and parameter descriptions | 函数和参数描述 - `Error Solution` - Common errors and their fixes | 常见错误及其修复方法 - `Technical Note` - Personal learning summaries | 个人学习总结 **Example | 示例:** ```json { "content": "// 快速排序实现\nfunction quickSort(arr) {\n if (arr.length <= 1) return arr;\n const pivot = arr[0];\n const left = []; \n const right = [];\n for (let i = 1; i < arr.length; i++) {\n arr[i] < pivot ? left.push(arr[i]) : right.push(arr[i]);\n }\n return [...quickSort(left), pivot, ...quickSort(right)];\n}\n\n// 常见错误：Uncaught TypeError\n// 解决方案：检查变量是否为null/undefined", "filename": "quicksort_algorithm", "category": "Code Snippet" } ``` **Response | 响应:** ```json { "success": true, "message": "Successfully added corpus file: code_snippet_quicksort_algorithm.txt", "filename": "code_snippet_quicksort_algorithm.txt", "category": "Code Snippet" } ``` ### 4. `vjdoc_get_docs_meta` Tool | `vjdoc_get_docs_meta` 工具 Retrieves metadata about all documents and corpus files to help LLMs understand the available content and plan effective searches. 获取所有文档和语料库文件的元数据，帮助大型语言模型了解可用内容并规划有效的搜索。 **Parameters | 参数:** - `query` (string, required): Natural language query or requirement | 自然语言查询或需求 **Response Format | 响应格式:** ```json { "query": "Original natural language query", "documents": [ { "url": "Document URL", "title": "Document title", "category": "Document category", "timestamp": 1712190000000, "keywords": ["keyword1", "keyword2", "..."], "summary": "Brief summary of document content..." } ], "totalDocuments": 42, "categories": ["API Documentation", "Code Snippet", "..."], "suggestion": "Search guidance for LLMs" } ``` ### 5. `vjdoc_get_document` Tool | `vjdoc_get_document` 工具 Gets the full content of a specific document by URL or title. 通过 URL 或标题获取特定文档的完整内容。 **Parameters | 参数:** - `url` (string, optional): URL of the document to retrieve | 要检索的文档的 URL - `title` (string, optional): Title of the document to retrieve | 要检索的文档的标题 **Notes | 注意:** - At least one of `url` or `title` must be provided | 必须提供 `url` 或 `title` 中的至少一个 - The tool supports partial matching for both parameters | 该工具支持两个参数的部分匹配 - When using `url` parameter, it will find documents where the URL contains the provided string | 使用 `url` 参数时，它将查找 URL 包含所提供字符串的文档 - When using `title` parameter, it will find documents where the title contains the provided string (case-insensitive) | 使用 `title` 参数时，它将查找标题包含所提供字符串的文档（不区分大小写） **Example | 示例:** ```json { "url": "https://example.com/docs/auth" } ``` or | 或 ```json { "title": "Authentication Guide" } ``` **Response | 响应:** ```json { "url": "https://example.com/docs/auth", "title": "Authentication Guide", "content": "Complete document content...", "metadata": { "category": "API Documentation", "lastModified": "2023-01-15T12:00:00Z" } } ``` ## Using with AI Coding Assistants | 与 AI 编码助手一起使用 You can use these MCP tools with various AI coding assistants to enhance your documentation workflow. 您可以在各种 AI 编码助手中使用这些 MCP 工具来增强您的文档工作流程。 ### Using with Cursor | 在 Cursor 中使用 In Cursor, you can use the MCP tools through the command interface: 在 Cursor 中，您可以通过命令界面使用 MCP 工具： 1. **Setup | 设置**: Configure Cursor to use your MCP server | 配置 Cursor 使用您的 MCP 服务器 2. **Crawling | 爬取**: Use the `/mcp` command to invoke the crawl tool | 使用 `/mcp` 命令调用 crawl 工具 ``` /mcp mcp-vj-docs vjdoc_crawl {"url": "https://example.com/docs", "maxDepth": 3, "maxPages": 100} ``` 3. **Searching | 搜索**: Use the `/mcp` command to invoke the search tool | 使用 `/mcp` 命令调用 search 工具 ``` /mcp mcp-vj-docs vjdoc_search {"query": "authentication", "limit": 5, "filters": {"categories": ["API Documentation"]}} ``` 4. **Adding Corpus Files | 添加语料库文件**: Use the `/mcp` command to add custom corpus files | 使用 `/mcp` 命令添加自定义语料库文件 ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "// Your code here", "category": "Code Snippet"} ``` 5. **Getting Document Content | 获取文档内容**: Use the `/mcp` command to get full document content | 使用 `/mcp` 命令获取完整文档内容 ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth"} ``` or | 或 ``` /mcp mcp-vj-docs vjdoc_get_document {"title": "Authentication Guide"} ``` ### Advanced Workflow with AI Assistants | 与 AI 助手的高级工作流程 When working with AI assistants like Claude or GPT, you can create a more effective workflow: 1. **First, get document metadata** to understand what's available: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement JWT authentication"} ``` 2. **Then, search for relevant documents**: ``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT authentication implementation", "limit": 3} ``` 3. **Finally, get the full content** of the most relevant document for comprehensive context: ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **Ask the AI assistant** to explain or generate code based on the full document: ``` Based on this documentation, please explain how to implement JWT authentication in my Node.js application. ``` This workflow ensures the AI has complete context while minimizing token usage by only retrieving full content for the most relevant documents. 当与 Claude 或 GPT 等 AI 助手一起工作时，您可以创建更有效的工作流程： 1. **首先，获取文档元数据**以了解有哪些可用内容： ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "我需要实现 JWT 认证"} ``` 2. **然后，搜索相关文档**： ``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT 认证实现", "limit": 3} ``` 3. **最后，获取最相关文档的完整内容**以获得全面的上下文： ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **请求 AI 助手**基于完整文档解释或生成代码： ``` 根据这份文档，请解释如何在我的 Node.js 应用程序中实现 JWT 认证。 ``` 这个工作流程确保 AI 拥有完整的上下文，同时通过仅检索最相关文档的完整内容来最小化令牌使用。 ## Optimized Token Usage with vjdoc_get_docs_meta | 优化的 vjdoc_get_docs_meta 令牌使用 ### Changelog (v0.1.61) - 2025-04-22 - **Enhanced Metadata Search**: The search algorithm now analyzes multiple dimensions: - **Title Analysis**: Improved title matching with weighted scoring for full and partial matches - **URL Analysis**: Extracts and scores URL segments for keyword relevance - **Metadata Field Analysis**: Individually processes important fields like keywords, description, and category - **Field-Specific Weighting**: Assigns different weights to matches in different metadata fields - **Chinese Query Optimization**: Implements specialized tokenization for Chinese characters and phrases ### 更新日志 (v0.1.61) - 2025-04-22 - **增强元数据搜索**：搜索算法现在分析多个维度： - **标题分析**：改进标题匹配，对完整和部分匹配进行加权评分 - **URL 分析**：提取并评分 URL 段以确定关键词相关性 - **元数据字段分析**：单独处理重要字段，如关键词、描述和类别 - **字段特定权重**：为不同元数据字段中的匹配分配不同权重 - **中文查询优化**：为中文字符和短语实现专门的分词 ### Changelog (v0.1.60) - 2025-04-22 {{ ... }} ## Search Tool Response Format | 搜索工具响应格式 The `vjdoc_search` tool returns results in the following format: ```json { "results": [ { "url": "https://example.com/docs/api", "title": "API Documentation", "relevance": 0.85, "category": "API Documentation", "paragraph": "Content excerpt most relevant to this document...", "highlightedParagraph": "Content with **highlighted** query terms for this document...", "fullDocument": "Complete content for this specific document..." // Only present for the most relevant result }, { "url": "https://example.com/docs/guide", "title": "User Guide", "relevance": 0.75, "category": "Documentation", "paragraph": "Content excerpt most relevant to this document...", "highlightedParagraph": "Content with **highlighted** query terms for this document..." // No fullDocument field for lower-ranked results }, // More results... ], "content": "Summary of content most relevant to the query...", "fullDocument": "Complete document of the most relevant result", "personalized": true } ``` Key fields: - `results`: 带有相关性分数的来源列表 - 每个结果包括： - `url`: 文档 URL - `title`: 文档标题 - `relevance`: 相关性分数 - `category`: 文档类别 - `paragraph`: 来自此特定文档的相关段落摘录 - `highlightedParagraph`: 带有高亮显示的此文档段落 - `fullDocument`: 完整的文档内容（仅适用于最相关的结果） - `content`: 与查询相关的提取内容摘要 - `fullDocument`: 最相关结果的完整文档内容 - `personalized`: 结果是否基于用户 ID 进行了个性化 ### 搜索工具响应格式 `vjdoc_search` 工具返回以下格式的结果： ```json { "results": [ { "url": "https://example.com/docs/api", "title": "API Documentation", "relevance": 0.85, "category": "API Documentation", "paragraph": "与此文档最相关的内容段落...", "highlightedParagraph": "带有**高亮**查询词的此文档段落...", "fullDocument": "此特定文档的完整内容..." // 只有最相关的结果才包含此字段 }, { "url": "https://example.com/docs/guide", "title": "User Guide", "relevance": 0.75, "category": "Documentation", "paragraph": "与此文档最相关的内容段落...", "highlightedParagraph": "带有**高亮**查询词的此文档段落..." // 较低排名的结果没有 fullDocument 字段 }, // 更多结果... ], "content": "与查询最相关的摘要内容...", "fullDocument": "最相关结果的完整文档内容", "personalized": true } ``` 关键字段： - `results`: 带有相关性分数的来源列表 - 每个结果包括： - `url`: 文档 URL - `title`: 文档标题 - `relevance`: 相关性分数 - `category`: 文档类别 - `paragraph`: 来自此特定文档的相关段落摘录 - `highlightedParagraph`: 带有高亮显示的此文档段落 - `fullDocument`: 完整的文档内容（仅适用于最相关的结果） - `content`: 与查询相关的提取内容摘要 - `fullDocument`: 最相关结果的完整文档内容 - `personalized`: 结果是否基于用户 ID 进行了个性化 ## Examples | 示例 ### Searching Across Database and Corpus | 在数据库和语料库中搜索 ``` /mcp mcp-vj-docs vjdoc_search {"query": "authentication", "limit": 5} ``` This will search for "authentication" in both the crawled documents (database) and your custom corpus files. ### Using Natural Language Queries | 使用自然语言查询 For natural language requirements, you can use the metadata tool first: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement user authentication in my React application"} ``` Then use the search tool with the refined query: ``` /mcp mcp-vj-docs vjdoc_search {"query": "React authentication implementation", "filters": {"categories": ["Code Snippet", "API Documentation"]}} ``` ### Utilizing the fullDocument Field | 利用 fullDocument 字段 When working with LLMs, you can use the `fullDocument` field to provide comprehensive context: ```javascript // 使用 fullDocument 字段与 LLM 的示例 const searchResults = await searchDocs("如何实现 JWT 认证"); const fullContext = searchResults.fullDocument; // 现在您可以要求 LLM 基于完整文档生成代码 const generatedCode = await llm.generateCode( `基于此文档: ${fullContext}\n\n生成一个 JWT 认证实现` ); ``` ## Real-World Use Cases | 实际使用场景 ### Personal Knowledge Base | 个人知识库 - Save code snippets you frequently use for easy reference | 保存您经常使用的代码片段以便于参考 - Document API endpoints with examples | 使用示例记录 API 端点 - Keep track of error messages and their solutions | 跟踪错误消息及其解决方案 - Store configuration examples for different environments | 存储不同环境的配置示例 - Create a personal knowledge base of technical notes | 创建技术笔记的个人知识库 **Pro Tip | 专业提示:** Organize your corpus files with consistent categories to make searching more effective. You can then filter search results by category to find exactly what you need! 使用一致的类别组织您的语料库文件，使搜索更有效。然后，您可以按类别过滤搜索结果，以找到您需要的确切内容！ ## PDF Support | PDF 支持 The system now supports adding PDF files to the corpus. PDFs are automatically converted to Markdown format for better searchability. | 系统现在支持将PDF文件添加到语料库。PDF会自动转换为Markdown格式以提高可搜索性。 **Adding a PDF file in Cline | 在Cline中添加PDF文件**: Simply provide the absolute path to your PDF file: ```bash cline mcp mcp-vj-docs vjdoc_add_corpus_file --filePath "/absolute/path/to/your/document.pdf" --category "Documentation" ``` **Adding a PDF file in Cursor | 在Cursor中添加PDF文件**: Simply provide the absolute path to your PDF file: ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"filePath": "/absolute/path/to/your/document.pdf", "category": "Documentation"} ``` The system extracts text from the PDF and converts it to Markdown format, preserving structure like headings, code blocks, and lists where possible. | 系统从PDF中提取文本并将其转换为Markdown格式，尽可能保留标题、代码块和列表等结构。 ## How It Works | 工作原理 1. When you add a corpus file, it's saved to the `VJDOC_TFIDF_FILES_DIR` directory | 当您添加语料库文件时，它会保存到 `VJDOC_TFIDF_FILES_DIR` 目录 2. If you don't specify a filename, one will be generated automatically | 如果您不指定文件名，将自动生成一个 3. The category will be added as a prefix to the filename | 类别将作为前缀添加到文件名中 4. The file is automatically indexed and will appear in search results | 文件会自动索引并出现在搜索结果中 5. You can search for this content later using the `vjdoc_search` tool | 您可以稍后使用 `vjdoc_search` 工具搜索此内容 ## Practical Workflow Examples | 实用工作流程示例 Here are some practical workflows combining these tools: 以下是结合这些工具的一些实用工作流程： 1. **Documentation Indexing | 文档索引** - Crawl your project documentation: | 爬取您的项目文档： ``` /mcp mcp-vj-docs vjdoc_crawl {"url": "https://your-project-docs.com"} ``` - Add custom code snippets: | 添加自定义代码片段： ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "// Your code here", "category": "Code Snippet"} ``` - Search across all indexed content: | 搜索所有已索引内容： ``` /mcp mcp-vj-docs vjdoc_search {"query": "how to implement feature X"} ``` 2. **Personal Knowledge Base | 个人知识库** - Add error solutions as you encounter them: | 添加您遇到的错误解决方案： ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "Error: Module not found\nSolution: Run npm install", "category": "Error Solution"} ``` - Add API documentation for your projects: | 为您的项目添加 API 文档： ``` /mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "function getData(id) - Retrieves data by ID from the API", "category": "API Documentation"} ``` - Search your knowledge base when needed: | 在需要时搜索您的知识库： ``` /mcp mcp-vj-docs vjdoc_search {"query": "module not found", "filters": {"categories": ["Error Solution"]}} ``` ## Advanced Workflow with AI Assistants | 与 AI 助手的高级工作流程 When working with AI assistants like Claude or GPT, you can create a more effective workflow: 1. **First, get document metadata** to understand what's available: ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement JWT authentication"} ``` 2. **Then, search for relevant documents**: ``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT authentication implementation", "limit": 3} ``` 3. **Finally, get the full content** of the most relevant document for comprehensive context: ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **Ask the AI assistant** to explain or generate code based on the full document: ``` Based on this documentation, please explain how to implement JWT authentication in my Node.js application. ``` This workflow ensures the AI has complete context while minimizing token usage by only retrieving full content for the most relevant documents. 当与 Claude 或 GPT 等 AI 助手一起工作时，您可以创建更有效的工作流程： 1. **首先，获取文档元数据**以了解有哪些可用内容： ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "我需要实现 JWT 认证"} ``` 2. **然后，搜索相关文档**： ``` /mcp mcp-vj-docs vjdoc_search {"query": "JWT 认证实现", "limit": 3} ``` 3. **最后，获取最相关文档的完整内容**以获得全面的上下文： ``` /mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"} ``` 4. **请求 AI 助手**基于完整文档解释或生成代码： ``` 根据这份文档，请解释如何在我的 Node.js 应用程序中实现 JWT 认证。 ``` 这个工作流程确保 AI 拥有完整的上下文，同时通过仅检索最相关文档的完整内容来最小化令牌使用。这个工作流程确保 AI 拥有完整的上下文，同时通过仅检索最相关文档的完整内容来最小化令牌使用。 ``` ### Recommended Usage Strategy | 推荐使用策略 The optimized `vjdoc_get_docs_meta` tool is designed to be more token-efficient while still providing valuable context to LLMs. Here's how to best utilize it: 1. **Start with a Specific Query**: The more specific your query, the more relevant the returned documents will be. ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "React server components vs client components"} ``` 2. **Review the Top Results**: The tool now returns only the most relevant documents with their snippets and relevance scores. 3. **Use the Suggested Tool Calls**: The response includes ready-to-use examples for: - Searching with `vjdoc_search` for more detailed results - Getting full document content with `vjdoc_get_document` for the most relevant document 4. **Progressive Disclosure Pattern**: This optimized approach follows a "progressive disclosure" pattern: - Start with metadata (minimal tokens) - Progress to search results (moderate tokens) - Finally get full document content (maximum tokens) only when necessary This approach is especially valuable in contexts where token usage directly impacts costs or performance. ### 推荐使用策略优化后的 `vjdoc_get_docs_meta` 工具旨在提高令牌效率，同时仍为大型语言模型提供有价值的上下文。以下是最佳利用方式： 1. **从特定查询开始**：查询越具体，返回的文档就越相关。 ``` /mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "React 服务器组件与客户端组件的区别"} ``` 2. **查看顶部结果**：该工具现在只返回最相关的文档，包含它们的摘要和相关性分数。 3. **使用建议的工具调用**：响应中包含可直接使用的示例： - 使用 `vjdoc_search` 搜索更详细的结果 - 使用 `vjdoc_get_document` 获取最相关文档的完整内容 4. **渐进式披露模式**：这种优化方法遵循"渐进式披露"模式： - 从元数据开始（最少令牌） - 进展到搜索结果（中等令牌） - 最后只在必要时获取完整文档内容（最大令牌）这种方法在令牌使用直接影响成本或性能的情况下特别有价值。 ### Example Response Format | 响应格式示例 The optimized response format now includes: ```json { "query": "React hooks", "documents": [ { "url": "https://example.com/docs/react/hooks/usestate", "title": "useState Hook", "category": "React Hooks", "score": 0.95, "snippet": "useState is a Hook that lets you add React state to function components..." }, // Additional documents (limited to most relevant) ], "totalDocuments": 10, "categories": ["React Hooks", "React Basics"], "userPrompt": "我找到了10个与\"React hooks\"相关的文档...(简洁的提示文本)" } ``` ### 响应格式示例优化后的响应格式现在包括： ```json { "query": "React hooks", "documents": [ { "url": "https://example.com/docs/react/hooks/usestate", "title": "useState Hook", "category": "React Hooks", "score": 0.95, "snippet": "useState 是一个 Hook，它允许你在函数组件中添加 React 状态..." }, // 其他文档（限于最相关的） ], "totalDocuments": 10, "categories": ["React Hooks", "React 基础"], "userPrompt": "我找到了10个与\"React hooks\"相关的文档...(简洁的提示文本)" } ``` ## Troubleshooting | 故障排除 ### Common Issues | 常见问题 1. **Database Path Issues | 数据库路径问题** - Ensure the directory for your database exists | 确保您的数据库目录存在 - Check if you have write permissions to the specified path | 检查您是否有写入指定路径的权限 - For tilde paths, ensure your home directory is correctly detected | 对于波浪号路径，确保正确检测到您的主目录 2. **Firecrawl API Issues | Firecrawl API 问题** - Verify your API key is correct | 验证您的 API 密钥是否正确 - Check if you've reached API rate limits | 检查您是否达到了 API 速率限制 - If using a local Firecrawl service, ensure it's running | 如果使用本地 Firecrawl 服务，确保它正在运行 3. **Crawling Issues | 爬取问题** - Some websites may block crawlers | 某些网站可能会阻止爬虫 - Check if the website requires authentication | 检查网站是否需要身份验证 - Try reducing the crawl depth and page limit | 尝试减少爬取深度和页面限制 ### Logs | 日志 Check the logs for more detailed error information: 查看日志以获取更详细的错误信息： - If `VJDOC_LOG_TO_FILE` is enabled, check the log files in your log directory | 如果启用了 `VJDOC_LOG_TO_FILE`，请检查日志目录中的日志文件 - Otherwise, check the console output | 否则，检查控制台输出 ## 传输协议配置 | Transport Configuration MCP 服务器支持多种传输协议，可以通过环境变量进行配置： ### 环境变量 | Environment Variables #### 传输协议控制 | Transport Protocol Control - `ENABLE_STDIO_TRANSPORT`: 控制是否启用标准输入/输出传输（默认为 true，设置为 'false' 禁用） - `ENABLE_STREAMABLE_HTTP`: 控制是否启用流式 HTTP 传输（默认为 false，设置为 'true' 启用） - `ENABLE_LEGACY_SSE`: 控制是否启用旧版 SSE 端点（默认为 false，设置为 'true' 启用） #### 端口配置 | Port Configuration - `STREAMABLE_HTTP_PORT`: 设置流式 HTTP 服务器的端口（默认为 3000） - `LEGACY_SSE_PORT`: 设置旧版 SSE 服务器的端口（默认为 3001） ### 使用示例 | Usage Examples ```bash # 启用所有传输协议 export ENABLE_STDIO_TRANSPORT=true export ENABLE_STREAMABLE_HTTP=true export ENABLE_LEGACY_SSE=true # 设置端口 export STREAMABLE_HTTP_PORT=3000 export LEGACY_SSE_PORT=3001 # 运行服务器 npm start ``` ### 传输协议说明 | Transport Protocol Description #### Streamable HTTP Transport 现代 HTTP 传输协议，支持流式传输和会话管理。提供以下端点： - `POST /mcp`: 处理客户端到服务器的通信 - `GET /mcp`: 处理服务器到客户端的通知（通过 SSE） - `DELETE /mcp`: 处理会话终止 #### Legacy SSE Transport 旧版 SSE 传输协议，提供以下端点： - `GET /sse`: 建立 SSE 连接 - `POST /messages?sessionId=<id>`: 处理客户端消息 #### Stdio Transport 标准输入/输出传输协议，用于命令行环境。 ## CHANGELOG | 更新日志 ### 2026-04-07 (v0.1.73) - **递归目录扫描**：语料库文件现在支持递归目录扫描，嵌套子目录中的文件会被自动发现和索引 - 新增 `scanFilesRecursive()` 方法，递归扫描 `.txt`、`.md`、`.pdf` 文件 - `loadTfidfFiles()` 和 `getCorpusDocuments()` 均已重构为使用递归扫描 - 文档元数据新增 `relativePath` 字段，标识文件在目录树中的相对位置 ### 2025-05-08 - **改进传输协议配置**： - 实现了基于环境变量的传输协议配置，支持灵活启用/禁用不同的传输协议 - 新增环境变量：`ENABLE_STDIO_TRANSPORT`、`ENABLE_STREAMABLE_HTTP`、`ENABLE_LEGACY_SSE` - 新增端口配置环境变量：`STREAMABLE_HTTP_PORT`、`LEGACY_SSE_PORT` - **升级 Streamable HTTP 传输**： - 采用最新的 MCP 规范实现会话管理 - 支持 POST、GET 和 DELETE 请求处理 - 改进了会话 ID 生成和验证机制 - **统一传输协议处理**： - 统一了所有传输协议的初始化和连接方式 - 改进了日志记录，提供更详细的传输协议状态信息 - 增强了错误处理和资源清理 ### 2025-04-24 - **改进搜索结果结构**： - 将搜索结果中的 `content` 字段重命名为 `paragraph`，更准确地反映其包含的内容 - 添加 `highlightedParagraph` 字段，提供带有高亮显示的段落内容 - 优化了搜索结果的格式，使其更适合大型语言模型处理 - **添加 PDF 支持**： - 集成 pdf-parse 库，支持 PDF 文件的解析和索引 - 添加 `pdf-base64` 内容类型，允许直接添加 PDF 文件到语料库 - **改进路径处理**： - 增强了波浪号路径扩展功能，更好地支持跨平台路径处理 - 修复了与 Node.js 路径处理相关的问题 - **添加高级日志功能**： - 集成 winston 日志库，提供更详细的日志记录 - 添加了可配置的日志级别和文件日志选项 - 新增环境变量：`VJDOC_LOG_LEVEL`、`VJDOC_LOG_TO_FILE`、`VJDOC_LOG_DIR` - **文档时间戳支持**： - 添加文档时间戳，用于实现基于新鲜度的文档评分 - 改进搜索算法，考虑文档的时间因素 - **增强爬取选项**： - 支持 `ignoreSitemap` 选项，允许忽略网站的 sitemap.xml - 添加 `allowExternalLinks` 和 `allowBackwardLinks` 选项，控制爬取范围 - 支持 `includePatterns` 和 `excludePatterns` 数组，用于精确控制要爬取的 URL - 添加 `defaultCategory` 选项，为爬取的文档设置默认类别 - 支持自定义 Firecrawl API 配置（`firecrawlApiKey` 和 `firecrawlApiUrl`）