@vjlanguage/mcp-vj-docs
Version:
MCP server for documentation crawling, indexing, and retrieval
1,165 lines (932 loc) • 48.2 kB
Markdown
# MCP Documentation Server (@vjlanguage/mcp-vj-docs)
一个用于文档爬取、索引和检索的模型上下文协议(MCP)服务器。该包提供了爬取网站、存储和索引内容以及使用基于 TF-IDF 的搜索来搜索内容的工具。搜索结果经过优化,适合大型语言模型使用。
A Model Context Protocol (MCP) server for documentation crawling, indexing, and retrieval. This package provides tools for crawling websites, storing and indexing content, and searching through that content using TF-IDF based search. The search results are optimized for large language models.
## dify stream 模式快速开始指南
## Quick Start Guide for Dify Stream Mode
### 安装
### Installation
```bash
# 使用 npm 安装
npm install @vjlanguage/mcp-vj-docs -g
# 或使用 yarn 安装
yarn global add @vjlanguage/mcp-vj-docs
```
### 命令行使用
### Command Line Usage
#### 1. 启动服务器(前台模式)
#### 1. Start the Server (Foreground Mode)
```bash
# 使用默认配置启动服务器
vjdoc-cli stream-start
# 指定端口启动服务器
vjdoc-cli stream-start -p 3000
# 指定数据库路径和 TF-IDF 目录
vjdoc-cli stream-start -d ~/mydata/docs.json --tfidf-dir ~/mydata/tfidf
```
#### 2. 后台服务模式
#### 2. Background Service Mode
```bash
# 在后台启动服务器
# Start the server in background mode
vjdoc-cli stream-serve
# 停止后台运行的服务器
# Stop the background server
vjdoc-cli stream-stop
```
#### 3. 命令行选项
#### 3. Command Line Options
所有命令都支持以下选项:
All commands support the following options:
- `-p, --port <number>` - 设置服务器端口(默认:3000)
- Set server port (default: 3000)
- `-d, --db-path <path>` - 设置数据库文件路径(默认:~/mcpdata/docs.json)
- Set database file path (default: ~/mcpdata/docs.json)
- `--tfidf-dir <path>` - 设置 TF-IDF 文件目录(默认:~/mcpdata/tfidf)
- Set TF-IDF files directory (default: ~/mcpdata/tfidf)
### 环境变量配置
### Environment Variable Configuration
您也可以通过环境变量配置服务器:
You can also configure the server using environment variables:
```bash
# 基本配置
# Basic Configuration
VJDOC_DB_PATH=~/mcpdata/docs.json
VJDOC_TFIDF_FILES_DIR=~/mcpdata/tfidf
VJDOC_LOG_LEVEL=info # debug, info, warn, error
VJDOC_LOG_TO_FILE=true
VJDOC_LOG_DIR=~/mcpdata/logs
# 爬虫配置
# Crawler Configuration
VJDOC_MAX_DEPTH=4
VJDOC_MAX_PAGES=100
FIRECRAWL_API_KEY=your_api_key_here
FIRECRAWL_API_URL=http://localhost:5002
# 传输配置
# Transport Configuration
ENABLE_STDIO_TRANSPORT=true
ENABLE_STREAMABLE_HTTP=true
STREAMABLE_HTTP_PORT=3000
```
### dify stream 模式配置
### Dify Stream Mode Configuration
```json
{
"mcpServers": {
"mcp-vjdoc": {
"transport": "streamable_http",
"url": "http://192.168.2.9:3000/mcp"
}
}
}
```
>
- 192.168.2.9 替换为你本地的实际的网卡地址
Replace 192.168.2.9 with your actual local network interface address
- 3000 替换为你默认执行 serve 命令的端口
Replace 3000 with the port you use when executing the serve command
### 作为 MCP 服务器使用
### Using as an MCP Server
要将此服务器与 MCP 客户端(如 Dify、Claude 或其他支持 MCP 的应用程序)一起使用,请在 MCP 配置中添加以下内容:
To use this server with MCP clients (such as Dify, Claude, or other applications that support MCP), add the following to your MCP configuration:
```json
{
"vj-docs": {
"command": "node",
"args": [
"/path/to/mcp-vj-docs/dist/index.js"
],
"env": {
"FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE",
"VJDOC_MAX_DEPTH": "4",
"VJDOC_MAX_PAGES": "100",
"VJDOC_DB_PATH": "~/mcpdata/docs.json",
"VJDOC_LOG_DIR": "~/mcpdata/logs",
"VJDOC_LOG_TO_FILE": "true",
"VJDOC_LOG_LEVEL": "debug",
"FIRECRAWL_API_URL": "http://localhost:5002",
"VJDOC_TFIDF_FILES_DIR": "~/mcpdata/tfidf"
},
"disabled": false,
"timeout": 3600
}
}
```
### 常见用例
### Common Use Cases
#### 爬取网站并索引内容
#### Crawl Websites and Index Content
使用 MCP 工具 `vjdoc_crawl` 爬取网站:
Use the MCP tool `vjdoc_crawl` to crawl websites:
```json
{
"name": "vjdoc_crawl",
"arguments": {
"url": "https://example.com/docs",
"maxDepth": 3,
"maxPages": 50,
"includePatterns": ["*/docs/*"],
"excludePatterns": ["*/api/*"],
"defaultCategory": "Documentation"
}
}
```
#### 搜索文档
#### Search Documents
使用 MCP 工具 `vjdoc_search` 搜索文档:
Use the MCP tool `vjdoc_search` to search documents:
```json
{
"name": "vjdoc_search",
"arguments": {
"query": "如何配置服务器",
"limit": 5,
"filters": {
"categories": ["Documentation", "Tutorial"]
}
}
}
```
### 故障排除
### Troubleshooting
#### 常见问题
#### Common Issues
1. **找不到命令**
**Command Not Found**
- 确保全局安装了包,或使用 npx 运行命令:`npx vjdoc-cli stream-start`
- Ensure the package is installed globally, or use npx to run the command: `npx vjdoc-cli stream-start`
2. **权限错误**
**Permission Errors**
- 确保数据目录(~/mcpdata)存在且有写入权限
- Ensure the data directory (~/mcpdata) exists and has write permissions
- 使用 `sudo mkdir -p ~/mcpdata` 创建目录
- Use `sudo mkdir -p ~/mcpdata` to create the directory
3. **无法连接到服务器**
**Cannot Connect to Server**
- 检查端口是否被占用:`lsof -i :3000`
- Check if the port is already in use: `lsof -i :3000`
- 确保防火墙未阻止连接
- Ensure firewall is not blocking the connection
4. **爬虫 API 错误**
**Crawler API Errors**
- 验证 FIRECRAWL_API_KEY 是否正确
- Verify that FIRECRAWL_API_KEY is correct
- 检查 FIRECRAWL_API_URL 是否可访问
- Check if FIRECRAWL_API_URL is accessible
---
# MCP Documentation Server (@vjlanguage/mcp-vj-docs)
A Model Context Protocol (MCP) server for documentation crawling, indexing, and retrieval. This package provides tools for crawling websites, storing and indexing the content, and searching through that content using TF-IDF based search. The search results are optimized for large language models.
## Features | 功能
- **Documentation Crawling**: Crawl documentation from websites using Firecrawl
- **Content Processing**: Convert HTML to Markdown and extract relevant content
- **Storage & Indexing**: Store documents using lowdb with TF-IDF based indexing
- **LLM-Optimized Search**: Search for documentation with aggregated results optimized for large language models
- **Full Content Return**: No character length limits on search results
- **Content-First Results**: Prioritizes content over URLs in search results
- **Smart Deduplication**: Removes duplicate content and returns only the top 3 most relevant results
- **AI-Optimized Format**: Results structured specifically for AI consumption and code generation
- **Complete Document Context**: Returns full document content via `fullDocument` field for comprehensive context
- **Enhanced Metadata Search**: Analyzes title, URL, and all metadata fields with field-specific weighting
- **Multi-dimensional Scoring**: Evaluates document relevance across content, metadata, URL, and title
- **Custom Corpus Management**: Add your own text corpus files for inclusion in search results
- **Multiple Format Support**: Supports TXT, Markdown, and PDF files
- **Recursive Directory Scanning**: Automatically discovers files in nested subdirectories
- **Automatic Indexing**: Files in corpus directory are automatically indexed and searchable
- **MCP Integration**: Expose tools for crawling and searching via Model Context Protocol
- **Path Handling**: Support for tilde (~) expansion in file paths
- **Server Modes**: Support for both SSE (Server-Sent Events) and stdio transports
- **Multilingual Support**: Enhanced handling for Chinese queries with specialized tokenization
## 功能
- **文档爬取**:使用 Firecrawl 从网站爬取文档
- **内容处理**:将 HTML 转换为 Markdown 并提取相关内容
- **存储和索引**:使用 lowdb 存储文档,并使用基于 TF-IDF 的索引
- **LLM 优化搜索**:搜索文档并返回经过聚合的结果,专为大型语言模型优化
- **完整内容返回**:搜索结果没有字符长度限制
- **内容优先结果**:在搜索结果中优先考虑内容而非 URL
- **智能去重**:移除重复内容并仅返回前 3 个最相关的结果
- **AI 优化格式**:结果结构专为 AI 消费和代码生成而设计
- **完整文档上下文**:通过 `fullDocument` 字段返回完整文档内容,提供全面的上下文
- **增强元数据搜索**:分析标题、URL 和所有元数据字段,并进行字段特定权重评分
- **多维度评分**:在内容、元数据、URL 和标题等多个维度评估文档相关性
- **自定义语料库管理**:添加您自己的文本语料库文件以包含在搜索结果中
- **多格式支持**:支持 TXT、Markdown 和 PDF 文件
- **递归目录扫描**:自动发现嵌套子目录中的文件
- **自动索引**:语料目录中的文件自动索引并可搜索
- **MCP 集成**:通过模型上下文协议暴露爬取和搜索工具
- **路径处理**:支持波浪号(~)在文件路径中的扩展
- **服务器模式**:支持 SSE(服务器发送事件)和 stdio 传输
- **多语言支持**:通过专门的分词增强中文查询处理
## Changelog | 更新日志
### 2026-04-07 (v0.1.73)
- **Recursive Directory Scanning**: Corpus files in nested subdirectories are now automatically discovered, indexed, and searchable
- `loadTfidfFiles()` and `getCorpusDocuments()` both scan recursively
- Document metadata now includes `relativePath` for nested file location
### 2025-04-22 (v0.1.61)
- **Enhanced Metadata Search**: The search algorithm now analyzes multiple dimensions:
- **Title Analysis**: Improved title matching with weighted scoring for full and partial matches
- **URL Analysis**: Extracts and scores URL segments for keyword relevance
- **Metadata Field Analysis**: Individually processes important fields like keywords, description, and category
- **Field-Specific Weighting**: Assigns different weights to matches in different metadata fields
- **Chinese Query Optimization**: Implements specialized tokenization for Chinese characters and phrases
### 2026年04月07日 (v0.1.73)
- **递归目录扫描**:嵌套子目录中的语料库文件现在可以自动发现、索引和搜索
- `loadTfidfFiles()` 和 `getCorpusDocuments()` 均支持递归扫描
- 文档元数据现在包含 `relativePath` 字段,标识嵌套文件位置
### 2025年04月22日 (v0.1.61)
- **增强元数据搜索**:搜索算法现在分析多个维度:
- **标题分析**:改进标题匹配,对完整和部分匹配进行加权评分
- **URL 分析**:提取并评分 URL 段以确定关键词相关性
- **元数据字段分析**:单独处理重要字段,如关键词、描述和类别
- **字段特定权重**:为不同元数据字段中的匹配分配不同权重
- **中文查询优化**:为中文字符和短语实现专门的分词
### 2025-04-11
- **Search Result Enhancement**: Modified search functionality to include relevant paragraphs for each individual result item, rather than only showing content for the top result.
- **Result Format Improvement**: Changed the structure to make it clearer which document content belongs to which search result.
- **Document Retrieval Enhancement**: Improved the `vjdoc_get_document` tool to support partial matching for both URL and title parameters.
### 2025年04月11日
- **搜索结果增强**:修改了搜索功能,以便为每个单独的结果项包含相关段落,而不仅仅是显示顶部结果的内容。
- **结果格式改进**:更改了结构,使其更清晰地显示哪些文档内容属于哪个搜索结果。
- **文档检索增强**:改进了 `vjdoc_get_document` 工具,支持 URL 和标题参数的部分匹配。
## Installation | 安装
```bash
# Install globally | 全局安装
npm install -g @vjlanguage/mcp-vj-docs
# Or use with npx | 或使用 npx
npx @vjlanguage/mcp-vj-docs
```
## Firecrawl Registration and API Key | Firecrawl 注册和 API 密钥
### English
This package uses Firecrawl service for web crawling. To use it, you need to:
1. **Register for Firecrawl**:
- Visit [Firecrawl website](https://firecrawl.dev) and create an account
- Or use the local Firecrawl service by setting `FIRECRAWL_API_URL` to your local endpoint
2. **Get your API Key**:
- After registration, navigate to your account dashboard
- Find and copy your API key
- Add this key to your environment variables or MCP configuration
3. **Configure the API Key**:
- Set the `FIRECRAWL_API_KEY` environment variable
- Or add it to your MCP configuration (see example below)
### 中文
本包使用 Firecrawl 服务进行网页爬取。要使用它,您需要:
1. **注册 Firecrawl**:
- 访问 [Firecrawl 网站](https://firecrawl.dev) 并创建账户
- 或通过设置 `FIRECRAWL_API_URL` 为您的本地端点来使用本地 Firecrawl 服务
2. **获取您的 API 密钥**:
- 注册后,导航到您的账户仪表板
- 找到并复制您的 API 密钥
- 将此密钥添加到您的环境变量或 MCP 配置中
3. **配置 API 密钥**:
- 设置 `FIRECRAWL_API_KEY` 环境变量
- 或将其添加到您的 MCP 配置中(见下面的示例)
## Usage | 使用方法
### Environment Variables | 环境变量
- `VJDOC_DB_PATH` - Path to the database file (default: ./data/docs.json) | 数据库文件路径(默认:./data/docs.json)
- `VJDOC_MAX_DEPTH` - Maximum depth to crawl (default: 3) | 最大爬取深度(默认:3)
- `VJDOC_MAX_PAGES` - Maximum number of pages to crawl (default: 100) | 最大爬取页面数(默认:100)
- `VJDOC_LOG_DIR` - Directory for log files | 日志文件目录
- `VJDOC_LOG_TO_FILE` - Whether to log to file (true/false) | 是否记录到文件(true/false)
- `VJDOC_LOG_LEVEL` - Log level (error, warn, info, debug) | 日志级别(error, warn, info, debug)
- `FIRECRAWL_API_KEY` - API key for Firecrawl service | Firecrawl 服务的 API 密钥
- `FIRECRAWL_API_URL` - Custom URL for Firecrawl API | Firecrawl API 的自定义 URL
- `MCP_TRANSPORT` - Transport method (sse or stdio, default: sse) | 传输方法(sse 或 stdio,默认:sse)
- `VJDOC_TFIDF_FILES_DIR` - Directory for custom corpus files (default: ~/mcpdata/tfidf_files) | 自定义语料库文件目录(默认:~/mcpdata/tfidf_files)
```json
{
"mcpServers": {
"mcp-vj-docs": {
"command": "npx",
"args": ["-y", "@vjlanguage/mcp-vj-docs@latest"],
"env": {
"FIRECRAWL_API_KEY": "YOUR_API_KEY_HERE",
"VJDOC_MAX_DEPTH": "4",
"VJDOC_MAX_PAGES": "100",
"VJDOC_DB_PATH": "~/mcpdata/docs.json",
"VJDOC_LOG_DIR": "~/mcpdata/logs",
"VJDOC_LOG_TO_FILE": "true",
"VJDOC_LOG_LEVEL": "debug",
"FIRECRAWL_API_URL": "http://localhost:5002",
"VJDOC_TFIDF_FILES_DIR": "~/mcpdata/tfidf_files"
},
"disabled": false,
"timeout": 3600,
"autoApprove": ["vjdoc_search", "vjdoc_crawl", "vjdoc_add_corpus_file"]
}
}
}
```
## MCP Tools | MCP 工具
The server exposes the following MCP tools:
服务器暴露以下 MCP 工具:
### 1. `vjdoc_crawl` Tool | `vjdoc_crawl` 工具
Crawls a website and indexes its content for search.
爬取网站并为搜索索引其内容。
**Parameters | 参数:**
- `url` (string, required): The URL to crawl (e.g., "https://example.com/docs") | 要爬取的 URL(例如,"https://example.com/docs")
- `maxDepth` (number, optional): Maximum depth to crawl, default: 3 | 最大爬取深度,默认:3
- `maxPages` (number, optional): Maximum number of pages to crawl, default: 100 | 最大爬取页面数,默认:100
- `includePatterns` (array of strings, optional): Patterns to include in crawl (e.g., ["docs/*"]) | 要包含在爬取中的模式(例如,["docs/*"])
- `excludePatterns` (array of strings, optional): Patterns to exclude from crawl (e.g., ["blog/*"]) | 要从爬取中排除的模式(例如,["blog/*"])
- `defaultCategory` (string, optional): Default category for documents if not detected automatically | 如果未自动检测到,文档的默认类别
**Example | 示例:**
```json
{
"url": "https://example.com/docs",
"maxDepth": 3,
"maxPages": 100,
"includePatterns": ["docs/*"],
"excludePatterns": ["blog/*"]
}
```
**Response | 响应:**
```json
{
"success": true,
"message": "Successfully crawled and indexed 42 pages from https://example.com/docs",
"count": 42
}
```
### 2. `vjdoc_search` Tool | `vjdoc_search` 工具
Searches indexed documents with results optimized for large language models.
搜索已索引的文档,结果经过优化,适合大型语言模型。
**Parameters | 参数:**
- `query` (string, required): The search query (e.g., "how to use the API") | 搜索查询(例如,"如何使用 API")
- `limit` (number, optional): Maximum number of sources to consider, default: 10 | 要考虑的最大源数,默认:10
- `filters` (object, optional): Optional filters to narrow down search results | 可选过滤器,用于缩小搜索结果范围
- `categories` (array of strings, optional): Filter by document categories | 按文档类别过滤
- `dateFrom` (number, optional): Filter documents created after this timestamp | 过滤在此时间戳之后创建的文档
- `dateTo` (number, optional): Filter documents created before this timestamp | 过滤在此时间戳之前创建的文档
- `metadata` (object, optional): Filter by metadata fields | 按元数据字段过滤
- `userId` (string, optional): Optional user ID for personalized results | 可选的用户 ID,用于个性化结果
**Example | 示例:**
```json
{
"query": "how to use the API",
"limit": 5,
"filters": {
"categories": ["API Documentation"]
}
}
```
**Response | 响应:**
```json
{
"success": true,
"results": {
"paragraph": "The API can be used by making HTTP requests to the endpoints...",
"sources": [
{
"url": "https://example.com/docs/api",
"title": "API Documentation",
"relevance": 0.85,
"paragraph": "The API can be used by making HTTP requests to the endpoints...",
"highlightedParagraph": "The **API** can be used by making **HTTP** requests to the **endpoints**...",
"fullDocument": "Complete document content for this specific result..."
}
]
}
}
```
### 3. `vjdoc_add_corpus_file` Tool | `vjdoc_add_corpus_file` 工具
Adds a custom corpus file to the TF-IDF files directory for inclusion in search results. This is perfect for adding your own code snippets, documentation, error solutions, or technical notes that you want to be searchable.
向 TF-IDF 文件目录添加自定义语料库文件,以包含在搜索结果中。这非常适合添加您自己的代码片段、文档、错误解决方案或技术笔记,使它们可被搜索。
**Parameters | 参数:**
- `content` (string, required): The text content to add to the corpus file | 要添加到语料库文件的文本内容
- `filename` (string, optional): Optional filename for the corpus file (without extension) | 语料库文件的可选文件名(不带扩展名)
- `category` (string, optional): Optional category for the corpus file | 语料库文件的可选类别
**Recommended Categories | 推荐类别:**
- `Code Snippet` - Reusable code patterns and examples | 可重用的代码模式和示例
- `API Documentation` - Function and parameter descriptions | 函数和参数描述
- `Error Solution` - Common errors and their fixes | 常见错误及其修复方法
- `Technical Note` - Personal learning summaries | 个人学习总结
**Example | 示例:**
```json
{
"content": "// 快速排序实现\nfunction quickSort(arr) {\n if (arr.length <= 1) return arr;\n const pivot = arr[0];\n const left = []; \n const right = [];\n for (let i = 1; i < arr.length; i++) {\n arr[i] < pivot ? left.push(arr[i]) : right.push(arr[i]);\n }\n return [...quickSort(left), pivot, ...quickSort(right)];\n}\n\n// 常见错误:Uncaught TypeError\n// 解决方案:检查变量是否为null/undefined",
"filename": "quicksort_algorithm",
"category": "Code Snippet"
}
```
**Response | 响应:**
```json
{
"success": true,
"message": "Successfully added corpus file: code_snippet_quicksort_algorithm.txt",
"filename": "code_snippet_quicksort_algorithm.txt",
"category": "Code Snippet"
}
```
### 4. `vjdoc_get_docs_meta` Tool | `vjdoc_get_docs_meta` 工具
Retrieves metadata about all documents and corpus files to help LLMs understand the available content and plan effective searches.
获取所有文档和语料库文件的元数据,帮助大型语言模型了解可用内容并规划有效的搜索。
**Parameters | 参数:**
- `query` (string, required): Natural language query or requirement | 自然语言查询或需求
**Response Format | 响应格式:**
```json
{
"query": "Original natural language query",
"documents": [
{
"url": "Document URL",
"title": "Document title",
"category": "Document category",
"timestamp": 1712190000000,
"keywords": ["keyword1", "keyword2", "..."],
"summary": "Brief summary of document content..."
}
],
"totalDocuments": 42,
"categories": ["API Documentation", "Code Snippet", "..."],
"suggestion": "Search guidance for LLMs"
}
```
### 5. `vjdoc_get_document` Tool | `vjdoc_get_document` 工具
Gets the full content of a specific document by URL or title.
通过 URL 或标题获取特定文档的完整内容。
**Parameters | 参数:**
- `url` (string, optional): URL of the document to retrieve | 要检索的文档的 URL
- `title` (string, optional): Title of the document to retrieve | 要检索的文档的标题
**Notes | 注意:**
- At least one of `url` or `title` must be provided | 必须提供 `url` 或 `title` 中的至少一个
- The tool supports partial matching for both parameters | 该工具支持两个参数的部分匹配
- When using `url` parameter, it will find documents where the URL contains the provided string | 使用 `url` 参数时,它将查找 URL 包含所提供字符串的文档
- When using `title` parameter, it will find documents where the title contains the provided string (case-insensitive) | 使用 `title` 参数时,它将查找标题包含所提供字符串的文档(不区分大小写)
**Example | 示例:**
```json
{
"url": "https://example.com/docs/auth"
}
```
or | 或
```json
{
"title": "Authentication Guide"
}
```
**Response | 响应:**
```json
{
"url": "https://example.com/docs/auth",
"title": "Authentication Guide",
"content": "Complete document content...",
"metadata": {
"category": "API Documentation",
"lastModified": "2023-01-15T12:00:00Z"
}
}
```
## Using with AI Coding Assistants | 与 AI 编码助手一起使用
You can use these MCP tools with various AI coding assistants to enhance your documentation workflow.
您可以在各种 AI 编码助手中使用这些 MCP 工具来增强您的文档工作流程。
### Using with Cursor | 在 Cursor 中使用
In Cursor, you can use the MCP tools through the command interface:
在 Cursor 中,您可以通过命令界面使用 MCP 工具:
1. **Setup | 设置**: Configure Cursor to use your MCP server | 配置 Cursor 使用您的 MCP 服务器
2. **Crawling | 爬取**: Use the `/mcp` command to invoke the crawl tool | 使用 `/mcp` 命令调用 crawl 工具
```
/mcp mcp-vj-docs vjdoc_crawl {"url": "https://example.com/docs", "maxDepth": 3, "maxPages": 100}
```
3. **Searching | 搜索**: Use the `/mcp` command to invoke the search tool | 使用 `/mcp` 命令调用 search 工具
```
/mcp mcp-vj-docs vjdoc_search {"query": "authentication", "limit": 5, "filters": {"categories": ["API Documentation"]}}
```
4. **Adding Corpus Files | 添加语料库文件**: Use the `/mcp` command to add custom corpus files | 使用 `/mcp` 命令添加自定义语料库文件
```
/mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "// Your code here", "category": "Code Snippet"}
```
5. **Getting Document Content | 获取文档内容**: Use the `/mcp` command to get full document content | 使用 `/mcp` 命令获取完整文档内容
```
/mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth"}
```
or | 或
```
/mcp mcp-vj-docs vjdoc_get_document {"title": "Authentication Guide"}
```
### Advanced Workflow with AI Assistants | 与 AI 助手的高级工作流程
When working with AI assistants like Claude or GPT, you can create a more effective workflow:
1. **First, get document metadata** to understand what's available:
```
/mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement JWT authentication"}
```
2. **Then, search for relevant documents**:
```
/mcp mcp-vj-docs vjdoc_search {"query": "JWT authentication implementation", "limit": 3}
```
3. **Finally, get the full content** of the most relevant document for comprehensive context:
```
/mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"}
```
4. **Ask the AI assistant** to explain or generate code based on the full document:
```
Based on this documentation, please explain how to implement JWT authentication in my Node.js application.
```
This workflow ensures the AI has complete context while minimizing token usage by only retrieving full content for the most relevant documents.
当与 Claude 或 GPT 等 AI 助手一起工作时,您可以创建更有效的工作流程:
1. **首先,获取文档元数据**以了解有哪些可用内容:
```
/mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "我需要实现 JWT 认证"}
```
2. **然后,搜索相关文档**:
```
/mcp mcp-vj-docs vjdoc_search {"query": "JWT 认证实现", "limit": 3}
```
3. **最后,获取最相关文档的完整内容**以获得全面的上下文:
```
/mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"}
```
4. **请求 AI 助手**基于完整文档解释或生成代码:
```
根据这份文档,请解释如何在我的 Node.js 应用程序中实现 JWT 认证。
```
这个工作流程确保 AI 拥有完整的上下文,同时通过仅检索最相关文档的完整内容来最小化令牌使用。
## Optimized Token Usage with vjdoc_get_docs_meta | 优化的 vjdoc_get_docs_meta 令牌使用
### Changelog (v0.1.61) - 2025-04-22
- **Enhanced Metadata Search**: The search algorithm now analyzes multiple dimensions:
- **Title Analysis**: Improved title matching with weighted scoring for full and partial matches
- **URL Analysis**: Extracts and scores URL segments for keyword relevance
- **Metadata Field Analysis**: Individually processes important fields like keywords, description, and category
- **Field-Specific Weighting**: Assigns different weights to matches in different metadata fields
- **Chinese Query Optimization**: Implements specialized tokenization for Chinese characters and phrases
### 更新日志 (v0.1.61) - 2025-04-22
- **增强元数据搜索**:搜索算法现在分析多个维度:
- **标题分析**:改进标题匹配,对完整和部分匹配进行加权评分
- **URL 分析**:提取并评分 URL 段以确定关键词相关性
- **元数据字段分析**:单独处理重要字段,如关键词、描述和类别
- **字段特定权重**:为不同元数据字段中的匹配分配不同权重
- **中文查询优化**:为中文字符和短语实现专门的分词
### Changelog (v0.1.60) - 2025-04-22
{{ ... }}
## Search Tool Response Format | 搜索工具响应格式
The `vjdoc_search` tool returns results in the following format:
```json
{
"results": [
{
"url": "https://example.com/docs/api",
"title": "API Documentation",
"relevance": 0.85,
"category": "API Documentation",
"paragraph": "Content excerpt most relevant to this document...",
"highlightedParagraph": "Content with **highlighted** query terms for this document...",
"fullDocument": "Complete content for this specific document..." // Only present for the most relevant result
},
{
"url": "https://example.com/docs/guide",
"title": "User Guide",
"relevance": 0.75,
"category": "Documentation",
"paragraph": "Content excerpt most relevant to this document...",
"highlightedParagraph": "Content with **highlighted** query terms for this document..."
// No fullDocument field for lower-ranked results
},
// More results...
],
"content": "Summary of content most relevant to the query...",
"fullDocument": "Complete document of the most relevant result",
"personalized": true
}
```
Key fields:
- `results`: 带有相关性分数的来源列表
- 每个结果包括:
- `url`: 文档 URL
- `title`: 文档标题
- `relevance`: 相关性分数
- `category`: 文档类别
- `paragraph`: 来自此特定文档的相关段落摘录
- `highlightedParagraph`: 带有高亮显示的此文档段落
- `fullDocument`: 完整的文档内容(仅适用于最相关的结果)
- `content`: 与查询相关的提取内容摘要
- `fullDocument`: 最相关结果的完整文档内容
- `personalized`: 结果是否基于用户 ID 进行了个性化
### 搜索工具响应格式
`vjdoc_search` 工具返回以下格式的结果:
```json
{
"results": [
{
"url": "https://example.com/docs/api",
"title": "API Documentation",
"relevance": 0.85,
"category": "API Documentation",
"paragraph": "与此文档最相关的内容段落...",
"highlightedParagraph": "带有**高亮**查询词的此文档段落...",
"fullDocument": "此特定文档的完整内容..." // 只有最相关的结果才包含此字段
},
{
"url": "https://example.com/docs/guide",
"title": "User Guide",
"relevance": 0.75,
"category": "Documentation",
"paragraph": "与此文档最相关的内容段落...",
"highlightedParagraph": "带有**高亮**查询词的此文档段落..."
// 较低排名的结果没有 fullDocument 字段
},
// 更多结果...
],
"content": "与查询最相关的摘要内容...",
"fullDocument": "最相关结果的完整文档内容",
"personalized": true
}
```
关键字段:
- `results`: 带有相关性分数的来源列表
- 每个结果包括:
- `url`: 文档 URL
- `title`: 文档标题
- `relevance`: 相关性分数
- `category`: 文档类别
- `paragraph`: 来自此特定文档的相关段落摘录
- `highlightedParagraph`: 带有高亮显示的此文档段落
- `fullDocument`: 完整的文档内容(仅适用于最相关的结果)
- `content`: 与查询相关的提取内容摘要
- `fullDocument`: 最相关结果的完整文档内容
- `personalized`: 结果是否基于用户 ID 进行了个性化
## Examples | 示例
### Searching Across Database and Corpus | 在数据库和语料库中搜索
```
/mcp mcp-vj-docs vjdoc_search {"query": "authentication", "limit": 5}
```
This will search for "authentication" in both the crawled documents (database) and your custom corpus files.
### Using Natural Language Queries | 使用自然语言查询
For natural language requirements, you can use the metadata tool first:
```
/mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement user authentication in my React application"}
```
Then use the search tool with the refined query:
```
/mcp mcp-vj-docs vjdoc_search {"query": "React authentication implementation", "filters": {"categories": ["Code Snippet", "API Documentation"]}}
```
### Utilizing the fullDocument Field | 利用 fullDocument 字段
When working with LLMs, you can use the `fullDocument` field to provide comprehensive context:
```javascript
// 使用 fullDocument 字段与 LLM 的示例
const searchResults = await searchDocs("如何实现 JWT 认证");
const fullContext = searchResults.fullDocument;
// 现在您可以要求 LLM 基于完整文档生成代码
const generatedCode = await llm.generateCode(
`基于此文档: ${fullContext}\n\n生成一个 JWT 认证实现`
);
```
## Real-World Use Cases | 实际使用场景
### Personal Knowledge Base | 个人知识库
- Save code snippets you frequently use for easy reference | 保存您经常使用的代码片段以便于参考
- Document API endpoints with examples | 使用示例记录 API 端点
- Keep track of error messages and their solutions | 跟踪错误消息及其解决方案
- Store configuration examples for different environments | 存储不同环境的配置示例
- Create a personal knowledge base of technical notes | 创建技术笔记的个人知识库
**Pro Tip | 专业提示:**
Organize your corpus files with consistent categories to make searching more effective. You can then filter search results by category to find exactly what you need!
使用一致的类别组织您的语料库文件,使搜索更有效。然后,您可以按类别过滤搜索结果,以找到您需要的确切内容!
## PDF Support | PDF 支持
The system now supports adding PDF files to the corpus. PDFs are automatically converted to Markdown format for better searchability. | 系统现在支持将PDF文件添加到语料库。PDF会自动转换为Markdown格式以提高可搜索性。
**Adding a PDF file in Cline | 在Cline中添加PDF文件**:
Simply provide the absolute path to your PDF file:
```bash
cline mcp mcp-vj-docs vjdoc_add_corpus_file --filePath "/absolute/path/to/your/document.pdf" --category "Documentation"
```
**Adding a PDF file in Cursor | 在Cursor中添加PDF文件**:
Simply provide the absolute path to your PDF file:
```
/mcp mcp-vj-docs vjdoc_add_corpus_file {"filePath": "/absolute/path/to/your/document.pdf", "category": "Documentation"}
```
The system extracts text from the PDF and converts it to Markdown format, preserving structure like headings, code blocks, and lists where possible. | 系统从PDF中提取文本并将其转换为Markdown格式,尽可能保留标题、代码块和列表等结构。
## How It Works | 工作原理
1. When you add a corpus file, it's saved to the `VJDOC_TFIDF_FILES_DIR` directory | 当您添加语料库文件时,它会保存到 `VJDOC_TFIDF_FILES_DIR` 目录
2. If you don't specify a filename, one will be generated automatically | 如果您不指定文件名,将自动生成一个
3. The category will be added as a prefix to the filename | 类别将作为前缀添加到文件名中
4. The file is automatically indexed and will appear in search results | 文件会自动索引并出现在搜索结果中
5. You can search for this content later using the `vjdoc_search` tool | 您可以稍后使用 `vjdoc_search` 工具搜索此内容
## Practical Workflow Examples | 实用工作流程示例
Here are some practical workflows combining these tools:
以下是结合这些工具的一些实用工作流程:
1. **Documentation Indexing | 文档索引**
- Crawl your project documentation: | 爬取您的项目文档:
```
/mcp mcp-vj-docs vjdoc_crawl {"url": "https://your-project-docs.com"}
```
- Add custom code snippets: | 添加自定义代码片段:
```
/mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "// Your code here", "category": "Code Snippet"}
```
- Search across all indexed content: | 搜索所有已索引内容:
```
/mcp mcp-vj-docs vjdoc_search {"query": "how to implement feature X"}
```
2. **Personal Knowledge Base | 个人知识库**
- Add error solutions as you encounter them: | 添加您遇到的错误解决方案:
```
/mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "Error: Module not found\nSolution: Run npm install", "category": "Error Solution"}
```
- Add API documentation for your projects: | 为您的项目添加 API 文档:
```
/mcp mcp-vj-docs vjdoc_add_corpus_file {"content": "function getData(id) - Retrieves data by ID from the API", "category": "API Documentation"}
```
- Search your knowledge base when needed: | 在需要时搜索您的知识库:
```
/mcp mcp-vj-docs vjdoc_search {"query": "module not found", "filters": {"categories": ["Error Solution"]}}
```
## Advanced Workflow with AI Assistants | 与 AI 助手的高级工作流程
When working with AI assistants like Claude or GPT, you can create a more effective workflow:
1. **First, get document metadata** to understand what's available:
```
/mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "I need to implement JWT authentication"}
```
2. **Then, search for relevant documents**:
```
/mcp mcp-vj-docs vjdoc_search {"query": "JWT authentication implementation", "limit": 3}
```
3. **Finally, get the full content** of the most relevant document for comprehensive context:
```
/mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"}
```
4. **Ask the AI assistant** to explain or generate code based on the full document:
```
Based on this documentation, please explain how to implement JWT authentication in my Node.js application.
```
This workflow ensures the AI has complete context while minimizing token usage by only retrieving full content for the most relevant documents.
当与 Claude 或 GPT 等 AI 助手一起工作时,您可以创建更有效的工作流程:
1. **首先,获取文档元数据**以了解有哪些可用内容:
```
/mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "我需要实现 JWT 认证"}
```
2. **然后,搜索相关文档**:
```
/mcp mcp-vj-docs vjdoc_search {"query": "JWT 认证实现", "limit": 3}
```
3. **最后,获取最相关文档的完整内容**以获得全面的上下文:
```
/mcp mcp-vj-docs vjdoc_get_document {"url": "https://example.com/docs/auth/jwt"}
```
4. **请求 AI 助手**基于完整文档解释或生成代码:
```
根据这份文档,请解释如何在我的 Node.js 应用程序中实现 JWT 认证。
```
这个工作流程确保 AI 拥有完整的上下文,同时通过仅检索最相关文档的完整内容来最小化令牌使用。
这个工作流程确保 AI 拥有完整的上下文,同时通过仅检索最相关文档的完整内容来最小化令牌使用。
```
### Recommended Usage Strategy | 推荐使用策略
The optimized `vjdoc_get_docs_meta` tool is designed to be more token-efficient while still providing valuable context to LLMs. Here's how to best utilize it:
1. **Start with a Specific Query**: The more specific your query, the more relevant the returned documents will be.
```
/mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "React server components vs client components"}
```
2. **Review the Top Results**: The tool now returns only the most relevant documents with their snippets and relevance scores.
3. **Use the Suggested Tool Calls**: The response includes ready-to-use examples for:
- Searching with `vjdoc_search` for more detailed results
- Getting full document content with `vjdoc_get_document` for the most relevant document
4. **Progressive Disclosure Pattern**: This optimized approach follows a "progressive disclosure" pattern:
- Start with metadata (minimal tokens)
- Progress to search results (moderate tokens)
- Finally get full document content (maximum tokens) only when necessary
This approach is especially valuable in contexts where token usage directly impacts costs or performance.
### 推荐使用策略
优化后的 `vjdoc_get_docs_meta` 工具旨在提高令牌效率,同时仍为大型语言模型提供有价值的上下文。以下是最佳利用方式:
1. **从特定查询开始**:查询越具体,返回的文档就越相关。
```
/mcp mcp-vj-docs vjdoc_get_docs_meta {"query": "React 服务器组件与客户端组件的区别"}
```
2. **查看顶部结果**:该工具现在只返回最相关的文档,包含它们的摘要和相关性分数。
3. **使用建议的工具调用**:响应中包含可直接使用的示例:
- 使用 `vjdoc_search` 搜索更详细的结果
- 使用 `vjdoc_get_document` 获取最相关文档的完整内容
4. **渐进式披露模式**:这种优化方法遵循"渐进式披露"模式:
- 从元数据开始(最少令牌)
- 进展到搜索结果(中等令牌)
- 最后只在必要时获取完整文档内容(最大令牌)
这种方法在令牌使用直接影响成本或性能的情况下特别有价值。
### Example Response Format | 响应格式示例
The optimized response format now includes:
```json
{
"query": "React hooks",
"documents": [
{
"url": "https://example.com/docs/react/hooks/usestate",
"title": "useState Hook",
"category": "React Hooks",
"score": 0.95,
"snippet": "useState is a Hook that lets you add React state to function components..."
},
// Additional documents (limited to most relevant)
],
"totalDocuments": 10,
"categories": ["React Hooks", "React Basics"],
"userPrompt": "我找到了10个与\"React hooks\"相关的文档...(简洁的提示文本)"
}
```
### 响应格式示例
优化后的响应格式现在包括:
```json
{
"query": "React hooks",
"documents": [
{
"url": "https://example.com/docs/react/hooks/usestate",
"title": "useState Hook",
"category": "React Hooks",
"score": 0.95,
"snippet": "useState 是一个 Hook,它允许你在函数组件中添加 React 状态..."
},
// 其他文档(限于最相关的)
],
"totalDocuments": 10,
"categories": ["React Hooks", "React 基础"],
"userPrompt": "我找到了10个与\"React hooks\"相关的文档...(简洁的提示文本)"
}
```
## Troubleshooting | 故障排除
### Common Issues | 常见问题
1. **Database Path Issues | 数据库路径问题**
- Ensure the directory for your database exists | 确保您的数据库目录存在
- Check if you have write permissions to the specified path | 检查您是否有写入指定路径的权限
- For tilde paths, ensure your home directory is correctly detected | 对于波浪号路径,确保正确检测到您的主目录
2. **Firecrawl API Issues | Firecrawl API 问题**
- Verify your API key is correct | 验证您的 API 密钥是否正确
- Check if you've reached API rate limits | 检查您是否达到了 API 速率限制
- If using a local Firecrawl service, ensure it's running | 如果使用本地 Firecrawl 服务,确保它正在运行
3. **Crawling Issues | 爬取问题**
- Some websites may block crawlers | 某些网站可能会阻止爬虫
- Check if the website requires authentication | 检查网站是否需要身份验证
- Try reducing the crawl depth and page limit | 尝试减少爬取深度和页面限制
### Logs | 日志
Check the logs for more detailed error information:
查看日志以获取更详细的错误信息:
- If `VJDOC_LOG_TO_FILE` is enabled, check the log files in your log directory | 如果启用了 `VJDOC_LOG_TO_FILE`,请检查日志目录中的日志文件
- Otherwise, check the console output | 否则,检查控制台输出
## 传输协议配置 | Transport Configuration
MCP 服务器支持多种传输协议,可以通过环境变量进行配置:
### 环境变量 | Environment Variables
#### 传输协议控制 | Transport Protocol Control
- `ENABLE_STDIO_TRANSPORT`: 控制是否启用标准输入/输出传输(默认为 true,设置为 'false' 禁用)
- `ENABLE_STREAMABLE_HTTP`: 控制是否启用流式 HTTP 传输(默认为 false,设置为 'true' 启用)
- `ENABLE_LEGACY_SSE`: 控制是否启用旧版 SSE 端点(默认为 false,设置为 'true' 启用)
#### 端口配置 | Port Configuration
- `STREAMABLE_HTTP_PORT`: 设置流式 HTTP 服务器的端口(默认为 3000)
- `LEGACY_SSE_PORT`: 设置旧版 SSE 服务器的端口(默认为 3001)
### 使用示例 | Usage Examples
```bash
# 启用所有传输协议
export ENABLE_STDIO_TRANSPORT=true
export ENABLE_STREAMABLE_HTTP=true
export ENABLE_LEGACY_SSE=true
# 设置端口
export STREAMABLE_HTTP_PORT=3000
export LEGACY_SSE_PORT=3001
# 运行服务器
npm start
```
### 传输协议说明 | Transport Protocol Description
#### Streamable HTTP Transport
现代 HTTP 传输协议,支持流式传输和会话管理。提供以下端点:
- `POST /mcp`: 处理客户端到服务器的通信
- `GET /mcp`: 处理服务器到客户端的通知(通过 SSE)
- `DELETE /mcp`: 处理会话终止
#### Legacy SSE Transport
旧版 SSE 传输协议,提供以下端点:
- `GET /sse`: 建立 SSE 连接
- `POST /messages?sessionId=<id>`: 处理客户端消息
#### Stdio Transport
标准输入/输出传输协议,用于命令行环境。
## CHANGELOG | 更新日志
### 2026-04-07 (v0.1.73)
- **递归目录扫描**:语料库文件现在支持递归目录扫描,嵌套子目录中的文件会被自动发现和索引
- 新增 `scanFilesRecursive()` 方法,递归扫描 `.txt`、`.md`、`.pdf` 文件
- `loadTfidfFiles()` 和 `getCorpusDocuments()` 均已重构为使用递归扫描
- 文档元数据新增 `relativePath` 字段,标识文件在目录树中的相对位置
### 2025-05-08
- **改进传输协议配置**:
- 实现了基于环境变量的传输协议配置,支持灵活启用/禁用不同的传输协议
- 新增环境变量:`ENABLE_STDIO_TRANSPORT`、`ENABLE_STREAMABLE_HTTP`、`ENABLE_LEGACY_SSE`
- 新增端口配置环境变量:`STREAMABLE_HTTP_PORT`、`LEGACY_SSE_PORT`
- **升级 Streamable HTTP 传输**:
- 采用最新的 MCP 规范实现会话管理
- 支持 POST、GET 和 DELETE 请求处理
- 改进了会话 ID 生成和验证机制
- **统一传输协议处理**:
- 统一了所有传输协议的初始化和连接方式
- 改进了日志记录,提供更详细的传输协议状态信息
- 增强了错误处理和资源清理
### 2025-04-24
- **改进搜索结果结构**:
- 将搜索结果中的 `content` 字段重命名为 `paragraph`,更准确地反映其包含的内容
- 添加 `highlightedParagraph` 字段,提供带有高亮显示的段落内容
- 优化了搜索结果的格式,使其更适合大型语言模型处理
- **添加 PDF 支持**:
- 集成 pdf-parse 库,支持 PDF 文件的解析和索引
- 添加 `pdf-base64` 内容类型,允许直接添加 PDF 文件到语料库
- **改进路径处理**:
- 增强了波浪号路径扩展功能,更好地支持跨平台路径处理
- 修复了与 Node.js 路径处理相关的问题
- **添加高级日志功能**:
- 集成 winston 日志库,提供更详细的日志记录
- 添加了可配置的日志级别和文件日志选项
- 新增环境变量:`VJDOC_LOG_LEVEL`、`VJDOC_LOG_TO_FILE`、`VJDOC_LOG_DIR`
- **文档时间戳支持**:
- 添加文档时间戳,用于实现基于新鲜度的文档评分
- 改进搜索算法,考虑文档的时间因素
- **增强爬取选项**:
- 支持 `ignoreSitemap` 选项,允许忽略网站的 sitemap.xml
- 添加 `allowExternalLinks` 和 `allowBackwardLinks` 选项,控制爬取范围
- 支持 `includePatterns` 和 `excludePatterns` 数组,用于精确控制要爬取的 URL
- 添加 `defaultCategory` 选项,为爬取的文档设置默认类别
- 支持自定义 Firecrawl API 配置(`firecrawlApiKey` 和 `firecrawlApiUrl`)