UNPKG

mcp-basic-web-crawler

Version:

A Model Context Protocol (MCP) server providing ethical web crawling and search capabilities

229 lines (185 loc) 6.05 kB
# MCP Basic Web Crawler A Model Context Protocol (MCP) server for basic web crawling and search capabilities. This server enables AI assistants to search the web and extract content, (no api needed), from web pages with a focus on privacy, secure and responsible web crawling. This is an example of a tool for LLM applications, often used in conjunction with other tools, eg url crawler, rss feeds, api web searches, context tools to build comprehensive web content tools. ## Features ### 🔍 **Web Search** - DuckDuckGo search integration - Structured search results with titles, URLs, and snippets - Configurable result limits ### 🌐 **Content Extraction** - Clean text extraction from web pages - Support for single URLs or batch processing - Automatic removal of scripts, styles, and navigation elements - Memory-efficient processing for large content ### 🛡️ **Secure & Mindful Crawling** - **Rate limiting**: Configurable request throttling to avoid overwhelming servers - **Memory management**: Dynamic batch sizing based on available system memory - **Graceful error handling**: Comprehensive error reporting and recovery - **Respectful user agents**: Proper identification in requests - **Content filtering**: Removes unnecessary elements for clean text extraction - **Timeout handling**: Prevents hanging requests ### ⚙️ **Configurable & Robust** - Command-line configuration options - Environment variable support - Comprehensive logging with multiple levels - TypeScript implementation for type safety ## Integration with MCP Clients ### Claude Desktop #### Option 1: NPX (Recommended) ```json { "mcpServers": { "web-crawler": { "command": "npx", "args": [ "mcp-basic-web-crawler", "--search-rate-limit", "25", "--fetch-rate-limit", "15", "--log-level", "info" ], "env": { "MCP_BASIC_WEB_CRAWLER_USER_AGENT": "Basic Web Crawler/1.0" } } } } ``` #### Option 2: Global Installation ```json { "mcpServers": { "web-crawler": { "command": "mcp-basic-web-crawler", "args": ["--log-level", "info"] } } } ``` #### Option 3: Docker ```json { "mcpServers": { "web-crawler": { "command": "docker", "args": [ "run", "--rm", "-i", "--security-opt", "no-new-privileges:true", "--memory", "512m", "--cpus", "0.5", "-e", "MCP_BASIC_WEB_CRAWLER_USER_AGENT=Basic Web Crawler/1.0", "calmren/mcp-basic_web-crawler:latest", "--search-rate-limit", "25", "--fetch-rate-limit", "15", "--log-level", "info" ] } } } ``` ### Other MCP Clients The server communicates via stdio and follows the MCP specification. It can be integrated with any MCP-compatible client. ## MCP Tools This server provides two main tools: ### 1. `web_search` Search the web using DuckDuckGo. **Parameters:** - `query` (string): Search query - `maxResults` (number, optional): Maximum results to return (default: 10) **Example:** ```json { "query": "artificial intelligence latest developments", "maxResults": 5 } ``` ### 2. `fetch_content` Extract content from web pages. **Parameters:** - `url` (string | string[]): Single URL or array of URLs to fetch **Examples:** ```json { "url": "https://example.com/article" } ``` ```json { "url": [ "https://example.com/article1", "https://example.com/article2" ] } ``` ## Installation ### Method 1: NPX (Recommended) ```bash # No installation needed - run directly with npx npx mcp-basic-web-crawler --help ``` ### Method 2: Global Installation ```bash npm install -g mcp=basic-web-crawler ``` ### Method 3: Docker ```bash # Pull the image docker pull calmren/mcp-basic-web-crawler:latest # Or build locally git clone https://github.com/calmren/mcp-basic-web-crawler.git cd mcp-basic-web-crawler docker build -t mcp-basic-web-crawler. ``` ### Method 4: From Source ```bash git clone https://github.com/calmren/mcp-basic-web-crawler.git cd mcp-basic-web-crawler npm install npm run build ``` ## Usage ### NPX Usage (Recommended) ```bash # Start the MCP server with npx npx mcp-basic-web-crawler # With custom configuration npx mcp-basic-web-crawler --search-rate-limit 20 --log-level debug ``` ### Docker Usage ```bash # Basic usage docker run -p 3000:3000 calmren/mcp-basic-web-crawler # With custom configuration docker run -p 3000:3000 calmren/mcp-basic-web-crawler \ --search-rate-limit 20 --log-level debug # With environment variables docker run -p 3000:3000 \ -e MCP_WEB_CRAWLER_LOG_LEVEL=debug \ -e MCP_WEB_CRAWLER_USER_AGENT="MyApp/1.0" \ calmren/mcp-web-crawler ``` ### Global Installation Usage ```bash # If installed globally mcp-basic-web-crawler --search-rate-limit 20 --log-level debug ``` ### Configuration Options | Option | Description | Default | |--------|-------------|---------| | `--search-rate-limit <number>` | Maximum search requests per minute | 30 | | `--fetch-rate-limit <number>` | Maximum fetch requests per minute | 20 | | `--max-content-length <number>` | Maximum content length to return | 8000 | | `--timeout <number>` | Request timeout in milliseconds | 30000 | | `--user-agent <string>` | Custom user agent string | Default MCP crawler UA | | `--log-level <level>` | Log level (error, warn, info, debug) | info | | `--help, -h` | Show help message | - | ### Environment Variables | Variable | Description | |----------|-------------| | `MCP_WEB_CRAWLER_LOG_LEVEL` | Set log level | | `MCP_WEB_CRAWLER_USER_AGENT` | Set custom user agent | ## License This MCP server is licensed under the MIT License. This means you are free to use, modify, and distribute the software, subject to the terms and conditions of the MIT License. For more details, please see the LICENSE file in the project repository. ## Acknowledgments - Built on the [Model Context Protocol](https://modelcontextprotocol.io/) - Uses [DuckDuckGo](https://duckduckgo.com/) for search functionality - Powered by [Cheerio](https://cheerio.js.org/) for HTML parsing