UNPKG

cleanweb-mcp

Version:

A lightweight MCP server for extracting clean web content with intelligent content filtering and Markdown conversion

214 lines (151 loc) โ€ข 6.35 kB
# ๐ŸŒ CleanWeb MCP <div align="center"> [![npm version](https://badge.fury.io/js/cleanweb-mcp.svg)](https://www.npmjs.com/package/cleanweb-mcp) [![GitHub stars](https://img.shields.io/github/stars/guangxiangdebizi/cleanweb-mcp.svg)](https://github.com/guangxiangdebizi/cleanweb-mcp) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) **A lightweight Model Context Protocol (MCP) server** Specialized in intelligently extracting core web content, automatically filtering ads and irrelevant elements, and converting to clean Markdown format [๐Ÿš€ Quick Start](#-quick-start) โ€ข [๐Ÿ“– Documentation](#-usage) โ€ข [๐Ÿ”ง Configuration](#-claude-configuration) โ€ข [๐Ÿค Contributing](#-contributing) </div> ## โœจ Features <div align="center"> | ๐ŸŒ Smart Extraction | ๐Ÿงน Content Cleaning | ๐Ÿ“ Format Conversion | โšก Lightweight Deploy | |:---:|:---:|:---:|:---:| | Axios + Cheerio + Readability | Auto-filter ads & distractions | HTML โ†’ Markdown | Zero browser dependency | </div> ### ๐ŸŽฏ Core Advantages - ๐ŸŒ **Smart Content Extraction**: Uses Axios + Cheerio + Readability algorithm to extract main web content - ๐Ÿงน **Intelligent Content Cleaning**: Automatically removes ads, navigation, sidebars and other distracting elements - ๐Ÿ“ **Markdown Conversion**: Converts HTML content to clean Markdown format - ๐Ÿ–ผ๏ธ **Image Link Optimization**: Automatically handles overly long image links for better readability - โšก **Lightweight Deployment**: No browser dependencies, simple and fast deployment - ๐Ÿ”ง **Multiple Output Formats**: Supports pure Markdown or JSON format with metadata - ๐Ÿš€ **MCP Protocol**: Fully compatible with Model Context Protocol standard ### ๐Ÿ› ๏ธ Tech Stack <div align="center"> ![TypeScript](https://img.shields.io/badge/TypeScript-007ACC?style=for-the-badge&logo=typescript&logoColor=white) ![Node.js](https://img.shields.io/badge/Node.js-43853D?style=for-the-badge&logo=node.js&logoColor=white) ![Axios](https://img.shields.io/badge/Axios-5A29E4?style=for-the-badge&logo=axios&logoColor=white) ![Cheerio](https://img.shields.io/badge/Cheerio-E34F26?style=for-the-badge&logo=html5&logoColor=white) </div> ## ๐Ÿš€ Quick Start ### ๐Ÿ“ฆ Installation ```bash # Install from npm npm install cleanweb-mcp # Or clone the repository git clone https://github.com/guangxiangdebizi/cleanweb-mcp.git cd cleanweb-mcp npm install ``` > **๐Ÿ’ก Advantage**: Uses lightweight HTTP client, no browser download required, simpler deployment! Focused on content cleaning and optimization. ## ๐Ÿ”ง Build Project ```bash npm run build ``` ## ๐ŸŽฏ Usage ### 1. Stdio Mode (Local Development) ```bash npm run mcp:stdio ``` ### 2. SSE Mode (via Supergateway) ```bash npm run mcp:sse ``` Server will start at `http://localhost:3100/sse` ### 3. WebSocket Mode ```bash npm run mcp:ws ``` ### 4. Development Mode (Watch file changes) ```bash npm run mcp:dev ``` ## ๐Ÿ› ๏ธ Claude Configuration ### Stdio Mode Configuration Add to Claude's configuration file: ```json { "mcpServers": { "cleanweb-mcp": { "command": "node", "args": ["path/to/your/project/build/index.js"] } } } ``` ### SSE Mode Configuration ```json { "mcpServers": { "cleanweb-mcp-sse": { "type": "sse", "url": "http://localhost:3100/sse", "timeout": 600 } } } ``` ## ๐Ÿ”จ API Reference ### `extract_web_content` Intelligently extract web content and convert to Markdown format. #### Parameters | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `url` | string | โœ… | - | The web URL to extract content from | | `format` | string | โŒ | `markdown` | Return format: `markdown` or `json` | | `timeout` | number | โŒ | `30000` | Page loading timeout (milliseconds) | #### Usage Examples ```javascript // Basic usage extract_web_content({ url: "https://example.com/article" }) // Advanced usage extract_web_content({ url: "https://example.com/article", format: "json", timeout: 60000 }) ``` ## ๐Ÿ“ Project Structure ``` cleanweb-mcp/ โ”œโ”€โ”€ ๐Ÿ“„ README.md # Project documentation โ”œโ”€โ”€ ๐Ÿ“ฆ package.json # Project configuration โ”œโ”€โ”€ โš™๏ธ tsconfig.json # TypeScript configuration โ”œโ”€โ”€ ๐Ÿ”ง claude-config-example.json # Claude configuration example โ”œโ”€โ”€ ๐Ÿ“– example-usage.md # Usage examples โ”œโ”€โ”€ ๐Ÿ—๏ธ build/ # Compiled output โ”‚ โ”œโ”€โ”€ index.js โ”‚ โ””โ”€โ”€ tools/ โ”‚ โ””โ”€โ”€ web-content-extractor.js โ””โ”€โ”€ ๐Ÿ“ src/ # Source code โ”œโ”€โ”€ index.ts # MCP server main entry โ””โ”€โ”€ tools/ โ””โ”€โ”€ web-content-extractor.ts # Web content extraction tool ``` ## ๐Ÿ”„ Migration from Express Server The original Express server (`server.js`) can still run independently: ```bash npm start ``` The MCP version provides the same core functionality but integrates with AI assistants through the MCP protocol. ## ๐Ÿšจ Important Notes 1. **Lightweight Implementation**: Uses HTTP client to fetch static content, no browser dependencies required 2. **Network Access**: Requires access to target websites 3. **Static Content**: Primarily suitable for static HTML content, dynamically rendered content may not be accessible 4. **Timeout Settings**: For slow-loading websites, you can appropriately increase the timeout parameter 5. **Content Optimization**: Automatically optimizes image link display for better readability ## ๐Ÿค Contributing Welcome to submit Issues and Pull Requests! If you have any questions or suggestions, feel free to contact me. ## ๐Ÿ“ž Contact - **GitHub**: [guangxiangdebizi](https://github.com/guangxiangdebizi/) - **Email**: guangxiangdebizi@gmail.com - **LinkedIn**: [Xingyu Chen](https://www.linkedin.com/in/xingyu-chen-b5b3b0313/) - **NPM**: [@xingyuchen](https://www.npmjs.com/~xingyuchen) ## ๐Ÿ”— Related Links - **GitHub Repository**: [https://github.com/guangxiangdebizi/cleanweb-mcp](https://github.com/guangxiangdebizi/cleanweb-mcp) - **NPM Package**: [https://www.npmjs.com/package/cleanweb-mcp](https://www.npmjs.com/package/cleanweb-mcp) ## ๐Ÿ“„ License MIT License - See [LICENSE](LICENSE) file for details