crawl-page-mcp-server
Version:
MCP server for crawling web pages and converting to markdown
422 lines (343 loc) • 7.33 kB
Markdown
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"format": "markdown"
}
}
```
**输出:**
```json
{
"url": "https://example.com",
"title": "Example Domain",
"description": "This domain is for use in illustrative examples",
"keywords": "",
"format": "markdown",
"content": "# Example Domain\n\nThis domain is for use in illustrative examples in documents...",
"contentLength": 234,
"timestamp": "2023-12-01T12:00:00.000Z"
}
```
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://news.ycombinator.com",
"format": "markdown",
"selector": ".storylink"
}
}
```
```json
{
"name": "extract_links",
"arguments": {
"url": "https://example.com",
"filterPattern": "https://.*\\.pdf$"
}
}
```
**输出:**
```json
{
"url": "https://example.com",
"totalLinks": 3,
"links": [
{
"href": "https://example.com/document1.pdf",
"text": "Document 1",
"title": "Download Document 1"
},
{
"href": "https://example.com/report.pdf",
"text": "Annual Report"
}
],
"timestamp": "2023-12-01T12:00:00.000Z"
}
```
抓取Python官方文档并转换为Markdown格式:
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://docs.python.org/3/tutorial/introduction.html",
"format": "markdown",
"selector": ".body"
}
}
```
**用途:**
- 离线文档存储
- 文档内容分析
- 知识库构建
从新闻网站提取文章主要内容:
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://news.example.com/article/123",
"format": "markdown",
"selector": "article .content",
"timeout": 15000
}
}
```
**用途:**
- 新闻聚合
- 内容分析
- 自动摘要生成
收集学术网站的PDF链接:
```json
{
"name": "extract_links",
"arguments": {
"url": "https://arxiv.org/list/cs.AI/recent",
"filterPattern": "https://arxiv\\.org/pdf/.*\\.pdf$"
}
}
```
**用途:**
- 学术资源收集
- 论文批量下载
- 研究资料整理
抓取技术博客的文章内容:
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://blog.example.com/post/machine-learning-basics",
"format": "markdown",
"selector": ".post-content",
"userAgent": "Mozilla/5.0 (compatible; BlogCrawler/1.0)"
}
}
```
**用途:**
- 技术文章收集
- 知识管理
- 内容备份
提取电商网站的产品信息:
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://shop.example.com/product/123",
"format": "text",
"selector": ".product-details"
}
}
```
**用途:**
- 价格监控
- 产品信息收集
- 竞品分析
对于需要JavaScript渲染的页面,可能需要特殊处理:
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://spa-example.com",
"format": "markdown",
"timeout": 20000,
"userAgent": "Mozilla/5.0 (compatible; ModernBrowser/1.0)"
}
}
```
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://private-site.com/content",
"format": "markdown",
"userAgent": "Mozilla/5.0 (authorized-client)"
}
}
```
提取多种类型的文件链接:
```json
{
"name": "extract_links",
"arguments": {
"url": "https://resources.example.com",
"filterPattern": ".*\\.(pdf|doc|docx|ppt|pptx)$"
}
}
```
处理包含相对链接的页面:
```json
{
"name": "extract_links",
"arguments": {
"url": "https://example.com/page/subpage",
"baseUrl": "https://example.com",
"filterPattern": ".*\\.html$"
}
}
```
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://slow-site.com",
"timeout": 30000,
"format": "markdown"
}
}
```
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://protected-site.com",
"userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"format": "markdown"
}
}
```
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://example.com",
"selector": "article, .content, .main",
"format": "markdown"
}
}
```
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://heavy-site.com",
"format": "text",
"selector": "p, h1, h2, h3"
}
}
```
通过CSS选择器限制抓取范围:
```json
{
"name": "crawl_page",
"arguments": {
"url": "https://long-article.com",
"format": "markdown",
"selector": ".content p:nth-child(-n+10)"
}
}
```
```bash
node debug-server.js
crawl https://httpbin.org/html markdown
links https://httpbin.org/links/5
test
```
```bash
npm run test:crawl
node quick-debug-test.js
```
```json
{
"mcpServers": {
"crawl-page": {
"command": "node",
"args": ["./dist/index.js"],
"cwd": "/path/to/crawl-page-mcp-server"
}
}
}
```
```javascript
// 示例:使用MCP客户端调用
const client = new MCPClient();
await client.connect();
const result = await client.callTool('crawl_page', {
url: 'https://example.com',
format: 'markdown'
});
console.log(result.content);
```
```javascript
async function crawlWithRetry(url, maxRetries = 3) {
for (let i = 0; i < maxRetries; i++) {
try {
const result = await client.callTool('crawl_page', {
url,
timeout: 10000 + (i * 5000) // 递增超时时间
});
return result;
} catch (error) {
if (i === maxRetries - 1) throw error;
await new Promise(resolve => setTimeout(resolve, 1000 * (i + 1)));
}
}
}
```
```javascript
async function crawlMultiplePages(urls) {
const results = [];
for (const url of urls) {
try {
const result = await client.callTool('crawl_page', { url });
results.push(result);
// 添加延迟避免过于频繁的请求
await new Promise(resolve => setTimeout(resolve, 1000));
} catch (error) {
console.error(`Failed to crawl ${url}:`, error);
}
}
return results;
}
```
```javascript
function validateContent(result) {
const content = JSON.parse(result.content[0].text);
// 检查内容长度
if (content.contentLength < 100) {
console.warn('Content seems too short');
}
// 检查是否包含错误页面标识
if (content.title.includes('404') || content.title.includes('Error')) {
throw new Error('Page not found or error page');
}
return content;
}