UNPKG

paper-search-mcp-nodejs

Version:

A Node.js MCP server for searching and downloading academic papers from multiple sources, including arXiv, PubMed, bioRxiv, Web of Science, and more.

398 lines (331 loc) β€’ 12.9 kB
# AIPaper-assisant - Comprehensive Project Documentation ## 🎯 Project Overview AIPaper-assisant is a sophisticated Node.js Model Context Protocol (MCP) server that provides unified access to 14+ academic paper databases and search platforms. It serves as a centralized interface for searching, discovering, and downloading academic papers from sources including arXiv, Web of Science, PubMed, Google Scholar, Sci-Hub, ScienceDirect, Springer, Wiley, Scopus, Crossref, and more. ### Key Value Propositions - **Unified Interface**: Single API for multiple academic platforms - **Intelligent Fallbacks**: Automatic platform switching for better results - **Enterprise Features**: Rate limiting, error handling, and robust architecture - **MCP Integration**: Native Claude Desktop integration for seamless research workflows ## πŸ—οΈ Architecture Documentation ### System Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ MCP Protocol Layer β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Server Core (server.ts) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚ Crossref β”‚ arXiv β”‚ Web of Sci β”‚ PubMed β”‚ β”‚ β”‚ β”‚ Searcher β”‚ Searcher β”‚ Searcher β”‚ Searcher β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚ β”‚Google Scholarβ”‚ Sci-Hub β”‚ScienceDirect β”‚ Springer β”‚ β”‚ β”‚ β”‚ Searcher β”‚ Searcher β”‚ Searcher β”‚ Searcher β”‚ β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Abstract Platform Framework β”‚ β”‚ (PaperSource.ts) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Unified Data Model (Paper.ts) β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Utility Layer (RateLimiter) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Core Components #### 1. MCP Server Core (`src/server.ts`) - **Purpose**: Main entry point implementing Model Context Protocol - **Responsibilities**: - Tool registration and lifecycle management - Request routing and parameter validation - Error handling and response formatting - MCP protocol compliance #### 2. Abstract Platform Framework (`src/platforms/PaperSource.ts`) - **Purpose**: Base class defining contract for all search platforms - **Key Features**: - Standardized search interface - Common HTTP client configuration - Error handling patterns - Capability discovery system - Rate limiting integration #### 3. Unified Data Model (`src/models/Paper.ts`) - **Purpose**: Consistent data structure across all platforms - **Components**: - `Paper` interface: Standard paper metadata - `PaperFactory`: Validation and transformation - Type-safe schemas using Zod #### 4. Rate Limiting System (`src/utils/RateLimiter.ts`) - **Algorithm**: Token bucket implementation - **Features**: - Per-platform rate configuration - Burst capacity management - Queue-based request handling - Automatic retry mechanisms ## πŸ“‹ Platform Integration Guide ### Platform Capabilities Matrix | Platform | Search | Download | Full Text | Citations | API Key | Special Features | |----------|--------|----------|-----------|-----------|---------|------------------| | Crossref | βœ… | ❌ | ❌ | βœ… | ❌ | Default fallback, extensive metadata | | arXiv | βœ… | βœ… | βœ… | ❌ | ❌ | Category filtering, PDF support | | Web of Science | βœ… | ❌ | ❌ | βœ… | βœ… | Multi-topic search, date sorting | | PubMed | βœ… | ❌ | ❌ | ❌ | 🟑 | Biomedical focus, MeSH terms | | Google Scholar | βœ… | ❌ | ❌ | βœ… | ❌ | Comprehensive coverage | | Sci-Hub | βœ… | βœ… | βœ… | ❌ | ❌ | Mirror management, DOI-based | | ScienceDirect | βœ… | ❌ | ❌ | βœ… | βœ… | Elsevier content | | Springer | βœ… | βœ…* | ❌ | ❌ | βœ… | Dual API (Meta + OpenAccess) | | Scopus | βœ… | ❌ | ❌ | βœ… | βœ… | Largest citation database | | Wiley | ❌ | βœ… | βœ… | ❌ | βœ… | TDM API, DOI required | ### Integration Patterns #### 1. Basic Platform Implementation ```typescript export class PlatformSearcher extends PaperSource { constructor() { super('platform-name', { rateLimit: { tokens: 10, refillRate: 1 }, capabilities: { search: true, download: false, citations: true } }); } async search(query: string, options?: SearchOptions): Promise<Paper[]> { // Platform-specific implementation } } ``` #### 2. Error Handling Pattern ```typescript try { const response = await this.client.get(url); return this.transformResults(response.data); } catch (error) { if (error.response?.status === 429) { throw new RateLimitError('Platform rate limit exceeded'); } throw new PlatformError(`Search failed: ${error.message}`); } ``` #### 3. Rate Limiting Integration ```typescript // Automatic rate limiting via base class await this.rateLimiter.acquire(); const results = await this.performSearch(query); return results; ``` ## πŸ”§ API Documentation ### Core Tools #### `search_papers` Searches for academic papers across multiple platforms. **Parameters:** - `query` (string): Search query terms - `platform` (string, optional): Specific platform to search - `maxResults` (number, optional): Maximum results per platform (default: 10) - `yearRange` (string, optional): Year range filter (e.g., "2020-2024") - `sortBy` (string, optional): Sort criteria (relevance/date) **Returns:** ```json { "papers": [ { "title": "Paper Title", "authors": ["Author 1", "Author 2"], "abstract": "Abstract text...", "doi": "10.1000/example", "year": 2024, "source": "arxiv", "pdfUrl": "https://...", "citations": 42 } ], "metadata": { "platform": "arxiv", "totalResults": 150, "searchTime": 1.23 } } ``` #### `download_paper` Downloads a paper PDF from supported platforms. **Parameters:** - `doi` (string): DOI identifier - `title` (string, optional): Paper title for better matching - `platform` (string, optional): Preferred download platform **Returns:** ```json { "success": true, "pdfUrl": "https://download.example.com/paper.pdf", "platform": "arxiv", "fileSize": 1024000 } ``` #### `get_platform_status` Checks the health and capabilities of all platforms. **Returns:** ```json { "platforms": [ { "name": "arxiv", "status": "online", "capabilities": { "search": true, "download": true, "citations": false }, "rateLimit": { "tokens": 15, "refillRate": 1 } } ] } ``` ### Platform-Specific Features #### Web of Science Integration - Multi-topic search with boolean operators - Advanced date filtering and sorting - Citation network analysis - Research area categorization #### arXiv Integration - Category-specific filtering (cs.AI, physics.gen-ph, etc.) - Date-based sorting and filtering - Direct PDF download support - Preprint version tracking #### Sci-Hub Integration - Mirror health monitoring - Automatic mirror failover - DOI-based paper location - Legal compliance considerations ## πŸš€ Development Workflow ### Setup Instructions 1. **Clone and Install** ```bash git clone <repository> cd paper-search-nodejs npm install ``` 2. **Configuration** - Copy `.env.example` to `.env` - Add API keys for platforms requiring authentication - Configure rate limits as needed 3. **Development Mode** ```bash npm run dev # Runs with tsx for hot reload ``` 4. **Build for Production** ```bash npm run build # Compiles TypeScript to dist/ npm start # Runs compiled JavaScript ``` ### Testing Strategy #### Test Categories 1. **Unit Tests**: Individual platform functionality 2. **Integration Tests**: Cross-platform compatibility 3. **Error Handling**: Invalid inputs and edge cases 4. **Rate Limiting**: Throttling and retry behavior 5. **MCP Protocol**: Tool registration and communication #### Running Tests ```bash npm test # Run all tests npm run test:watch # Watch mode for development npm run test:coverage # Generate coverage report ``` ### Code Quality Standards #### TypeScript Configuration - Strict mode enabled - ES2022 target with ESNext modules - Source maps for debugging - Explicit return types required #### Linting and Formatting ```bash npm run lint # ESLint with TypeScript support npm run format # Prettier formatting npm run type-check # TypeScript compiler checks ``` ## πŸ” Security Considerations ### API Key Management - Store keys in environment variables - Never commit keys to version control - Validate keys on platform initialization - Rotate keys regularly ### Rate Limiting - Respect platform rate limits - Implement exponential backoff - Queue requests during high load - Monitor for abuse patterns ### Error Sanitization - Remove sensitive data from errors - Log sanitized error messages only - Never expose internal implementation details - Use generic error messages for users ## πŸ“Š Performance Optimization ### Concurrent Operations - Parallel search execution where possible - Non-blocking I/O for all network requests - Efficient memory usage with streaming ### Caching Strategy - Platform capability caching - Mirror health status (Sci-Hub) - Request deduplication - Configurable cache TTL ### Network Efficiency - Connection pooling with keep-alive - Request timeout configuration - Retry with exponential backoff - Compression support (gzip) ## 🎯 Best Practices ### Adding New Platforms 1. Extend `PaperSource` base class 2. Implement required methods (`search`, `download`) 3. Define platform capabilities 4. Add rate limiting configuration 5. Create comprehensive tests 6. Update documentation ### Error Handling Guidelines ```typescript // Platform-specific errors class PlatformError extends Error { constructor(message: string, public platform: string) { super(message); this.name = 'PlatformError'; } } // Rate limit errors class RateLimitError extends Error { constructor(message: string, public retryAfter?: number) { super(message); this.name = 'RateLimitError'; } } ``` ### Logging Standards ```typescript // Use structured logging logger.info('Platform search completed', { platform: 'arxiv', query: 'machine learning', results: 10, duration: 1234 }); // Log errors with context logger.error('Search failed', { platform: 'wos', error: error.message, query: options.query, retryCount }); ``` ## πŸ“š Additional Resources ### Related Documentation - [MCP Protocol Specification](https://modelcontextprotocol.io/) - [Platform API Documentation](./PLATFORM_APIS.md) - [Contributing Guidelines](./CONTRIBUTING.md) - [Changelog](./CHANGELOG.md) ### External References - [Crossref API](https://api.crossref.org/) - [arXiv API](https://arxiv.org/help/api/) - [Web of Science API](https://developer.clarivate.com/) - [PubMed API](https://www.ncbi.nlm.nih.gov/home/develop/) --- *This documentation is automatically maintained. For updates, please submit pull requests with detailed descriptions of changes.*