@futurelab-studio/latest-science-mcp
Version:
MCP Server for Scientific Paper Harvesting from arXiv and OpenAlex
430 lines (327 loc) โข 14.4 kB
Markdown
# Scientific Paper Harvester MCP Server
A comprehensive Model Context Protocol (MCP) server that provides LLMs with real-time access to scientific papers from **6 major academic sources**: arXiv, OpenAlex, PMC (PubMed Central), Europe PMC, bioRxiv/medRxiv, and CORE.
## ๐ Features
### **Comprehensive Source Coverage**
- **arXiv**: Computer science, physics, mathematics preprints and papers
- **OpenAlex**: Open catalog of scholarly papers with citation data
- **PMC**: PubMed Central biomedical and life science literature
- **Europe PMC**: European life science literature database
- **bioRxiv/medRxiv**: Biology and medical preprint servers
- **CORE**: World's largest collection of open access research papers
### **Advanced Capabilities**
- **Paper Fetching**: Get latest papers from any source by category/concept
- **Paper Search**: Search papers by title, abstract, author, or full-text across 4 major sources
- **Full-Text Extraction**: Extract complete text content with intelligent fallback strategies
- **Citation Analysis**: Find top cited papers from OpenAlex since a specific date
- **Paper Lookup**: Retrieve full metadata for specific papers by ID
- **Category Discovery**: Browse available categories from all sources
- **Smart Rate Limiting**: Respectful API usage with per-source rate limiting
- **DOI Resolution**: Advanced DOI resolver with Unpaywall โ Crossref โ Semantic Scholar fallback
- **Dual Interface**: Both MCP protocol and CLI access
- **TypeScript**: Full type safety with ESM modules
## ๐ Coverage Statistics
- **Total Sources**: 6 academic databases
- **Category Coverage**: 100+ categories across all disciplines
- **Paper Access**: 200M+ papers with intelligent text extraction
- **Text Extraction Success**: >90% for supported paper types
- **Response Time**: <15 seconds average for paper fetching
## ๐ Installation
```bash
npm install
npm run build
```
## ๐ MCP Client Configuration
To use this server with an MCP client (like Claude Desktop), add the following to your MCP client configuration:
### For published package (available on npm):
**Option 1: Using npx (recommended for AI tools like Claude)**
```json
{
"mcpServers": {
"scientific-papers": {
"command": "npx",
"args": [
"-y",
"@futurelab-studio/latest-science-mcp@latest"
]
}
}
}
```
**Option 2: Global installation**
```bash
npm install -g @futurelab-studio/latest-science-mcp
```
Then configure:
```json
{
"mcpServers": {
"scientific-papers": {
"command": "latest-science-mcp"
}
}
}
```
## ๐ Usage
### CLI Interface
#### List Categories
```bash
# List arXiv categories
node dist/cli.js list-categories --source=arxiv
# List OpenAlex concepts
node dist/cli.js list-categories --source=openalex
# List PMC biomedical categories
node dist/cli.js list-categories --source=pmc
# List Europe PMC life science categories
node dist/cli.js list-categories --source=europepmc
# List bioRxiv/medRxiv categories (includes both servers)
node dist/cli.js list-categories --source=biorxiv
# List CORE academic categories
node dist/cli.js list-categories --source=core
```
#### Fetch Latest Papers
```bash
# Get latest AI papers from arXiv
node dist/cli.js fetch-latest --source=arxiv --category=cs.AI --count=10
# Get latest biology papers from bioRxiv
node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=5
# Get latest immunology papers from PMC
node dist/cli.js fetch-latest --source=pmc --category=immunology --count=3
# Get latest papers from CORE by subject
node dist/cli.js fetch-latest --source=core --category=computer_science --count=5
# Search by concept name (OpenAlex)
node dist/cli.js fetch-latest --source=openalex --category="machine learning" --count=3
```
#### Fetch Top Cited Papers
```bash
# Get top 20 cited papers in machine learning since 2024
node dist/cli.js fetch-top-cited --concept="machine learning" --since=2024-01-01 --count=20
# Get top cited papers by concept ID
node dist/cli.js fetch-top-cited --concept=C41008148 --since=2023-06-01 --count=10
```
#### Search Papers
```bash
# Search by keywords across all fields
node dist/cli.js search-papers --source=arxiv --query="machine learning" --count=10
# Search by paper title
node dist/cli.js search-papers --source=openalex --query="neural networks" --field=title --count=5
# Search by author name
node dist/cli.js search-papers --source=europepmc --query="John Smith" --field=author --count=10
# Search full-text content sorted by citations
node dist/cli.js search-papers --source=core --query="climate change" --field=fulltext --sortBy=citations --count=20
```
#### Fetch Specific Paper Content
```bash
# Get arXiv paper by ID
node dist/cli.js fetch-content --source=arxiv --id=2401.12345
# Get bioRxiv paper by DOI
node dist/cli.js fetch-content --source=biorxiv --id="10.1101/2021.01.01.425001"
# Get PMC paper by ID
node dist/cli.js fetch-content --source=pmc --id=PMC8245678
# Get CORE paper by ID
node dist/cli.js fetch-content --source=core --id=12345678
# Show text content with preview
node dist/cli.js fetch-content --source=arxiv --id=2401.12345 --show-text --text-preview=500
```
## ๐ง Available Tools
### `list_categories`
Lists available categories/concepts from any data source.
**Parameters:**
- `source`: `"arxiv"` | `"openalex"` | `"pmc"` | `"europepmc"` | `"biorxiv"` | `"core"`
**Returns:**
- Array of category objects with `id`, `name`, and optional `description`
**Examples:**
```json
{
"name": "list_categories",
"arguments": {
"source": "biorxiv"
}
}
```
### `fetch_latest`
Fetches the latest papers from any source for a given category with **metadata only** (no text extraction).
**Parameters:**
- `source`: `"arxiv"` | `"openalex"` | `"pmc"` | `"europepmc"` | `"biorxiv"` | `"core"`
- `category`: Category ID or concept name (varies by source)
- `count`: Number of papers to fetch (default: 50, max: 200)
**Category Examples by Source:**
- **arXiv**: `"cs.AI"`, `"physics.gen-ph"`, `"math.CO"`
- **OpenAlex**: `"artificial intelligence"`, `"machine learning"`, `"C41008148"`
- **PMC**: `"immunology"`, `"genetics"`, `"neuroscience"`
- **Europe PMC**: `"biology"`, `"medicine"`, `"cancer"`
- **bioRxiv/medRxiv**: `"biorxiv:neuroscience"`, `"medrxiv:psychiatry"`
- **CORE**: `"computer_science"`, `"mathematics"`, `"physics"`
**Returns:**
- Array of paper objects with metadata (id, title, authors, date, pdf_url)
- **Text field**: Empty string (`text: ""`) - use `fetch_content` for full text
### `fetch_top_cited`
Fetches the top cited papers from OpenAlex for a given concept since a specific date.
**Parameters:**
- `concept`: Concept name or OpenAlex concept ID
- `since`: Start date in YYYY-MM-DD format
- `count`: Number of papers to fetch (default: 50, max: 200)
### `search_papers`
Searches for papers across multiple academic sources with field-specific search and sorting options.
**Parameters:**
- `source`: `"arxiv"` | `"openalex"` | `"europepmc"` | `"core"`
- `query`: Search query string (max 1500 characters)
- `field`: `"all"` | `"title"` | `"abstract"` | `"author"` | `"fulltext"` (default: "all")
- `count`: Number of results to return (default: 50, max: 200)
- `sortBy`: `"relevance"` | `"date"` | `"citations"` (default: "relevance")
**Search Capabilities by Source:**
- **arXiv**: Title, abstract, author, and general search with Boolean operators
- **OpenAlex**: Advanced search with relevance scoring and citation sorting
- **Europe PMC**: Biomedical literature with MeSH terms and full-text search
- **CORE**: Global academic papers with advanced query language
**Example Queries:**
- Keywords: `"machine learning"`, `"climate change"`
- Phrases: `"artificial intelligence"` (use quotes for exact phrases)
- Boolean: `"deep learning AND neural networks"` (arXiv supports this)
- Authors: `"John Smith"`, `"Smith J"`
**Returns:**
- Array of paper objects with metadata (id, title, authors, date, pdf_url)
- **Text field**: Empty string (`text: ""`) - use `fetch_content` for full text
### `fetch_content`
Fetches full metadata and text content for a specific paper by ID with **complete text extraction**.
**Parameters:**
- `source`: Any of the 6 supported sources
- `id`: Paper ID (format varies by source)
**ID Formats by Source:**
- **arXiv**: `"2401.12345"`, `"cs/0601001"`, `"1234.5678v2"`
- **OpenAlex**: `"W2741809807"` or numeric `2741809807`
- **PMC**: `"PMC8245678"` or `"12345678"`
- **Europe PMC**: `"PMC8245678"`, `"12345678"`, or DOI
- **bioRxiv/medRxiv**: `"10.1101/2021.01.01.425001"` or `"2021.01.01.425001"`
- **CORE**: Numeric ID like `"12345678"`
## ๐ Paper Metadata Format
All tools return paper objects with the following structure:
```typescript
{
id: string; // Paper ID
title: string; // Paper title
authors: string[]; // List of author names
date: string; // Publication date (ISO format)
pdf_url?: string; // PDF URL (if available)
text: string; // Extracted full text content
textTruncated?: boolean; // Warning: text was truncated due to size limits
textExtractionFailed?: boolean; // Warning: text extraction failed
}
```
## ๐ง Advanced Text Extraction
### Multi-Source Strategy
Each source has specialized text extraction approaches:
- **arXiv**: HTML from `arxiv.org/html` with `ar5iv.labs.arxiv.org` fallback
- **OpenAlex**: HTML sources with DOI resolver fallback chain
- **PMC**: E-utilities API with XML/HTML extraction
- **Europe PMC**: REST API with multiple URL strategies
- **bioRxiv/medRxiv**: Direct HTML extraction with abstract fallback
- **CORE**: PDF/HTML with source URL fallback
### DOI Resolution Chain
Advanced DOI resolver with multiple fallback strategies:
1. **Unpaywall** โ Free full-text sources
2. **Crossref** โ Publisher metadata and links
3. **Semantic Scholar Academic Graph** โ Alternative access
### Performance & Reliability
- **Text Extraction Success**: >90% for HTML-available papers
- **Graceful Degradation**: Always returns metadata even if text extraction fails
- **Size Management**: 6MB text limit with intelligent truncation
- **Caching**: 24-hour LRU cache for DOI resolution
## ๐ Rate Limiting
Respectful API usage with per-source rate limiting:
- **arXiv**: 5 requests per minute
- **OpenAlex**: 10 requests per minute
- **PMC**: 3 requests per second
- **Europe PMC**: 10 requests per minute
- **bioRxiv/medRxiv**: 5 requests per minute
- **CORE**: 10 requests per minute (public), higher with API key
### CORE API Configuration
For enhanced CORE access, set environment variable:
```bash
export CORE_API_KEY="your-api-key"
```
## ๐งช Testing
### Run Test Suite
```bash
# Run all tests
npm test
# Run integration tests
npm run test -- tests/integration
# Run end-to-end workflow tests
npm run test -- tests/e2e
# Run performance benchmarks
npm run test -- tests/integration/performance.test.ts
```
### Test Coverage
- **Integration Tests**: All 6 sources tested end-to-end
- **Performance Tests**: Response time and throughput benchmarks
- **Workflow Tests**: Real research scenarios across multiple sources
- **Unit Tests**: Core components and edge cases
## ๐ Architecture
### **Modular Driver System**
- Clean separation between sources
- Consistent interface across all drivers
- Specialized text extraction per source
### **Advanced Features**
- **DOI Resolution**: Multi-provider fallback chain
- **Rate Limiting**: Token bucket algorithm per source
- **Text Processing**: HTML cleaning and normalization
- **Error Handling**: Structured responses with actionable suggestions
- **Caching**: Intelligent caching for DOI resolution
### **Technology Stack**
- **TypeScript + ESM**: Modern JavaScript with full type safety
- **Modular Design**: Clean separation of concerns
- **Graceful Degradation**: Always functional even with partial failures
- **Response Size Management**: Automatic truncation and warnings
## ๐ Source Comparison
| Source | Papers | Disciplines | Full-Text | Citation Data | Preprints | Search |
|--------|--------|-------------|-----------|---------------|-----------|--------|
| arXiv | 2.3M+ | STEM | HTML โ | Limited | โ | โโโ |
| OpenAlex | 200M+ | All | Variable | โโโ | โ | โโโ |
| PMC | 7M+ | Biomedical | XML/HTML โ | Limited | โ | Limited |
| Europe PMC | 40M+ | Life Sciences | HTML โ | Limited | โ | โโโ |
| bioRxiv/medRxiv | 500K+ | Bio/Medical | HTML โ | Limited | โโโ | Limited |
| CORE | 200M+ | All | PDF/HTML โ | Limited | โ | โโโ |
## ๐ง Development
### Build
```bash
npm run build
```
### Test Individual Sources
```bash
# Test specific sources
node dist/cli.js list-categories --source=arxiv
node dist/cli.js fetch-latest --source=biorxiv --category="biorxiv:biology" --count=3
node dist/cli.js fetch-content --source=core --id=12345678
# Test search functionality
node dist/cli.js search-papers --source=arxiv --query="artificial intelligence" --count=5
node dist/cli.js search-papers --source=openalex --query="quantum computing" --field=title --count=3
```
### Performance Testing
```bash
# Run performance benchmarks
npm run test -- tests/integration/performance.test.ts
# Test memory usage
npm run test -- --reporter=verbose
```
## ๐จ Error Handling
Comprehensive error handling for all sources:
- Invalid paper IDs with format suggestions
- Rate limiting with retry-after information
- API timeouts and server errors
- Missing authentication (CORE API key)
- Network connectivity issues
- Text extraction failures with fallback strategies
## ๐ Troubleshooting
### Common Issues
- **Rate limiting**: Automatic retry with exponential backoff
- **Missing papers**: Try alternative sources for the same content
- **Text extraction failures**: Fallback to abstract or metadata
- **CORE API limits**: Set `CORE_API_KEY` environment variable
### Performance Optimization
- Use appropriate `count` parameters (smaller for faster responses)
- Cache results when possible
- Use `fetch_latest` for discovery, `fetch_content` for detailed reading
## ๐ License
MIT
**Ready to explore the world's scientific knowledge? Start with any of the 6 sources and discover papers across all academic disciplines!** ๐ฌ๐