UNPKG

paper-search-mcp-nodejs

Version:

A Node.js MCP server for searching and downloading academic papers from multiple sources, including arXiv, PubMed, bioRxiv, Web of Science, and more.

1,087 lines (890 loc) 23.1 kB
# Platform API Documentation - Paper Search MCP This document provides comprehensive API documentation for all academic paper search platform integrations in the Paper Search MCP project. Each platform has unique authentication requirements, API endpoints, rate limits, and data formats. ## Table of Contents 1. [Platform Overview](#platform-overview) 2. [arXiv](#arxiv) 3. [Web of Science](#web-of-science) 4. [Crossref](#crossref) 5. [ScienceDirect (Elsevier)](#sciencedirect-elsevier) 6. [Scopus (Elsevier)](#scopus-elsevier) 7. [Springer Nature](#springer-nature) 8. [Wiley TDM](#wiley-tdm) 9. [PubMed](#pubmed) 10. [Google Scholar](#google-scholar) 11. [Semantic Scholar](#semantic-scholar) 12. [Sci-Hub](#sci-hub) 13. [bioRxiv/medRxiv](#biorxivmedrxiv) 14. [IACR ePrint](#iacr-eprint) 15. [Common Patterns](#common-patterns) 16. [Error Handling](#error-handling) ## Platform Overview | Platform | API Key Required | Search | Download | Full Text | Citations | Rate Limit | |----------|------------------|--------|----------|-----------|-----------|------------| | arXiv | ❌ | ✅ | ✅ | ✅ | ❌ | 30s timeout | | Web of Science | ✅ | ✅ | ❌ | ❌ | ✅ | 100 req/5min | | Crossref | ❌ | ✅ | ❌ | ❌ | ✅ | Polite pool | | ScienceDirect | ✅ | ✅ | ❌ | ❌ | ✅ | 10 req/s | | Scopus | ✅ | ✅ | ❌ | ❌ | ✅ | 10 req/s | | Springer | ✅ | ✅ | ✅ | ❌ | ✅ | 3 req/min | | Wiley TDM | ✅ | ❌ | ✅ | ✅ | ❌ | 6 req/min | | PubMed | ❌ | ✅ | ❌ | ❌ | ❌ | 3/10 req/s | | Google Scholar | ❌ | ✅ | ❌ | ❌ | ✅ | Anti-bot | | Semantic Scholar | ❌ | ✅ | ✅* | ❌ | ✅ | 100 req/5min | | Sci-Hub | ❌ | ✅** | ✅ | ❌ | ❌ | Manual | | bioRxiv/medRxiv | ❌ | ✅ | ✅ | ✅ | ❌ | No limit | | IACR ePrint | ❌ | ✅ | ✅ | ✅ | ❌ | No limit | \* Only open access papers \** Only via DOI/URL lookup --- ## Crossref **Base URL:** `https://api.crossref.org/works` **Documentation:** https://api.crossref.org/ **API Type:** REST JSON API **Authentication:** Optional (mailto parameter recommended) ### Capabilities - ✅ Search papers - ❌ No PDF download - ❌ No full text - ✅ Citation statistics (via OpenCitations) - ❌ No API key required ### API Endpoints #### Search Works ``` GET /works ``` **Parameters:** - `query` - Search query - `rows` - Results per request (max: 1000) - `offset` - Pagination offset - `filter` - Filters (comma-separated) - `sort` - Sort field - `order` - Sort order: `asc`, `desc` - `mailto` - Email for polite pool **Available Filters:** ``` from-pub-date:2023-01-01 until-pub-date:2023-12-31 from-created-date:2023-01-01 type:journal-article has-abstract:true has-license:true is-update:true prefix:10.1038 member:78 funder-name:NIH award.number:12345 ``` **Sort Options:** - `relevance` - Relevance score - `published` - Publication date - `created` - Record creation date - `deposited` - Deposit date - `indexed` - Index date - `is-referenced-by-count` - Citation count **Response Structure:** ```json { "status": "ok", "message-type": "work-list", "message-version": "1.0.0", "message": { "items": [{...}], "items-per-page": 20, "query": { "search-terms": null, "start-index": 0 }, "total-results": 1234 } } ``` **Work Item Fields:** - `DOI` - Digital Object Identifier - `title` - Article title - `author` - Author list with names - `abstract` - Abstract (when available) - `container-title` - Journal/publisher - `published` - Publication date parts - `is-referenced-by-count` - Citation count - `type` - Publication type - `URL` - DOI resolver URL - `subject` - Keywords/subjects #### Get Work by DOI ``` GET /works/{doi} ``` #### Get Citations (via OpenCitations) ``` GET https://opencitations.net/index/coci/api/v1/citations/{doi} ``` **Rate Limiting:** - No hard limits for public API - Polite pool: include `mailto` parameter - Recommended: 50 requests per second max **Special Features:** - Largest DOI metadata collection - RESTful DOI resolution - Citation data via OpenCitations - Funder information - License information --- ## ScienceDirect (Elsevier) **Base URL:** `https://api.elsevier.com` **Documentation:** https://dev.elsevier.com/ **API Type:** REST JSON API **Authentication:** API Key (X-ELS-APIKey header) ### Capabilities - ✅ Search papers - ❌ No PDF download (institutional access required) - ❌ No full text - ✅ Citation statistics - ✅ API key required ### API Endpoints #### Search ScienceDirect ``` PUT /content/search/sciencedirect ``` **Request Body:** ```json { "qs": "machine learning", "date": "2023", "authors": "Smith", "display": { "offset": 0, "show": 25, "sortBy": "relevance" } } ``` **Parameters:** - `qs` - Query string - `date` - Year or year range - `authors` - Author name filter - `display.offset` - Start position - `display.show` - Results per page (max: 100) - `display.sortBy` - Sort: `relevance`, `date` #### Get Article Details ``` GET /content/article/pii/{pii} GET /content/article/doi/{doi} ``` **Query Parameters:** - `view` - Detail level: `META`, `META_ABS`, `FULL` **Response Fields:** - `dc:identifier` - Article identifier - `dc:title` - Title - `dc:creator` - Authors - `dc:description` - Abstract - `prism:publicationName` - Journal - `prism:coverDate` - Publication date - `prism:doi` - DOI - `prism:volume` - Volume - `prism:issueIdentifier` - Issue - `prism:pageRange` - Pages - `citedby-count` - Citation count **Rate Limiting:** - Without key: 20 requests/minute - With key: 10 requests/second - 5000 requests per week **Special Features:** - PII (Publisher Item Identifier) support - Institutional access integration - Article clustering - ScienceDirect recommendations --- ## Scopus (Elsevier) **Base URL:** `https://api.elsevier.com` **Documentation:** https://dev.elsevier.com/documentation/SCOPUSSearchAPI.wadl **API Type:** REST JSON API **Authentication:** API Key (X-ELS-APIKey header) ### Capabilities - ✅ Search papers - ❌ No PDF download - ❌ No full text - ✅ Citation statistics - ✅ API key required ### API Endpoints #### Search Scopus ``` GET /content/search/scopus ``` **Parameters:** - `query` - Scopus search query - `count` - Results per page (max: 25) - `start` - Start position - `view` - Detail level: `STANDARD`, `COMPLETE` - `field` - Fields to return - `sort` - Sort field and order **Query Format (Scopus):** ``` # Title/Abstract/Keywords TITLE-ABS-KEY(machine learning) # Author search AUTHOR(Smith) # Affiliation search AFFIL(MIT) # Journal search SRCTITLE(Nature) # Subject area SUBJAREA(COMP) # Year range PUBYEAR > 2020 AND PUBYEAR < 2024 # Open access OPENACCESS(1) # Document type DOCTYPE(Article) ``` #### Get Abstract ``` GET /content/abstract/scopus_id/{scopus_id} ``` **Parameters:** - `view` - Detail level: `META`, `META_ABS`, `FULL` **Response Fields:** - `dc:identifier` - Scopus ID - `eid` - EID identifier - `dc:title` - Title - `dc:creator` - Authors - `prism:publicationName` - Journal - `prism:coverDate` - Publication date - `prism:doi` - DOI - `citedby-count` - Citation count - `authkeywords` - Author keywords - `affiliation` - Author affiliations **Rate Limiting:** - Without key: 20 requests/minute - With key: 10 requests/second - 5000 requests per week **Special Features:** - Largest abstract database - Author profiles - Affiliation data - Citation tracking - h-index calculations --- ## Springer Nature **Base URL:** `https://api.springernature.com` **Documentation:** https://dev.springernature.com/ **API Type:** REST JSON API **Authentication:** API Key (api_key parameter) ### Capabilities - ✅ Search papers - ✅ PDF download (open access) - ❌ No full text - ✅ Citation statistics (via Crossref) - ✅ API key required ### API Endpoints #### Metadata API v2 ``` GET /meta/v2/json ``` #### OpenAccess API ``` GET /openaccess/json ``` **Parameters:** - `q` - Query string - `api_key` - API key - `s` - Start index - `p` - Page size (max: 100) - `year` - Publication year - `name` - Author name - `pub` - Publication name - `subject` - Subject area - `type` - Content type **Content Types:** - `Journal` - Journal articles - `Book` - Books - `BookChapter` - Book chapters - `ConferencePaper` - Conference papers **Response Structure:** ```json { "records": [{ "identifier": "doi:10.1038/s41586-023-06083-9", "title": "Article Title", "creators": [{"creator": "Smith, John"}], "publicationName": "Nature", "publicationDate": "2023-08-01", "doi": "10.1038/s41586-023-06083-9", "url": [{"format": "pdf", "value": "https://..."}], "abstract": "Abstract text...", "openaccess": "true" }] } ``` **Rate Limiting:** - 5000 requests per day - ~200 requests per hour - 3-4 requests per minute recommended **Special Features:** - Open access content filtering - Multiple content types - Publisher metadata - PDF URL extraction --- ## Wiley TDM **Base URL:** `https://api.wiley.com/onlinelibrary/tdm/v1` **Documentation:** https://onlinelibrary.wiley.com/library-info/resources/text-and-datamining **API Type:** REST API **Authentication:** TDM Token (Wiley-TDM-Client-Token header) ### Capabilities - ❌ No search functionality - ✅ PDF download by DOI - ✅ Full PDF text - ❌ No citation statistics - ✅ TDM token required ### API Endpoints #### Download Article PDF ``` GET /articles/{doi} ``` **Headers:** - `Wiley-TDM-Client-Token: {tdm_token}` - `Accept: application/pdf` **DOI Format:** - URL-encoded DOI: `10.1111%2Fjtsb.12390` **Error Codes:** - `400` - No TDM token provided - `403` - Invalid TDM token - `404` - Access denied (no subscription) - `429` - Rate limit exceeded **Rate Limiting:** - 3 articles per second maximum - 60 requests per 10 minutes - Built-in delays recommended **Important Notes:** - TDM API is for Text and Data Mining only - No search functionality - use Crossref for searching - Requires institutional subscription - PDFs are watermarked for TDM use --- ## PubMed **Base URL:** `https://eutils.ncbi.nlm.nih.gov/entrez/eutils` **Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/ **API Type:** REST XML API **Authentication:** Optional API key ### Capabilities - ✅ Search papers - ❌ No direct PDF download - ❌ Abstracts only - ❌ No citation statistics - ❌ API key optional ### API Endpoints #### Search (ESearch) ``` GET /esearch.fcgi ``` **Parameters:** - `db` - Database ("pubmed") - `term` - Search query - `retmax` - Results to return - `retstart` - Start position - `sort` - Sort method - `api_key` - Optional API key **Query Format:** ``` # Basic search machine learning # Field-specific "Smith J"[Author] "Nature"[Journal] 2023[Publication Date] # Boolean combinations (machine learning OR deep learning) AND 2023[Publication Date] ``` #### Fetch Details (EFetch) ``` GET /efetch.fcgi ``` **Parameters:** - `db` - Database ("pubmed") - `id` - PMID list (comma-separated) - `retmode` - Return format ("xml") - `api_key` - Optional API key #### Get Summaries (ESummary) ``` GET /esummary.fcgi ``` **Rate Limiting:** - Without API key: 3 requests/second - With API key: 10 requests/second - Batch requests recommended **Special Features:** - Medical literature focus - MeSH terms indexing - Clinical trial links - PMC (PubMed Central) integration --- ## Google Scholar **Base URL:** `https://scholar.google.com/scholar` **Documentation:** None (unofficial) **API Type:** Web scraping **Authentication:** None (anti-bot measures) ### Capabilities - ✅ Search papers - ❌ No direct PDF download - ❌ Metadata only - ✅ Citation statistics - ❌ No API key required ### Scraping Implementation #### Search Parameters ``` GET /scholar ``` **Parameters:** - `q` - Search query - `start` - Start position (pagination) - `hl` - Interface language - `as_sdt` - Search type (0,5 for articles) - `as_vis` - Include citations (1) - `as_ylo` - Year low - `as_yhi` - Year high - `as_sauthors` - Author filter **Anti-Detection Measures:** - Random user agents - Request delays (1-3 seconds) - Session management - IP rotation (recommended) **HTML Structure Parsing:** ```javascript // Result container '.gs_ri' // Title and URL 'h3.gs_rt a' // Authors and publication info 'div.gs_a' // Abstract 'div.gs_rs' // Citation count 'div.gs_fl a:contains("Cited by")' ``` **Rate Limiting:** - No explicit limits - Anti-bot protection - May require CAPTCHA solving - IP-based rate limiting **Special Considerations:** - Terms of service restrictions - Unstable HTML structure - Legal compliance required - Alternative APIs recommended --- ## Semantic Scholar **Base URL:** `https://api.semanticscholar.org/graph/v1` **Documentation:** https://api.semanticscholar.org/api-docs/ **API Type:** REST JSON API **Authentication:** Optional API key ### Capabilities - ✅ Search papers - ✅ PDF download (open access) - ❌ No full text extraction - ✅ Citation statistics - ❌ API key optional ### API Endpoints #### Search Papers ``` GET /paper/search ``` **Parameters:** - `query` - Search string - `limit` - Results (max: 100) - `fields` - Fields to return - `year` - Year filter - `fieldsOfStudy` - Subject areas **Available Fields:** ``` paperId, title, abstract, venue, year, referenceCount, citationCount, influentialCitationCount, isOpenAccess, openAccessPdf, fieldsOfStudy, publicationTypes, publicationDate, journal, authors, externalIds, url ``` #### Get Paper Details ``` GET /paper/{paper_id} ``` #### Get Paper by DOI ``` GET /paper/DOI:{doi} ``` **Response Structure:** ```json { "paperId": "abc123", "title": "Paper Title", "abstract": "Abstract text...", "venue": "Conference Name", "year": 2023, "citationCount": 42, "referenceCount": 25, "influentialCitationCount": 5, "isOpenAccess": true, "openAccessPdf": { "url": "https://arxiv.org/pdf/...", "status": "ok" }, "authors": [{"name": "Author Name"}], "externalIds": { "DOI": "10.1234/abc", "ArXiv": "2308.12345" } } ``` **Rate Limiting:** - Free tier: 100 requests per 5 minutes - Pro tier: 1000 requests per 5 minutes - Bulk API available **Special Features:** - AI-powered semantic search - Influence metrics - Open access detection - External ID mapping - Author disambiguation --- ## Sci-Hub **Base URLs:** Multiple mirror sites **Documentation:** None (unofficial) **API Type:** Web scraping **Authentication:** None required ### Capabilities - ✅ Search by DOI/URL - ✅ PDF download - ❌ No metadata search - ❌ No citation data - ❌ No API key required ### Mirror Sites ```javascript [ 'https://sci-hub.se', 'https://sci-hub.st', 'https://sci-hub.ru', 'https://sci-hub.ren', 'https://sci-hub.wf', 'https://sci-hub.yt' ] ``` ### Implementation Details #### Search by DOI/URL ``` GET /{doi_or_url} ``` **Supported Input Formats:** - DOI: `10.1234/abcd.1234` - URL: `https://doi.org/10.1234/abcd.1234` - With prefix: `doi:10.1234/abcd.1234` **PDF URL Extraction:** ```javascript // Multiple selectors for PDF frame '#pdf' 'embed[type="application/pdf"]' 'iframe[src*=".pdf"]' 'button[onclick*="download"]' ``` **Health Monitoring:** - Automatic mirror health checks - Failover between mirrors - Response time tracking - Failure count management **Rate Limiting:** - No explicit limits - Manual delays recommended - Anti-detection measures **Legal Considerations:** - Copyright infringement risks - Institutional policy compliance - Alternative access methods - Ethical usage guidelines --- ## bioRxiv/medRxiv **Base URL:** `https://api.biorxiv.org/details/{server}` **Documentation:** https://api.biorxiv.org/ **API Type:** REST JSON API **Authentication:** None required ### Capabilities - ✅ Search papers - ✅ PDF download - ✅ Full text (PDF) - ❌ No citation statistics - ❌ No API key required ### API Endpoints #### Search by Date Range ``` GET /{server}/{start_date}/{end_date} ``` **Servers:** - `biorxiv` - Biology preprints - `medrxiv` - Medicine preprints **Parameters:** - `cursor` - Pagination cursor - `category` - Subject category - `format` - Response format (json) **Date Format:** `YYYY-MM-DD` #### Get PDF ``` GET https://www.{server}.org/content/{doi}v{version}.full.pdf ``` **Response Structure:** ```json { "messages": [{"status": "ok", "count": 50}], "collection": [{ "doi": "10.1101/2023.08.12.553123", "title": "Paper Title", "authors": "Smith, John; Doe, Jane", "author_corresponding": "Smith, John", "date": "2023-08-12", "version": "1", "type": "preprint", "license": "cc-by-4.0", "category": "bioinformatics", "abstract": "Abstract text...", "server": "biorxiv" }] } ``` **Categories (bioRxiv):** - `animal_behavior_and_cognition` - `biochemistry` - `bioengineering` - `bioinformatics` - `biophysics` - `cancer_biology` - `cell_biology` - `clinical_trials` - `developmental_biology` - `ecology` - `epidemiology` - `evolutionary_biology` - `genetics` - `genomics` - `immunology` - `microbiology` - `molecular_biology` - `neuroscience` - `paleontology` - `pathology` - `pharmacology_and_toxicology` - `physiology` - `plant_biology` - `scientific_communication_and_education` - `synthetic_biology` - `systems_biology` - `zoology` **Rate Limiting:** - No explicit rate limits - Reasonable request frequency recommended **Special Features:** - Preprint server integration - Version tracking - Category filtering - DOI-based PDF access - Daily content updates --- ## IACR ePrint **Base URL:** `https://eprint.iacr.org` **Documentation:** None (web scraping) **API Type:** HTML scraping **Authentication:** None required ### Capabilities - ✅ Search papers - ✅ PDF download - ✅ Full text (PDF) - ❌ No citation statistics - ❌ No API key required ### Implementation Details #### Search Interface ``` GET /search ``` **Parameters:** - `q` - Search query #### PDF Download ``` GET /{year}/{paper_id}.pdf ``` **Paper ID Format:** `YYYY/paper_id` **HTML Structure:** ```css /* Search results */ '.mb-4' /* Result container */ '.d-flex .paperlink' /* Paper ID/link */ 'a[href$=".pdf"]' /* PDF link */ 'small.ms-auto' /* Last updated */ '.ms-md-4 strong' /* Title */ '.ms-md-4 .fst-italic' /* Authors */ '.ms-md-4 .badge' /* Category */ 'p.search-abstract' /* Abstract */ ``` #### Detailed Page ```css /* Paper details */ 'h3.mb-3' /* Title */ 'p.fst-italic' /* Authors */ 'p[style*="white-space"]' /* Abstract */ 'a.badge.bg-secondary' /* Keywords */ ``` **Rate Limiting:** - No explicit limits - 1-second delays between requests - Respect server resources **Special Features:** - Cryptography focus - Detailed paper information - Keyword extraction - Publication history - Category classification --- ## Common Patterns ### Rate Limiting Strategies 1. **Token Bucket Algorithm** - Implemented in `RateLimiter` utility - Configurable requests per second - Burst capacity management 2. **Adaptive Delays** - Platform-specific delays - Retry with exponential backoff - Health check integration 3. **Request Queuing** - Batch request processing - Priority queuing - Concurrent request limits ### Authentication Patterns 1. **API Key Headers** ```javascript headers: { 'X-ELS-APIKey': apiKey, // Elsevier 'X-ApiKey': apiKey, // Web of Science 'Wiley-TDM-Client-Token': tdmToken, // Wiley 'x-api-key': apiKey // Semantic Scholar } ``` 2. **Query Parameters** ```javascript params: { api_key: apiKey, // Springer mailto: emailAddress // Crossref } ``` 3. **Polite Pool Access** ```javascript headers: { 'User-Agent': 'Paper-Search-MCP/1.0 (mailto:user@example.com)' } ``` ### Error Handling Patterns 1. **HTTP Status Code Mapping** ```javascript 400: Bad Request 401: Unauthorized (invalid API key) 403: Forbidden (access denied) 404: Not Found 429: Rate Limit Exceeded 500: Internal Server Error ``` 2. **Retry Strategies** ```javascript - Network errors: 3 retries - Rate limits: Wait and retry - Server errors: Exponential backoff - Authentication: No retry ``` 3. **Platform-Specific Errors** ```javascript // Web of Science 'API key required' 'Rate limit exceeded' // Elsevier 'Invalid or missing API key' 'Rate limit exceeded' // Wiley 'TDM Client Token is invalid' 'Access denied' ``` ### Data Transformation Patterns 1. **Date Parsing** ```javascript // Multiple formats '2023-08-15' // ISO '2023-08' // Year-month '2023' // Year only '2023-08-15T10:30:00Z' // ISO with time ``` 2. **Author Name Normalization** ```javascript // Various formats 'Smith, John A.' // Last, First 'John A. Smith' // First Last 'Smith JA' // Last FirstInitials ``` 3. **DOI Extraction** ```javascript // From various sources '10.1234/abcd.1234' // Raw DOI 'https://doi.org/10.1234/abcd.1234' // DOI URL 'doi:10.1234/abcd.1234' // DOI with prefix ``` --- ## Error Handling ### Common Error Types 1. **Authentication Errors** - Invalid API keys - Expired tokens - Missing credentials - Insufficient permissions 2. **Rate Limiting Errors** - Request quota exceeded - Too many requests - Burst limit exceeded - Daily limits reached 3. **Network Errors** - Connection timeouts - DNS resolution failures - SSL certificate issues - Proxy errors 4. **Data Errors** - Invalid DOI formats - Missing required fields - Malformed responses - Encoding issues ### Error Recovery Strategies 1. **Automatic Retry** ```javascript - Network failures: immediate retry - Rate limits: wait and retry - Server errors: exponential backoff - Timeouts: increase timeout value ``` 2. **Fallback Mechanisms** ```javascript - Alternative API endpoints - Different authentication methods - Cached data usage - Alternative platforms ``` 3. **User Notifications** ```javascript - Clear error messages - Suggested actions - Alternative approaches - Contact information ``` ### Logging and Monitoring 1. **Request Logging** ```javascript - Platform name - Endpoint URL - Request parameters - Response status - Response time - Error details ``` 2. **Performance Monitoring** ```javascript - Success rates - Response times - Error frequencies - Rate limit hits - API quota usage ``` This documentation serves as a comprehensive reference for developers working with the Paper Search MCP platform integrations. Each platform has unique characteristics and requirements that must be carefully considered when implementing search functionality.