biorxiv
Version:
CLI tool to download openRxiv MECA files from AWS S3 for text and data mining
225 lines (148 loc) • 5.98 kB
Markdown
# bioRxiv MECA Downloader CLI
A comprehensive command-line interface (CLI) tool to download, process, and manage bioRxiv MECA (Manuscript Exchange Common Approach) files from AWS S3 for text and data mining purposes.
## Features
- **Multi-Server Support**: Works with both bioRxiv and medRxiv servers
- **AWS S3 Integration**: Connect to S3 buckets with requester-pays support
- **Content Exploration**: List, search, and browse available content by month or batch
- **Individual Downloads**: Download MECA files by DOI with API integration
- **Batch Processing**: Process large amounts of data with configurable concurrency
- **Content Summaries**: Get detailed information about preprints
- **Month/Batch Analysis**: Detailed metadata for specific time periods
- **XML Processing**: Robust handling of bioRxiv XML files with entity replacement
## Installation
### Prerequisites
- Node.js 18.0.0 or higher
- AWS account with access to bioRxiv/medRxiv S3 buckets (`--requester-pays`)
- API key for bioRxiv API access
### Global Installation
```bash
npm install -g biorxiv
```
## Quick Start
### Summary
Get a summary of a bioRxiv/medRxiv preprint from a URL or DOI.
**Arguments:**
- `<url-or-doi>`: bioRxiv URL or DOI to summarize
**Options:**
- `-m, --more`: Show additional details and full abstract
- `-s, --server <server>`: Specify server (biorxiv or medrxiv)
**Examples:**
```bash
biorxiv summary "10.1101/2024.05.08.593085"
biorxiv summary -m "10.1101/2024.05.08.593085"
biorxiv summary -s medrxiv "10.1101/2020.03.19.20039131" --more
```
### Download Files
Download MECA files from the bioRxiv/medRxiv S3 buckets by DOI.
**Arguments:**
- `<doi>`: DOI of the paper (e.g., "10.1101/2024.05.08.593085")
**Options:**
- `-o, --output <dir>`: Output directory for downloaded files (default: "./downloads")
- `-a, --api-url <url>`: API base URL
- `--requester-pays`: Enable requester-pays for S3 bucket access
**Examples:**
```bash
biorxiv --requester-pays download "10.1101/2024.05.08.593085"
biorxiv --requester-pays download "10.1101/2024.05.08.593085" --output "./papers"
biorxiv --requester-pays download "10.1101/2024.05.08.593085" --api-url "https://custom-api.com"
```
### List Bucket Contents
List available content in the bioRxiv or medRxiv S3 bucket.
**Options:**
- `-m, --month <month>`: Filter by specific month (e.g., "2024-01" or "January_2024")
- `-b, --batch <batch>`: Filter by specific batch (e.g., "Batch_01")
- `-l, --limit <number>`: Limit the number of results (default: 50)
- `-s, --server <server>`: Server to use: "biorxiv" or "medrxiv"
**Examples:**
```bash
# Local development
biorxiv list
biorxiv list --month "2024-01"
biorxiv list --batch 1 --limit 100 --server medrxiv
```
## Batch Processing
List detailed metadata for all files in a specific month or batch.
**Options:**
- `-m, --month <month>`: Month to list (e.g., "January_2024" or "2024-01")
- `-b, --batch <batch>`: Batch to list (e.g., "1", "batch-1", "Batch_01")
- `-s, --server <server>`: Server to use: "biorxiv" or "medrxiv"
**Examples:**
```bash
biorxiv batch-info --month "2024-01"
biorxiv batch-info --batch "1"
biorxiv batch-info --server medrxiv --month "2024-01"
```
## Global Options
### `--requester-pays`
Enable requester pays functionality. The S3 buckets require requester pays for external access.
# Batch Processing
Batch process MECA files for a given month or batch.
**Options:**
**Time Selection:**
- `-m, --month <month>`: Month(s) to process. Supports: YYYY-MM, comma-separated list, or wildcard patterns
- `-b, --batch <batch>`: Batch to process. Supports: single batch, range, or comma-separated list
**Processing Control:**
- `-l, --limit <number>`: Maximum number of files to process
- `-c, --concurrency <number>`: Number of files to process concurrently (default: 1)
- `--force`: Force reprocessing of existing files
- `--dry-run`: List files without processing them
**Output Control:**
- `-o, --output <dir>`: Output directory for extracted files (default: "./batch-extracted")
- `--keep`: Keep MECA files after processing
- `--full-extract`: Extract entire MECA file instead of selective extraction
- `--max-file-size <size>`: Skip files larger than this size (e.g. 1GB)
**API Configuration:**
- `-a, --api-url <url>`: API base URL (default: "https://biorxiv.csf.now")
- `-k, --api-key <key>`: API key for authentication (or use OPENRXIV_BATCH_PROCESSING_API_KEY env var)
- `-s, --server <server>`: Server type: biorxiv or medrxiv
**Examples:**
```bash
# Process specific month
biorxiv batch-process --month "2025-08" --requester-pays
# Process multiple months
biorxiv batch-process --month "2024-01,2024-02,2024-03" --requester-pays
# Dry run to see what would be processed
biorxiv batch-process --month "2025-08" --dry-run
# Process all of 2025
biorxiv batch-process --month "2025-*" --requester-pays
# Process with concurrency
biorxiv batch-process --month "2025-08" --concurrency 5 --requester-pays
```
## Configuration
The tool reads AWS credentials from the home directory under the default profile, if available.
### Environment Variables
You can also set credentials via environment variables:
```bash
export OPENRXIV_BATCH_PROCESSING_API_KEY="your-api-key"
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
```
## Development
### Local Development
```bash
git clone https://github.com/continuous-foundation/biorxiv
cd biorxiv
npm install
```
### Building
```bash
npm run build
```
### Testing
```bash
npm test
npm run test:watch
```
### Linting & Formatting
```bash
npm run lint
npm run lint:format
```
## License
MIT License - see LICENSE file for details.
## Compliance
This tool is designed to comply with bioRxiv's and medRxiv's fair use policies:
- No content redistribution
- Link back to bioRxiv/medRxiv for indexing services
- Respect author copyright and licensing
- Intended for legitimate text and data mining purposes