@pinkpixel/prysm-mcp
Version:
MCP server for the Prysm web scraper - enabling AI assistants to scrape web content
120 lines (83 loc) • 3.6 kB
Markdown
# 🔍 Prysm CLI Usage Guide
This document explains how to use the Prysm command-line interface for web scraping.
## Commands
There are two main ways to use Prysm:
1. **Scrape a URL** - Run the web scraper on any URL
2. **Start the API server** - Run the REST API server for remote control
## 1. Scraping a URL
```bash
# Basic usage
npm run scrape "https://example.com"
# With options
npm run scrape "https://example.com" --maxScrolls 10 --noHeadless
```
> Note: You can also use `npm run start:cli` which does the same thing.
### Options
- `--maxScrolls <number>` - Maximum scroll attempts (default: 100)
- `--scrollDelay <ms>` - Delay between scrolls in ms (default: 2000)
- `--headless` - Run in headless mode (default: true)
- `--noHeadless` - Run with browser visible
- `--output <path>` - Custom output path for results
- `--help` - Show help message
### Smart Scan Options
- `--analyze` - Run analysis without scraping (for testing)
- `--skipAnalysis` - Disable Smart Scan for traditional brute force approach
- `--focused` - Optimize for speed with fewer scrolls, main content only
- `--standard` - Balanced approach (default)
- `--deep` - Maximum extraction, slower but thorough
- `--article` - Optimize for articles and blog posts
- `--product` - Optimize for product pages
- `--listing` - Optimize for product listings
### Environment Variables
You can configure output directories using environment variables:
- `PRYSM_OUTPUT_DIR` - Set the main output directory for results
- `PRYSM_IMAGE_OUTPUT_DIR` - Set the output directory for downloaded images
Example:
```bash
# Set output directories using environment variables
export PRYSM_OUTPUT_DIR="/custom/path/to/results"
export PRYSM_IMAGE_OUTPUT_DIR="/custom/path/to/images"
npm run scrape "https://example.com"
```
These environment variables are especially useful when integrating with MCP (Model Control Protocol) or other systems where command-line arguments might not be available.
### Examples
```bash
# Basic scraping
npm run scrape "https://example.com"
# With custom scroll settings
npm run scrape "https://example.com" --maxScrolls 10 --scrollDelay 1000
# Disable headless mode to see the browser
npm run scrape "https://example.com" --noHeadless
# Custom output location
npm run scrape "https://example.com" --output "./my-results"
# Use Smart Scan with focused mode (faster)
npm run scrape "https://example.com" --focused
# Analyze site structure without scraping
npm run scrape "https://example.com" --analyze
# Optimize for article content
npm run scrape "https://example.com" --article
```
## 2. Starting the API Server
```bash
npm run start:api
```
This will start the API server, which automatically finds an available port (defaults to 3000 if available).
Once running, you can access:
- API at http://localhost:3001/api
- Documentation at http://localhost:3001/api-docs
## Results
All scraping results are saved in the `results` folder inside the `scraper` directory by default. You can customize this using the `--output` option or the `PRYSM_OUTPUT_DIR` environment variable.
Each result is saved as a JSON file with:
- Page title
- Extracted content
- Metadata
- Structure type (article, recipe, etc.)
- Pagination type used
- Extraction method
- URL and timestamp
## Troubleshooting
If you encounter errors:
1. Make sure you have all dependencies installed: `npm install`
2. Check if the URL is accessible in a regular browser
3. Try with `--noHeadless` to see what's happening in the browser
4. Disable Cloudflare bypass with `--noBypassCloudflare` if you're having issues