UNPKG

@pinkpixel/prysm-mcp

Version:

MCP server for the Prysm web scraper - enabling AI assistants to scrape web content

120 lines (83 loc) 3.6 kB
# 🔍 Prysm CLI Usage Guide This document explains how to use the Prysm command-line interface for web scraping. ## Commands There are two main ways to use Prysm: 1. **Scrape a URL** - Run the web scraper on any URL 2. **Start the API server** - Run the REST API server for remote control ## 1. Scraping a URL ```bash # Basic usage npm run scrape "https://example.com" # With options npm run scrape "https://example.com" --maxScrolls 10 --noHeadless ``` > Note: You can also use `npm run start:cli` which does the same thing. ### Options - `--maxScrolls <number>` - Maximum scroll attempts (default: 100) - `--scrollDelay <ms>` - Delay between scrolls in ms (default: 2000) - `--headless` - Run in headless mode (default: true) - `--noHeadless` - Run with browser visible - `--output <path>` - Custom output path for results - `--help` - Show help message ### Smart Scan Options - `--analyze` - Run analysis without scraping (for testing) - `--skipAnalysis` - Disable Smart Scan for traditional brute force approach - `--focused` - Optimize for speed with fewer scrolls, main content only - `--standard` - Balanced approach (default) - `--deep` - Maximum extraction, slower but thorough - `--article` - Optimize for articles and blog posts - `--product` - Optimize for product pages - `--listing` - Optimize for product listings ### Environment Variables You can configure output directories using environment variables: - `PRYSM_OUTPUT_DIR` - Set the main output directory for results - `PRYSM_IMAGE_OUTPUT_DIR` - Set the output directory for downloaded images Example: ```bash # Set output directories using environment variables export PRYSM_OUTPUT_DIR="/custom/path/to/results" export PRYSM_IMAGE_OUTPUT_DIR="/custom/path/to/images" npm run scrape "https://example.com" ``` These environment variables are especially useful when integrating with MCP (Model Control Protocol) or other systems where command-line arguments might not be available. ### Examples ```bash # Basic scraping npm run scrape "https://example.com" # With custom scroll settings npm run scrape "https://example.com" --maxScrolls 10 --scrollDelay 1000 # Disable headless mode to see the browser npm run scrape "https://example.com" --noHeadless # Custom output location npm run scrape "https://example.com" --output "./my-results" # Use Smart Scan with focused mode (faster) npm run scrape "https://example.com" --focused # Analyze site structure without scraping npm run scrape "https://example.com" --analyze # Optimize for article content npm run scrape "https://example.com" --article ``` ## 2. Starting the API Server ```bash npm run start:api ``` This will start the API server, which automatically finds an available port (defaults to 3000 if available). Once running, you can access: - API at http://localhost:3001/api - Documentation at http://localhost:3001/api-docs ## Results All scraping results are saved in the `results` folder inside the `scraper` directory by default. You can customize this using the `--output` option or the `PRYSM_OUTPUT_DIR` environment variable. Each result is saved as a JSON file with: - Page title - Extracted content - Metadata - Structure type (article, recipe, etc.) - Pagination type used - Extraction method - URL and timestamp ## Troubleshooting If you encounter errors: 1. Make sure you have all dependencies installed: `npm install` 2. Check if the URL is accessible in a regular browser 3. Try with `--noHeadless` to see what's happening in the browser 4. Disable Cloudflare bypass with `--noBypassCloudflare` if you're having issues