UNPKG

@pinkpixel/prysm-mcp

Version:

MCP server for the Prysm web scraper - enabling AI assistants to scrape web content

176 lines (126 loc) โ€ข 6.55 kB
# ๐Ÿ“ Changelog All notable changes to the Prysm scraper will be documented in this file. ## [1.1.1] - 2024-04-05 ### Added - ๐Ÿ”ง Added support for `PRYSM_OUTPUT_DIR` environment variable to configure output directory - ๐Ÿ–ผ๏ธ Added support for `PRYSM_IMAGE_OUTPUT_DIR` environment variable to configure image output directory - ๐Ÿ“š Updated documentation with detailed explanation of environment variables - ๐Ÿงฉ Enhanced MCP integration examples in documentation ### Fixed - ๐Ÿ”„ Removed circular dependency where the package referenced itself - ๐Ÿ› ๏ธ Improved path handling for better compatibility with MCP and other environments ## [1.4.0] - 2024-04-05 ### Added - ๐Ÿง  Smart Scan technology that automatically analyzes page structure for optimized scraping - ๐Ÿ” Content type detection for articles, products, listings, and more - โšก Optimized extraction strategies based on detected page structure - ๐Ÿ”„ Automated pagination strategy selection - ๐Ÿš€ Performance profiles for different scraping needs (focused, standard, deep) - ๐Ÿ’จ Streamlined CLI with intuitive speed options (--focused, --standard, --deep) - ๐Ÿ“Š Improved metadata extraction with automatic optimization ### Changed - ๐ŸŽฏ Focused metadata extraction to prioritize essential information - ๐Ÿงฉ Restructured extractor organization to work with Smart Scan - ๐Ÿคซ Reduced console output for cleaner terminal display - โš™๏ธ Improved default settings for various page types ## [1.3.4] - 2024-04-04 ### Changed - ๐Ÿ”จ Implemented true brute force approach that applies all extraction methods to every page - ๐Ÿšซ Removed all detection logic and thresholds for maximum content extraction - ๐Ÿงน Removed conditional checks in pagination strategies to try everything on every page - ๐Ÿ”„ Simplified pagination handling for more consistent results across different sites - ๐Ÿ–ผ๏ธ Enhanced image extraction to capture all images without filtering - ๐Ÿคซ Significantly reduced console output for a cleaner terminal experience - โšก Streamlined metadata extraction to focus on content and images ## [1.3.3] - 2024-04-03 ### Added - ๐Ÿงช Added comprehensive test script with category and name-based filtering - ๐ŸŒˆ Enhanced test runner with detailed results reporting and summaries - ๐Ÿ“Š Added JSON summary files for test runs with timestamp and statistics ### Improved - ๐ŸŽจ Enhanced CLI UI with additional colors and visual formatting - ๐Ÿ“‹ Improved error handling and reporting in test scripts - ๐Ÿ‘๏ธ Added more visual feedback during image extraction and downloading ## [1.3.2] - 2024-04-03 ### Added - ๐ŸŽจ Added beautiful multicolored ASCII banner to the CLI interface - ๐ŸŒˆ Enhanced terminal output with colored text and multicolored progress indicators - โœจ Added package version and branding display in CLI ### Fixed - ๐Ÿ–ผ๏ธ Fixed image downloading functionality by correcting fs module usage - ๐Ÿ“Š Added duplicate image detection to avoid downloading the same image multiple times - ๐Ÿ”ข Improved image count accuracy between reported and actual downloaded images ## [1.3.1] - 2024-04-04 ### Changed - โš™๏ธ Relaxed strict filtering thresholds in content verification - ๐Ÿ”„ Enhanced URL Parameter pagination with more reliable content loading - ๐Ÿ–ผ๏ธ Improved image extraction for sites with lazy-loaded images - ๐Ÿš€ Increased default scroll limits for better content capture - ๐Ÿง  Added multiple events to trigger lazy-loading (mousemove, DOMContentLoaded, custom events) - โฑ๏ธ Improved timing delays for better content loading ## [1.3.0] - 2024-04-03 ### Added - ๐Ÿ“„ Added URL Parameter pagination strategy for sites like CigarScanner - ๐Ÿ”„ Implemented hybrid pagination approach that combines URL parameters with scrolling - ๐Ÿง  Automatic detection of sites that use URL-based pagination (?page=X) - ๐Ÿ› ๏ธ Added `parameter` option to `--paginationStrategy` flag ## [1.2.0] - 2024-04-03 ### Added - ๐Ÿ“ธ Added image scraping functionality - ๐Ÿ“ฅ Added image downloading capability - ๐Ÿ—ƒ๏ธ Images are now included in the JSON output - ๐Ÿ”ง New CLI options for controlling image scraping: - `--scrapeImages`: Enable image extraction - `--downloadImages`: Download images locally - `--maxImages`: Control maximum images extracted - `--minImageSize`: Filter out images smaller than specified size ## [1.0.1] - 2024-04-03 ### Added - ๐Ÿ“ฆ Published package to npm under @pinkpixel organization - ๐Ÿท๏ธ Added npm version and license badges to README - ๐Ÿ“„ Added .npmignore file to exclude development files from the package ## [1.1.0] - 2024-04-02 ### Added - ๐Ÿ” Integrated multi-page scraping directly into main CLI - โš™๏ธ Added `--pages` parameter to specify number of pages to scrape - ๐Ÿ”— Added `--linkSelector` option for custom link selection - ๐ŸŒ Added `--allDomains` flag to follow links across domains - ๐Ÿง  Added 14 specialized scroll strategies for comprehensive content extraction: - Standard scroll - Chunk scroll (10% increments) - Reverse scroll (bottom to top) - Pulse scroll (down then slightly up) - Zigzag scroll (with horizontal movement) - Step scroll (small viewport increments) - Bounce scroll (full page bouncing) - Hover scroll (mouse movement simulation) - Random scroll (random positions) - Corner scroll (hits viewport corners) - Diagonal scroll (diagonal pattern) - Spiral scroll (spiral pattern) - Swipe scroll (keyboard-based) - Resize scroll (viewport resizing) ### Changed - ๐Ÿ”„ Optimized default scroll parameters (maxScrolls: 100, scrollDelay: 1000) - ๐Ÿ“Š Simplified results output to focus on essential information - ๐Ÿš€ Improved progress indication with dot-based progress bar - โšก Enhanced content extraction to accumulate results across multiple pagination attempts - ๐Ÿงฉ Restructured pagination handling to maximize content discovery ### Removed - ๐Ÿ—‘๏ธ Removed redundant multi_scrape.js script - ๐Ÿ”‡ Removed verbose logging for cleaner output ### Fixed - ๐Ÿ› Fixed duplicate "Starting scraper" messages - ๐Ÿ”ง Fixed scroll strategy implementation for better dynamic content capture - ๐Ÿงช Fixed content deduplication to maintain unique items ## [1.0.0] - 2024-03-15 ### Added - ๐ŸŒ Initial release of Prysm web scraper - ๐Ÿง  Structure-aware content extraction - ๐Ÿ•ต๏ธโ€โ™‚๏ธ Cloudflare bypass capability - ๐Ÿšซ Resource blocking for improved performance - ๐Ÿ”„ Basic pagination handling - ๐ŸŒ REST API for remote control - ๐Ÿ“‘ Basic CLI interface ### Fixed - Initial version - no fixes