@pinkpixel/prysm-mcp
Version:
MCP server for the Prysm web scraper - enabling AI assistants to scrape web content
176 lines (126 loc) โข 6.55 kB
Markdown
# ๐ Changelog
All notable changes to the Prysm scraper will be documented in this file.
## [1.1.1] - 2024-04-05
### Added
- ๐ง Added support for `PRYSM_OUTPUT_DIR` environment variable to configure output directory
- ๐ผ๏ธ Added support for `PRYSM_IMAGE_OUTPUT_DIR` environment variable to configure image output directory
- ๐ Updated documentation with detailed explanation of environment variables
- ๐งฉ Enhanced MCP integration examples in documentation
### Fixed
- ๐ Removed circular dependency where the package referenced itself
- ๐ ๏ธ Improved path handling for better compatibility with MCP and other environments
## [1.4.0] - 2024-04-05
### Added
- ๐ง Smart Scan technology that automatically analyzes page structure for optimized scraping
- ๐ Content type detection for articles, products, listings, and more
- โก Optimized extraction strategies based on detected page structure
- ๐ Automated pagination strategy selection
- ๐ Performance profiles for different scraping needs (focused, standard, deep)
- ๐จ Streamlined CLI with intuitive speed options (--focused, --standard, --deep)
- ๐ Improved metadata extraction with automatic optimization
### Changed
- ๐ฏ Focused metadata extraction to prioritize essential information
- ๐งฉ Restructured extractor organization to work with Smart Scan
- ๐คซ Reduced console output for cleaner terminal display
- โ๏ธ Improved default settings for various page types
## [1.3.4] - 2024-04-04
### Changed
- ๐จ Implemented true brute force approach that applies all extraction methods to every page
- ๐ซ Removed all detection logic and thresholds for maximum content extraction
- ๐งน Removed conditional checks in pagination strategies to try everything on every page
- ๐ Simplified pagination handling for more consistent results across different sites
- ๐ผ๏ธ Enhanced image extraction to capture all images without filtering
- ๐คซ Significantly reduced console output for a cleaner terminal experience
- โก Streamlined metadata extraction to focus on content and images
## [1.3.3] - 2024-04-03
### Added
- ๐งช Added comprehensive test script with category and name-based filtering
- ๐ Enhanced test runner with detailed results reporting and summaries
- ๐ Added JSON summary files for test runs with timestamp and statistics
### Improved
- ๐จ Enhanced CLI UI with additional colors and visual formatting
- ๐ Improved error handling and reporting in test scripts
- ๐๏ธ Added more visual feedback during image extraction and downloading
## [1.3.2] - 2024-04-03
### Added
- ๐จ Added beautiful multicolored ASCII banner to the CLI interface
- ๐ Enhanced terminal output with colored text and multicolored progress indicators
- โจ Added package version and branding display in CLI
### Fixed
- ๐ผ๏ธ Fixed image downloading functionality by correcting fs module usage
- ๐ Added duplicate image detection to avoid downloading the same image multiple times
- ๐ข Improved image count accuracy between reported and actual downloaded images
## [1.3.1] - 2024-04-04
### Changed
- โ๏ธ Relaxed strict filtering thresholds in content verification
- ๐ Enhanced URL Parameter pagination with more reliable content loading
- ๐ผ๏ธ Improved image extraction for sites with lazy-loaded images
- ๐ Increased default scroll limits for better content capture
- ๐ง Added multiple events to trigger lazy-loading (mousemove, DOMContentLoaded, custom events)
- โฑ๏ธ Improved timing delays for better content loading
## [1.3.0] - 2024-04-03
### Added
- ๐ Added URL Parameter pagination strategy for sites like CigarScanner
- ๐ Implemented hybrid pagination approach that combines URL parameters with scrolling
- ๐ง Automatic detection of sites that use URL-based pagination (?page=X)
- ๐ ๏ธ Added `parameter` option to `--paginationStrategy` flag
## [1.2.0] - 2024-04-03
### Added
- ๐ธ Added image scraping functionality
- ๐ฅ Added image downloading capability
- ๐๏ธ Images are now included in the JSON output
- ๐ง New CLI options for controlling image scraping:
- `--scrapeImages`: Enable image extraction
- `--downloadImages`: Download images locally
- `--maxImages`: Control maximum images extracted
- `--minImageSize`: Filter out images smaller than specified size
## [1.0.1] - 2024-04-03
### Added
- ๐ฆ Published package to npm under @pinkpixel organization
- ๐ท๏ธ Added npm version and license badges to README
- ๐ Added .npmignore file to exclude development files from the package
## [1.1.0] - 2024-04-02
### Added
- ๐ Integrated multi-page scraping directly into main CLI
- โ๏ธ Added `--pages` parameter to specify number of pages to scrape
- ๐ Added `--linkSelector` option for custom link selection
- ๐ Added `--allDomains` flag to follow links across domains
- ๐ง Added 14 specialized scroll strategies for comprehensive content extraction:
- Standard scroll
- Chunk scroll (10% increments)
- Reverse scroll (bottom to top)
- Pulse scroll (down then slightly up)
- Zigzag scroll (with horizontal movement)
- Step scroll (small viewport increments)
- Bounce scroll (full page bouncing)
- Hover scroll (mouse movement simulation)
- Random scroll (random positions)
- Corner scroll (hits viewport corners)
- Diagonal scroll (diagonal pattern)
- Spiral scroll (spiral pattern)
- Swipe scroll (keyboard-based)
- Resize scroll (viewport resizing)
### Changed
- ๐ Optimized default scroll parameters (maxScrolls: 100, scrollDelay: 1000)
- ๐ Simplified results output to focus on essential information
- ๐ Improved progress indication with dot-based progress bar
- โก Enhanced content extraction to accumulate results across multiple pagination attempts
- ๐งฉ Restructured pagination handling to maximize content discovery
### Removed
- ๐๏ธ Removed redundant multi_scrape.js script
- ๐ Removed verbose logging for cleaner output
### Fixed
- ๐ Fixed duplicate "Starting scraper" messages
- ๐ง Fixed scroll strategy implementation for better dynamic content capture
- ๐งช Fixed content deduplication to maintain unique items
## [1.0.0] - 2024-03-15
### Added
- ๐ Initial release of Prysm web scraper
- ๐ง Structure-aware content extraction
- ๐ต๏ธโโ๏ธ Cloudflare bypass capability
- ๐ซ Resource blocking for improved performance
- ๐ Basic pagination handling
- ๐ REST API for remote control
- ๐ Basic CLI interface
### Fixed
- Initial version - no fixes