@pinkpixel/prysm-mcp

# 📝 Changelog All notable changes to the Prysm scraper will be documented in this file. ## [1.1.1] - 2024-04-05 ### Added - 🔧 Added support for `PRYSM_OUTPUT_DIR` environment variable to configure output directory - 🖼️ Added support for `PRYSM_IMAGE_OUTPUT_DIR` environment variable to configure image output directory - 📚 Updated documentation with detailed explanation of environment variables - 🧩 Enhanced MCP integration examples in documentation ### Fixed - 🔄 Removed circular dependency where the package referenced itself - 🛠️ Improved path handling for better compatibility with MCP and other environments ## [1.4.0] - 2024-04-05 ### Added - 🧠 Smart Scan technology that automatically analyzes page structure for optimized scraping - 🔍 Content type detection for articles, products, listings, and more - ⚡ Optimized extraction strategies based on detected page structure - 🔄 Automated pagination strategy selection - 🚀 Performance profiles for different scraping needs (focused, standard, deep) - 💨 Streamlined CLI with intuitive speed options (--focused, --standard, --deep) - 📊 Improved metadata extraction with automatic optimization ### Changed - 🎯 Focused metadata extraction to prioritize essential information - 🧩 Restructured extractor organization to work with Smart Scan - 🤫 Reduced console output for cleaner terminal display - ⚙️ Improved default settings for various page types ## [1.3.4] - 2024-04-04 ### Changed - 🔨 Implemented true brute force approach that applies all extraction methods to every page - 🚫 Removed all detection logic and thresholds for maximum content extraction - 🧹 Removed conditional checks in pagination strategies to try everything on every page - 🔄 Simplified pagination handling for more consistent results across different sites - 🖼️ Enhanced image extraction to capture all images without filtering - 🤫 Significantly reduced console output for a cleaner terminal experience - ⚡ Streamlined metadata extraction to focus on content and images ## [1.3.3] - 2024-04-03 ### Added - 🧪 Added comprehensive test script with category and name-based filtering - 🌈 Enhanced test runner with detailed results reporting and summaries - 📊 Added JSON summary files for test runs with timestamp and statistics ### Improved - 🎨 Enhanced CLI UI with additional colors and visual formatting - 📋 Improved error handling and reporting in test scripts - 👁️ Added more visual feedback during image extraction and downloading ## [1.3.2] - 2024-04-03 ### Added - 🎨 Added beautiful multicolored ASCII banner to the CLI interface - 🌈 Enhanced terminal output with colored text and multicolored progress indicators - ✨ Added package version and branding display in CLI ### Fixed - 🖼️ Fixed image downloading functionality by correcting fs module usage - 📊 Added duplicate image detection to avoid downloading the same image multiple times - 🔢 Improved image count accuracy between reported and actual downloaded images ## [1.3.1] - 2024-04-04 ### Changed - ⚙️ Relaxed strict filtering thresholds in content verification - 🔄 Enhanced URL Parameter pagination with more reliable content loading - 🖼️ Improved image extraction for sites with lazy-loaded images - 🚀 Increased default scroll limits for better content capture - 🧠 Added multiple events to trigger lazy-loading (mousemove, DOMContentLoaded, custom events) - ⏱️ Improved timing delays for better content loading ## [1.3.0] - 2024-04-03 ### Added - 📄 Added URL Parameter pagination strategy for sites like CigarScanner - 🔄 Implemented hybrid pagination approach that combines URL parameters with scrolling - 🧠 Automatic detection of sites that use URL-based pagination (?page=X) - 🛠️ Added `parameter` option to `--paginationStrategy` flag ## [1.2.0] - 2024-04-03 ### Added - 📸 Added image scraping functionality - 📥 Added image downloading capability - 🗃️ Images are now included in the JSON output - 🔧 New CLI options for controlling image scraping: - `--scrapeImages`: Enable image extraction - `--downloadImages`: Download images locally - `--maxImages`: Control maximum images extracted - `--minImageSize`: Filter out images smaller than specified size ## [1.0.1] - 2024-04-03 ### Added - 📦 Published package to npm under @pinkpixel organization - 🏷️ Added npm version and license badges to README - 📄 Added .npmignore file to exclude development files from the package ## [1.1.0] - 2024-04-02 ### Added - 🔍 Integrated multi-page scraping directly into main CLI - ⚙️ Added `--pages` parameter to specify number of pages to scrape - 🔗 Added `--linkSelector` option for custom link selection - 🌐 Added `--allDomains` flag to follow links across domains - 🧠 Added 14 specialized scroll strategies for comprehensive content extraction: - Standard scroll - Chunk scroll (10% increments) - Reverse scroll (bottom to top) - Pulse scroll (down then slightly up) - Zigzag scroll (with horizontal movement) - Step scroll (small viewport increments) - Bounce scroll (full page bouncing) - Hover scroll (mouse movement simulation) - Random scroll (random positions) - Corner scroll (hits viewport corners) - Diagonal scroll (diagonal pattern) - Spiral scroll (spiral pattern) - Swipe scroll (keyboard-based) - Resize scroll (viewport resizing) ### Changed - 🔄 Optimized default scroll parameters (maxScrolls: 100, scrollDelay: 1000) - 📊 Simplified results output to focus on essential information - 🚀 Improved progress indication with dot-based progress bar - ⚡ Enhanced content extraction to accumulate results across multiple pagination attempts - 🧩 Restructured pagination handling to maximize content discovery ### Removed - 🗑️ Removed redundant multi_scrape.js script - 🔇 Removed verbose logging for cleaner output ### Fixed - 🐛 Fixed duplicate "Starting scraper" messages - 🔧 Fixed scroll strategy implementation for better dynamic content capture - 🧪 Fixed content deduplication to maintain unique items ## [1.0.0] - 2024-03-15 ### Added - 🌐 Initial release of Prysm web scraper - 🧠 Structure-aware content extraction - 🕵️‍♂️ Cloudflare bypass capability - 🚫 Resource blocking for improved performance - 🔄 Basic pagination handling - 🌐 REST API for remote control - 📑 Basic CLI interface ### Fixed - Initial version - no fixes