web-scraper-pro

# web-scraper-pro [![npm version](https://badge.fury.io/js/web-scraper-pro.svg)](https://badge.fury.io/js/web-scraper-pro) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Node.js](https://img.shields.io/badge/Node.js-14%2B-green.svg)](https://nodejs.org/) A professional web scraper powered by Puppeteer and Mozilla Readability. Extract clean, readable content from any website with full TypeScript support and comprehensive error handling. ## 🚀 Installation ```bash npm install web-scraper-pro ``` ## 📦 Quick Start ```javascript const WebScraper = require("web-scraper-pro"); // Method 1: Default output directory (./output) const scraper = new WebScraper(); const result = await scraper.scrapeAndSave("https://example.com"); // Method 2: Custom output directory via constructor const scraper2 = new WebScraper({ outputDir: "./my-downloads" }); const result2 = await scraper2.scrapeAndSave("https://example.com"); // Method 3: Set output directory after creation const scraper3 = new WebScraper(); scraper3.setOutputDir("./custom-folder"); const result3 = await scraper3.scrapeAndSave("https://example.com"); // Method 4: Extract content only (no files saved) const content = await scraper.scrapeContentOnly("https://example.com", [ "title", "content", ]); console.log(content); ``` ## 📁 Project Structure ``` web-scraper-pro/ ├── src/ │ ├── scraper.js # Main scraper implementation │ └── scraper.d.ts # TypeScript definitions ├── output/ # Generated output files │ ├── scraped_*.html # Raw HTML files │ └── extracted_*.txt # Extracted content files ├── test/ # Test files ├── index.js # Main entry point ├── package.json └── README.md ``` ## 🔧 Usage ### Command Line Interface Run with default URL (Wikipedia): ```bash node src/scraper.js ``` Run with custom URL: ```bash node src/scraper.js "https://your-target-url.com" ``` ### Setting Custom Output Directory There are multiple ways to specify where output files should be saved: #### 1. Via Constructor ```javascript const WebScraper = require("web-scraper-pro"); const scraper = new WebScraper({ outputDir: "./my-custom-folder" }); ``` #### 2. Via setOutputDir() Method ```javascript const scraper = new WebScraper(); scraper.setOutputDir("./downloads/scraped-data"); // Method chaining is supported scraper.setOutputDir("./downloads").getOutputDir(); // Returns current path ``` #### 3. Using Absolute Paths ```javascript const scraper = new WebScraper(); scraper.setOutputDir("C:/Users/Desktop/scraped-content"); // or on Linux/Mac scraper.setOutputDir("/home/user/scraped-content"); ``` #### 4. Check Current Output Directory ```javascript const currentPath = scraper.getOutputDir(); console.log(`Files will be saved to: ${currentPath}`); ``` **Note:** The output directory will be created automatically if it doesn't exist. ## 📝 API Reference ### Constructor Creates a new WebScraper instance with optional configuration. ```javascript const WebScraper = require("web-scraper-pro"); // Default output directory (./output) const scraper = new WebScraper(); // Custom output directory const scraper = new WebScraper({ outputDir: "./my-output" }); ``` **Parameters:** - `options` (object, optional): Configuration options - `outputDir` (string): Custom output directory path ### setOutputDir(path) Sets a custom output directory for saved files. ```javascript scraper.setOutputDir("./custom-folder"); // Supports method chaining scraper.setOutputDir("./downloads").getOutputDir(); ``` **Parameters:** - `path` (string): Absolute or relative path to output directory **Returns:** WebScraper instance (for method chaining) ### getOutputDir() Gets the current output directory path. ```javascript const currentPath = scraper.getOutputDir(); console.log(currentPath); // e.g., "C:/Users/Name/project/output" ``` **Returns:** String with current output directory path ### scrapeAndSave(url, returnFields) Scrapes a webpage and saves both HTML and extracted content to files. ```javascript const result = await scraper.scrapeAndSave("https://example.com", [ "url", "title", "content", ]); console.log(result.data.title); // Extracted title console.log(result.files.htmlFile); // Path to saved HTML file console.log(result.files.txtFile); // Path to saved text file ``` **Parameters:** - `url` (string): The URL to scrape - `returnFields` (array, optional): Data fields to include in output - Default: `['url','title','siteName','length','extractedAt','content']` **Returns:** Object with `success`, `data`, `files`, and `duration` properties ### scrapeContentOnly(url, fields) Scrapes a webpage and returns formatted content as a string (no files created). ```javascript const { scrapeContentOnly } = require("web-scraper-pro"); const content = await scrapeContentOnly("https://example.com", [ "title", "url", "content", ]); console.log(content); // Output: // TITLE: Page Title // URL: https://example.com // // CONTENT: // Main content text... ``` **Parameters:** - `url` (string): The URL to scrape - `fields` (array): Data fields to include in output string **Returns:** Formatted string with extracted content ### Available Data Fields - `url` - The webpage URL - `title` - Page title - `content` - Main content (cleaned by Readability) - `siteName` - Website name - `length` - Content length in characters - `extractedAt` - Extraction timestamp - `excerpt` - Short content summary ## ⚙️ Configuration ### TypeScript Support Full TypeScript definitions are included: ```typescript import { scrapeAndSave, scrapeContentOnly, ExtractedData, ScrapeResult, } from "web-scraper-pro"; const result: ScrapeResult = await scrapeAndSave("https://example.com", { saveText: true, saveHtml: true, }); const content: string = await scrapeContentOnly("https://example.com", [ "title", "content", ]); ``` ### Custom Puppeteer Settings The scraper uses optimized Puppeteer settings by default, but you can customize them by modifying the source: ```javascript // In src/scraper.js browser = await puppeteer.launch({ headless: false, // Show browser for debugging args: ["--no-sandbox"], // Additional Chrome flags timeout: 60000, // Custom timeout }); ``` ### Timeout Configuration ```javascript await page.goto(url, { waitUntil: "networkidle2", timeout: 60000, // Increase timeout to 60 seconds }); ``` ## 📄 Output Formats ### File Output Generated files use descriptive naming with timestamps: ``` extracted_example_com_2025-09-08T12-34-56-789Z.txt scraped_example_com_2025-09-08T12-34-56-789Z.html ``` **Text file format:** ``` URL: https://example.com TITLE: Page Title SITE_NAME: Site Name LENGTH: 1000 characters EXTRACTED_AT: 2025-09-08T12:34:56.789Z CONTENT: Main content extracted by Mozilla Readability... ``` ### Programmatic Output ```javascript { url: "https://example.com", title: "Page Title", content: "Main content...", siteName: "Site Name", length: 1000, extractedAt: "2025-09-08T12:34:56.789Z", excerpt: "Brief summary..." } ``` ## 🛠️ Troubleshooting ### Puppeteer Installation Issues **Linux:** ```bash sudo apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo-gobject2 libdrm2 libgtk-3-0 libnspr4 libnss3 libx11-xcb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 ``` **macOS (M1/M2):** ```bash arch -x86_64 npm install puppeteer ``` **Windows:** Ensure Visual Studio Build Tools are installed for native dependencies. ### Bot Detection Some websites block automated scraping. Try these solutions: ```javascript // Add user agent and delays await page.setUserAgent( "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" ); await page.setViewport({ width: 1366, height: 768 }); await page.waitForTimeout(2000); // Add delay ``` ### Memory Issues For large-scale scraping, implement proper cleanup: ```javascript // Close browser instances properly await browser.close(); // Monitor memory usage process.on("exit", () => console.log("Process memory usage:", process.memoryUsage()) ); ``` ## ✨ Features - ✅ **JavaScript-rendered content** - Handles dynamic pages - ✅ **Clean content extraction** - Removes ads, sidebars, navigation - ✅ **Automatic file naming** - Timestamp-based file organization - ✅ **Comprehensive error handling** - Robust retry logic - ✅ **TypeScript support** - Full type definitions included - ✅ **Dual output modes** - Files + data or string-only - ✅ **Component architecture** - Modular, maintainable codebase - ✅ **Professional logging** - Detailed console feedback ## 🧪 Tested Websites - ✅ **Wikipedia** (Multiple languages) - ✅ **News websites** (BBC, CNN, etc.) - ✅ **Gaming sites** (Perfect World Games, IGN) - ✅ **Blog platforms** (Medium, Dev.to) - ✅ **Documentation sites** (MDN, official docs) - ✅ **E-commerce** (Product pages) ### Running Tests ```bash # Test core functionality npm test # Test with specific examples node test-new-functions.js node test-gaming-news.js ``` ## 📊 Performance - **Speed**: ~3-5 seconds per page (includes browser startup) - **Memory**: ~50-100MB per browser instance - **Reliability**: Built-in retry logic for network issues - **Output Size**: Typical compression ratio 80-90% vs raw HTML ## 🤝 Contributing 1. Fork the repository 2. Create a feature branch: `git checkout -b feature-name` 3. Commit changes: `git commit -am 'Add feature'` 4. Push to branch: `git push origin feature-name` 5. Submit a pull request ## 📝 License MIT License - see [LICENSE](LICENSE) file for details. ## 🔗 Links - **NPM Package**: https://www.npmjs.com/package/web-scraper-pro - **GitHub Repository**: https://github.com/nanpapu/web-scraper-pro - **Issues & Support**: https://github.com/nanpapu/web-scraper-pro/issues --- Made with ❤️ by [Nanpapu](https://github.com/nanpapu)