UNPKG

ag-webscrape

Version:

TypeScript web scraper with Playwright fallback for anti-scraping protection

207 lines (151 loc) 5.14 kB
# ag-webscrape A TypeScript web scraper with intelligent fallback strategy. Attempts direct HTTP fetching first, then falls back to Playwright for anti-scraping protection. ## Features - **Dual Strategy**: Direct fetch first, Playwright fallback - **Anti-Scraping Detection**: Automatically detects and bypasses common anti-scraping measures - **Persistent Browser**: Maintains browser instance for faster subsequent scrapes - **Error Handling**: Comprehensive error detection for 4xx/5xx responses - **TypeScript Support**: Full type safety and IntelliSense - **Configurable**: Extensive customization options ## Installation ```bash npm install ag-webscrape ``` ## Quick Start ```typescript import { WebScraper } from 'ag-webscrape'; const scraper = new WebScraper(); // Scrape a single URL const result = await scraper.scrape('https://example.com'); console.log(result.html); // Clean up when done await scraper.dispose(); ``` ## API Reference ### WebScraper Class #### Constructor ```typescript new WebScraper(options?: ScrapingOptions) ``` #### Options ```typescript interface ScrapingOptions { timeout?: number; // Request timeout in ms (default: 30000) userAgent?: string; // Custom user agent headers?: Record<string, string>; // Additional headers retries?: number; // Number of retries (default: 3) waitForSelector?: string; // CSS selector to wait for waitForTimeout?: number; // Time to wait in ms (default: 5000) } ``` #### Methods ##### `scrape(url: string, options?: ScrapingOptions): Promise<ScrapingResult>` Scrapes a single URL with fallback strategy. ```typescript const result = await scraper.scrape('https://example.com', { timeout: 60000, waitForSelector: '.main-content' }); ``` ##### `scrapeMultiple(urls: string[], options?: ScrapingOptions): Promise<ScrapingResult[]>` Scrapes multiple URLs efficiently. ```typescript const results = await scraper.scrapeMultiple([ 'https://example1.com', 'https://example2.com' ]); ``` ##### `dispose(): Promise<void>` Cleans up browser resources. Always call this when done. ```typescript await scraper.dispose(); ``` #### Result Object ```typescript interface ScrapingResult { url: string; // Original URL html: string; // HTML content status: number; // HTTP status code method: 'fetch' | 'playwright'; // Method used error?: string; // Error message if any redirected?: boolean; // Whether request was redirected finalUrl?: string; // Final URL after redirects } ``` ## Advanced Usage ### Custom Headers and User Agent ```typescript const scraper = new WebScraper({ userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', headers: { 'Accept': 'text/html,application/xhtml+xml', 'Accept-Language': 'en-US,en;q=0.9' } }); ``` ### Waiting for Content ```typescript // Wait for specific element const result = await scraper.scrape('https://spa-app.com', { waitForSelector: '.dynamic-content' }); // Wait for specific time const result = await scraper.scrape('https://slow-app.com', { waitForTimeout: 10000 }); ``` ### Error Handling ```typescript const result = await scraper.scrape('https://example.com'); if (result.error) { console.error('Scraping failed:', result.error); } else { console.log('Success:', result.html.length, 'characters'); } ``` ### Batch Scraping ```typescript const urls = [ 'https://news.site.com/article1', 'https://news.site.com/article2', 'https://news.site.com/article3' ]; const results = await scraper.scrapeMultiple(urls, { waitForSelector: '.article-content' }); results.forEach((result, index) => { if (!result.error) { console.log(`Article ${index + 1}: ${result.html.length} chars`); } }); ``` ## How It Works 1. **Direct Fetch**: First attempts HTTP request using `node-fetch` 2. **Anti-Scraping Detection**: Checks response for common anti-scraping patterns 3. **Playwright Fallback**: If direct fetch fails or anti-scraping detected, uses Playwright 4. **Error Detection**: Monitors for 4xx/5xx responses in both methods 5. **Resource Management**: Maintains browser instance for performance ## Anti-Scraping Protection The scraper automatically detects and handles: - Cloudflare protection - DistilNetworks - PerimeterX - DataDome - Akamai Bot Manager - CAPTCHA challenges - JavaScript requirement checks - Rate limiting - Access denied pages ## Performance - **Fast**: Direct fetch for simple pages - **Efficient**: Reuses browser instance - **Robust**: Fallback ensures high success rate - **Intelligent**: Only uses Playwright when necessary ## Examples Check out the `src/example.ts` file for complete usage examples. ## License MIT ## Contributing Pull requests welcome! Please ensure TypeScript compilation and tests pass. ## Support For issues and questions, please use the GitHub issue tracker.