ag-webscrape
Version:
TypeScript web scraper with Playwright fallback for anti-scraping protection
207 lines (151 loc) • 5.14 kB
Markdown
A TypeScript web scraper with intelligent fallback strategy. Attempts direct HTTP fetching first, then falls back to Playwright for anti-scraping protection.
- **Dual Strategy**: Direct fetch first, Playwright fallback
- **Anti-Scraping Detection**: Automatically detects and bypasses common anti-scraping measures
- **Persistent Browser**: Maintains browser instance for faster subsequent scrapes
- **Error Handling**: Comprehensive error detection for 4xx/5xx responses
- **TypeScript Support**: Full type safety and IntelliSense
- **Configurable**: Extensive customization options
```bash
npm install ag-webscrape
```
```typescript
import { WebScraper } from 'ag-webscrape';
const scraper = new WebScraper();
// Scrape a single URL
const result = await scraper.scrape('https://example.com');
console.log(result.html);
// Clean up when done
await scraper.dispose();
```
```typescript
new WebScraper(options?: ScrapingOptions)
```
#### Options
```typescript
interface ScrapingOptions {
timeout?: number; // Request timeout in ms (default: 30000)
userAgent?: string; // Custom user agent
headers?: Record<string, string>; // Additional headers
retries?: number; // Number of retries (default: 3)
waitForSelector?: string; // CSS selector to wait for
waitForTimeout?: number; // Time to wait in ms (default: 5000)
}
```
Scrapes a single URL with fallback strategy.
```typescript
const result = await scraper.scrape('https://example.com', {
timeout: 60000,
waitForSelector: '.main-content'
});
```
Scrapes multiple URLs efficiently.
```typescript
const results = await scraper.scrapeMultiple([
'https://example1.com',
'https://example2.com'
]);
```
Cleans up browser resources. Always call this when done.
```typescript
await scraper.dispose();
```
```typescript
interface ScrapingResult {
url: string; // Original URL
html: string; // HTML content
status: number; // HTTP status code
method: 'fetch' | 'playwright'; // Method used
error?: string; // Error message if any
redirected?: boolean; // Whether request was redirected
finalUrl?: string; // Final URL after redirects
}
```
```typescript
const scraper = new WebScraper({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
headers: {
'Accept': 'text/html,application/xhtml+xml',
'Accept-Language': 'en-US,en;q=0.9'
}
});
```
```typescript
// Wait for specific element
const result = await scraper.scrape('https://spa-app.com', {
waitForSelector: '.dynamic-content'
});
// Wait for specific time
const result = await scraper.scrape('https://slow-app.com', {
waitForTimeout: 10000
});
```
```typescript
const result = await scraper.scrape('https://example.com');
if (result.error) {
console.error('Scraping failed:', result.error);
} else {
console.log('Success:', result.html.length, 'characters');
}
```
```typescript
const urls = [
'https://news.site.com/article1',
'https://news.site.com/article2',
'https://news.site.com/article3'
];
const results = await scraper.scrapeMultiple(urls, {
waitForSelector: '.article-content'
});
results.forEach((result, index) => {
if (!result.error) {
console.log(`Article ${index + 1}: ${result.html.length} chars`);
}
});
```
1. **Direct Fetch**: First attempts HTTP request using `node-fetch`
2. **Anti-Scraping Detection**: Checks response for common anti-scraping patterns
3. **Playwright Fallback**: If direct fetch fails or anti-scraping detected, uses Playwright
4. **Error Detection**: Monitors for 4xx/5xx responses in both methods
5. **Resource Management**: Maintains browser instance for performance
The scraper automatically detects and handles:
- Cloudflare protection
- DistilNetworks
- PerimeterX
- DataDome
- Akamai Bot Manager
- CAPTCHA challenges
- JavaScript requirement checks
- Rate limiting
- Access denied pages
- **Fast**: Direct fetch for simple pages
- **Efficient**: Reuses browser instance
- **Robust**: Fallback ensures high success rate
- **Intelligent**: Only uses Playwright when necessary
Check out the `src/example.ts` file for complete usage examples.
MIT
Pull requests welcome! Please ensure TypeScript compilation and tests pass.
For issues and questions, please use the GitHub issue tracker.