@monostate/node-scraper
Version:
Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers
626 lines (496 loc) • 16.6 kB
Markdown
# Bulk Scraping Guide
The `@monostate/node-scraper` package provides powerful bulk scraping capabilities with automatic request queueing, progress tracking, and efficient resource management.
## Key Features
- **Automatic Request Queueing**: Never worry about "too many browser instances" errors. Requests are automatically queued when the browser pool is full.
- **Smart Browser Pooling**: Reuses browser instances for better performance while preventing memory leaks.
- **Real-time Progress Tracking**: Monitor scraping progress with customizable callbacks.
- **Streaming Support**: Process results as they complete for memory-efficient handling of large datasets.
- **Graceful Error Handling**: Continue processing even when some URLs fail, with detailed error reporting.
## Table of Contents
- [Automatic Request Queueing](#automatic-request-queueing)
- [Basic Usage](#basic-usage)
- [Streaming Results](#streaming-results)
- [Configuration Options](#configuration-options)
- [Error Handling](#error-handling)
- [Performance Optimization](#performance-optimization)
- [Real-World Examples](#real-world-examples)
- [Best Practices](#best-practices)
## Automatic Request Queueing
One of the most powerful features of v1.8.0 is automatic request queueing. When you make multiple concurrent requests:
```javascript
// Before v1.8.0: This would throw errors when browser pool is full
// After v1.8.0: Requests are automatically queued!
const urls = Array.from({ length: 100 }, (_, i) => `https://example.com/page${i}`);
// Even with 100 URLs and only 3 browser instances, no errors!
const results = await bulkScrape(urls, {
concurrency: 20, // Request 20 at a time
method: 'puppeteer' // Force browser usage
});
// The browser pool (max 3 instances) automatically queues requests
// No "too many browser instances" errors!
```
### How It Works
1. **Browser Pool**: Maximum of 3 browser instances by default
2. **Request Queue**: When all browsers are busy, new requests wait in a queue
3. **Automatic Processing**: As browsers become available, queued requests are processed
4. **No Errors**: Instead of throwing errors, requests wait their turn
### Benefits
- **No Manual Retry Logic**: The SDK handles queueing automatically
- **Memory Efficient**: Only 3 browsers maximum, preventing OOM errors
- **Optimal Performance**: Browsers are reused for faster processing
- **Graceful Degradation**: System remains stable under high load
### Works Everywhere
The automatic queueing works with all scraping methods, not just bulk operations:
```javascript
import { smartScrape } from '@monostate/node-scraper';
// Make 50 parallel requests with Puppeteer
// Only 3 browsers will be created, rest will queue automatically
const promises = [];
for (let i = 0; i < 50; i++) {
promises.push(
smartScrape(`https://example.com/page${i}`, { method: 'puppeteer' })
);
}
// All 50 requests complete successfully!
// No "too many browser instances" errors
const results = await Promise.all(promises);
console.log(`Completed ${results.length} requests with only 3 browsers!`);
```
## Basic Usage
### Simple Bulk Scraping
```javascript
import { bulkScrape } from '@monostate/node-scraper';
const urls = [
'https://example1.com',
'https://example2.com',
'https://example3.com'
];
const results = await bulkScrape(urls);
// Access results
results.success.forEach(result => {
console.log(`URL: ${result.url}`);
console.log(`Method: ${result.method}`);
console.log(`Duration: ${result.duration}ms`);
console.log(`Content: ${result.content.substring(0, 100)}...`);
});
// Check failures
results.failed.forEach(failure => {
console.log(`Failed URL: ${failure.url}`);
console.log(`Error: ${failure.error}`);
});
```
### With Progress Tracking
```javascript
const results = await bulkScrape(urls, {
progressCallback: (progress) => {
console.log(`Progress: ${progress.percentage.toFixed(1)}%`);
console.log(`Current: ${progress.current}`);
console.log(`Processed: ${progress.processed}/${progress.total}`);
}
});
```
## Streaming Results
For large datasets, streaming allows you to process results as they complete:
### Basic Streaming
```javascript
import { bulkScrapeStream } from '@monostate/node-scraper';
const stats = await bulkScrapeStream(urls, {
onResult: (result) => {
console.log(`Success: ${result.url}`);
// Process immediately - save to database, write to file, etc.
},
onError: (error) => {
console.log(`Failed: ${error.url} - ${error.error}`);
}
});
console.log(`Total processed: ${stats.processed}`);
console.log(`Success rate: ${(stats.successful / stats.total * 100).toFixed(1)}%`);
```
### Stream to File
```javascript
import { createWriteStream } from 'fs';
import { bulkScrapeStream } from '@monostate/node-scraper';
const outputStream = createWriteStream('results.jsonl');
await bulkScrapeStream(urls, {
onResult: async (result) => {
// Write each result as a JSON line
outputStream.write(JSON.stringify(result) + '\n');
},
onError: async (error) => {
// Log errors to a separate file
outputStream.write(JSON.stringify({ ...error, isError: true }) + '\n');
}
});
outputStream.end();
```
### Stream to Database
```javascript
import { bulkScrapeStream } from '@monostate/node-scraper';
import { db } from './database.js';
await bulkScrapeStream(urls, {
concurrency: 10,
onResult: async (result) => {
await db.scraped_pages.insert({
url: result.url,
content: result.content,
method: result.method,
duration_ms: result.duration,
scraped_at: new Date(result.timestamp)
});
},
onError: async (error) => {
await db.scrape_errors.insert({
url: error.url,
error_message: error.error,
failed_at: new Date(error.timestamp)
});
}
});
```
## Configuration Options
### Concurrency Control
```javascript
// Low concurrency for rate-limited sites
const results = await bulkScrape(urls, {
concurrency: 2, // Only 2 parallel requests
timeout: 30000 // 30 second timeout per request
});
// High concurrency for your own servers
const results = await bulkScrape(internalUrls, {
concurrency: 20, // 20 parallel requests
timeout: 5000 // 5 second timeout
});
```
### Method Selection
```javascript
// Force all URLs to use Puppeteer
const results = await bulkScrape(urls, {
method: 'puppeteer',
concurrency: 3 // Puppeteer is resource-intensive
});
// Force direct fetch for known static sites
const results = await bulkScrape(staticUrls, {
method: 'direct',
concurrency: 50 // Direct fetch can handle high concurrency
});
```
### Continue vs Stop on Error
```javascript
// Continue processing even if some URLs fail (default)
const results = await bulkScrape(urls, {
continueOnError: true
});
// Stop immediately on first error
try {
const results = await bulkScrape(urls, {
continueOnError: false
});
} catch (error) {
console.error('Bulk scraping stopped due to error:', error);
}
```
## Error Handling
### Retry Failed URLs
```javascript
// First pass
const results = await bulkScrape(urls);
// Retry failures with different method
if (results.failed.length > 0) {
const failedUrls = results.failed.map(f => f.url);
const retryResults = await bulkScrape(failedUrls, {
method: 'puppeteer', // Try with full browser
timeout: 60000 // Longer timeout
});
}
```
### Custom Error Handling
```javascript
await bulkScrapeStream(urls, {
onResult: (result) => {
// Process successful results
},
onError: async (error) => {
// Categorize and handle different error types
if (error.error.includes('timeout')) {
await logTimeoutError(error);
} else if (error.error.includes('404')) {
await handle404(error);
} else {
await logGeneralError(error);
}
}
});
```
## Performance Optimization
### Dynamic Concurrency
```javascript
// Start with low concurrency and increase based on success rate
let concurrency = 5;
const batchSize = 100;
for (let i = 0; i < urls.length; i += batchSize) {
const batch = urls.slice(i, i + batchSize);
const results = await bulkScrape(batch, { concurrency });
const successRate = results.stats.successful / batch.length;
// Adjust concurrency based on success rate
if (successRate > 0.95) {
concurrency = Math.min(concurrency + 2, 20);
} else if (successRate < 0.8) {
concurrency = Math.max(concurrency - 2, 2);
}
console.log(`Batch complete. Success rate: ${(successRate * 100).toFixed(1)}%. New concurrency: ${concurrency}`);
}
```
### Memory-Efficient Processing
```javascript
// Process in chunks to avoid memory issues
async function processLargeDataset(allUrls) {
const chunkSize = 1000;
const results = {
successful: 0,
failed: 0,
totalTime: 0
};
for (let i = 0; i < allUrls.length; i += chunkSize) {
const chunk = allUrls.slice(i, i + chunkSize);
console.log(`Processing chunk ${i / chunkSize + 1} of ${Math.ceil(allUrls.length / chunkSize)}`);
const chunkResults = await bulkScrape(chunk, {
concurrency: 10,
progressCallback: (p) => {
const overallProgress = ((i + p.processed) / allUrls.length * 100).toFixed(1);
console.log(`Overall progress: ${overallProgress}%`);
}
});
results.successful += chunkResults.stats.successful;
results.failed += chunkResults.stats.failed;
results.totalTime += chunkResults.stats.totalTime;
// Optional: Process results immediately to free memory
await processChunkResults(chunkResults);
}
return results;
}
```
## Real-World Examples
### E-commerce Price Monitoring
```javascript
import { bulkScrapeStream } from '@monostate/node-scraper';
const productUrls = [
'https://shop1.com/product/laptop-123',
'https://shop2.com/items/laptop-123',
// ... hundreds of product URLs
];
const priceData = [];
await bulkScrapeStream(productUrls, {
concurrency: 5,
onResult: async (result) => {
// Extract price from scraped content
const content = JSON.parse(result.content);
const price = extractPrice(content);
priceData.push({
url: result.url,
price: price,
timestamp: result.timestamp
});
},
progressCallback: (progress) => {
process.stdout.write(`\rChecking prices: ${progress.percentage.toFixed(0)}%`);
}
});
// Analyze price data
const avgPrice = priceData.reduce((sum, p) => sum + p.price, 0) / priceData.length;
console.log(`\nAverage price: $${avgPrice.toFixed(2)}`);
```
### News Aggregation
```javascript
import { bulkScrape } from '@monostate/node-scraper';
const newsUrls = [
'https://news1.com/latest',
'https://news2.com/today',
'https://news3.com/breaking'
];
const results = await bulkScrape(newsUrls, {
concurrency: 3,
timeout: 10000
});
// Extract and combine articles
const allArticles = [];
results.success.forEach(result => {
const content = JSON.parse(result.content);
const articles = extractArticles(content);
allArticles.push(...articles.map(a => ({
...a,
source: new URL(result.url).hostname
})));
});
// Sort by date and deduplicate
const uniqueArticles = deduplicateArticles(allArticles);
console.log(`Found ${uniqueArticles.length} unique articles`);
```
### SEO Analysis
```javascript
import { bulkScrape } from '@monostate/node-scraper';
async function analyzeSEO(urls) {
const results = await bulkScrape(urls, {
concurrency: 10,
method: 'auto'
});
const seoData = results.success.map(result => {
const content = JSON.parse(result.content);
return {
url: result.url,
title: content.title,
metaDescription: content.metaDescription,
headings: content.headings,
loadTime: result.duration,
method: result.method,
hasStructuredData: !!content.structuredData
};
});
// Generate SEO report
const avgLoadTime = seoData.reduce((sum, d) => sum + d.loadTime, 0) / seoData.length;
const missingTitles = seoData.filter(d => !d.title).length;
const missingDescriptions = seoData.filter(d => !d.metaDescription).length;
return {
totalAnalyzed: seoData.length,
avgLoadTime: Math.round(avgLoadTime),
missingTitles,
missingDescriptions,
details: seoData
};
}
```
## Best Practices
### 1. Respect Rate Limits
```javascript
// Add delays between requests for external sites
async function respectfulBulkScrape(urls, delayMs = 1000) {
const results = [];
for (const url of urls) {
const result = await smartScrape(url);
results.push(result);
// Wait before next request
await new Promise(resolve => setTimeout(resolve, delayMs));
}
return results;
}
```
### 2. Handle Different Content Types
```javascript
const results = await bulkScrape(mixedUrls, {
progressCallback: (progress) => {
console.log(`Processing: ${progress.current}`);
}
});
// Separate different content types
const pdfResults = results.success.filter(r => r.contentType?.includes('pdf'));
const htmlResults = results.success.filter(r => !r.contentType?.includes('pdf'));
console.log(`Found ${pdfResults.length} PDFs and ${htmlResults.length} web pages`);
```
### 3. Monitor Resource Usage
```javascript
import { BNCASmartScraper } from '@monostate/node-scraper';
import browserPool from '@monostate/node-scraper/browser-pool.js';
const scraper = new BNCASmartScraper({ verbose: true });
// Monitor memory usage during bulk scraping
const memoryUsage = [];
const interval = setInterval(() => {
memoryUsage.push(process.memoryUsage());
// Log browser pool statistics
const poolStats = browserPool.getStats();
console.log('Browser Pool:', {
active: poolStats.busyCount,
idle: poolStats.idleCount,
queued: poolStats.queueLength,
totalCreated: poolStats.created,
reused: poolStats.reused
});
}, 1000);
try {
const results = await scraper.bulkScrape(urls, {
concurrency: 10,
progressCallback: (p) => {
const mem = process.memoryUsage();
console.log(`Progress: ${p.percentage.toFixed(1)}% | Memory: ${(mem.heapUsed / 1024 / 1024).toFixed(1)}MB`);
}
});
} finally {
clearInterval(interval);
await scraper.cleanup();
}
```
### 4. Implement Circuit Breaker
```javascript
class CircuitBreaker {
constructor(threshold = 5, timeout = 60000) {
this.failureCount = 0;
this.threshold = threshold;
this.timeout = timeout;
this.state = 'CLOSED';
this.nextAttempt = Date.now();
}
async call(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker is OPEN');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (error) {
this.onFailure();
throw error;
}
}
onSuccess() {
this.failureCount = 0;
this.state = 'CLOSED';
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.threshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.timeout;
}
}
}
// Use with bulk scraping
const breaker = new CircuitBreaker();
const results = [];
for (const url of urls) {
try {
const result = await breaker.call(() => smartScrape(url));
results.push(result);
} catch (error) {
console.error(`Failed to scrape ${url}: ${error.message}`);
}
}
```
## Performance Tips
1. **Use appropriate concurrency**: Start with 5-10 concurrent requests and adjust based on performance
2. **Choose the right method**: Use `direct` for static sites, `puppeteer` for SPAs
3. **Stream large datasets**: Use `bulkScrapeStream` for datasets over 1000 URLs
4. **Monitor memory usage**: Process results immediately in streaming mode
5. **Implement retry logic**: Some failures are temporary
6. **Cache results**: Avoid re-scraping unchanged content
7. **Use timeouts**: Prevent hanging requests from blocking progress
## Troubleshooting
### High Memory Usage
If you experience high memory usage:
1. Reduce concurrency
2. Use streaming mode
3. Process results immediately
4. Call `cleanup()` periodically
### Slow Performance
If scraping is slow:
1. Increase concurrency (if server allows)
2. Use `direct` method when possible
3. Reduce timeout values
4. Check network connectivity
### Many Failures
If many URLs fail:
1. Check if sites require authentication
2. Verify URLs are correct
3. Use `puppeteer` method for JavaScript-heavy sites
4. Implement retry logic with backoff