@monostate/node-scraper
Version:
Intelligent web scraping with AI Q&A, PDF support and multi-level fallback system - 11x faster than traditional scrapers
645 lines (488 loc) • 19.6 kB
Markdown
> **Lightning-fast web scraping with intelligent fallback system - 11.35x faster than Firecrawl**
[](https://www.npmjs.com/package/@monostate/node-scraper)
[](../../test-results/)
[](../../LICENSE)
[](https://nodejs.org/)
```bash
npm install @monostate/node-scraper
yarn add @monostate/node-scraper
pnpm add @monostate/node-scraper
```
**Fixed in v1.8.1**: Critical production fix - browser-pool.js now included in npm package.
**New in v1.8.0**: Bulk scraping with automatic request queueing, progress tracking, and streaming results! Process hundreds of URLs efficiently. Plus critical memory leak fix with browser pooling.
**Fixed in v1.7.0**: Critical cross-platform compatibility fix - binaries are now correctly downloaded per platform instead of being bundled.
**New in v1.6.0**: Method override support! Force specific scraping methods with `method` parameter for testing and optimization.
**New in v1.5.0**: AI-powered Q&A! Ask questions about any website using OpenRouter, OpenAI, or built-in AI.
**Also in v1.3.0**: PDF parsing support added! Automatically extracts text, metadata, and page count from PDF documents.
**Also in v1.2.0**: Lightpanda binary is now automatically downloaded and configured during installation! No manual setup required.
The package now automatically:
- Downloads the correct Lightpanda binary for your platform (macOS, Linux, Windows/WSL)
- Configures binary paths and permissions
- Validates installation health on first use
### Basic Usage
```javascript
import { smartScrape, smartScreenshot, quickShot } from '@monostate/node-scraper';
// Simple one-line scraping
const result = await smartScrape('https://example.com');
console.log(result.content); // Extracted content
console.log(result.method); // Method used: direct-fetch, lightpanda, or puppeteer
// Take a screenshot
const screenshot = await smartScreenshot('https://example.com');
console.log(screenshot.screenshot); // Base64 encoded image
// Quick screenshot (optimized for speed)
const quick = await quickShot('https://example.com');
console.log(quick.screenshot); // Fast screenshot capture
// PDF parsing (automatic detection)
const pdfResult = await smartScrape('https://example.com/document.pdf');
console.log(pdfResult.content); // Extracted text, metadata, page count
```
```javascript
import { BNCASmartScraper } from '@monostate/node-scraper';
const scraper = new BNCASmartScraper({
timeout: 10000,
verbose: true,
lightpandaPath: './lightpanda' // optional
});
const result = await scraper.scrape('https://complex-spa.com');
console.log(result.stats); // Performance statistics
await scraper.cleanup(); // Clean up resources
```
The package now includes automatic browser instance pooling to prevent memory leaks:
```javascript
// Browser pool is managed automatically with these defaults:
// - Max 3 concurrent browser instances
// - 5 second idle timeout before cleanup
// - Automatic reuse of browser instances
// For heavy workloads, you can manually clean up:
const scraper = new BNCASmartScraper();
// ... perform multiple scrapes ...
await scraper.cleanup(); // Closes all browser instances
```
**Important**: The convenience functions (`smartScrape`, `smartScreenshot`, etc.) automatically handle cleanup. You only need to call `cleanup()` when using the `BNCASmartScraper` class directly.
### Method Override (New in v1.6.0)
Force a specific scraping method instead of using automatic fallback:
```javascript
// Force direct fetch (no browser)
const result = await smartScrape('https://example.com', { method: 'direct' });
// Force Lightpanda browser
const result = await smartScrape('https://example.com', { method: 'lightpanda' });
// Force Puppeteer (full Chrome)
const result = await smartScrape('https://example.com', { method: 'puppeteer' });
// Auto mode (default - intelligent fallback)
const result = await smartScrape('https://example.com', { method: 'auto' });
```
**Important**: When forcing a method, no fallback occurs if it fails. This is useful for:
- Testing specific methods in isolation
- Optimizing for known site requirements
- Debugging method-specific issues
**Error Response for Forced Methods**:
```javascript
{
success: false,
error: "Lightpanda scraping failed: [specific error]",
method: "lightpanda",
errorType: "network|timeout|parsing|service_unavailable",
details: "Additional error context"
}
```
Process multiple URLs efficiently with automatic request queueing and progress tracking:
```javascript
import { bulkScrape } from '@monostate/node-scraper';
// Basic bulk scraping
const urls = [
'https://example1.com',
'https://example2.com',
'https://example3.com',
// ... hundreds more
];
const results = await bulkScrape(urls, {
concurrency: 5, // Process 5 URLs at a time
continueOnError: true, // Don't stop on failures
progressCallback: (progress) => {
console.log(`Progress: ${progress.percentage.toFixed(1)}% (${progress.processed}/${progress.total})`);
}
});
console.log(`Success: ${results.stats.successful}, Failed: ${results.stats.failed}`);
console.log(`Total time: ${results.stats.totalTime}ms`);
console.log(`Average time per URL: ${results.stats.averageTime}ms`);
```
For large datasets, use streaming to process results as they complete:
```javascript
import { bulkScrapeStream } from '@monostate/node-scraper';
await bulkScrapeStream(urls, {
concurrency: 10,
onResult: async (result) => {
// Process each successful result immediately
await saveToDatabase(result);
console.log(`✓ ${result.url} - ${result.duration}ms`);
},
onError: async (error) => {
// Handle errors as they occur
console.error(`✗ ${error.url} - ${error.error}`);
},
progressCallback: (progress) => {
process.stdout.write(`\rProcessing: ${progress.percentage.toFixed(1)}%`);
}
});
```
**Features:**
- Automatic request queueing (no more memory errors!)
- Configurable concurrency control
- Real-time progress tracking
- Continue on error or stop on first failure
- Detailed statistics and method tracking
- Browser instance pooling for efficiency
For detailed examples and advanced usage, see [BULK_SCRAPING.md](./BULK_SCRAPING.md).
## How It Works
BNCA uses a sophisticated multi-tier system with intelligent detection:
### 1. 🔄 Direct Fetch (Fastest)
- Pure HTTP requests with intelligent HTML parsing
- **Performance**: Sub-second responses
- **Success rate**: 75% of websites
- **PDF Detection**: Automatically detects PDFs by URL, content-type, or magic bytes
### 2. 🐼 Lightpanda Browser (Fast)
- Lightweight browser engine (2-3x faster than Chromium)
- **Performance**: Fast JavaScript execution
- **Fallback triggers**: SPA detection
### 3. 🔵 Puppeteer (Complete)
- Full Chromium browser for maximum compatibility
- **Performance**: Complete JavaScript execution
- **Fallback triggers**: Complex interactions needed
### 📄 PDF Parser (Specialized)
- Automatic PDF detection and parsing
- **Features**: Text extraction, metadata, page count
- **Smart Detection**: Works even when PDFs are served with wrong content-types
- **Performance**: Typically 100-500ms for most PDFs
### 📸 Screenshot Methods
- **Chrome CLI**: Direct Chrome screenshot capture
- **Quickshot**: Optimized with retry logic and smart timeouts
## 📊 Performance Benchmark
| Site Type | BNCA | Firecrawl | Speed Advantage |
|-----------|------|-----------|----------------|
| **Wikipedia** | 154ms | 4,662ms | **30.3x faster** |
| **Hacker News** | 1,715ms | 4,644ms | **2.7x faster** |
| **GitHub** | 9,167ms | 9,790ms | **1.1x faster** |
**Average**: 11.35x faster than Firecrawl with 100% reliability
## 🎛️ API Reference
### Convenience Functions
#### `smartScrape(url, options?)`
Quick scraping with intelligent fallback.
#### `smartScreenshot(url, options?)`
Take a screenshot of any webpage.
#### `quickShot(url, options?)`
Optimized screenshot capture for maximum speed.
**Parameters:**
- `url` (string): URL to scrape/capture
- `options` (object, optional): Configuration options
**Returns:** Promise<ScrapingResult>
### `BNCASmartScraper`
Main scraper class with advanced features.
#### Constructor Options
```javascript
const scraper = new BNCASmartScraper({
timeout: 10000, // Request timeout in ms
retries: 2, // Number of retries per method
verbose: false, // Enable detailed logging
lightpandaPath: './lightpanda', // Path to Lightpanda binary
userAgent: 'Mozilla/5.0 ...', // Custom user agent
});
```
Scrape a URL with intelligent fallback.
```javascript
const result = await scraper.scrape('https://example.com');
```
Take a screenshot of a webpage.
```javascript
const result = await scraper.screenshot('https://example.com');
const img = result.screenshot; // data:image/png;base64,...
```
Quick screenshot capture - optimized for speed with retry logic.
```javascript
const result = await scraper.quickshot('https://example.com');
// 2-3x faster than regular screenshot
```
Get performance statistics.
```javascript
const stats = scraper.getStats();
console.log(stats.successRates); // Success rates by method
```
Check availability of all scraping methods.
```javascript
const health = await scraper.healthCheck();
console.log(health.status); // 'healthy' or 'unhealthy'
```
Clean up resources (close browser instances).
```javascript
await scraper.cleanup();
```
Ask questions about any website and get AI-generated answers:
```javascript
// Method 1: Using your own OpenRouter API key
const scraper = new BNCASmartScraper({
openRouterApiKey: 'your-openrouter-api-key'
});
const result = await scraper.askAI('https://example.com', 'What is this website about?');
// Method 2: Using OpenAI API (or compatible endpoints)
const scraper = new BNCASmartScraper({
openAIApiKey: 'your-openai-api-key',
// Optional: Use a compatible endpoint like Groq, Together AI, etc.
openAIBaseUrl: 'https://api.groq.com/openai'
});
const result = await scraper.askAI('https://example.com', 'What services do they offer?');
// Method 3: One-liner with OpenRouter
import { askWebsiteAI } from '@monostate/node-scraper';
const answer = await askWebsiteAI('https://example.com', 'What is the main topic?', {
openRouterApiKey: process.env.OPENROUTER_API_KEY
});
// Method 4: Using BNCA backend API (requires BNCA API key)
const scraper = new BNCASmartScraper({
apiKey: 'your-bnca-api-key'
});
const result = await scraper.askAI('https://example.com', 'What products are featured?');
```
**API Key Priority:**
1. OpenRouter API key (`openRouterApiKey`)
2. OpenAI API key (`openAIApiKey`)
3. BNCA backend API (`apiKey`)
4. Local fallback (pattern matching - no API key required)
**Configuration Options:**
```javascript
const result = await scraper.askAI(url, question, {
// OpenRouter specific
openRouterApiKey: 'sk-or-...',
model: 'meta-llama/llama-4-scout:free', // Default model
// OpenAI specific
openAIApiKey: 'sk-...',
openAIBaseUrl: 'https://api.openai.com', // Or compatible endpoint
model: 'gpt-3.5-turbo',
// Shared options
temperature: 0.3,
maxTokens: 500
});
```
**Response Format:**
```javascript
{
success: true,
answer: "This website is about...",
method: "direct-fetch", // Scraping method used
scrapeTime: 1234, // Time to scrape in ms
processing: "openrouter" // AI processing method used
}
```
BNCA automatically detects and parses PDF documents:
```javascript
const pdfResult = await smartScrape('https://example.com/document.pdf');
// Parsed content includes:
const content = JSON.parse(pdfResult.content);
console.log(content.title); // PDF title
console.log(content.author); // Author name
console.log(content.pages); // Number of pages
console.log(content.text); // Full extracted text
console.log(content.creationDate); // Creation date
console.log(content.metadata); // Additional metadata
```
**PDF Detection Methods:**
- URL ending with `.pdf`
- Content-Type header `application/pdf`
- Binary content starting with `%PDF` (magic bytes)
- Works with PDFs served as `application/octet-stream` (e.g., GitHub raw files)
**Limitations:**
- Maximum file size: 20MB
- Text extraction only (no image OCR)
- Requires `pdf-parse` dependency (automatically installed)
## 📱 Next.js Integration
### API Route Example
```javascript
// pages/api/scrape.js or app/api/scrape/route.js
import { smartScrape } from '@monostate/node-scraper';
export async function POST(request) {
try {
const { url } = await request.json();
const result = await smartScrape(url);
return Response.json({
success: true,
data: result.content,
method: result.method,
time: result.performance.totalTime
});
} catch (error) {
return Response.json({
success: false,
error: error.message
}, { status: 500 });
}
}
```
```javascript
// hooks/useScraper.js
import { useState } from 'react';
export function useScraper() {
const [loading, setLoading] = useState(false);
const [data, setData] = useState(null);
const [error, setError] = useState(null);
const scrape = async (url) => {
setLoading(true);
setError(null);
try {
const response = await fetch('/api/scrape', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ url })
});
const result = await response.json();
if (result.success) {
setData(result.data);
} else {
setError(result.error);
}
} catch (err) {
setError(err.message);
} finally {
setLoading(false);
}
};
return { scrape, loading, data, error };
}
```
```javascript
// components/ScraperDemo.jsx
import { useScraper } from '../hooks/useScraper';
export default function ScraperDemo() {
const { scrape, loading, data, error } = useScraper();
const [url, setUrl] = useState('');
const handleScrape = () => {
if (url) scrape(url);
};
return (
<div className="p-4">
<div className="flex gap-2 mb-4">
<input
type="url"
value={url}
onChange={(e) => setUrl(e.target.value)}
placeholder="Enter URL to scrape..."
className="flex-1 px-3 py-2 border rounded"
/>
<button
onClick={handleScrape}
disabled={loading}
className="px-4 py-2 bg-blue-500 text-white rounded disabled:opacity-50"
>
{loading ? 'Scraping...' : 'Scrape'}
</button>
</div>
{error && (
<div className="p-3 bg-red-100 text-red-700 rounded mb-4">
Error: {error}
</div>
)}
{data && (
<div className="p-3 bg-green-100 rounded">
<h3 className="font-bold mb-2">Scraped Content:</h3>
<pre className="text-sm overflow-auto">{data}</pre>
</div>
)}
</div>
);
}
```
BNCA is designed for **server-side use only** due to:
- Browser automation requirements (Puppeteer)
- File system access for Lightpanda binary
- CORS restrictions in browsers
### Next.js Deployment
- Use in API routes, not client components
- Ensure Node.js 18+ in production environment
- Consider adding Lightpanda binary to deployment
### Lightpanda Setup (Optional)
For maximum performance, install Lightpanda:
```bash
# macOS ARM64
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-aarch64-macos
chmod +x lightpanda
# Linux x64
curl -L -o lightpanda https://github.com/lightpanda-io/browser/releases/download/nightly/lightpanda-x86_64-linux
chmod +x lightpanda
```
## 🔒 Privacy & Security
- **No external API calls** - all processing is local
- **No data collection** - your data stays private
- **Respects robots.txt** (optional enforcement)
- **Configurable rate limiting**
## 📝 TypeScript Support
Full TypeScript definitions included:
```typescript
import { BNCASmartScraper, ScrapingResult, ScrapingOptions } from '@monostate/node-scraper';
const scraper: BNCASmartScraper = new BNCASmartScraper({
timeout: 5000,
verbose: true
});
const result: ScrapingResult = await scraper.scrape('https://example.com');
```
- **Method Override**: Force specific scraping methods with `method` parameter
- **Enhanced Error Handling**: Categorized error types for better debugging
- **Fallback Chain Tracking**: See which methods were attempted in auto mode
- **Graceful Failures**: No automatic fallback when method is forced
- **AI-Powered Q&A**: Ask questions about any website and get AI-generated answers
- **OpenRouter Support**: Native integration with OpenRouter API for advanced AI models
- **OpenAI Support**: Compatible with OpenAI and OpenAI-compatible endpoints (Groq, Together AI, etc.)
- **Smart Fallback**: Automatic fallback chain: OpenRouter -> OpenAI -> Backend API -> Local processing
- **One-liner AI**: New `askWebsiteAI()` convenience function for quick AI queries
- **Enhanced TypeScript**: Complete type definitions for all AI features
### v1.4.0
- Internal release (skipped for public release)
### v1.3.0
- **PDF Support**: Full PDF parsing with text extraction, metadata, and page count
- **Smart PDF Detection**: Detects PDFs by URL patterns, content-type, or magic bytes
- **Robust Parsing**: Handles PDFs served with incorrect content-types (e.g., GitHub raw files)
- **Fast Performance**: PDF parsing typically completes in 100-500ms
- **Comprehensive Extraction**: Title, author, creation date, page count, and full text
### v1.2.0
- **Auto-Installation**: Lightpanda binary is now automatically downloaded during `npm install`
- **Cross-Platform Support**: Automatic detection and installation for macOS, Linux, and Windows/WSL
- **Improved Performance**: Enhanced binary detection and ES6 module compatibility
- **Better Error Handling**: More robust installation scripts with retry logic
- **Zero Configuration**: No manual setup required - works out of the box
### v1.1.1
- Bug fixes and stability improvements
- Enhanced Puppeteer integration
### v1.1.0
- Added screenshot capabilities
- Improved fallback system
- Performance optimizations
## 🤝 Contributing
See the [main repository](https://github.com/your-org/bnca-prototype) for contribution guidelines.
## 📄 License
MIT License - see [LICENSE](../../LICENSE) file for details.
---
<div align="center">
**Built with ❤️ for fast, reliable web scraping**
[⭐ Star on GitHub](https://github.com/your-org/bnca-prototype) | [📖 Full Documentation](https://github.com/your-org/bnca-prototype#readme)
</div>