web-scraper-pro
Version:
Professional web scraper with Puppeteer & Mozilla Readability. Extract clean content from any website with full TypeScript support.
412 lines (296 loc) โข 10.5 kB
Markdown
# web-scraper-pro
[](https://badge.fury.io/js/web-scraper-pro)
[](https://opensource.org/licenses/MIT)
[](https://nodejs.org/)
A professional web scraper powered by Puppeteer and Mozilla Readability. Extract clean, readable content from any website with full TypeScript support and comprehensive error handling.
## ๐ Installation
```bash
npm install web-scraper-pro
```
## ๐ฆ Quick Start
```javascript
const WebScraper = require("web-scraper-pro");
// Method 1: Default output directory (./output)
const scraper = new WebScraper();
const result = await scraper.scrapeAndSave("https://example.com");
// Method 2: Custom output directory via constructor
const scraper2 = new WebScraper({ outputDir: "./my-downloads" });
const result2 = await scraper2.scrapeAndSave("https://example.com");
// Method 3: Set output directory after creation
const scraper3 = new WebScraper();
scraper3.setOutputDir("./custom-folder");
const result3 = await scraper3.scrapeAndSave("https://example.com");
// Method 4: Extract content only (no files saved)
const content = await scraper.scrapeContentOnly("https://example.com", [
"title",
"content",
]);
console.log(content);
```
## ๐ Project Structure
```
web-scraper-pro/
โโโ src/
โ โโโ scraper.js # Main scraper implementation
โ โโโ scraper.d.ts # TypeScript definitions
โโโ output/ # Generated output files
โ โโโ scraped_*.html # Raw HTML files
โ โโโ extracted_*.txt # Extracted content files
โโโ test/ # Test files
โโโ index.js # Main entry point
โโโ package.json
โโโ README.md
```
## ๐ง Usage
### Command Line Interface
Run with default URL (Wikipedia):
```bash
node src/scraper.js
```
Run with custom URL:
```bash
node src/scraper.js "https://your-target-url.com"
```
### Setting Custom Output Directory
There are multiple ways to specify where output files should be saved:
#### 1. Via Constructor
```javascript
const WebScraper = require("web-scraper-pro");
const scraper = new WebScraper({ outputDir: "./my-custom-folder" });
```
#### 2. Via setOutputDir() Method
```javascript
const scraper = new WebScraper();
scraper.setOutputDir("./downloads/scraped-data");
// Method chaining is supported
scraper.setOutputDir("./downloads").getOutputDir(); // Returns current path
```
#### 3. Using Absolute Paths
```javascript
const scraper = new WebScraper();
scraper.setOutputDir("C:/Users/Desktop/scraped-content");
// or on Linux/Mac
scraper.setOutputDir("/home/user/scraped-content");
```
#### 4. Check Current Output Directory
```javascript
const currentPath = scraper.getOutputDir();
console.log(`Files will be saved to: ${currentPath}`);
```
**Note:** The output directory will be created automatically if it doesn't exist.
## ๐ API Reference
### Constructor
Creates a new WebScraper instance with optional configuration.
```javascript
const WebScraper = require("web-scraper-pro");
// Default output directory (./output)
const scraper = new WebScraper();
// Custom output directory
const scraper = new WebScraper({ outputDir: "./my-output" });
```
**Parameters:**
- `options` (object, optional): Configuration options
- `outputDir` (string): Custom output directory path
### setOutputDir(path)
Sets a custom output directory for saved files.
```javascript
scraper.setOutputDir("./custom-folder");
// Supports method chaining
scraper.setOutputDir("./downloads").getOutputDir();
```
**Parameters:**
- `path` (string): Absolute or relative path to output directory
**Returns:** WebScraper instance (for method chaining)
### getOutputDir()
Gets the current output directory path.
```javascript
const currentPath = scraper.getOutputDir();
console.log(currentPath); // e.g., "C:/Users/Name/project/output"
```
**Returns:** String with current output directory path
### scrapeAndSave(url, returnFields)
Scrapes a webpage and saves both HTML and extracted content to files.
```javascript
const result = await scraper.scrapeAndSave("https://example.com", [
"url",
"title",
"content",
]);
console.log(result.data.title); // Extracted title
console.log(result.files.htmlFile); // Path to saved HTML file
console.log(result.files.txtFile); // Path to saved text file
```
**Parameters:**
- `url` (string): The URL to scrape
- `returnFields` (array, optional): Data fields to include in output
- Default: `['url','title','siteName','length','extractedAt','content']`
**Returns:** Object with `success`, `data`, `files`, and `duration` properties
### scrapeContentOnly(url, fields)
Scrapes a webpage and returns formatted content as a string (no files created).
```javascript
const { scrapeContentOnly } = require("web-scraper-pro");
const content = await scrapeContentOnly("https://example.com", [
"title",
"url",
"content",
]);
console.log(content);
// Output:
// TITLE: Page Title
// URL: https://example.com
//
// CONTENT:
// Main content text...
```
**Parameters:**
- `url` (string): The URL to scrape
- `fields` (array): Data fields to include in output string
**Returns:** Formatted string with extracted content
### Available Data Fields
- `url` - The webpage URL
- `title` - Page title
- `content` - Main content (cleaned by Readability)
- `siteName` - Website name
- `length` - Content length in characters
- `extractedAt` - Extraction timestamp
- `excerpt` - Short content summary
## โ๏ธ Configuration
### TypeScript Support
Full TypeScript definitions are included:
```typescript
import {
scrapeAndSave,
scrapeContentOnly,
ExtractedData,
ScrapeResult,
} from "web-scraper-pro";
const result: ScrapeResult = await scrapeAndSave("https://example.com", {
saveText: true,
saveHtml: true,
});
const content: string = await scrapeContentOnly("https://example.com", [
"title",
"content",
]);
```
### Custom Puppeteer Settings
The scraper uses optimized Puppeteer settings by default, but you can customize them by modifying the source:
```javascript
// In src/scraper.js
browser = await puppeteer.launch({
headless: false, // Show browser for debugging
args: ["--no-sandbox"], // Additional Chrome flags
timeout: 60000, // Custom timeout
});
```
### Timeout Configuration
```javascript
await page.goto(url, {
waitUntil: "networkidle2",
timeout: 60000, // Increase timeout to 60 seconds
});
```
## ๐ Output Formats
### File Output
Generated files use descriptive naming with timestamps:
```
extracted_example_com_2025-09-08T12-34-56-789Z.txt
scraped_example_com_2025-09-08T12-34-56-789Z.html
```
**Text file format:**
```
URL: https://example.com
TITLE: Page Title
SITE_NAME: Site Name
LENGTH: 1000 characters
EXTRACTED_AT: 2025-09-08T12:34:56.789Z
CONTENT:
Main content extracted by Mozilla Readability...
```
### Programmatic Output
```javascript
{
url: "https://example.com",
title: "Page Title",
content: "Main content...",
siteName: "Site Name",
length: 1000,
extractedAt: "2025-09-08T12:34:56.789Z",
excerpt: "Brief summary..."
}
```
## ๐ ๏ธ Troubleshooting
### Puppeteer Installation Issues
**Linux:**
```bash
sudo apt-get install -y gconf-service libasound2 libatk1.0-0 libcairo-gobject2 libdrm2 libgtk-3-0 libnspr4 libnss3 libx11-xcb1 libxcomposite1 libxcursor1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6
```
**macOS (M1/M2):**
```bash
arch -x86_64 npm install puppeteer
```
**Windows:**
Ensure Visual Studio Build Tools are installed for native dependencies.
### Bot Detection
Some websites block automated scraping. Try these solutions:
```javascript
// Add user agent and delays
await page.setUserAgent(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);
await page.setViewport({ width: 1366, height: 768 });
await page.waitForTimeout(2000); // Add delay
```
### Memory Issues
For large-scale scraping, implement proper cleanup:
```javascript
// Close browser instances properly
await browser.close();
// Monitor memory usage
process.on("exit", () =>
console.log("Process memory usage:", process.memoryUsage())
);
```
## โจ Features
- โ
**JavaScript-rendered content** - Handles dynamic pages
- โ
**Clean content extraction** - Removes ads, sidebars, navigation
- โ
**Automatic file naming** - Timestamp-based file organization
- โ
**Comprehensive error handling** - Robust retry logic
- โ
**TypeScript support** - Full type definitions included
- โ
**Dual output modes** - Files + data or string-only
- โ
**Component architecture** - Modular, maintainable codebase
- โ
**Professional logging** - Detailed console feedback
## ๐งช Tested Websites
- โ
**Wikipedia** (Multiple languages)
- โ
**News websites** (BBC, CNN, etc.)
- โ
**Gaming sites** (Perfect World Games, IGN)
- โ
**Blog platforms** (Medium, Dev.to)
- โ
**Documentation sites** (MDN, official docs)
- โ
**E-commerce** (Product pages)
### Running Tests
```bash
# Test core functionality
npm test
# Test with specific examples
node test-new-functions.js
node test-gaming-news.js
```
## ๐ Performance
- **Speed**: ~3-5 seconds per page (includes browser startup)
- **Memory**: ~50-100MB per browser instance
- **Reliability**: Built-in retry logic for network issues
- **Output Size**: Typical compression ratio 80-90% vs raw HTML
## ๐ค Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Commit changes: `git commit -am 'Add feature'`
4. Push to branch: `git push origin feature-name`
5. Submit a pull request
## ๐ License
MIT License - see [LICENSE](LICENSE) file for details.
## ๐ Links
- **NPM Package**: https://www.npmjs.com/package/web-scraper-pro
- **GitHub Repository**: https://github.com/nanpapu/web-scraper-pro
- **Issues & Support**: https://github.com/nanpapu/web-scraper-pro/issues
---
Made with โค๏ธ by [Nanpapu](https://github.com/nanpapu)