ts-web-scraper
Version:
A powerful web scraper for both static and client-side rendered sites using only Bun native APIs
293 lines (214 loc) • 8.37 kB
Markdown
<p align="center"><img src=".github/art/cover.jpg" alt="Social Card of this repo"></p>
[![npm version][npm-version-src]][npm-version-href]
[![GitHub Actions][github-actions-src]][github-actions-href]
[](http://commitizen.github.io/cz-cli/)
<!-- [![npm downloads][npm-downloads-src]][npm-downloads-href] -->
<!-- [![Codecov][codecov-src]][codecov-href] -->
# ts-web-scraper
A powerful, type-safe web scraping library for TypeScript and Bun with **zero external dependencies**. Built entirely on Bun's native APIs for maximum performance and minimal footprint.
## Features
- 🚀 **Zero Dependencies** - Built entirely on Bun native APIs
- 💪 **Fully Typed** - Complete TypeScript support with type inference
- ⚡️ **High Performance** - Optimized for speed with native Bun performance
- 🔄 **Rate Limiting** - Built-in token bucket rate limiter with burst support
- 💾 **Smart Caching** - LRU cache with TTL support
- 🔁 **Automatic Retries** - Exponential backoff retry logic
- 📊 **Data Extraction** - Powerful pipeline-based data extraction and transformation
- 🎯 **Validation** - Built-in schema validation for extracted data
- 📈 **Monitoring** - Performance metrics and analytics
- 🔍 **Change Detection** - Track content changes over time with diff algorithms
- 🤖 **Ethical Scraping** - Robots.txt support and user-agent management
- 🍪 **Session Management** - Cookie jar and session persistence
- 📝 **Multiple Export Formats** - JSON, CSV, XML, YAML, Markdown, HTML
- 🌐 **Pagination** - Automatic pagination detection and traversal
- 🎨 **Client-Side Rendering** - Support for JavaScript-heavy sites
- 📚 **Comprehensive Docs** - Full documentation with examples
## Installation
```bash
bun add ts-web-scraper
```
## Quick Start
```typescript
import { createScraper } from 'ts-web-scraper'
// Create a scraper instance
const scraper = createScraper({
rateLimit: { requestsPerSecond: 2 },
cache: { enabled: true, ttl: 60000 },
retry: { maxRetries: 3 },
})
// Scrape a website
const result = await scraper.scrape('https://example.com', {
extract: doc => ({
title: doc.querySelector('title')?.textContent,
headings: Array.from(doc.querySelectorAll('h1')).map(h => h.textContent),
}),
})
console.log(result.data)
```
## Core Concepts
### Scraper
The main scraper class provides a unified API for all scraping operations:
```typescript
import { createScraper } from 'ts-web-scraper'
const scraper = createScraper({
// Rate limiting
rateLimit: {
requestsPerSecond: 2,
burstSize: 5
},
// Caching
cache: {
enabled: true,
ttl: 60000,
maxSize: 100
},
// Retry logic
retry: {
maxRetries: 3,
initialDelay: 1000
},
// Performance monitoring
monitor: true,
// Change tracking
trackChanges: true,
// Cookies & sessions
cookies: { enabled: true },
})
```
### Data Extraction
Extract and transform data using pipelines:
```typescript
import { extractors, pipeline } from 'ts-web-scraper'
const extractProducts = pipeline()
.step(extractors.structured('.product', {
name: '.product-name',
price: '.product-price',
rating: '.rating',
}))
.map('parse-price', p => ({
...p,
price: Number.parseFloat(p.price.replace(/[^0-9.]/g, '')),
}))
.filter('in-stock', products => products.every(p => p.price > 0))
.sort('by-price', (a, b) => a.price - b.price)
const result = await extractProducts.execute(document)
```
### Change Detection
Track content changes over time:
```typescript
const scraper = createScraper({ trackChanges: true })
// First scrape
const result1 = await scraper.scrape('https://example.com', {
extract: doc => ({ price: doc.querySelector('.price')?.textContent }),
})
// result1.changed === undefined (no previous snapshot)
// Second scrape
const result2 = await scraper.scrape('https://example.com', {
extract: doc => ({ price: doc.querySelector('.price')?.textContent }),
})
// result2.changed === false (if price hasn't changed)
```
### Export Data
Export scraped data to multiple formats:
```typescript
import { exportData, saveExport } from 'ts-web-scraper'
// Export to JSON
const json = exportData(data, { format: 'json', pretty: true })
// Export to CSV
const csv = exportData(data, { format: 'csv' })
// Save to file (format auto-detected from extension)
await saveExport(data, 'output.csv')
await saveExport(data, 'output.json')
await saveExport(data, 'output.xml')
```
## Advanced Features
### Pagination
Automatically traverse paginated content:
```typescript
for await (const page of scraper.scrapeAll('https://example.com/posts', {
extract: doc => ({
posts: extractors.structured('article', {
title: 'h2',
content: '.content',
}).execute(doc),
}),
}, { maxPages: 10 })) {
console.log(`Page ${page.pageNumber}:`, page.data)
}
```
### Performance Monitoring
Track and analyze scraping performance:
```typescript
const scraper = createScraper({ monitor: true })
await scraper.scrape('https://example.com')
await scraper.scrape('https://example.com/page2')
const stats = scraper.getStats()
console.log(stats.totalRequests) // 2
console.log(stats.averageDuration) // Average time per request
console.log(stats.cacheHitRate) // Cache effectiveness
const report = scraper.getReport()
console.log(report) // Formatted performance report
```
### Content Validation
Validate extracted data against schemas:
```typescript
const result = await scraper.scrape('https://example.com', {
extract: doc => ({
title: doc.querySelector('title')?.textContent,
price: Number.parseFloat(doc.querySelector('.price')?.textContent || '0'),
}),
validate: {
title: { type: 'string', required: true },
price: { type: 'number', min: 0, required: true },
},
})
if (result.success) {
// Data is valid and typed
console.log(result.data.title, result.data.price)
}
else {
console.error(result.error)
}
```
## Documentation
For full documentation, visit [https://ts-web-scraper.netlify.app](https://ts-web-scraper.netlify.app)
- [Getting Started Guide](https://ts-web-scraper.netlify.app/guide/)
- [API Reference](https://ts-web-scraper.netlify.app/api/)
- [Examples](https://ts-web-scraper.netlify.app/examples/)
## Testing
```bash
bun test
```
All 482 tests passing with comprehensive coverage of:
- Core scraping functionality
- Rate limiting and caching
- Data extraction pipelines
- Change detection and monitoring
- Export formats
- Error handling and edge cases
## Changelog
Please see our [releases](https://github.com/stacksjs/ts-web-scraper/releases) page for more information on what has changed recently.
## Contributing
Please see [CONTRIBUTING](.github/CONTRIBUTING.md) for details.
## Community
For help, discussion about best practices, or any other conversation that would benefit from being searchable:
[Discussions on GitHub](https://github.com/stacksjs/ts-web-scraper/discussions)
For casual chit-chat with others using this package:
[Join the Stacks Discord Server](https://discord.gg/stacksjs)
## Postcardware
"Software that is free, but hopes for a postcard." We love receiving postcards from around the world showing where Stacks is being used! We showcase them on our website too.
Our address: Stacks.js, 12665 Village Ln #2306, Playa Vista, CA 90094, United States 🌎
## Sponsors
We would like to extend our thanks to the following sponsors for funding Stacks development. If you are interested in becoming a sponsor, please reach out to us.
- [JetBrains](https://www.jetbrains.com/)
- [The Solana Foundation](https://solana.com/)
## License
The MIT License (MIT). Please see [LICENSE](LICENSE.md) for more information.
Made with 💙
<!-- Badges -->
[npm-version-src]: https://img.shields.io/npm/v/ts-web-scraper?style=flat-square
[npm-version-href]: https://npmjs.com/package/ts-web-scraper
[github-actions-src]: https://img.shields.io/github/actions/workflow/status/stacksjs/ts-web-scraper/ci.yml?style=flat-square&branch=main
[github-actions-href]: https://github.com/stacksjs/ts-web-scraper/actions?query=workflow%3Aci
<!-- [codecov-src]: https://img.shields.io/codecov/c/gh/stacksjs/ts-web-scraper/main?style=flat-square
[codecov-href]: https://codecov.io/gh/stacksjs/ts-web-scraper -->