@phunky/scrape-channel-listings
Version:
A TypeScript library for scraping TV channel listings from various providers
197 lines (146 loc) • 5.81 kB
Markdown
A TypeScript library for scraping TV channel listings from various providers:
- DIRECTV
- DISH Network
- Sky UK
- Virgin Media
This project started as a proof-of-concept for scraping TV channel listings from various providers. The codebase has been significantly improved with the assistance of Cursor AI
- Parallel scraping with configurable concurrency
- Performance monitoring and statistics
- Error handling and detailed logging
- JSON output by default
- Optional file output for each provider
- Individual provider scraping support
- Available as both a library and CLI tool
This package requires Playwright with Chromium browser for web scraping. After installing the package, you'll need to install Playwright's Chromium browser:
```bash
npm install @phunky/scrape-channel-listings
npx playwright install chromium
```
The channel listings are scraped from the following sources:
- Virgin Media UK: [rxtvinfo.com/virgin-media-channel-list-uk](https://rxtvinfo.com/virgin-media-channel-list-uk/)
- Sky UK: [rxtvinfo.com/sky-channel-list-uk](https://rxtvinfo.com/sky-channel-list-uk/)
- Sky UK Satallite: [rxtvinfo.com/sky-satellite-channel-list-uk](https://rxtvinfo.com/sky-satellite-channel-list-uk/)
- Sky Ireland: [rxtvinfo.com/sky-stream-and-sky-glass-channel-list-republic-of-ireland-epg](https://rxtvinfo.com/sky-stream-and-sky-glass-channel-list-republic-of-ireland-epg/)
- Freesat HD: [rxtvinfo.com/freesat-channel-list-uk/](https://rxtvinfo.com/freesat-channel-list-uk/)
- DIRECTV: [usdirect.com/channels](https://www.usdirect.com/channels)
- DISH Network: [allconnect.com/providers/dish/channel-guide](https://www.allconnect.com/providers/dish/channel-guide)
Please note that these sources are third-party websites and may change without notice. The scrapers are maintained to work with the current structure of these sites, but may need updates if the source websites undergo significant changes.
```bash
npm install @phunky/scrape-channel-listings
```
```typescript
import { scrapeAllProviders, scrapeProvider, type Channel, type ScrapingSummary } from '@phunky/scrape-channel-listings';
// Scrape all providers
const channels = await scrapeAllProviders();
console.log(channels); // Array of { provider: string, channels: Channel[] }
// Scrape with options
const summary = await scrapeAllProviders({
writeFiles: true, // Write results to files
maxConcurrent: 2 // Limit concurrent scrapers
});
console.log(summary); // ScrapingSummary object
// Scrape a specific provider
const result = await scrapeProvider('directv');
console.log(result); // ScraperResult object
```
```bash
npx @phunky/scrape-channel-listings
npx @phunky/scrape-channel-listings --provider virgin
npx @phunky/scrape-channel-listings --provider sky
npx @phunky/scrape-channel-listings --provider skyireland
npx @phunky/scrape-channel-listings --provider skysatellite
npx @phunky/scrape-channel-listings --provider freesat
npx @phunky/scrape-channel-listings --provider directv
npx @phunky/scrape-channel-listings --provider dish
npx @phunky/scrape-channel-listings --write-files
npx @phunky/scrape-channel-listings --max-concurrent 4
```
```typescript
interface Channel {
number: string;
name: string;
}
interface ProviderChannels {
provider: string;
channels: Channel[];
}
interface ScraperResult {
name: string;
success: boolean;
duration: number;
channelCount?: number;
error?: Error;
channels?: Channel[];
}
interface ScrapingOptions {
writeFiles?: boolean;
maxConcurrent?: number;
}
interface ScrapingSummary {
results: ScraperResult[];
totalDuration: number;
successRate: string;
totalChannels: number;
failedScrapers: ScraperResult[];
}
```
Scrapes channel listings from all configured providers. Returns either an array of provider channels or a summary object depending on the `writeFiles` option.
Scrapes channel listings from a specific provider. Throws an error if the provider is not found.
The scraper can be configured using environment variables:
```bash
HEADLESS=false CONCURRENT_SCRAPERS=2 npm run scrape
```
Available environment variables:
- `HEADLESS`: Set to 'false' to see the browser while scraping (default: true)
- `CONCURRENT_SCRAPERS`: Number of scrapers to run in parallel (default: 10)
- `RETRY_ATTEMPTS`: Number of retry attempts for failed scrapes (default: 1)
- `RETRY_DELAY`: Delay between retries in milliseconds (default: 1000)
- `PAGE_TIMEOUT`: Page load timeout in milliseconds (default: 30000)
- `OUTPUT_DIR`: Directory to save results when using --files (default: '../data')
## Error Handling
The scraper will:
- Retry failed attempts based on `RETRY_ATTEMPTS` setting
- Log detailed error messages
- Continue with remaining providers if one fails
- Exit with code 1 if any scraper fails
- Provide error details in the final summary (when using --files)
## Development
```bash
# Install dependencies
npm install
# Build the package
npm run build
# Run tests
npm test
# Run specific scraper
npm run scrape:virgin
npm run scrape:sky
npm run scrape:skyireland
npm run scrape:skysatellite
npm run scrape:freesat
npm run scrape:directv
npm run scrape:dish
```
## License
ISC