UNPKG

xscrape

Version:

A flexible and powerful library designed to extract and transform data from HTML documents using user-defined schemas

423 lines (337 loc) 11.4 kB
<p align="center"> <h1 align="center">🕷️<br/><code>xscrape</code></h1> <p align="center">Extract and transform HTML with your own schema, powered by <code>Standard Schema</code> compatibility. <br/> by <a href="https://github.com/johnie">@johnie</a> </p> </p> <br/> <p align="center"> <a href="https://opensource.org/licenses/MIT" rel="nofollow"><img src="https://img.shields.io/github/license/johnie/xscrape" alt="License"></a> <a href="https://www.npmjs.com/package/xscrape" rel="nofollow"><img src="https://img.shields.io/npm/v/xscrape.svg" alt="npm"></a> <a href="https://github.com/johnie/xscrape/actions"><img src="https://github.com/johnie/xscrape/actions/workflows/ci.yml/badge.svg" alt="Build Status"></a> <a href="https://github.com/johnie/xscrape" rel="nofollow"><img src="https://img.shields.io/github/stars/johnie/xscrape" alt="stars"></a> </p> <br/> <br/> ## Overview xscrape is a powerful HTML scraping library that combines the flexibility of query selectors with the safety of schema validation. It works with any validation library that implements the [Standard Schema](https://standardschema.dev) specification, including Zod, Valibot, ArkType, and Effect Schema. ## Features - **HTML Parsing**: Extract data from HTML using query selectors powered by [cheerio](https://github.com/cheeriojs/cheerio) - **Universal Schema Support**: Works with any [Standard Schema](https://standardschema.dev) compatible library - **Type Safety**: Full TypeScript support with inferred types from your schemas - **Flexible Extraction**: Support for nested objects, arrays, and custom transformation functions - **Error Handling**: Comprehensive error handling with detailed validation feedback - **Custom Transformations**: Apply post-processing transformations to validated data - **Default Values**: Handle missing data gracefully through schema defaults ## Installation Install xscrape with your preferred package manager: ```bash npm install xscrape # or pnpm add xscrape # or bun add xscrape ``` ## Quick Start ```typescript import { defineScraper } from 'xscrape'; import { z } from 'zod'; // Define your schema const schema = z.object({ title: z.string(), description: z.string(), keywords: z.array(z.string()), views: z.coerce.number(), }); // Create a scraper const scraper = defineScraper({ schema, extract: { title: { selector: 'title' }, description: { selector: 'meta[name="description"]', value: 'content' }, keywords: { selector: 'meta[name="keywords"]', value: (el) => el.attribs['content']?.split(',') || [], }, views: { selector: 'meta[name="views"]', value: 'content' }, }, }); // Use the scraper const { data, error } = await scraper(htmlString); ``` ## Usage Examples ### Basic Extraction Extract basic metadata from an HTML page: ```typescript import { defineScraper } from 'xscrape'; import { z } from 'zod'; const scraper = defineScraper({ schema: z.object({ title: z.string(), description: z.string(), author: z.string(), }), extract: { title: { selector: 'title' }, description: { selector: 'meta[name="description"]', value: 'content' }, author: { selector: 'meta[name="author"]', value: 'content' }, }, }); const html = ` <!DOCTYPE html> <html> <head> <title>My Blog Post</title> <meta name="description" content="An interesting blog post"> <meta name="author" content="John Doe"> </head> <body>...</body> </html> `; const { data, error } = await scraper(html); // data: { title: "My Blog Post", description: "An interesting blog post", author: "John Doe" } ``` ### Handling Missing Data Use schema defaults to handle missing data gracefully: ```typescript const scraper = defineScraper({ schema: z.object({ title: z.string().default('Untitled'), description: z.string().default('No description available'), publishedAt: z.string().optional(), views: z.coerce.number().default(0), }), extract: { title: { selector: 'title' }, description: { selector: 'meta[name="description"]', value: 'content' }, publishedAt: { selector: 'meta[name="published"]', value: 'content' }, views: { selector: 'meta[name="views"]', value: 'content' }, }, }); // Even with incomplete HTML, you get sensible defaults const { data } = await scraper('<html><head><title>Test</title></head></html>'); // data: { title: "Test", description: "No description available", views: 0 } ``` ### Extracting Arrays Extract multiple elements as arrays: ```typescript const scraper = defineScraper({ schema: z.object({ links: z.array(z.string()), headings: z.array(z.string()), }), extract: { links: [{ selector: 'a', value: 'href' }], headings: [{ selector: 'h1, h2, h3' }], }, }); const html = ` <html> <body> <h1>Main Title</h1> <h2>Subtitle</h2> <a href="/page1">Link 1</a> <a href="/page2">Link 2</a> </body> </html> `; const { data } = await scraper(html); // data: { // links: ["/page1", "/page2"], // headings: ["Main Title", "Subtitle"] // } ``` ### Nested Objects Extract complex nested data structures: ```typescript const scraper = defineScraper({ schema: z.object({ title: z.string(), socialMedia: z.object({ image: z.string().url(), width: z.coerce.number(), height: z.coerce.number(), type: z.string(), }), }), extract: { title: { selector: 'title' }, socialMedia: { selector: 'head', value: { image: { selector: 'meta[property="og:image"]', value: 'content' }, width: { selector: 'meta[property="og:image:width"]', value: 'content' }, height: { selector: 'meta[property="og:image:height"]', value: 'content' }, type: { selector: 'meta[property="og:type"]', value: 'content' }, }, }, }, }); ``` ### Custom Value Transformations Apply custom logic to extracted values: ```typescript const scraper = defineScraper({ schema: z.object({ tags: z.array(z.string()), publishedDate: z.date(), readingTime: z.number(), }), extract: { tags: { selector: 'meta[name="keywords"]', value: (el) => el.attribs['content']?.split(',').map(tag => tag.trim()) || [], }, publishedDate: { selector: 'meta[name="published"]', value: (el) => new Date(el.attribs['content']), }, readingTime: { selector: 'article', value: (el) => { const text = el.text(); const wordsPerMinute = 200; const wordCount = text.split(/\s+/).length; return Math.ceil(wordCount / wordsPerMinute); }, }, }, }); ``` ### Post-Processing with Transform Apply transformations to the validated data: ```typescript const scraper = defineScraper({ schema: z.object({ title: z.string(), description: z.string(), tags: z.array(z.string()), }), extract: { title: { selector: 'title' }, description: { selector: 'meta[name="description"]', value: 'content' }, tags: { selector: 'meta[name="keywords"]', value: (el) => el.attribs['content']?.split(',') || [], }, }, transform: (data) => ({ ...data, slug: data.title.toLowerCase().replace(/\s+/g, '-'), tagCount: data.tags.length, summary: data.description.substring(0, 100) + '...', }), }); ``` ## Schema Library Examples ### Zod ```typescript import { z } from 'zod'; const schema = z.object({ title: z.string(), price: z.coerce.number(), inStock: z.boolean().default(false), }); ``` ### Valibot ```typescript import * as v from 'valibot'; const schema = v.object({ title: v.string(), price: v.pipe(v.string(), v.transform(Number)), inStock: v.optional(v.boolean(), false), }); ``` ### ArkType ```typescript import { type } from 'arktype'; const schema = type({ title: 'string', price: 'number', inStock: 'boolean = false', }); ``` ### Effect Schema ```typescript import { Schema } from 'effect'; const schema = Schema.Struct({ title: Schema.String, price: Schema.NumberFromString, inStock: Schema.optionalWith(Schema.Boolean, { default: () => false }), }); ``` ## API Reference ### `defineScraper(config)` Creates a scraper function with the specified configuration. #### Parameters - `config.schema`: A Standard Schema compatible schema object - `config.extract`: Extraction configuration object - `config.transform?`: Optional post-processing function #### Returns A scraper function that takes HTML string and returns `Promise<{ data?: T, error?: unknown }>`. ### Extraction Configuration The `extract` object defines how to extract data from HTML: ```typescript type ExtractConfig = { [key: string]: ExtractDescriptor | [ExtractDescriptor]; }; type ExtractDescriptor = { selector: string; value?: string | ((el: Element) => any) | ExtractConfig; }; ``` #### Properties - `selector`: CSS selector to find elements - `value`: How to extract the value: - `string`: Attribute name (e.g., `'href'`, `'content'`) - `function`: Custom extraction function - `object`: Nested extraction configuration - `undefined`: Extract text content #### Array Extraction Wrap the descriptor in an array to extract multiple elements: ```typescript { links: [{ selector: 'a', value: 'href' }] } ``` ## Error Handling xscrape provides comprehensive error handling: ```typescript const { data, error } = await scraper(html); if (error) { // Handle validation errors, extraction errors, or transform errors console.error('Scraping failed:', error); } else { // Use the validated data console.log('Extracted data:', data); } ``` ## Best Practices 1. **Use Specific Selectors**: Be as specific as possible with CSS selectors to avoid unexpected matches 2. **Handle Missing Data**: Use schema defaults or optional fields for data that might not be present 3. **Validate URLs**: Use URL validation in your schema for href attributes 4. **Transform Data Early**: Use custom value functions rather than post-processing when possible 5. **Type Safety**: Let TypeScript infer types from your schema for better developer experience ## Common Use Cases - **Web Scraping**: Extract structured data from websites - **Meta Tag Extraction**: Get social media and SEO metadata - **Content Migration**: Transform HTML content to structured data - **Testing**: Validate HTML structure in tests - **RSS/Feed Processing**: Extract article data from HTML feeds ## Performance Considerations - xscrape uses cheerio for fast HTML parsing - Schema validation is performed once after extraction - Consider using streaming for large HTML documents - Cache scrapers when processing many similar documents ## Contributing We welcome contributions! Please see our [Contributing Guide](https://github.com/johnie/xscrape/blob/main/CONTRIBUTING.md) for details. ## License MIT License. See the [LICENSE](https://github.com/johnie/xscrape/blob/main/LICENSE) file for details. ## Related Projects - [cheerio](https://github.com/cheeriojs/cheerio) - jQuery-like server-side HTML parsing - [Standard Schema](https://standardschema.dev) - Universal schema specification - [Zod](https://zod.dev) - TypeScript-first schema validation - [Valibot](https://valibot.dev) - Modular and type-safe schema library - [Effect](https://effect.website) - Maximum Type-safety (incl. error handling) - [ArkType](https://arktype.io) - TypeScript's 1:1 validator, optimized from editor to runtime