UNPKG

@skypilot/scraper

Version:
94 lines (69 loc) 3.14 kB
# @skypilot/scraper [![npm latest](https://img.shields.io/npm/v/@skypilot/scraper/alpha?label=latest)](https://www.npmjs.com/package/@skypilot/scraper) ![downloads](https://img.shields.io/npm/dm/@skypilot/scraper) [![license: ISC](https://img.shields.io/badge/license-ISC-blue.svg)](https://opensource.org/licenses/ISC) Node-base scriptable web scraper ## How to use 1. Create a database adapter ```typescript const dbFilePath = 'tmp/demo.json'; const database = new LowDb(dbFilePath); ``` 2. Create a scraper that uses the database ```typescript import { PlaywrightScraper } from './src/PlaywrightScraper'; const scraper = new PlaywrightScraper({ database }); ``` 3. Use `ScriptBuilder` to build a script: ```typescript import { ScriptBuilder } from './src/ScriptBuilder'; const builder = new ScriptBuilder() .goTo('https://www.iana.org/domains/reserved') // start at a page .runOnAll({ // Runs the nested `commands` on each element that matches `query` query: 'table#arpa-table > tbody > tr > td > span.domain.label', commands: new ScriptBuilder() .follow('a') // follow the href in the first `a` tag .query({ // gather this data for each iteration of the elements matching the `runOnAll` query title: 'head > title', sponsor: '//h2[contains(text(), "Sponsoring Organisation")]/following-sibling::b', adminContact: '//h2[contains(text(), "Administrative Contact")]/following-sibling::b', techContact: '//h2[contains(text(), "Technical Contact")]/following-sibling::b', }) .write() // writes to the database }); ``` 4. Pass the script into the scraper's `run` method: ```typescript const result = scraper.run(builder); ``` ## Query There are two ways to write a query: ### 1. A `Query` or `ShorthandQuery` object A `Query` object is the standard way to write a selector: ```typescript interface Query { selector: string; // a CSS or XPath selector attributeName?: string; // if specified, select this attribute's value; otherwise, select the element's text content scope?: 'one' | 'all'; // default = 'one'; when used with `runOnAll`, `scope: 'all'` is automatically set limit?: Integer; // limits the selection to `limit` elements nthOfType?: Integer; // select the `nth` element matching the selector } ``` A `ShorthandQuery` is the same as `Query` object, but uses a shorthand syntax for some of the keys: ```typescript interface ShorthandQuery { sel: string; attr?: string; scope?: 'one' | 'all'; limit?: Integer; nth?: Integer; } ``` See [CSS and XPath selectors](https://playwright.dev/docs/selectors). Support for text selectors will be added soon. A query matches the first element matching the selector, with two exceptions: - When used with `runOnAll` or when `scope: 'all'`, the selector selects all matching elements up to the `limit` (if any) - When `nthOfType` is set, the selector selects the `nth` matching element ### 2. A string query When a string value is used as the query, that value is treated as the `selector` param. E.g., if the argument is `'h2'`, it is understood to mean `{ selector: 'h2' }`.