@skypilot/scraper
Version:
Node-based scriptable web scraper
94 lines (69 loc) • 3.14 kB
Markdown
# @skypilot/scraper
[](https://www.npmjs.com/package/@skypilot/scraper)

[](https://opensource.org/licenses/ISC)
Node-base scriptable web scraper
## How to use
1. Create a database adapter
```typescript
const dbFilePath = 'tmp/demo.json';
const database = new LowDb(dbFilePath);
```
2. Create a scraper that uses the database
```typescript
import { PlaywrightScraper } from './src/PlaywrightScraper';
const scraper = new PlaywrightScraper({ database });
```
3. Use `ScriptBuilder` to build a script:
```typescript
import { ScriptBuilder } from './src/ScriptBuilder';
const builder = new ScriptBuilder()
.goTo('https://www.iana.org/domains/reserved') // start at a page
.runOnAll({ // Runs the nested `commands` on each element that matches `query`
query: 'table#arpa-table > tbody > tr > td > span.domain.label',
commands: new ScriptBuilder()
.follow('a') // follow the href in the first `a` tag
.query({ // gather this data for each iteration of the elements matching the `runOnAll` query
title: 'head > title',
sponsor: '//h2[contains(text(), "Sponsoring Organisation")]/following-sibling::b',
adminContact: '//h2[contains(text(), "Administrative Contact")]/following-sibling::b',
techContact: '//h2[contains(text(), "Technical Contact")]/following-sibling::b',
})
.write() // writes to the database
});
```
4. Pass the script into the scraper's `run` method:
```typescript
const result = scraper.run(builder);
```
## Query
There are two ways to write a query:
### 1. A `Query` or `ShorthandQuery` object
A `Query` object is the standard way to write a selector:
```typescript
interface Query {
selector: string; // a CSS or XPath selector
attributeName?: string; // if specified, select this attribute's value; otherwise, select the element's text content
scope?: 'one' | 'all'; // default = 'one'; when used with `runOnAll`, `scope: 'all'` is automatically set
limit?: Integer; // limits the selection to `limit` elements
nthOfType?: Integer; // select the `nth` element matching the selector
}
```
A `ShorthandQuery` is the same as `Query` object, but uses a shorthand syntax for some of the keys:
```typescript
interface ShorthandQuery {
sel: string;
attr?: string;
scope?: 'one' | 'all';
limit?: Integer;
nth?: Integer;
}
```
See [CSS and XPath selectors](https://playwright.dev/docs/selectors). Support for text selectors will be added soon.
A query matches the first element matching the selector, with two exceptions:
- When used with `runOnAll` or when `scope: 'all'`, the selector selects all matching elements up
to the `limit` (if any)
- When `nthOfType` is set, the selector selects the `nth` matching element
### 2. A string query
When a string value is used as the query, that value is treated as the `selector` param.
E.g., if the argument is `'h2'`, it is understood to mean `{ selector: 'h2' }`.