UNPKG

crawler-ts-htmlparser2

Version:

Lightweight crawler written in TypeScript using ES6 generators.

134 lines (104 loc) 4.73 kB
# crawler-ts Lightweight crawler written in TypeScript using ES6 generators. <a href="https://www.npmjs.com/package/crawler-ts"> <img alt="npm" src="https://img.shields.io/npm/v/crawler-ts.svg?color=green"/> </a> <a href="https://bundlephobia.com/result?p=crawler-ts"> <img alt="bundle size" src="https://img.shields.io/bundlephobia/minzip/crawler-ts?label=bundle size"/> </a> <img alt="license" src="https://img.shields.io/npm/l/crawler-ts?label=license&color=green"/> ## Installation ```sh npm install --save crawler-ts crawler-ts-htmlparser2 ``` ## Examples - [Crawl NASA Mars News](./examples/mars-news/src/index.ts) - [Crawl Hacker News](./examples/hacker-news/src/index.ts) - [Crawl the file system](./examples/fs/src/index.ts) ## API The `createCrawler` function expects the following options as the first parameter. ```typescript /** * @type {L} The type of the locations to crawl, e.g. `URL` or `string` that represents a path. * @type {R} The type of the response at the location that is crawler, e.g. Cheerio object, file system `fs.Stats`. * @type {P} The intermediate parsed result that can be parsed from the response and generated by the crawler. */ interface Options<L, R, P> { /** * This function should return the response for the given location. */ requester(location: L): ValueOrPromise<R | undefined>; /** * This function should return true if the crawler should parse the response, or false if not. */ shouldParse(props: PreParseProps<L, R>): ValueOrPromise<boolean>; /** * This function should parse the response and convert the response to the parsed type. */ parser(props: PreParseProps<L, R>): ValueOrPromise<P | undefined>; /** * This function should return true if the crawler should yield the parsed result, or false if not. */ shouldYield(props: PostParseProps<L, R, P>): ValueOrPromise<boolean>; /** * This function should yield all the locations to follow in the given parsed result. */ follower(props: PostParseProps<L, R, P>): AsyncGenerator<L>; /** * This function should return true if the crawler should queue the location for crawling, or false if not. */ shouldQueue(props: { location: L; origin: L; response: R; parsed: P }): ValueOrPromise<boolean>; /** * The logger can be set to `console` to output debug information to the `console`. * * @default undefined */ logger?: Logger; } interface PreParseProps<L, R> { location: L; response: R; } interface PostParseProps<L, R, P> extends PreParseProps<L, R> { parsed: P; } type ValueOrPromise<T> = T | Promise<T>; ``` There are built-in modules available that implement some of these configuration values. See [Modules](.#modules) section. ## Modules ### crawler-ts-fetch <p> <a href="https://www.npmjs.com/package/crawler-ts-fetch"> <img alt="npm" src="https://img.shields.io/npm/v/crawler-ts-fetch.svg?color=green"/> </a> <a href="https://bundlephobia.com/result?p=crawler-ts-fetch"> <img alt="bundle size" src="https://img.shields.io/bundlephobia/minzip/crawler-ts-fetch?label=bundle size"/> </a> </p> This module implements a `requester` that uses `node-fetch` to request content over HTTP. See [modules/crawler-ts-fetch](./modules/crawler-ts-fetch). ### crawler-ts-htmlparser2 <p> <a href="https://www.npmjs.com/package/crawler-ts-htmlparser2"> <img alt="npm" src="https://img.shields.io/npm/v/crawler-ts-htmlparser2.svg?color=green"/> </a> <a href="https://bundlephobia.com/result?p=crawler-ts-htmlparser2"> <img alt="bundle size" src="https://img.shields.io/bundlephobia/minzip/crawler-ts-htmlparser2?label=bundle size"/> </a> </p> This module implements a `requester`, `parser` and `follower` for HTML. The `requester` uses `crawler-ts-fetch` to request content over HTTP. The `parser` uses `htmlparser2` to parse HTML files. The `follower` uses the parser result to find `<a>` anchor elements and yields its `href` properties. See [modules/crawler-ts-htmlparser2](./modules/crawler-ts-htmlparser2). ### crawler-ts-fs <p> <a href="https://www.npmjs.com/package/crawler-ts-fs"> <img alt="npm" src="https://img.shields.io/npm/v/crawler-ts-fs.svg?color=green"/> </a> <a href="https://bundlephobia.com/result?p=crawler-ts-fs"> <img alt="bundle size" src="https://img.shields.io/bundlephobia/minzip/crawler-ts-fs?label=bundle size"/> </a> </p> This module implements a `requester`, `parser` and `follower` for the file system. The `requester` uses `fs.stat` to request file information. The `parser` by default just returns the response from the `requester`. The `follower` follows directories. ## Author Gillis Van Ginderachter ## License GNU General Public License v3.0