commoncrawl
Version:
NodeJs Client for CommonCrawl Index API
172 lines (102 loc) • 4.99 kB
Markdown
A Node.js client for the commoncrawl.org index
### Install
```commoncrawl``` is available on ```npm``` and as such, can be installed through ```npm``` with ease.
To install ```commoncrawl``` and add it to your ```package.json``` file, use the following command:
```sh
$ npm install --save commoncrawl
```
This node Library can be used to get information about a range of archive captures/mementos, including filtering, sorting, and pagination for bulk query. The actual archive files (WARC/ARC) files are not loaded during this query, only
the generated CDX index.
### Usage
In order to use ```commoncrawl``` you will need to 1st install it and add to your project, use this command :
```sh
$ npm install --save commoncrawl
```
Once installed you need to instantiate a new copy of ```commoncrawl``` in your application, like so:
```js
const commoncrawl = require('commoncrawl')
```
Method - getIndex - to will return a list indexes from CommonCrawl index in Json.
```js
commoncrawl.getIndex().then((data) => {
console.log(data);
});
```
Method - searchURL - to search and return list of matches for the given URL :
```js
commoncrawl.searchURL('example.com')
.then((data) => {
console.log(data);
});
```
You will get back a JSON Array response similar to the below.
```json
[{
"urlkey": "com,example)/your/path.html",
"timestamp": "20190719170504",
"url": "https://www.example.com/your/path.html",
"mime": "text/html",
"filename": "crawl-data/CC-MAIN-2019-30/segments/1563195526324.57/crawldiagnostics/CC-MAIN-20190719161034-20190719183034-00145.warc.gz",
"length": "1177",
"offset": "24936937",
"mime-detected": "text/html",
"digest": "B2LTWWPUOYAH7UIPQ7ZUPQ4VMBSVC36A",
"status": "404"
}]
```
*Note: By default it uses the latest index to fetch data. You can override this in options parameter.
You can override default behavior of search in by adding `options` parameter to `searchURL` call.
### Optional Settings.
### `index`
By default it will use the latest Index - `CC-MAIN-2019-30-index`
You can manually change the `index` by overriding `index` parameter.
A Json Array of indexes can be fetched with `getIndex()` method.
### `from` / `to`
Setting `from=<ts>` or `to=<ts>` will restrict the results to the given date/time range (inclusive).
Timestamps may be <=14 digits and will be padded to either lower or upper bound.
For example, `from : 2019, to : 2019` will return results of `example.com` that
have a timestamp between `20190101000000` and `20191231235959`
*Available from pywb 0.10.9*
### `matchType`
The CommonCrawl server supports the following `matchType`
- `exact` -- default setting, will return captures that match the url exactly
- `prefix` -- return captures that begin with a specified path, eg: `http://example.com/path/*`
- `host` -- return captures which for a begin host (the path segment is ignored if specified)
- `domain` -- return captures for the current host and all subdomains, eg. `*.example.com`
### `limit`
Setting `limit:` will limit the number of index lines returned. Limit must be set to a positive integer. If no limit is provided, all the matching lines are returned, which may be slow.
### `output` (Default JSON output)
While CommonCrawl returns a text output of Json items separated by newline, the library will convert into a well-formmated json Array.
### `page`
`page` is the current page number, and defaults to 0 if omitted. If the `page` exceeds the number of available `pages` from the page count query, a 400 error will be returned.
### `pageSize`
`pageSize` is an optional parameter which can increase or decrease the amount of data returned in each page.
The default setting can be configuration dependent.
### `showNumPages:true|false`
This will only show the count of total matches found, blocks, pages and PageSize.
```
{"blocks": 423, "pages": 85, "pageSize": 5}
```
In this result:
- `pages` is the total number of pages available for this query. The `page` parameter may be between 0 and `pages - 1`
- `pageSize` is the total number of ZipNum compressed blocks that are read for each page.
- `blocks` is the actual number of compressed blocks that match the query. This can be used to quickly estimate the total number of captures, within a margin of error. In general, `blocks / pageSize + 1 = pages` (since there is always at least 1 page even if `blocks < pageSize`)
If changing `pageSize`, the same value should be used for both the `showNumPages` query and the regular paged query. ex:
### Example with Options
```js
const commoncrawl = require('commoncrawl')
let options = {
index: 'CC-MAIN-2019-30-index',
from: '2018',
to: '2019',
matchType: 'domain', // exact, prefix, host , domain,
limit: 100,
page: 1,
pageSize: 100,
showNumPages: false,
}
commoncrawl.searchURL('example.com',options)
.then((data) => {
console.log(data);
});
```