broken-link-checker
Version:
Find broken links, missing images, etc in your HTML.
357 lines (290 loc) • 16.4 kB
Markdown
# broken-link-checker [![NPM Version][npm-image]][npm-url] [![Build Status][travis-image]][travis-url] [![Dependency Status][david-image]][david-url]
> Find broken links, missing images, etc in your HTML.
Features:
* Stream-parses local and remote HTML pages
* Concurrently checks multiple links
* Supports various HTML elements/attributes, not just `<a href>`
* Supports redirects, absolute URLs, relative URLs and `<base>`
* Honors robot exclusions
* Provides detailed information about each link (HTTP and HTML)
* URL keyword filtering with wildcards
* Pause/Resume at any time
## Installation
[Node.js](http://nodejs.org/) `>= 0.10` is required; `< 4.0` will need `Promise` and `Object.assign` polyfills.
There're two ways to use it:
### Command Line Usage
To install, type this at the command line:
```shell
npm install broken-link-checker -g
```
After that, check out the help for available options:
```shell
blc --help
```
A typical site-wide check might look like:
```shell
blc http://yoursite.com -ro
```
### Programmatic API
To install, type this at the command line:
```shell
npm install broken-link-checker
```
The rest of this document will assist you with how to use the API.
## Classes
### `blc.HtmlChecker(options, handlers)`
Scans an HTML document to find broken links.
* `handlers.complete` is fired after the last result or zero results.
* `handlers.html` is fired after the HTML document has been fully parsed.
* `tree` is supplied by [parse5](https://npmjs.com/parse5)
* `robots` is an instance of [robot-directives](https://npmjs.com/robot-directives) containing any `<meta>` robot exclusions.
* `handlers.junk` is fired with data on each skipped link, as configured in options.
* `handlers.link` is fired with the result of each discovered link (broken or not).
* `.clearCache()` will remove any cached URL responses. This is only relevant if the `cacheResponses` option is enabled.
* `.numActiveLinks()` returns the number of links with active requests.
* `.numQueuedLinks()` returns the number of links that currently have no active requests.
* `.pause()` will pause the internal link queue, but will not pause any active requests.
* `.resume()` will resume the internal link queue.
* `.scan(html, baseUrl)` parses & scans a single HTML document. Returns `false` when there is a previously incomplete scan (and `true` otherwise).
* `html` can be a stream or a string.
* `baseUrl` is the address to which all relative URLs will be made absolute. Without a value, links to relative URLs will output an "Invalid URL" error.
```js
var htmlChecker = new blc.HtmlChecker(options, {
html: function(tree, robots){},
junk: function(result){},
link: function(result){},
complete: function(){}
});
htmlChecker.scan(html, baseUrl);
```
### `blc.HtmlUrlChecker(options, handlers)`
Scans the HTML content at each queued URL to find broken links.
* `handlers.end` is fired when the end of the queue has been reached.
* `handlers.html` is fired after a page's HTML document has been fully parsed.
* `tree` is supplied by [parse5](https://npmjs.com/parse5).
* `robots` is an instance of [robot-directives](https://npmjs.com/robot-directives) containing any `<meta>` and `X-Robots-Tag` robot exclusions.
* `handlers.junk` is fired with data on each skipped link, as configured in options.
* `handlers.link` is fired with the result of each discovered link (broken or not) within the current page.
* `handlers.page` is fired after a page's last result, on zero results, or if the HTML could not be retrieved.
* `.clearCache()` will remove any cached URL responses. This is only relevant if the `cacheResponses` option is enabled.
* `.dequeue(id)` removes a page from the queue. Returns `true` on success or an `Error` on failure.
* `.enqueue(pageUrl, customData)` adds a page to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success or an `Error` on failure.
* `customData` is optional data that is stored in the queue item for the page.
* `.numActiveLinks()` returns the number of links with active requests.
* `.numPages()` returns the total number of pages in the queue.
* `.numQueuedLinks()` returns the number of links that currently have no active requests.
* `.pause()` will pause the queue, but will not pause any active requests.
* `.resume()` will resume the queue.
```js
var htmlUrlChecker = new blc.HtmlUrlChecker(options, {
html: function(tree, robots, response, pageUrl, customData){},
junk: function(result, customData){},
link: function(result, customData){},
page: function(error, pageUrl, customData){},
end: function(){}
});
htmlUrlChecker.enqueue(pageUrl, customData);
```
### `blc.SiteChecker(options, handlers)`
Recursively scans (crawls) the HTML content at each queued URL to find broken links.
* `handlers.end` is fired when the end of the queue has been reached.
* `handlers.html` is fired after a page's HTML document has been fully parsed.
* `tree` is supplied by [parse5](https://npmjs.com/parse5).
* `robots` is an instance of [robot-directives](https://npmjs.com/robot-directives) containing any `<meta>` and `X-Robots-Tag` robot exclusions.
* `handlers.junk` is fired with data on each skipped link, as configured in options.
* `handlers.link` is fired with the result of each discovered link (broken or not) within the current page.
* `handlers.page` is fired after a page's last result, on zero results, or if the HTML could not be retrieved.
* `handlers.robots` is fired after a site's robots.txt has been downloaded and provides an instance of [robots-txt-guard](https://npmjs.com/robots-txt-guard).
* `handlers.site` is fired after a site's last result, on zero results, or if the *initial* HTML could not be retrieved.
* `.clearCache()` will remove any cached URL responses. This is only relevant if the `cacheResponses` option is enabled.
* `.dequeue(id)` removes a site from the queue. Returns `true` on success or an `Error` on failure.
* `.enqueue(siteUrl, customData)` adds [the first page of] a site to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success or an `Error` on failure.
* `customData` is optional data that is stored in the queue item for the site.
* `.numActiveLinks()` returns the number of links with active requests.
* `.numPages()` returns the total number of pages in the queue.
* `.numQueuedLinks()` returns the number of links that currently have no active requests.
* `.numSites()` returns the total number of sites in the queue.
* `.pause()` will pause the queue, but will not pause any active requests.
* `.resume()` will resume the queue.
**Note:** `options.filterLevel` is used for determining which links are recursive.
```js
var siteChecker = new blc.SiteChecker(options, {
robots: function(robots, customData){},
html: function(tree, robots, response, pageUrl, customData){},
junk: function(result, customData){},
link: function(result, customData){},
page: function(error, pageUrl, customData){},
site: function(error, siteUrl, customData){},
end: function(){}
});
siteChecker.enqueue(siteUrl, customData);
```
### `blc.UrlChecker(options, handlers)`
Requests each queued URL to determine if they are broken.
* `handlers.end` is fired when the end of the queue has been reached.
* `handlers.link` is fired for each result (broken or not).
* `.clearCache()` will remove any cached URL responses. This is only relevant if the `cacheResponses` option is enabled.
* `.dequeue(id)` removes a URL from the queue. Returns `true` on success or an `Error` on failure.
* `.enqueue(url, baseUrl, customData)` adds a URL to the queue. Queue items are auto-dequeued when their requests are completed. Returns a queue ID on success or an `Error` on failure.
* `baseUrl` is the address to which all relative URLs will be made absolute. Without a value, links to relative URLs will output an "Invalid URL" error.
* `customData` is optional data that is stored in the queue item for the URL.
* `.numActiveLinks()` returns the number of links with active requests.
* `.numQueuedLinks()` returns the number of links that currently have no active requests.
* `.pause()` will pause the queue, but will not pause any active requests.
* `.resume()` will resume the queue.
```js
var urlChecker = new blc.UrlChecker(options, {
link: function(result, customData){},
end: function(){}
});
urlChecker.enqueue(url, baseUrl, customData);
```
## Options
### `options.acceptedSchemes`
Type: `Array`
Default value: `["http","https"]`
Will only check links with schemes/protocols mentioned in this list. Any others (except those in `excludedSchemes`) will output an "Invalid URL" error.
### `options.cacheExpiryTime`
Type: `Number`
Default Value: `3600000` (1 hour)
The number of milliseconds in which a cached response should be considered valid. This is only relevant if the `cacheResponses` option is enabled.
### `options.cacheResponses`
Type: `Boolean`
Default Value: `true`
URL request results will be cached when `true`. This will ensure that each unique URL will only be checked once.
### `options.excludedKeywords`
Type: `Array`
Default value: `[]`
Will not check or output links that match the keywords and glob patterns in this list. The only wildcard supported is `*`.
This option does *not* apply to `UrlChecker`.
### `options.excludedSchemes`
Type: `Array`
Default value: `["data","geo","javascript","mailto","sms","tel"]`
Will not check or output links with schemes/protocols mentioned in this list. This avoids the output of "Invalid URL" errors with links that cannot be checked.
This option does *not* apply to `UrlChecker`.
### `options.excludeExternalLinks`
Type: `Boolean`
Default value: `false`
Will not check or output external links when `true`; relative links with a remote `<base>` included.
This option does *not* apply to `UrlChecker`.
### `options.excludeInternalLinks`
Type: `Boolean`
Default value: `false`
Will not check or output internal links when `true`.
This option does *not* apply to `UrlChecker` nor `SiteChecker`'s *crawler*.
### `options.excludeLinksToSamePage`
Type: `Boolean`
Default value: `true`
Will not check or output links to the same page; relative and absolute fragments/hashes included.
This option does *not* apply to `UrlChecker`.
### `options.filterLevel`
Type: `Number`
Default value: `1`
The tags and attributes that are considered links for checking, split into the following levels:
* `0`: clickable links
* `1`: clickable links, media, iframes, meta refreshes
* `2`: clickable links, media, iframes, meta refreshes, stylesheets, scripts, forms
* `3`: clickable links, media, iframes, meta refreshes, stylesheets, scripts, forms, metadata
Recursive links have a slightly different filter subset. To see the exact breakdown of both, check out the [tag map](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/tags.js). `<base>` is not listed because it is not a link, though it is always parsed.
This option does *not* apply to `UrlChecker`.
### `options.honorRobotExclusions`
Type: `Boolean`
Default value: `true`
Will not scan pages that search engine crawlers would not follow. Such will have been specified with any of the following:
* `<a rel="nofollow" href="…">`
* `<area rel="nofollow" href="…">`
* `<meta name="robots" content="noindex,nofollow,…">`
* `<meta name="googlebot" content="noindex,nofollow,…">`
* `<meta name="robots" content="unavailable_after: …">`
* `X-Robots-Tag: noindex,nofollow,…`
* `X-Robots-Tag: googlebot: noindex,nofollow,…`
* `X-Robots-Tag: otherbot: noindex,nofollow,…`
* `X-Robots-Tag: unavailable_after: …`
* robots.txt
This option does *not* apply to `UrlChecker`.
### `options.maxSockets`
Type: `Number`
Default value: `Infinity`
The maximum number of links to check at any given time.
### `options.maxSocketsPerHost`
Type: `Number`
Default value: `1`
The maximum number of links per host/port to check at any given time. This avoids overloading a single target host with too many concurrent requests. This will not limit concurrent requests to other hosts.
### `options.rateLimit`
Type: `Number`
Default value: `0`
The number of milliseconds to wait before each request.
### `options.requestMethod`
Type: `String`
Default value: `"head"`
The HTTP request method used in checking links. If you experience problems, try using `"get"`, however `options.retry405Head` should have you covered.
### `options.retry405Head`
Type: `Boolean`
Default value: `true`
Some servers do not respond correctly to a `"head"` request method. When `true`, a link resulting in an HTTP 405 "Method Not Allowed" error will be re-requested using a `"get"` method before deciding that it is broken.
### `options.userAgent`
Type: `String`
Default value: `"broken-link-checker/0.7.0 Node.js/5.5.0 (OS X El Capitan; x64)"` (or similar)
The HTTP user-agent to use when checking links as well as retrieving pages and robot exclusions.
## Handling Broken/Excluded Links
A broken link will have a `broken` value of `true` and a reason code defined in `brokenReason`. A link that was not checked (emitted as `"junk"`) will have an `excluded` value of `true` and a reason code defined in `excludedReason`.
```js
if (result.broken) {
console.log(result.brokenReason);
//=> HTTP_404
} else if (result.excluded) {
console.log(result.excludedReason);
//=> BLC_ROBOTS
}
```
Additionally, more descriptive messages are available for each reason code:
```js
console.log(blc.BLC_ROBOTS); //=> Robots Exclusion
console.log(blc.ERRNO_ECONNRESET); //=> connection reset by peer (ECONNRESET)
console.log(blc.HTTP_404); //=> Not Found (404)
// List all
console.log(blc);
```
Putting it all together:
```js
if (result.broken) {
console.log(blc[result.brokenReason]);
} else if (result.excluded) {
console.log(blc[result.excludedReason]);
}
```
## HTML and HTTP information
Detailed information for each link result is provided. Check out the [schema](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/linkObj.js#L16-L64) or:
```js
console.log(result);
```
## Roadmap Features
* fix issue where same-page links are not excluded when cache is enabled, despite `excludeLinksToSamePage===true`
* publicize filter handlers
* add cheerio support by using parse5's htmlparser2 tree adaptor?
* add `rejectUnauthorized:false` option to avoid `UNABLE_TO_VERIFY_LEAF_SIGNATURE`
* load sitemap.xml at end of each `SiteChecker` site to possibly check pages that were not linked to
* remove `options.excludedSchemes` and handle schemes not in `options.acceptedSchemes` as junk?
* change order of checking to: tcp error, 4xx code (broken), 5xx code (undetermined), 200
* abort download of body when `options.retry405Head===true`
* option to retry broken links a number of times (default=0)
* option to scrape `response.body` for erroneous sounding text (using [fathom](https://npmjs.com/fathom-web)?), since an error page could be presented but still have code 200
* option to check broken link on archive.org for archived version (using [this lib](https://npmjs.com/archive.org))
* option to run `HtmlUrlChecker` checks on page load (using [jsdom](https://npmjs.com/jsdom)) to include links added with JavaScript?
* option to check if hashes exist in target URL document?
* option to parse Markdown in `HtmlChecker` for links
* option to play sound when broken link is found
* option to hide unbroken links
* option to check plain text URLs
* add throttle profiles (0–9, -1 for "custom") for easy configuring
* check [ftp:](https://nmjs.com/ftp), [sftp:](https://npmjs.com/ssh2) (for downloadable files)
* check ~~mailto:~~, news:, nntp:, telnet:?
* check local files if URL is relative and has no base URL?
* cli json mode -- streamed or not?
* cli non-tty mode -- change nesting ASCII artwork to time stamps?
[npm-image]: https://img.shields.io/npm/v/broken-link-checker.svg
[npm-url]: https://npmjs.org/package/broken-link-checker
[travis-image]: https://img.shields.io/travis/stevenvachon/broken-link-checker.svg
[travis-url]: https://travis-ci.org/stevenvachon/broken-link-checker
[david-image]: https://img.shields.io/david/stevenvachon/broken-link-checker.svg
[david-url]: https://david-dm.org/stevenvachon/broken-link-checker