@algolia/404-crawler
Version:
Detect 404/Not found pages from sitemap
115 lines (75 loc) • 3.3 kB
Markdown
# 404 Crawler 🏊♂️
A command line interface to crawl and detect 404 pages from sitemap.

## 📊 Usage
### Install
> Make sure npm is installed in your computer. To know more about it, visit https://docs.npmjs.com/downloading-and-installing-node-js-and-npm
In a terminal, run
```sh
npm install -g @algolia/404-crawler
```
After that, you'll be able to use the command `404crawler` in your terminal
### Examples
- Crawl and detect every 404 pages from algolia website's sitemap:
```sh
404crawler crawl -u https://algolia.com/sitemap.xml
```
- Use JavaScript rendering to crawl and identify all 404 or 'Not Found' pages on the Algolia website.
```sh
404crawler crawl -u https://algolia.com/sitemap.xml --render-js
```
- Crawl and identify all 404 pages on the Algolia website by analyzing its sitemap, including all potential sub-path variations
```sh
404crawler crawl -u https://algolia.com/sitemap.xml --include-variations
```
### Options
- `--sitemap-url` or `-u`:
**Required** URL of the `sitemap.xml` file.
- `--render-js` or `-r`:
Use JavaScript rendering to crawl and identify a 'Not Found Page' if the status code isn't a 404. This option is useful for websites that returns a 200 status code even if the page is not found (Next.js with custom not found page for example)
- `--output` or `-o`:
Ouput path for the JSON file of the results. Example: `crawler/results.json`. If not set, no file is written after the crawl.
- `--include-variations` or `-v`:
Include all sub-path variations from URLs found in the `sitemap.xml`.
For example, if https://algolia.com/foo/bar/baz is found in the sitemap, the crawler will test https://algolia.com/foo/bar/baz, https://algolia.com/foo/bar, https://algolia.com/foo and https://algolia.com
- `--exit-on-detection` or `-e`:
Exit when a 404 or a 'Not Found' page is detected.
- `--run-in-parallel` or `-p`:
Run the crawler with multiple pages in parallel. By default, the number of parallel instances is set to 10. See `--batch-size` option to configure this number.
- `--batch-size` or `-s`:
Number or parallel instances of crawler to run: the more this number is, the more resources are consumed. Only available when `--run-in-parallel` option is set. If not set, default to 10.
- `--browser-type` or `-b`:
Type of the browser to use to crawl pages. Can be 'firefox', 'chromium' or 'webkit'. If not set, default to 'firefox'.
## 👨💻 Get started (maintainers)
This CLI is built with TypeScript and uses `ts-node` to run the code locally.
### Install
Install all dependencies
```sh
pnpm i
```
### Run locally
```
pnpm 404crawler crawl <options>
```
### Deploy
1. Update `package.json` version
2. Commit and push changes
3. Build JS files in `dist/` with
```sh
pnpm build
```
4. Initialize npm with Algolia org as scope
```sh
npm init --scope=algolia
```
5. Follow instructions
6. Publish package with
```sh
npm publish
```
## 🔗 References
This package uses:
- [TypeScript](https://www.typescriptlang.org/)
- [Commander.js](https://github.com/tj/commander.js)
- [Playwright](https://playwright.dev/)
- [sitemapper](https://github.com/seantomburke/sitemapper)