sitemap-xml-parser

Version:

Parses sitemap XML files and returns all listed URLs. CLI and library. Supports TSV/JSON output, filtering, sitemap index files, and gzip compression.

github.com/shinkawax/sitemap-xml-parser

shinkawax/sitemap-xml-parser

174 lines (126 loc) • 7.91 kB

Markdown

# sitemap-xml-parser Parses sitemap XML files and returns all listed URLs. Can be used as a CLI tool or a Node.js library. - Follows sitemap index files recursively and decompresses gzip automatically - Supports custom request headers, concurrency control, and request timeouts - **CLI:** outputs plain URLs, TSV, or JSON with configurable field selection (`--fields`) - **CLI:** filters URLs by substring or regular expression ## Installation ``` npm install sitemap-xml-parser ``` ## CLI Run without installing via `npx`: ```sh npx sitemap-xml-parser <url> [options] ``` Or, after installing globally (`npm install -g sitemap-xml-parser`): ```sh sitemap-xml-parser <url> [options] ``` Fetched URLs are printed to stdout, one per line. Errors are printed to stderr. See [Options](#options) for available flags. ## Examples ```sh # Print all URLs npx sitemap-xml-parser https://example.com/sitemap.xml # Save URLs to a file, errors to a log npx sitemap-xml-parser https://example.com/sitemap.xml > urls.txt 2> errors.log # Count URLs npx sitemap-xml-parser https://example.com/sitemap.xml --count # Stop after 100 entries npx sitemap-xml-parser https://example.com/sitemap.xml --cap 100 # Filter and count npx sitemap-xml-parser https://example.com/sitemap.xml --filter "blog" --count # Filter by regular expression npx sitemap-xml-parser https://example.com/sitemap.xml --filter-regex "blog/[0-9]{4}/" # Output as TSV (loc, lastmod, changefreq, priority) npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv ``` <details> <summary>CLI: getting more fields — discovering what's available and outputting all of them</summary> Some sitemaps include extension fields such as `image:image` or `news:news` beyond the standard four. If you need to include those fields in your output, use `--list-fields` to find out what's available first. ```sh # Output as JSON with all fields (all fields present in the source XML are included by default) npx sitemap-xml-parser https://example.com/sitemap.xml --format json # Discover all fields present in a sitemap npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields # Output as TSV with custom columns (e.g. image sitemap extension) npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv --fields loc,image:image # Output as TSV with all fields (fetches twice: once to discover fields, once to output) npx sitemap-xml-parser https://example.com/sitemap.xml --format tsv \ --fields "$(npx sitemap-xml-parser https://example.com/sitemap.xml --list-fields | paste -sd, -)" ``` </details> ## Options ### CLI | Flag | Default | Description | |-------------------------|---------|-----------------------------------------------------------------------------| | `--delay <ms>` | `1000` | Milliseconds to wait between batches when following a sitemap index. `--limit` URLs are fetched in parallel per batch; after each batch completes, the process waits `--delay` ms before starting the next. Set to `0` to disable. | | `--limit <n>` | `10` | Number of child sitemaps to fetch concurrently per batch. | | `--timeout <ms>` | `30000` | Milliseconds before a request is aborted. | | `--cap <n>` | — | Stop collecting after this many URL entries. Useful for sampling large sitemaps. | | `--header <Name: Value>`| — | Add a request header. Repeatable. Single: `--header "User-Agent: MyBot/1.0"`. Multiple: `--header "User-Agent: MyBot/1.0" --header "Authorization: Bearer token"` | | `--filter <str>` | — | Only output URLs whose `loc` contains the given string (substring match). Can be combined with `--count` or `--format`. | | `--filter-regex <regex>`| — | Only output URLs whose `loc` matches the given regular expression. Invalid patterns exit non-zero. Can be combined with `--count` or `--format`. | | `--format <fmt>` | — | Output format: `tsv` prints a header row followed by one tab-separated row per entry; `json` outputs a JSON array of entry objects including all fields from the source XML. | | `--fields <f1,f2,...>` | — | Comma-separated list of fields to include in the output. Requires `--format`. For `tsv`, defaults to `loc,lastmod,changefreq,priority`. For `json`, defaults to all fields. Nested values are serialized as JSON in TSV output. | | `--list-fields` | — | Print all field names found across every entry, one per line. Scans the entire sitemap and outputs the union of all keys seen. Useful for discovering available fields before using `--fields`. Compatible with `--filter` and `--filter-regex`. Cannot be combined with `--format`, `--fields`, `--cap`, or `--count`. | | `--count` | — | Print only the total number of URLs. | ### Library | Option | Type | Default | Description | |-----------|------------|---------|------------------------------------| | `delay` | `number` | `1000` | Same as `--delay`. | | `limit` | `number` | `10` | Same as `--limit`. | | `timeout` | `number` | `30000` | Same as `--timeout`. | | `cap` | `number` | — | Same as `--cap`. | | `headers` | `object` | — | Key-value map of request headers. Same as repeated `--header`. | | `onError` | `function` | — | Called as `onError(url, error)` when a fetch or parse fails. The entry is skipped regardless. | | `onEntry` | `function` | — | Called as `onEntry(entry)` each time a URL entry is parsed. `entry` has the same shape as the objects returned by `fetch()`. | ## Features - Follows Sitemap Index files recursively, including nested indexes (Index within an Index) - Automatically decompresses gzip: supports both `.gz` URLs and `Content-Encoding: gzip` responses - Batch processing: fetches `limit` child sitemaps in parallel per batch, then waits `delay` ms after each batch completes - Automatically follows redirects (301/302/303/307/308) up to 5 hops; errors beyond that are reported via `onError`. Custom request headers are forwarded only when the redirect stays on the same origin (same scheme, host, and port); they are stripped on cross-origin redirects. ## Usage ```js const SitemapXMLParser = require('sitemap-xml-parser'); const parser = new SitemapXMLParser('https://example.com/sitemap.xml'); (async () => { const urls = await parser.fetch(); urls.forEach(entry => { console.log(entry.loc); }); })(); ``` ### Custom headers ```js const parser = new SitemapXMLParser('https://example.com/sitemap.xml', { headers: { 'User-Agent': 'MyBot/2.0', 'Authorization': 'Bearer my-token', }, }); ``` ### Error handling with `onError` Failed URLs (network errors, non-2xx responses, malformed XML) are skipped by default. Provide an `onError` callback to inspect them: ```js const parser = new SitemapXMLParser('https://example.com/sitemap.xml', { onError: (url, err) => { console.error(`Skipped ${url}: ${err.message}`); }, }); ``` ## Return value `fetch()` resolves to an array of URL entry objects. Each object contains all fields present in the source XML — no field selection is applied at the library level: ```js [ { loc: 'https://example.com/page1', lastmod: '2024-01-01', changefreq: 'weekly', priority: '0.8', }, // ... ] ``` `loc` is always a string. Standard fields (`lastmod`, `changefreq`, `priority`) are strings when present, or `undefined` when absent from the source XML. Sitemap extension fields (e.g. `image:image`, `news:news`, `video:video`) are also preserved as-is when present in the source XML. Their values reflect the structure parsed by the underlying XML parser — nested elements become objects.