sitemap
Version:
Sitemap-generating lib/cli
347 lines (257 loc) • 15.1 kB
Markdown
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
sitemap.js is a TypeScript library and CLI tool for generating sitemap XML files compliant with the sitemaps.org protocol. It supports streaming large datasets, handles sitemap indexes for >50k URLs, and includes parsers for reading existing sitemaps.
## Development Commands
### Building
```bash
npm run build # Compile TypeScript to dist/esm/ and dist/cjs/
npm run build:esm # Build ESM only (dist/esm/)
npm run build:cjs # Build CJS only (dist/cjs/)
```
### Testing
```bash
npm test # Run Jest tests with coverage
npm run test:full # Run lint, build, Jest, and xmllint validation
npm run test:typecheck # Type check only (tsc)
npm run test:perf # Run performance tests (tests/perf.mjs)
npm run test:xmllint # Validate XML schema (requires xmllint)
```
### Linting
```bash
npx eslint lib/* ./cli.ts # Lint TypeScript files
npx eslint lib/* ./cli.ts --fix # Auto-fix linting issues
```
### Running CLI Locally
```bash
node dist/esm/cli.js < urls.txt # Run CLI from built dist
./dist/esm/cli.js --version # Run directly (has shebang)
npm link && sitemap --version # Link and test as global command
```
## Code Architecture
### Entry Points
- **[index.ts](index.ts)**: Main library entry point, exports all public APIs
- **[cli.ts](cli.ts)**: Command-line interface for generating/parsing sitemaps
### File Organization & Responsibilities
The library follows a strict separation of concerns. Each file has a specific purpose:
**Core Infrastructure:**
- **[lib/types.ts](lib/types.ts)**: ALL TypeScript type definitions, interfaces, and enums. NO implementation code.
- **[lib/constants.ts](lib/constants.ts)**: Single source of truth for all shared constants (limits, regexes, defaults).
- **[lib/validation.ts](lib/validation.ts)**: ALL validation logic, type guards, and validators centralized here.
- **[lib/utils.ts](lib/utils.ts)**: Stream utilities, URL normalization, and general helper functions.
- **[lib/errors.ts](lib/errors.ts)**: Custom error class definitions.
- **[lib/sitemap-xml.ts](lib/sitemap-xml.ts)**: Low-level XML generation utilities (text escaping, tag building).
**Stream Processing:**
- **[lib/sitemap-stream.ts](lib/sitemap-stream.ts)**: Main transform stream for URL → sitemap XML.
- **[lib/sitemap-item-stream.ts](lib/sitemap-item-stream.ts)**: Lower-level stream for sitemap item → XML elements.
- **[lib/sitemap-index-stream.ts](lib/sitemap-index-stream.ts)**: Streams for sitemap indexes and multi-file generation.
**Parsers:**
- **[lib/sitemap-parser.ts](lib/sitemap-parser.ts)**: Parses sitemap XML → SitemapItem objects.
- **[lib/sitemap-index-parser.ts](lib/sitemap-index-parser.ts)**: Parses sitemap index XML → IndexItem objects.
**High-Level API:**
- **[lib/sitemap-simple.ts](lib/sitemap-simple.ts)**: Simplified API for common use cases.
### Core Streaming Architecture
The library is built on Node.js Transform streams for memory-efficient processing of large URL lists:
**Stream Chain Flow:**
```
Input → Transform Stream → Output
```
**Key Stream Classes:**
1. **SitemapStream** ([lib/sitemap-stream.ts](lib/sitemap-stream.ts))
- Core Transform stream that converts `SitemapItemLoose` objects to sitemap XML
- Handles single sitemaps (up to ~50k URLs)
- Automatically generates XML namespaces for images, videos, news, xhtml
- Uses `SitemapItemStream` internally for XML element generation
2. **SitemapAndIndexStream** ([lib/sitemap-index-stream.ts](lib/sitemap-index-stream.ts))
- Higher-level stream for handling >50k URLs
- Automatically splits into multiple sitemap files when limit reached
- Generates sitemap index XML pointing to individual sitemaps
- Requires `getSitemapStream` callback to create output files
3. **SitemapItemStream** ([lib/sitemap-item-stream.ts](lib/sitemap-item-stream.ts))
- Low-level Transform stream that converts sitemap items to XML elements
- Validates and normalizes URLs
- Handles image, video, news, and link extensions
4. **XMLToSitemapItemStream** ([lib/sitemap-parser.ts](lib/sitemap-parser.ts))
- Parser that converts sitemap XML back to `SitemapItem` objects
- Built on SAX parser for streaming large XML files
5. **SitemapIndexStream** ([lib/sitemap-index-stream.ts](lib/sitemap-index-stream.ts))
- Generates sitemap index XML from a list of sitemap URLs
- Used for organizing multiple sitemaps
### Type System
**[lib/types.ts](lib/types.ts)** defines the core data structures:
- **SitemapItemLoose**: Flexible input type (accepts strings, objects, arrays for images/videos)
- **SitemapItem**: Strict normalized type (arrays only)
- **ErrorLevel**: Enum controlling validation behavior (SILENT, WARN, THROW)
- **NewsItem**, **Img**, **VideoItem**, **LinkItem**: Extension types for rich sitemap entries
- **IndexItem**: Structure for sitemap index entries
- **StringObj**: Generic object with string keys (used for XML attributes)
### Constants & Limits
**[lib/constants.ts](lib/constants.ts)** is the single source of truth for:
- `LIMITS`: Security limits (max URL length, max items per sitemap, max video tags, etc.)
- `DEFAULT_SITEMAP_ITEM_LIMIT`: Default items per sitemap file (45,000)
All limits are documented with references to sitemaps.org and Google specifications.
### Validation & Normalization
**[lib/validation.ts](lib/validation.ts)** centralizes ALL validation logic:
- `validateSMIOptions()`: Validates complete sitemap item fields
- `validateURL()`, `validatePath()`, `validateLimit()`: Input validation
- `validators`: Regex patterns for field validation (price, language, genres, etc.)
- Type guards: `isPriceType()`, `isResolution()`, `isValidChangeFreq()`, `isValidYesNo()`, `isAllowDeny()`
**[lib/utils.ts](lib/utils.ts)** contains utility functions:
- `normalizeURL()`: Converts `SitemapItemLoose` to `SitemapItem` with validation
- `lineSeparatedURLsToSitemapOptions()`: Stream transform for parsing line-delimited URLs
- `ReadlineStream`: Helper for reading line-by-line input
- `mergeStreams()`: Combines multiple streams into one
### XML Generation
**[lib/sitemap-xml.ts](lib/sitemap-xml.ts)** provides low-level XML building functions:
- Tag generation helpers (`otag`, `ctag`, `element`)
- Sitemap-specific element builders (images, videos, news, links)
### Error Handling
**[lib/errors.ts](lib/errors.ts)** defines custom error classes:
- `EmptyStream`, `EmptySitemap`: Stream validation errors
- `InvalidAttr`, `InvalidVideoFormat`, `InvalidNewsFormat`: Validation errors
- `XMLLintUnavailable`: External tool errors
## When Making Changes
### Where to Add New Code
- **New type or interface?** → Add to [lib/types.ts](lib/types.ts)
- **New constant or limit?** → Add to [lib/constants.ts](lib/constants.ts) (import from here everywhere)
- **New validation function or type guard?** → Add to [lib/validation.ts](lib/validation.ts)
- **New utility function?** → Add to [lib/utils.ts](lib/utils.ts)
- **New error class?** → Add to [lib/errors.ts](lib/errors.ts)
- **New public API?** → Export from [index.ts](index.ts)
### Common Pitfalls to Avoid
1. **DON'T duplicate constants** - Always import from [lib/constants.ts](lib/constants.ts)
2. **DON'T define types in implementation files** - Put them in [lib/types.ts](lib/types.ts)
3. **DON'T scatter validation logic** - Keep it all in [lib/validation.ts](lib/validation.ts)
4. **DON'T break backward compatibility** - Use re-exports if moving code between files
5. **DO update [index.ts](index.ts)** if adding new public API functions
### Adding a New Field to Sitemap Items
1. Add type to [lib/types.ts](lib/types.ts) in both `SitemapItem` and `SitemapItemLoose` interfaces
2. Add XML generation logic in [lib/sitemap-item-stream.ts](lib/sitemap-item-stream.ts) `_transform` method
3. Add parsing logic in [lib/sitemap-parser.ts](lib/sitemap-parser.ts) SAX event handlers
4. Add validation in [lib/validation.ts](lib/validation.ts) `validateSMIOptions` if needed
5. Add constants to [lib/constants.ts](lib/constants.ts) if limits are needed
6. Write tests covering the new field
### Before Submitting Changes
```bash
npm run test:full # Run all tests, linting, and validation
npm run build # Ensure both ESM and CJS builds work
npm test # Verify 90%+ code coverage maintained
```
## Finding Code in the Codebase
### "Where is...?"
- **Validation for sitemap items?** → [lib/validation.ts](lib/validation.ts) (`validateSMIOptions`)
- **URL validation?** → [lib/validation.ts](lib/validation.ts) (`validateURL`)
- **Constants like max URL length?** → [lib/constants.ts](lib/constants.ts) (`LIMITS`)
- **Type guards (isPriceType, isValidYesNo)?** → [lib/validation.ts](lib/validation.ts)
- **Type definitions (SitemapItem, etc)?** → [lib/types.ts](lib/types.ts)
- **XML escaping/generation?** → [lib/sitemap-xml.ts](lib/sitemap-xml.ts)
- **URL normalization?** → [lib/utils.ts](lib/utils.ts) (`normalizeURL`)
- **Stream utilities?** → [lib/utils.ts](lib/utils.ts) (`mergeStreams`, `lineSeparatedURLsToSitemapOptions`)
### "How do I...?"
- **Check if a value is valid?** → Import type guard from [lib/validation.ts](lib/validation.ts)
- **Get a constant limit?** → Import `LIMITS` from [lib/constants.ts](lib/constants.ts)
- **Validate user input?** → Use validation functions from [lib/validation.ts](lib/validation.ts)
- **Generate XML safely?** → Use functions from [lib/sitemap-xml.ts](lib/sitemap-xml.ts) (auto-escapes)
## Testing Strategy
Tests are in [tests/](tests/) directory with Jest:
- **[tests/sitemap-stream.test.ts](tests/sitemap-stream.test.ts)**: Core streaming functionality
- **[tests/sitemap-parser.test.ts](tests/sitemap-parser.test.ts)**: XML parsing
- **[tests/sitemap-index.test.ts](tests/sitemap-index.test.ts)**: Index generation
- **[tests/sitemap-simple.test.ts](tests/sitemap-simple.test.ts)**: High-level API
- **[tests/cli.test.ts](tests/cli.test.ts)**: CLI argument parsing
- **[tests/*-security.test.ts](tests/)**: Security-focused validation and injection tests
- **[tests/sitemap-utils.test.ts](tests/sitemap-utils.test.ts)**: Utility function tests
### Coverage Requirements (enforced by jest.config.cjs)
- Branches: 80%
- Functions: 90%
- Lines: 90%
- Statements: 90%
### When to Write Tests
- **Always** write tests for new validation functions
- **Always** write tests for new security features
- **Always** add security tests for user-facing inputs (URL validation, path traversal, etc.)
- Write tests for bug fixes to prevent regression
- Add edge case tests for data transformations
## TypeScript Configuration
The project uses a dual-build setup for ESM and CommonJS:
- **[tsconfig.json](tsconfig.json)**: ESM build (`module: "NodeNext"`, `moduleResolution: "NodeNext"`)
- Outputs to `dist/esm/`
- Includes both [index.ts](index.ts) and [cli.ts](cli.ts)
- ES2023 target with strict null checks enabled
- **[tsconfig.cjs.json](tsconfig.cjs.json)**: CommonJS build (`module: "CommonJS"`)
- Outputs to `dist/cjs/`
- Excludes [cli.ts](cli.ts) (CLI is ESM-only)
- Only includes [index.ts](index.ts) for library exports
**Important**: All relative imports must include `.js` extensions for ESM compatibility (e.g., `import { foo } from './types.js'`)
## Key Patterns
### Stream Creation
Always create a new stream instance per operation. Streams cannot be reused.
```typescript
const stream = new SitemapStream({ hostname: 'https://example.com' });
stream.write({ url: '/page' });
stream.end();
```
### Memory Management
For large datasets, use streaming patterns with `pipe()` rather than collecting all data in memory:
```typescript
// Good - streams through
lineSeparatedURLsToSitemapOptions(readStream).pipe(sitemapStream).pipe(outputStream);
// Bad - loads everything into memory
const allUrls = await readAllUrls();
allUrls.forEach(url => stream.write(url));
```
### Error Levels
Control validation strictness with `ErrorLevel`:
- `SILENT`: Skip validation (fastest, use in production if data is pre-validated)
- `WARN`: Log warnings (default, good for development)
- `THROW`: Throw on invalid data (strict mode, good for testing)
## Package Distribution
The package is distributed as a dual ESM/CommonJS package with `"type": "module"` in package.json:
- **ESM**: `dist/esm/index.js` (ES modules)
- **CJS**: `dist/cjs/index.js` (CommonJS, via conditional exports)
- **Types**: `dist/esm/index.d.ts` (TypeScript definitions)
- **Binary**: `dist/esm/cli.js` (ESM-only CLI, executable via `npx sitemap`)
- **Engines**: Node.js >=20.19.5, npm >=10.8.2
### Dual Package Exports
The `exports` field in package.json provides conditional exports:
```json
{
"exports": {
".": {
"import": "./dist/esm/index.js",
"require": "./dist/cjs/index.js"
}
}
}
```
This allows both:
```javascript
// ESM
import { SitemapStream } from 'sitemap'
// CommonJS
const { SitemapStream } = require('sitemap')
```
## Git Hooks
Husky pre-commit hooks run lint-staged which:
- Sorts package.json
- Runs eslint --fix on TypeScript files
- Runs prettier on TypeScript files
## Architecture Decisions
### Why This File Structure?
The codebase is organized around **separation of concerns** and **single source of truth** principles:
1. **Types in [lib/types.ts](lib/types.ts)**: All interfaces and enums live here, with NO implementation code. This makes types easy to find and prevents circular dependencies.
2. **Constants in [lib/constants.ts](lib/constants.ts)**: All shared constants (limits, regexes) defined once. This prevents inconsistencies where different files use different values.
3. **Validation in [lib/validation.ts](lib/validation.ts)**: All validation logic centralized. Easy to find, test, and maintain security rules.
4. **Clear file boundaries**: Each file has ONE responsibility. You know exactly where to look for specific functionality.
### Key Principles
- **Single Source of Truth**: Constants and validation logic exist in exactly one place
- **No Duplication**: Import shared code rather than copying it
- **Backward Compatibility**: Use re-exports when moving code between files to avoid breaking changes
- **Types Separate from Implementation**: [lib/types.ts](lib/types.ts) contains only type definitions
- **Security First**: All validation and limits are centralized for consistent security enforcement
### Benefits of This Organization
- **Discoverability**: Developers know exactly where to look for types, constants, or validation
- **Maintainability**: Changes to limits or validation only require editing one file
- **Consistency**: Importing from a single source prevents different parts of the code using different limits
- **Testing**: Centralized validation makes it easy to write comprehensive security tests
- **Refactoring**: Clear boundaries make it safe to refactor without affecting other modules