llmoptimizer
Version:
Generate an llms.txt summary of your website/docs for LLMs (framework-agnostic with Vite/Next/Nuxt/Astro/Remix helpers).
499 lines (404 loc) • 16.9 kB
Markdown
## llmoptimizer
Generate an llms.txt that gives AI models a clean, structured summary of your website or docs. It works with any site and has first-class helpers for popular frameworks (Vite, Next.js, Nuxt, Astro, Remix), plus a docs generator for Markdown/MDX.
Node.js 18+ is required.
## Why This Matters
- Clear signal for AI: Produce a compact, consistent llms.txt that lists your important pages with key metadata, headings, and structured data.
- Multiple input modes: Crawl a live site, read a sitemap, scan static builds, or run framework-aware adapters without extra setup.
- Docs-first: Generate llms.txt and llms-full.txt directly from Markdown/MDX, including optional sectioned link lists and concatenated context files.
- Robots made easy: Generate a robots.txt that explicitly allows popular search and LLM crawlers, and auto-includes your sitemap.
## Install
```
npm install --save-dev llmoptimizer
```
## Quick Starts
Pick the scenario that matches your project. All commands write llms.txt by default.
```
# 1) Crawl production
npx llmoptimizer generate --url https://example.com --out public/llms.txt --max-pages 200
# 2) Use a sitemap
npx llmoptimizer generate --sitemap https://example.com/sitemap.xml --out llms.txt
# 3) Scan a static export (e.g., Next.js out/)
npx llmoptimizer generate --root ./out --out ./out/llms.txt
# 4) Build-scan (no crawling): search common build dirs for HTML
npx llmoptimizer generate --build-scan --project-root . --out llms.txt
# 5) Docs (Markdown/MDX) → llms.txt + llms-full.txt + stats
npx llmoptimizer docs --docs-dir docs --out-dir build --site-url https://example.com --base-url /
# 6) Autodetect best mode (docs → build-scan → adapter → crawl)
npx llmoptimizer auto --url https://example.com
# 7) Generate robots.txt that allows search + LLM crawlers
npx llmoptimizer robots --out public/robots.txt --sitemap https://example.com/sitemap.xml
```
Common flags:
- `--format markdown|json` (default markdown)
- `--include <glob...>` / `--exclude <glob...>` to filter routes/files
- `--concurrency <n>` and `--delay-ms <ms>` for performance/throttling
- `--no-robots` to skip robots.txt checks in network modes
## What llmoptimizer Generates
llmoptimizer extracts and summarizes the signals that matter to AI and search.
- Site summary: base URL, generation time, totals
- Per page (varies by mode):
- Basics: URL, title, description, canonical
- Metadata: robots meta, keywords, social (OpenGraph/Twitter)
- Structure: H1–H4 headings, snippets, estimated words/tokens
- Links/media: internal/external link counts, images, missing alt counts
- Structured data: schema.org JSON‑LD types summary
Docs mode also emits:
- `llms.txt`: Sectioned link list (or auto-grouped) with a short intro
- `llms-full.txt`: Concatenated cleaned content for all docs
- `llms-stats.json`: Headings, words, token estimates per doc + totals
- Optional: `llms-ctx.txt` and `llms-ctx-full.txt` context bundles
### Structured theme
Use `--theme structured` (or `render.theme: 'structured'` in config) for a more LLM-friendly, categorized Markdown output. It includes:
- Site header with base URL, locales, page count, and totals.
- Categories (Home, Docs, Guides, API, Blog, etc.) with counts and an index.
- Per-page JSON metadata blocks (url/title/description/canonical/locale/metrics/alternates/OG/Twitter) followed by concise headings, links, and images samples.
Example:
# llms.txt — Structured Site Summary
Base URL: https://example.com
Generated: 2025-08-27
Pages: 42
Totals: words=12345 images=120 missingAlt=3 internalLinks=420 externalLinks=88
## Categories
- Docs: 20
- Guides: 8
- Blog: 5
- Other: 9
## Docs (20)
### Getting Started
```json
{ "url": "https://example.com/docs/getting-started", "title": "Getting Started", "metrics": { "wordCount": 950 } }
```
- Headings:
- H1: Getting Started
- H2: Installation
## CLI Overview
1) Generate from a site/build
```
npx llmoptimizer generate [options]
# Modes
--url <https://...> # crawl production (obeys robots by default)
--sitemap <url> # seed from sitemap.xml
--root <dir> # scan a static export/build dir for HTML
--build-scan # scan common build dirs under --project-root
--adapter --project-root . # framework-aware route fetch (when supported)
# Output & format
--out <file> # default: llms.txt
--format markdown|json
--theme default|compact|detailed|structured # default: structured
# Filtering & perf
--include <glob...> --exclude <glob...>
--max-pages <n> --concurrency <n> --delay-ms <ms>
--no-robots
```
3) Debug dump (routes/build/sample)
```
npx llmoptimizer dump \
--project-root . \
--base-url https://example.com --sample 5 \
--scan-build --build-dirs dist .next/server/pages \
--framework-details \
--include "/docs/*" --exclude "/admin/*" \
--out dump.json
```
Outputs JSON including:
- Adapter detection and basic routes/params
- Next.js extractor details (when applicable)
- Framework details (when `--framework-details`):
- SvelteKit: filesystem-derived route patterns + param names + example blog slugs
- Nuxt: pages/ routes (Nuxt 2 underscore + Nuxt 3 bracket), i18n locales (best-effort), content/blog slugs
- Remix: app/routes routes (dotted segments, $params, pathless parentheses), param names
- Angular: `angular.json` outputPath, extracted `path:` entries and `loadChildren` hints
- Optional build scan results
- Optional sample of fetched pages when `--base-url` is provided
2) Docs (Markdown/MDX) → llms files
```
npx llmoptimizer docs \
--docs-dir docs --out-dir build --site-url https://example.com --base-url / \
--include-blog --blog-dir blog \
--ignore "advanced/*" "private/*" \
--order "getting-started/*" "guides/*" "api/*" \
--ignore-path docs --add-path api \
--exclude-imports --remove-duplicate-headings \
--generate-markdown-files \
--emit-ctx --ctx-out llms-ctx.txt --ctx-full-out llms-ctx-full.txt \
--llms-filename llms.txt --llms-full-filename llms-full.txt \
--stats-file llms-stats.json \
--title "Your Docs" --description "Great docs" --version 1.0.0 \
--sections-file ./examples/sections.json \
--optional-links-file ./examples/optional-links.json
```
What “sections” mean:
- You can provide explicit sections as JSON (see `examples/sections.json`).
- Or omit them and let auto-sections group content like Getting Started, Guides, API, Tutorials, Reference.
- “Optional” links are supported via a separate JSON file (see `examples/optional-links.json`).
3) Autodetect best mode
```
npx llmoptimizer auto \
--project-root . \
--url https://example.com \
--out llms.txt --format markdown --concurrency 8 --max-pages 200 --delay-ms 0
```
4) Robots.txt generator
```
npx llmoptimizer robots \
--out public/robots.txt \
--sitemap https://example.com/sitemap.xml \
--no-allow-all # optional: do not add Allow: /
--no-llm-allow # optional: skip explicit LLM bot allow-list
--no-search-allow # optional: skip search bot allow-list
--search-bot Googlebot --search-bot Bingbot # override bots
```
It allows popular LLM crawlers (e.g., GPTBot, Google‑Extended, Claude, Perplexity, CCBot, Applebot‑Extended, Meta‑ExternalAgent, Amazonbot, Bytespider) and mainstream search bots (Googlebot, Bingbot, DuckDuckBot, Slurp, Baiduspider, YandexBot).
## Configuration (optional)
Create `llmoptimizer.config.ts` if you prefer defaults on the CLI. Structured is the default theme.
```ts
// llmoptimizer.config.ts
import { defineConfig } from 'llmoptimizer'
export default defineConfig({
baseUrl: 'https://example.com',
obeyRobots: true,
maxPages: 200,
concurrency: 8,
network: { delayMs: 100, sitemap: { concurrency: 6, delayMs: 50 } },
// Themes: 'default' | 'compact' | 'detailed' | 'structured'
render: {
theme: 'structured',
// Optional: customize structured output
structured: {
limits: { headings: 16, links: 12, images: 8 },
categories: {
// Control section order
order: ['Home', 'Products', 'Product Categories', 'Docs', 'Guides', 'API', 'Policies', 'Important', 'Blog', 'Company', 'Legal', 'Support', 'Examples', 'Other'],
// Keyword mapping: match in URL path or H1
keywords: {
Products: ['product', 'pricing', 'features'],
'Product Categories': ['category', 'categories', 'catalog', 'collection'],
Policies: ['privacy', 'terms', 'cookies', 'policy', 'policies', 'security', 'gdpr'],
Important: ['status', 'uptime', 'login', 'signup', 'contact'],
},
},
},
},
output: { file: 'public/llms.txt', format: 'markdown' },
robots: {
outFile: 'public/robots.txt',
allowAll: true,
llmAllow: true,
searchAllow: true,
sitemaps: ['https://example.com/sitemap.xml'],
},
})
```
## Framework Integrations
All integrations default to writing llms.txt. You can swap to JSON via `format: 'json'`.
- Vite (React/Vue/Svelte/Solid/Preact)
```ts
// vite.config.ts
import { defineConfig } from 'vite'
import { llmOptimizer } from 'llmoptimizer/vite'
export default defineConfig({
plugins: [
llmOptimizer({
mode: 'static', // or 'crawl' with baseUrl
robots: { outFile: 'dist/robots.txt' },
}),
],
})
```
- Next.js
```ts
// scripts/postbuild-llm.ts
import { runAfterNextBuild } from 'llmoptimizer/next'
await runAfterNextBuild({
projectRoot: process.cwd(),
baseUrl: process.env.NEXT_PUBLIC_SITE_URL || 'https://yourdomain.com',
outFile: 'public/llms.txt',
// Choose the strategy:
// - static: build-scan (.next/server/*, out) with baseUrl mapping → adapter → crawl
// - adapter: fetch detected routes from baseUrl → build-scan → crawl
// - crawl: breadth-first crawl baseUrl
mode: 'static',
robots: true,
log: true,
})
// package.json
// { "scripts": { "postbuild": "node scripts/postbuild-llm.ts" } }
```
- Nuxt 3 (Nitro)
```ts
// nuxt.config.ts
export default defineNuxtConfig({
modules: [[
'llmoptimizer/nuxt',
{
// static: build-scan on .output/public with baseUrl mapping → crawl fallback
mode: 'static',
baseUrl: process.env.NUXT_PUBLIC_SITE_URL || 'https://yourdomain.com',
robots: true,
},
]],
})
```
- Astro
```ts
// astro.config.mjs
import { defineConfig } from 'astro/config'
import llm from 'llmoptimizer/astro'
export default defineConfig({
integrations: [
llm({
// static: build-scan on dist with baseUrl mapping → crawl fallback
mode: 'static',
baseUrl: process.env.SITE_URL,
robots: true,
})
]
})
```
- Remix
```ts
// scripts/postbuild-llm.mjs
import { runAfterRemixBuild } from 'llmoptimizer/remix'
await runAfterRemixBuild({
// static: build-scan on public with baseUrl mapping → crawl fallback
mode: 'static',
baseUrl: process.env.SITE_URL || 'https://your.app',
outFile: 'public/llms.txt',
robots: true,
})
```
- SvelteKit
```ts
// scripts/sveltekit-postbuild-llm.mjs
import { runAfterSvelteKitBuild } from 'llmoptimizer/sveltekit'
await runAfterSvelteKitBuild({
// static: scan 'build' and map to URLs using baseUrl → crawl fallback if SSR-only
mode: 'static',
buildDir: 'build',
baseUrl: process.env.SITE_URL || 'https://your.app',
outFile: 'build/llms.txt',
theme: 'structured',
// Optional filters and structured theme options
// include: ['/docs/*'], exclude: ['/admin/*'],
// renderOptions: { limits: { headings: 12, links: 10, images: 6 } },
robots: { outFile: 'build/robots.txt' },
})
// package.json → { "scripts": { "postbuild": "node scripts/sveltekit-postbuild-llm.mjs" } }
```
- Angular
```ts
// scripts/angular-postbuild-llm.mjs
import { runAfterAngularBuild } from 'llmoptimizer/angular'
await runAfterAngularBuild({
// static: scan Angular dist output; distDir auto-detected from angular.json when omitted
mode: 'static',
baseUrl: process.env.SITE_URL || 'https://your.app',
theme: 'structured',
// Optional: distDir: 'dist/your-project/browser'
// include/exclude and renderOptions are supported
robots: { outFile: 'dist/robots.txt' },
})
// package.json → { "scripts": { "postbuild": "node scripts/angular-postbuild-llm.mjs" } }
```
- Generic Node script
```ts
// scripts/postbuild-llm.ts
import { runAfterBuild } from 'llmoptimizer/node'
await runAfterBuild({
// static: build-scan on dist with baseUrl mapping → crawl fallback
mode: 'static',
rootDir: 'dist',
baseUrl: process.env.SITE_URL,
robots: true,
})
```
- Generic Node/SSR
```ts
// scripts/postbuild-llm.mjs
import { runAfterBuild } from 'llmoptimizer/node'
await runAfterBuild({ mode: 'crawl', baseUrl: 'https://yourdomain.com', outFile: 'llms.txt' })
```
## Docs Integration Details (Markdown/MDX)
Use the CLI or the API. The integration cleans content, removes duplicate headings, optionally inlines local partials, and can generate cleaned per-doc .md files.
Programmatic example:
```ts
// scripts/generate-docs-llm.mjs
import { docsLLMs } from 'llmoptimizer/docs'
const plugin = docsLLMs({
docsDir: 'docs',
includeBlog: true,
ignoreFiles: ['advanced/*', 'private/*'],
includeOrder: ['getting-started/*', 'guides/*', 'api/*'],
pathTransformation: { ignorePaths: ['docs'], addPaths: ['api'] },
excludeImports: true,
removeDuplicateHeadings: true,
generateMarkdownFiles: true,
autoSections: true,
// Optional: explicit sections/links
// sections: [...],
// optionalLinks: [...],
})
await plugin.postBuild({
outDir: 'build',
siteConfig: { url: 'https://example.com', baseUrl: '/', title: 'Docs', tagline: 'Great docs' },
})
```
Outputs in `build/`:
- `llms.txt` and `llms-full.txt`
- `llms-stats.json` with word/token estimates
- Optionally `llms-ctx.txt` and `llms-ctx-full.txt` (when `emitCtx`)
- Optional cleaned per-doc `.md` files used for link targets
See `examples/sections.json` and `examples/optional-links.json` for input formats.
## Smart Autoregistration (Auto)
Prefer one helper that “just works”? Use the auto integration in a postbuild script. It picks from docs → build → adapter → crawl based on your repo and writes the right output.
```ts
// scripts/auto-llm.mjs
import { autoPostbuild } from 'llmoptimizer/auto'
const res = await autoPostbuild({ baseUrl: 'https://example.com', log: true })
console.log(res) // { mode: 'docs'|'build'|'adapter'|'crawl', outPath: '...' }
```
Add to package.json: `{ "scripts": { "postbuild": "node scripts/auto-llm.mjs" } }`.
Notes
- Absolute links: Internal links, canonical, hreflang, and images are resolved to absolute URLs using the page URL. Pass `baseUrl` in static/build-scan modes to avoid file:// URLs.
- Build-scan coverage: When `baseUrl` is provided, build-scan enriches routes using framework artifacts (e.g., Next prerender/routes manifests) and falls back to sitemap or crawl if empty.
- Adapter vs static: Adapter fetches via HTTP from `baseUrl` (requires a reachable server). Static uses build output folders and does not require a running server.
Examples
- Next postbuild: `examples/next-postbuild-llm.mjs`
- Auto detection: `examples/auto-llm.mjs`
- Nuxt config: `examples/nuxt.config.ts`
- Astro config: `examples/astro.config.mjs`
- Remix postbuild: `examples/remix-postbuild-llm.mjs`
- Vite config: `examples/vite.config.mjs`
- Generic Node postbuild: `examples/node-postbuild-llm.mjs`
- SvelteKit postbuild: `examples/sveltekit-postbuild-llm.mjs`
- Angular postbuild: `examples/angular-postbuild-llm.mjs`
## Best Practices
- Titles and descriptions: Ensure every page has good `<title>` and meta description.
- Structured data: Use JSON‑LD for key entities; we summarize types in output.
- Headings: Keep H1–H3 clear and scannable; these are extracted.
- Internationalization: Use `<html lang>` and `hreflang` alternates when applicable.
- Sitemaps: Keep `sitemap.xml` fresh for coverage.
- Robots: Use the robots generator to allow search + LLM crawlers on public content.
## Troubleshooting
- Empty or few pages: Check `--include/--exclude` filters and robots settings; try `--no-robots` for testing.
- Dynamic routes (adapter mode): Provide sample params or ensure your framework exposes discoverable routes.
- Rate limits: Lower `--concurrency` and add `--delay-ms` when crawling.
- Wrong links in docs mode: Adjust `--ignore-path/--add-path` or provide `--site-url/--base-url`.
## Contact
- Email: ihuzaifashoukat@gmail.com
- GitHub: https://github.com/ihuzaifashoukat
## License
MIT