llm-gen
Version:
A CLI tool to extract text from a static Next.js export and generate llm.txt for LLM ingestion.
207 lines (150 loc) • 8.15 kB
Markdown
# llm-gen
A CLI tool to extract readable text from static HTML files (e.g., Next.js static exports) and generate an `llm.txt` file optimized for ingestion by Large Language Models (LLMs). It also supports generating a JSON metadata file (`pages.json`) and an optional interactive HTML UI (`llm_ui.html`) for browsing extracted content.
## Features
- **Text Extraction**: Extracts clean, readable text from HTML files using Cheerio, removing noise like scripts, styles, and hidden elements.
- **Concurrent Processing**: Processes multiple HTML files concurrently with configurable concurrency limits for performance.
- **Output Formats**:
- `llm.txt`: A single text file with extracted content, organized with file headers and a table of contents.
- `pages.json`: Metadata about processed files, including file paths, sizes, text lengths, and SHA-256 hashes.
- `llm_ui.html` (optional): An interactive HTML interface for browsing extracted text with search functionality.
- **Customizable**: Supports glob patterns for file selection, configurable output directories, and verbosity options.
- **Efficient**: Uses streaming for large file outputs and handles errors gracefully.
- **MIT Licensed**: Free to use, modify, and distribute.
## Installation
Install `llm-gen` globally via npm for CLI access, or locally for use in scripts:
```bash
# Install globally
npm install -g llm-gen
# Or install locally in a project
npm install llm-gen
```
### Prerequisites
- **Node.js**: Version 18 or higher (uses ES Modules and modern JavaScript features).
- **npm**: For installing dependencies.
## Usage
Run `llm-gen` from the command line, specifying the source directory containing HTML files (e.g., a Next.js static export in the `out` directory).
### Basic Command
```bash
llm-gen --src ./out
```
This processes all HTML files (`*.html`, `*.htm`) in the `./out` directory, generating:
- `llm.txt`: Extracted text with a table of contents.
- `pages.json`: Metadata about processed files.
### Options
| Option | Description | Type | Default |
| --------------- | ------------------------------------------------------------------------------ | ------- | ----------------- |
| `--src` | Source directory containing HTML files to process (required). | String | None |
| `--public` | Output directory for generated files (`llm.txt`, `pages.json`, `llm_ui.html`). | String | `.` (current dir) |
| `--out` | Output filename for the extracted text (relative to `--public`). | String | `llm.txt` |
| `--ui` | Generate an interactive HTML UI file (`llm_ui.html`). | Boolean | `false` |
| `--concurrency` | Maximum number of files to process concurrently. | Number | `10` |
| `--verbose` | Enable detailed logging of processing steps. | Boolean | `false` |
| `--pattern` | Glob pattern to match HTML files (relative to `--src`). | String | `**/*.htm?(l)` |
| `--help` | Display help information. | - | - |
### Example Commands
1. **Process HTML files in a Next.js `out` directory**:
```bash
llm-gen --src ./out --public ./dist --out extracted.txt
```
- Processes HTML files in `./out`.
- Writes `extracted.txt` and `pages.json` to `./dist`.
2. **Generate an interactive UI**:
```bash
llm-gen --src ./out --ui --verbose
```
- Generates `llm.txt`, `pages.json`, and `llm_ui.html` in the current directory.
- Logs detailed progress.
3. **Custom glob pattern with high concurrency**:
```bash
llm-gen --src ./out --pattern "**/*.html" --concurrency 20 --public ./output
```
- Processes only `*.html` files with up to 20 concurrent tasks.
- Outputs to `./output`.
## Output Files
- **`llm.txt`**:
- Contains extracted text from all HTML files, organized with headers for each file.
- Includes a table of contents with file paths, sizes, and character counts.
- Example structure:
```
Generated: 2025-08-10T12:34:56.789Z
Source directory: /path/to/out
Files processed: 5
╔══════════════════════════════╤═══════╤═══════╗
║ File Path │ Size │ Chars ║
╟──────────────────────────────┼───────┼───────┼
║ index.html │ 12345 │ 5678 ║
║ about.html │ 6789 │ 2345 ║
╚══════════════════════════════╧═══════╧═══════╝
Total files: 5
Total characters: 12,345
════════════════════════════════════════════════════════════════════════════════
📄 FILE: index.html
════════════════════════════════════════════════════════════════════════════════
Welcome to our website! This is the home page content...
```
- **`pages.json`**:
- Metadata about processed files in JSON format.
- Includes file paths, sizes, text lengths, SHA-256 hashes, and any errors.
- Example:
```json
{
"generatedAt": "2025-08-10T12:34:56.789Z",
"source": "/path/to/out",
"pages": [
{
"path": "index.html",
"size": 12345,
"textLength": 5678,
"hash": "a1b2c3d4...",
"error": null
}
]
}
```
- **`llm_ui.html`** (if `--ui` is enabled):
- An interactive HTML page for browsing extracted text.
- Features a search bar to filter files by path or content.
- Collapsible sections for each file with a text preview.
- Note: Full-text display for large files is not implemented (displays a placeholder alert).
## Development
### Setup
Clone the repository and install dependencies:
```bash
git clone https://github.com/Agecoder/llm-gen.git
cd llm-gen
npm install
```
### Scripts
- `npm start`: Run the CLI with default arguments (`--src ./out`).
- `npm run build`: Run the CLI with `./out` as the source directory.
- `npm run dev`: Run the CLI in watch mode for development.
- `npm run lint`: Lint the codebase using ESLint.
- `npm run format`: Format the codebase using Prettier.
### Dependencies
- **cheerio**: HTML parsing for text extraction.
- **fs-extra**: Enhanced file system operations.
- **glob**: File pattern matching.
- **p-limit**: Concurrency control for file processing.
- **yargs**: Command-line argument parsing.
### Dev Dependencies
- **eslint**: Linting for code quality.
- **prettier**: Code formatting.
## Contributing
Contributions are welcome! Please follow these steps:
1. Fork the repository.
2. Create a feature branch (`git checkout -b feature/your-feature`).
3. Commit your changes (`git commit -m "Add your feature"`).
4. Push to the branch (`git push origin feature/your-feature`).
5. Open a pull request.
Please ensure your code passes linting (`npm run lint`) and is formatted (`npm run format`).
## License
This project is licensed under the [MIT License](LICENSE).
## Author
- **Vedant Navale**
- Email: vedantnavale45@gmail.com
- GitHub: [Agecoder](https://github.com/Agecoder)
## Issues
Report bugs or suggest features at [GitHub Issues](https://github.com/Agecoder/llm-gen/issues).
## Acknowledgments
- Built with inspiration from static site generation workflows and LLM content ingestion needs.
- Thanks to the open-source community for providing robust libraries like Cheerio and yargs.