@harutakax/html-rag-optimizer
Version:
HTML optimization tool for RAG (Retrieval-Augmented Generation) systems
281 lines (217 loc) • 7.12 kB
Markdown
A powerful HTML optimization tool designed specifically for RAG (Retrieval-Augmented Generation) systems. This library removes unnecessary HTML elements, attributes, and formatting to create clean, search-optimized content while preserving semantic structure.
## Features
- **🚀 Fast Processing**: Optimizes large HTML files (1MB+) in seconds
- **🎯 RAG-Focused**: Designed specifically for information retrieval systems
- **⚙️ Highly Configurable**: Extensive options for customizing optimization behavior
- **📝 TypeScript Support**: Full TypeScript support with detailed type definitions
- **🛠️ CLI & API**: Both command-line interface and programmatic API
- **🔄 Batch Processing**: Supports single files and entire directories
- **📊 Performance Optimized**: Efficient memory usage and concurrent processing
## Installation
```bash
npm install @harutakax/html-rag-optimizer
```
## Quick Start
### Programmatic API
```typescript
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const html = `
<div class="container">
<h1 id="title">Welcome</h1>
<p>This is a <strong>sample</strong> paragraph.</p>
<script>console.log('remove me');</script>
<style>.container { margin: 0; }</style>
</div>
`;
// Basic optimization
const optimized = optimizeHtml(html);
console.log(optimized);
// Output: <div><h1>Welcome</h1><p>This is a<strong>sample</strong>paragraph.</p></div>
```
```bash
npx @harutakax/html-rag-optimizer input.html -o output.html
```
```bash
npm install -g @harutakax/html-rag-optimizer
html-rag-optimizer --input-dir ./docs --output-dir ./optimized
html-rag-optimizer input.html -o output.html --keep-attributes --exclude-tags script,style
```
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `keepAttributes` | `boolean` | `false` | Preserve HTML attributes |
| `removeEmpty` | `boolean` | `true` | Remove empty elements |
| `preserveWhitespace` | `boolean` | `false` | Preserve whitespace formatting |
| `excludeTags` | `string[]` | `[]` | Tags to exclude from removal |
| `removeComments` | `boolean` | `true` | Remove HTML comments |
| `minifyText` | `boolean` | `true` | Normalize and minify text content |
```typescript
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
const options = {
keepAttributes: false,
removeEmpty: true,
preserveWhitespace: false,
excludeTags: ['code', 'pre'], // Don't remove code blocks
removeComments: true,
minifyText: true
};
const optimized = optimizeHtml(html, options);
```
```typescript
import { optimizeHtmlFile, optimizeHtmlDir } from '@harutakax/html-rag-optimizer';
// Process single file
await optimizeHtmlFile('input.html', 'output.html', options);
// Process entire directory
await optimizeHtmlDir('./docs', './optimized', options);
```
```typescript
import { optimizeHtml } from '@harutakax/html-rag-optimizer';
import { promises as fs } from 'fs';
async function processBatch(files: string[]) {
const results = await Promise.all(
files.map(async (file) => {
const html = await fs.readFile(file, 'utf-8');
return optimizeHtml(html, {
removeComments: true
});
})
);
return results;
}
```
It is assumed that it is installed globally.
```bash
html-rag-optimizer --help
html-rag-optimizer --version
html-rag-optimizer input.html -o output.html
html-rag-optimizer --input-dir ./src --output-dir ./dist
```
```bash
-o, --output <path> Output file or directory
--input-dir <path> Input directory
--output-dir <path> Output directory
--keep-attributes Keep HTML attributes
--exclude-tags <tags> Exclude tags (comma-separated)
--preserve-whitespace Preserve whitespace
--config <path> Configuration file path
-h, --help Show help
-v, --version Show version
```
### Configuration File
Create a `html-rag-optimizer.json` file:
```json
{
"keepAttributes": false,
"removeEmpty": true,
"excludeTags": ["code", "pre"],
"removeComments": true,
"minifyText": true
}
```
Use with: `html-rag-optimizer --config html-rag-optimizer.json input.html -o output.html`
- `<script>` tags and content
- `<style>` tags and content
- `<meta>` tags
- HTML comments (`<!-- -->`)
- All HTML attributes (class, id, style, etc.)
- Empty elements (`<div></div>`, `<p> </p>`)
- Excess whitespace and formatting
### Preserved
- Semantic HTML structure
- Text content
- Essential tags (headings, paragraphs, lists, etc.)
- HTML entities (`&`, `<`, etc.)
### Before Optimization
```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Sample Page</title>
<style>body { font-family: Arial; }</style>
</head>
<body>
<div class="container" id="main">
<h1 class="title"> Welcome to Our Site </h1>
<!-- Navigation goes here -->
<p class="intro">This is a sample paragraph.</p>
<div></div>
<script>console.log('hello');</script>
</div>
</body>
</html>
```
```html
<html><head><title>Sample Page</title></head><body><div><h1>Welcome to Our Site</h1><p>This is a sample paragraph.</p></div></body></html>
```
- **Large Files**: Processes 1MB+ HTML files in under 5 seconds
- **Memory Efficient**: Memory usage stays under 3x input file size
- **Concurrent Processing**: Supports parallel processing of multiple files
- **Scalable**: Performance scales linearly with input size
Perfect for preparing HTML content for vector databases and search systems:
```typescript
// Optimize content before indexing
const webContent = await fetchWebPage(url);
const optimizedForRAG = optimizeHtml(webContent, {
removeComments: true,
minifyText: true
});
// Index optimizedForRAG in your vector database
```
Clean up documentation before feeding to LLMs:
```typescript
const docs = await fs.readFile('documentation.html', 'utf-8');
const cleanDocs = optimizeHtml(docs, {
excludeTags: ['code', 'pre'], // Keep code examples
});
```
Clean scraped content for analysis:
```typescript
const scrapedHTML = await scrapeWebsite(url);
const cleanContent = optimizeHtml(scrapedHTML, {
removeComments: true,
minifyText: true
});
```
- Node.js 18 or higher
- TypeScript 5.0+ (for development)
```bash
git clone https://github.com/your-org/html-rag-optimizer.git
cd html-rag-optimizer
pnpm install
pnpm test
pnpm build
pnpm dlx tsx examples/basic-usage.ts
```