document-outline-extractor
Version:
Extract structured outlines from documents with optional AI enhancement
249 lines (188 loc) ⢠5.85 kB
Markdown
# document-outline-extractor
A flexible TypeScript library for extracting structured outlines from documents of arbitrary length, with optional OpenAI/Azure OpenAI integration for enhanced outline generation.
## Features
- š Extract outlines from Markdown documents
- š¤ Optional AI-powered outline generation using OpenAI/Azure OpenAI
- š Automatic document chunking for large documents
- šÆ Smart quality scoring to determine if existing outline is sufficient
- š§ Multiple output formats (tree, markdown, JSON)
- ā” Fallback to regex-based extraction when AI is unavailable
- š„ļø Command-line interface for quick testing
## Installation
```bash
npm install -g document-outline-extractor
```
Or as a library:
```bash
npm install document-outline-extractor
```
## CLI Usage
### Basic Commands
```bash
# Extract outline from file
outline-extractor -i document.md
# Extract with specific format
outline-extractor -i document.md -f json -o outline.json
# Use OpenAI for enhanced extraction
outline-extractor -i document.md --openai-key sk-... --model gpt-4o
# Check document quality
outline-extractor -i document.md -q
# Pipe content
cat document.md | outline-extractor -f markdown
# Use configuration file
outline-extractor -i document.md -c config.json
```
### CLI Options
- `-i, --input <file>` - Input markdown file path
- `-o, --output <file>` - Output file path (default: stdout)
- `-f, --format <format>` - Output format: tree, markdown, json, flat
- `-d, --max-depth <n>` - Maximum heading depth to include
- `-q, --quality` - Show quality metrics instead of outline
- `-c, --config <file>` - Configuration file path (JSON)
- `--openai-key <key>` - OpenAI API key
- `--openai-url <url>` - OpenAI base URL
- `--model <name>` - Model name
- `-h, --help` - Show help message
- `-v, --version` - Show version
### Configuration File
Create a `config.json` file:
```json
{
"format": "markdown",
"maxDepth": 3,
"openai": {
"apiKey": "your-api-key",
"baseUrl": "https://api.openai.com/v1",
"model": "gpt-4o-mini",
"temperature": 0.3,
"maxTokens": 2000
},
"extractor": {
"chunkSize": 5000,
"qualityThreshold": 0.8,
"defaultFormat": "tree"
}
}
```
## Library Usage
### Basic Usage
```typescript
import { OutlineExtractor } from 'document-outline-extractor';
const extractor = new OutlineExtractor();
const outline = await extractor.extract(markdownContent);
console.log(outline);
```
### With OpenAI Configuration
```typescript
import { OutlineExtractor } from 'document-outline-extractor';
const extractor = new OutlineExtractor({
openai: {
baseUrl: 'https://api.openai.com/v1',
apiKey: 'your-api-key',
model: 'gpt-4o-mini',
temperature: 0.5,
maxTokens: 3000
}
});
const outline = await extractor.extract(markdownContent, {
format: 'json',
maxDepth: 3
});
```
### Quality Evaluation
```typescript
const extractor = new OutlineExtractor();
const metrics = extractor.evaluateQuality(markdownContent);
console.log('Quality Score:', metrics.score);
console.log('Heading Count:', metrics.headingCount);
console.log('Max Depth:', metrics.depth);
```
### Document Chunking
```typescript
const extractor = new OutlineExtractor({ chunkSize: 3000 });
const chunks = extractor.splitDocument(longDocument, 'smart');
for (const chunk of chunks) {
console.log('Chunk length:', chunk.length);
}
```
### Custom OpenAI Parameters per Request
```typescript
// Override temperature and max tokens for specific requests
const extractor = new OutlineExtractor({
openai: {
baseUrl: 'https://api.openai.com/v1',
apiKey: 'your-api-key',
model: 'gpt-4o-mini'
}
});
// Pass custom parameters to generateOutline
const outline = await extractor.generateOutlineWithAI(content, systemPrompt, {
temperature: 0.7,
maxTokens: 4000,
maxCompletionTokens: 3500 // Use max_completion_tokens instead of max_tokens
});
```
## API Reference
### `OutlineExtractor`
Main class for extracting outlines.
#### Constructor Options
```typescript
interface OutlineExtractorConfig {
openai?: OpenAIConfig; // OpenAI configuration
chunkSize?: number; // Max chunk size (default: 5000)
qualityThreshold?: number; // Min quality score (default: 0.8)
defaultFormat?: OutlineFormat; // Default output format
caching?: boolean; // Enable caching (default: true)
}
```
#### Methods
- `extract(content: string, options?: ExtractOptions)` - Extract outline from content
- `evaluateQuality(content: string)` - Evaluate outline quality score
- `splitDocument(content: string, strategy?: ChunkingStrategy)` - Split document into chunks
- `clearCache()` - Clear internal cache
- `updateConfig(config: Partial<OutlineExtractorConfig>)` - Update configuration
### Output Formats
- **tree** - Indented tree structure
- **markdown** - Markdown headings
- **json** - JSON object with hierarchy
- **flat** - Numbered flat list
## Examples
### Extract from README
```bash
outline-extractor -i README.md -f tree
```
### Generate JSON Outline
```bash
outline-extractor -i document.md -f json -o outline.json
```
### Quality Check
```bash
outline-extractor -i document.md -q
```
Output:
```
Document Outline Quality Metrics:
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Overall Score: 85.3%
Richness: 50.0%
Balance: 92.1%
Coherence: 100.0%
Coverage: 8.5%
Heading Count: 12
Max Depth: 3
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Document has good outline structure
```
## Development
```bash
# Install dependencies
npm install
# Build
npm run build
# Test
npm test
# Run CLI in development
npm run cli -- -i document.md
```
## License
MIT