UNPKG

document-outline-extractor

Version:

Extract structured outlines from documents with optional AI enhancement

249 lines (188 loc) • 5.85 kB
# document-outline-extractor A flexible TypeScript library for extracting structured outlines from documents of arbitrary length, with optional OpenAI/Azure OpenAI integration for enhanced outline generation. ## Features - šŸ“ Extract outlines from Markdown documents - šŸ¤– Optional AI-powered outline generation using OpenAI/Azure OpenAI - šŸ“Š Automatic document chunking for large documents - šŸŽÆ Smart quality scoring to determine if existing outline is sufficient - šŸ”§ Multiple output formats (tree, markdown, JSON) - ⚔ Fallback to regex-based extraction when AI is unavailable - šŸ–„ļø Command-line interface for quick testing ## Installation ```bash npm install -g document-outline-extractor ``` Or as a library: ```bash npm install document-outline-extractor ``` ## CLI Usage ### Basic Commands ```bash # Extract outline from file outline-extractor -i document.md # Extract with specific format outline-extractor -i document.md -f json -o outline.json # Use OpenAI for enhanced extraction outline-extractor -i document.md --openai-key sk-... --model gpt-4o # Check document quality outline-extractor -i document.md -q # Pipe content cat document.md | outline-extractor -f markdown # Use configuration file outline-extractor -i document.md -c config.json ``` ### CLI Options - `-i, --input <file>` - Input markdown file path - `-o, --output <file>` - Output file path (default: stdout) - `-f, --format <format>` - Output format: tree, markdown, json, flat - `-d, --max-depth <n>` - Maximum heading depth to include - `-q, --quality` - Show quality metrics instead of outline - `-c, --config <file>` - Configuration file path (JSON) - `--openai-key <key>` - OpenAI API key - `--openai-url <url>` - OpenAI base URL - `--model <name>` - Model name - `-h, --help` - Show help message - `-v, --version` - Show version ### Configuration File Create a `config.json` file: ```json { "format": "markdown", "maxDepth": 3, "openai": { "apiKey": "your-api-key", "baseUrl": "https://api.openai.com/v1", "model": "gpt-4o-mini", "temperature": 0.3, "maxTokens": 2000 }, "extractor": { "chunkSize": 5000, "qualityThreshold": 0.8, "defaultFormat": "tree" } } ``` ## Library Usage ### Basic Usage ```typescript import { OutlineExtractor } from 'document-outline-extractor'; const extractor = new OutlineExtractor(); const outline = await extractor.extract(markdownContent); console.log(outline); ``` ### With OpenAI Configuration ```typescript import { OutlineExtractor } from 'document-outline-extractor'; const extractor = new OutlineExtractor({ openai: { baseUrl: 'https://api.openai.com/v1', apiKey: 'your-api-key', model: 'gpt-4o-mini', temperature: 0.5, maxTokens: 3000 } }); const outline = await extractor.extract(markdownContent, { format: 'json', maxDepth: 3 }); ``` ### Quality Evaluation ```typescript const extractor = new OutlineExtractor(); const metrics = extractor.evaluateQuality(markdownContent); console.log('Quality Score:', metrics.score); console.log('Heading Count:', metrics.headingCount); console.log('Max Depth:', metrics.depth); ``` ### Document Chunking ```typescript const extractor = new OutlineExtractor({ chunkSize: 3000 }); const chunks = extractor.splitDocument(longDocument, 'smart'); for (const chunk of chunks) { console.log('Chunk length:', chunk.length); } ``` ### Custom OpenAI Parameters per Request ```typescript // Override temperature and max tokens for specific requests const extractor = new OutlineExtractor({ openai: { baseUrl: 'https://api.openai.com/v1', apiKey: 'your-api-key', model: 'gpt-4o-mini' } }); // Pass custom parameters to generateOutline const outline = await extractor.generateOutlineWithAI(content, systemPrompt, { temperature: 0.7, maxTokens: 4000, maxCompletionTokens: 3500 // Use max_completion_tokens instead of max_tokens }); ``` ## API Reference ### `OutlineExtractor` Main class for extracting outlines. #### Constructor Options ```typescript interface OutlineExtractorConfig { openai?: OpenAIConfig; // OpenAI configuration chunkSize?: number; // Max chunk size (default: 5000) qualityThreshold?: number; // Min quality score (default: 0.8) defaultFormat?: OutlineFormat; // Default output format caching?: boolean; // Enable caching (default: true) } ``` #### Methods - `extract(content: string, options?: ExtractOptions)` - Extract outline from content - `evaluateQuality(content: string)` - Evaluate outline quality score - `splitDocument(content: string, strategy?: ChunkingStrategy)` - Split document into chunks - `clearCache()` - Clear internal cache - `updateConfig(config: Partial<OutlineExtractorConfig>)` - Update configuration ### Output Formats - **tree** - Indented tree structure - **markdown** - Markdown headings - **json** - JSON object with hierarchy - **flat** - Numbered flat list ## Examples ### Extract from README ```bash outline-extractor -i README.md -f tree ``` ### Generate JSON Outline ```bash outline-extractor -i document.md -f json -o outline.json ``` ### Quality Check ```bash outline-extractor -i document.md -q ``` Output: ``` Document Outline Quality Metrics: ──────────────────────────────────── Overall Score: 85.3% Richness: 50.0% Balance: 92.1% Coherence: 100.0% Coverage: 8.5% Heading Count: 12 Max Depth: 3 ──────────────────────────────────── āœ“ Document has good outline structure ``` ## Development ```bash # Install dependencies npm install # Build npm run build # Test npm test # Run CLI in development npm run cli -- -i document.md ``` ## License MIT