UNPKG

html-content-processor

Version:

A professional library for processing, cleaning, filtering, and converting HTML content to Markdown. Features advanced customization options, presets, plugin support, fluent API, and TypeScript integration for reliable content extraction.

296 lines (223 loc) 10.2 kB
# HTML Content Processor A modern TypeScript library for cleaning, filtering, and converting HTML content to Markdown with intelligent content extraction. Supports cross-environment execution (Browser/Node.js) with automatic page type detection. ## Features - 🚀 **Modern API Design** - Clean functional and class-based APIs - 🧠 **Intelligent Filtering** - Automatic page type detection with optimal filtering strategies - 📝 **High-Quality Markdown Conversion** - Advanced HTML to Markdown transformation - 🌐 **Cross-Environment Support** - Full compatibility with both browser and Node.js environments - 🎯 **Smart Presets** - Optimized configurations for different content types - 🔌 **Plugin System** - Extensible plugin architecture - 📊 **Automatic Detection** - Smart detection of search engines, blogs, news, documentation, and more ## Installation ```bash npm install html-content-processor ``` ## Quick Start ### Basic Usage ```typescript import { htmlToMarkdown, htmlToText, cleanHtml } from 'html-content-processor'; // Convert HTML to Markdown const markdown = await htmlToMarkdown('<h1>Hello</h1><p>World</p>'); // Convert HTML to plain text const text = await htmlToText('<h1>Hello</h1><p>World</p>'); // Clean HTML content const clean = await cleanHtml('<div>Content</div><script>ads</script>'); ``` ### Automatic Page Type Detection (Recommended) The library can automatically detect page types and apply optimal filtering strategies: ```typescript import { htmlToMarkdownAuto, cleanHtmlAuto, extractContentAuto } from 'html-content-processor'; // Automatic detection with URL context const markdown = await htmlToMarkdownAuto(html, 'https://example.com/blog-post'); // Clean HTML with automatic page type detection const cleanHtml = await cleanHtmlAuto(html, 'https://news.example.com/article'); // Extract content with detailed page type information const result = await extractContentAuto(html, 'https://docs.example.com/guide'); console.log('Detected page type:', result.pageType.type); console.log('Confidence:', result.pageType.confidence); console.log('Markdown:', result.markdown.content); ``` ### HtmlProcessor Class (Advanced Usage) ```typescript import { HtmlProcessor } from 'html-content-processor'; // Method chaining const result = await HtmlProcessor .from(html) .withBaseUrl('https://example.com') .withAutoDetection() // Enable automatic page type detection .filter() .toMarkdown(); // Manual page type setting const processor = await HtmlProcessor .from(html) .withPageType('blog') // Manually set page type .filter(); const markdown = await processor.toMarkdown(); ``` ### Content-Specific Presets ```typescript import { htmlToArticleMarkdown, htmlToBlogMarkdown, htmlToNewsMarkdown } from 'html-content-processor'; // Optimized for different content types const articleMd = await htmlToArticleMarkdown(html, baseUrl); const blogMd = await htmlToBlogMarkdown(html, baseUrl); const newsMd = await htmlToNewsMarkdown(html, baseUrl); ``` ## API Reference ### Core Functions | Function | Description | Return Type | |----------|-------------|-------------| | `htmlToMarkdown(html, options?)` | Convert HTML to Markdown | `Promise<string>` | | `htmlToMarkdownWithCitations(html, baseUrl?, options?)` | Convert HTML to Markdown with citations | `Promise<string>` | | `htmlToText(html, options?)` | Convert HTML to plain text | `Promise<string>` | | `cleanHtml(html, options?)` | Clean HTML content | `Promise<string>` | | `extractContent(html, options?)` | Extract content fragments | `Promise<string[]>` | ### Automatic Detection Functions | Function | Description | Benefits | |----------|-------------|----------| | `htmlToMarkdownAuto(html, url?, options?)` | Auto-detect page type and convert to Markdown | Optimal filtering for each page type | | `cleanHtmlAuto(html, url?, options?)` | Auto-detect page type and clean HTML | Smart noise removal | | `extractContentAuto(html, url?, options?)` | Auto-detect and extract with detailed results | Comprehensive page analysis | #### Example: Using Auto-Detection ```typescript // Blog post detection const blogResult = await htmlToMarkdownAuto(html, 'https://medium.com/@user/post'); // Automatically applies blog-optimized filtering // News article detection const newsResult = await htmlToMarkdownAuto(html, 'https://cnn.com/article'); // Automatically applies news-optimized filtering // Documentation detection const docsResult = await htmlToMarkdownAuto(html, 'https://docs.react.dev/guide'); // Automatically applies documentation-optimized filtering // Search engine results detection const searchResult = await htmlToMarkdownAuto(html, 'https://google.com/search?q=query'); // Automatically applies search-results-optimized filtering ``` ### Content-Specific Presets | Function | Optimized For | |----------|---------------| | `htmlToArticleMarkdown()` | Long-form articles | | `htmlToBlogMarkdown()` | Blog posts | | `htmlToNewsMarkdown()` | News articles | | `strictCleanHtml()` | Aggressive cleaning | | `gentleCleanHtml()` | Conservative cleaning | ### HtmlProcessor Class ```typescript // Create processor const processor = HtmlProcessor.from(html, options); // Configuration methods processor.withBaseUrl(url) // Set base URL processor.withOptions(options) // Update options processor.withAutoDetection(url?) // Enable auto-detection processor.withPageType(type) // Manually set page type // Processing methods await processor.filter(options?) // Apply filtering await processor.toMarkdown(options?) // Convert to Markdown await processor.toText() // Convert to plain text await processor.toArray() // Convert to fragment array processor.toString() // Get cleaned HTML // Information methods processor.getOptions() // Get current options processor.isProcessed() // Check if processed processor.getPageTypeResult() // Get page type detection result ``` ## Configuration Options ### Filter Options (FilterOptions) ```typescript { threshold?: number; // Filtering threshold (default: 2) strategy?: 'fixed' | 'dynamic'; // Filtering strategy (default: 'dynamic') ratio?: number; // Text density ratio (default: 0.48) minWords?: number; // Minimum word count (default: 0) preserveStructure?: boolean; // Preserve structure (default: false) keepElements?: string[]; // Elements to keep removeElements?: string[]; // Elements to remove } ``` ### Convert Options (ConvertOptions) ```typescript { citations?: boolean; // Generate citations (default: true) ignoreLinks?: boolean; // Ignore links (default: false) ignoreImages?: boolean; // Ignore images (default: false) baseUrl?: string; // Base URL threshold?: number; // Filter threshold strategy?: 'fixed' | 'dynamic'; // Filter strategy ratio?: number; // Text density ratio } ``` ## Automatic Page Type Detection The library automatically detects and optimizes for these page types: - `search-engine` - Search engine result pages - `blog` - Blog posts and personal articles - `news` - News articles and journalism - `documentation` - Technical documentation - `e-commerce` - E-commerce and product pages - `social-media` - Social media content - `forum` - Forum discussions and Q&A - `article` - General articles and content - `landing-page` - Marketing and landing pages ### How Auto-Detection Works ```typescript import { extractContentAuto } from 'html-content-processor'; const result = await extractContentAuto(html, url); console.log('Page Type:', result.pageType.type); console.log('Confidence:', (result.pageType.confidence * 100).toFixed(1) + '%'); console.log('Detection Reasons:', result.pageType.reasons); console.log('Applied Filter Options:', result.pageType.filterOptions); ``` ## Environment Support ### Node.js ```bash npm install jsdom # Recommended for best performance ``` ### Browser Direct support, no additional dependencies required. ### CDN ```html <script src="https://unpkg.com/html-content-processor"></script> <script> // Global variable: window.htmlFilter htmlFilter.htmlToMarkdown(html).then(console.log); // Auto-detection example htmlFilter.htmlToMarkdownAuto(html, window.location.href).then(result => { console.log('Auto-detected content:', result); }); </script> ``` ## Real-World Examples ### Web Scraping with Auto-Detection ```typescript import { htmlToMarkdownAuto } from 'html-content-processor'; // Scrape and convert blog post const response = await fetch('https://blog.example.com/post-123'); const html = await response.text(); const markdown = await htmlToMarkdownAuto(html, response.url); // Automatically detects it's a blog and applies blog-specific filtering ``` ### News Article Processing ```typescript import { extractContentAuto } from 'html-content-processor'; const result = await extractContentAuto(newsHtml, 'https://news.site.com/article'); if (result.pageType.type === 'news') { console.log('High-quality news content extracted'); console.log('Confidence:', result.pageType.confidence); } ``` ### Documentation Conversion ```typescript import { htmlToMarkdownAuto } from 'html-content-processor'; // Convert technical documentation const docMarkdown = await htmlToMarkdownAuto(docsHtml, 'https://docs.example.com/api'); // Automatically preserves code blocks, headers, and technical content structure ``` ## Performance - ⚡ **Fast Processing**: Optimized algorithms for quick content extraction - 💾 **Memory Efficient**: Minimal memory footprint - 🔄 **Batch Processing**: Handle multiple documents efficiently - 📊 **Smart Caching**: Automatic page type detection caching ## License MIT License