UNPKG

n8n-nodes-n8ntools-document-processor

Version:

N8N Tools - Document Processor: Process and analyze documents with OCR, text extraction, and format conversion

354 lines (288 loc) 9.27 kB
# N8N Tools - Document Processor [![npm version](https://img.shields.io/npm/v/n8n-nodes-n8ntools-document-processor)](https://www.npmjs.com/package/n8n-nodes-n8ntools-document-processor) [![npm downloads](https://img.shields.io/npm/dt/n8n-nodes-n8ntools-document-processor)](https://www.npmjs.com/package/n8n-nodes-n8ntools-document-processor) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) Process and analyze documents with OCR, text extraction, and format conversion capabilities. This N8N community node provides comprehensive document processing through the N8N Tools platform. ## ✨ Features - **📄 Text Extraction**: Extract text from various document formats - **🔍 OCR Processing**: Extract text from images and scanned documents - **🔄 Format Conversion**: Convert between PDF, DOCX, TXT, HTML, MD, RTF - **📊 Metadata Extraction**: Get document properties and information - **✂️ Page Splitting**: Split documents into individual pages - **🔗 Document Merging**: Combine multiple documents - **🌍 Multi-language OCR**: Support for Portuguese, English, Spanish, French, German - **💰 Cost Tracking**: Usage monitoring and budget controls ## 🚀 Quick Start ### Installation Install this node in your N8N instance: #### Via Community Nodes (Recommended) 1. Go to **Settings > Community Nodes** in your N8N interface 2. Click **Install a community node** 3. Enter `n8n-nodes-n8ntools-document-processor` 4. Click **Install** #### Via npm ```bash npm install n8n-nodes-n8ntools-document-processor ``` ### Setup Credentials 1. Sign up at [N8N Tools](https://n8ntools.io) and get your API key 2. In N8N, create new **N8N Tools API** credentials 3. Enter your API URL: `https://api.n8ntools.io` 4. Enter your API key ## 📖 Usage ### Supported Operations | Operation | Description | Input | Output | |-----------|-------------|-------|--------| | **Extract Text** | Extract text content | PDF, DOCX, DOC, RTF | Plain text | | **Extract Metadata** | Get document properties | Any document | JSON metadata | | **Convert Format** | Change document format | Various formats | PDF, DOCX, TXT, HTML, MD, RTF | | **Split Pages** | Split into individual pages | PDF, DOCX | ZIP with pages | | **Merge Documents** | Combine multiple documents | Multiple files | Single document | | **OCR Processing** | Extract text from images | PDF, images | Text with OCR | ### Example Workflow ``` [File Trigger] → [N8N Tools Document Processor] → [Extract Data] → [Database/Email] ``` ### Configuration Example **Invoice Text Extraction:** ```json { "operation": "extractText", "inputSource": "binaryData", "binaryPropertyName": "data", "advancedOptions": { "extractImages": true, "extractTables": true, "preserveFormatting": true } } ``` ## ⚙️ Node Parameters ### Input Configuration - **Input Source**: Binary Data, File URL, or Base64 - **Binary Property**: Name of binary property (default: "data") - **File URL**: Direct URL to document file - **Base64 Data**: Base64 encoded document content ### Operation-Specific Options #### Format Conversion - **Target Format**: PDF, DOCX, TXT, HTML, MD, RTF #### Page Splitting - **Page Range**: Specific pages (e.g., "1-5") or "all" #### OCR Processing - **Language**: Portuguese, English, Spanish, French, German, Auto-detect ### Advanced Options - **Extract Images**: Include images from document - **Extract Tables**: Parse table data - **Preserve Formatting**: Maintain original formatting - **Password**: For password-protected documents ## 📤 Output Data ### Text Extraction Result ```json { "text": "This is the extracted text content...", "wordCount": 1250, "pageCount": 3, "hasImages": true, "hasTables": true, "images": [ { "page": 1, "base64": "iVBORw0KGgoAAAANSUhEUgAA...", "format": "png" } ], "tables": [ { "page": 2, "rows": 5, "columns": 3, "data": [["Header1", "Header2", "Header3"], ...] } ], "success": true, "operation": "extractText", "creditsUsed": 2, "originalFilename": "invoice.pdf" } ``` ### Format Conversion Result Returns the converted document as binary data with metadata: ```json { "success": true, "operation": "convertFormat", "originalFilename": "document.pdf", "convertedFilename": "document.docx", "targetFormat": "docx", "creditsUsed": 1 } ``` ### Metadata Extraction Result ```json { "filename": "report.pdf", "fileSize": 2048000, "mimeType": "application/pdf", "pageCount": 15, "author": "John Doe", "title": "Annual Report 2024", "subject": "Company Performance", "keywords": ["business", "report", "annual"], "creationDate": "2024-01-15T10:30:00Z", "modificationDate": "2024-01-16T14:20:00Z", "hasPassword": false, "isEncrypted": false, "success": true } ``` ## 🔧 Supported File Formats ### Input Formats - **PDF**: PDF documents (including password-protected) - **Microsoft Word**: DOCX, DOC - **Text**: TXT, RTF - **Web**: HTML, XML - **Images**: PNG, JPG, TIFF (for OCR) ### Output Formats - **PDF**: Portable Document Format - **DOCX**: Microsoft Word (newer format) - **TXT**: Plain text - **HTML**: HyperText Markup Language - **MD**: Markdown - **RTF**: Rich Text Format ## 🔍 OCR Capabilities ### Supported Languages - **Portuguese** (`por`): Optimized for Brazilian Portuguese - **English** (`eng`): US and UK English - **Spanish** (`spa`): Latin American and Iberian Spanish - **French** (`fra`): French language support - **German** (`deu`): German language support - **Auto-detect** (`auto`): Automatic language detection ### OCR Example ```json { "operation": "ocrProcessing", "inputSource": "fileUrl", "fileUrl": "https://example.com/scanned-invoice.pdf", "ocrLanguage": "por", "advancedOptions": { "extractTables": true, "preserveFormatting": true } } ``` ## 🛠️ Advanced Use Cases ### Invoice Processing Pipeline ``` [Email Trigger] → [Download Attachment] → [Extract Text] → [Parse Data] → [Update CRM] ``` ### Document Classification ``` [File Upload] → [Extract Metadata] → [Classify Type] → [Route to Process] ``` ### Bulk Document Conversion ``` [File Monitor] → [Document Processor] → [Convert to PDF] → [Archive] ``` ### Contract Analysis ``` [Document Input] → [Extract Text] → [Find Key Terms] → [Generate Summary] ``` ## 📊 Processing Examples ### Extract Contract Details ```javascript // Extract specific information from legal documents { "operation": "extractText", "advancedOptions": { "extractTables": true, "preserveFormatting": true } } // Then use regex or NLP to find specific clauses ``` ### Convert Legacy Documents ```javascript // Convert old DOC files to modern formats { "operation": "convertFormat", "targetFormat": "docx" } ``` ### Process Scanned Forms ```javascript // OCR processing for form data extraction { "operation": "ocrProcessing", "ocrLanguage": "eng", "advancedOptions": { "extractTables": true // For form fields } } ``` ## 💸 Pricing & Limits - **Text Extraction**: 1 credit per document - **Format Conversion**: 1 credit per conversion - **OCR Processing**: 2 credits per document - **Page Splitting**: 1 credit per document - **Document Merging**: 1 credit per operation - **File Size Limit**: 100MB per document - **Page Limit**: 500 pages per document ## 🚨 Error Handling Common errors and solutions: ```json // Password-protected document { "error": "Document is password protected", "success": false, "suggestion": "Provide password in advancedOptions" } // Unsupported format { "error": "Unsupported file format: .xyz", "success": false, "suggestion": "Check supported input formats" } // OCR language not detected { "error": "Could not detect document language", "success": false, "suggestion": "Specify OCR language manually" } ``` ### Password-Protected Documents ```json { "advancedOptions": { "password": "your-document-password" } } ``` ## 🔄 Integration Examples ### With PDF Generator ``` [Data] → [Generate PDF] → [Extract Text] → [Validate Content] ``` ### With Web Scraper ``` [Scrape URLs] → [Download PDFs] → [Process Documents] → [Store Data] ``` ### With Email ``` [Email Attachment] → [Process Document] → [Extract Key Info] → [Reply with Summary] ``` ## 🔗 Related Packages - **[PDF Generator](https://npmjs.com/package/n8n-nodes-n8ntools-pdf-generator)**: Create PDFs from processed data - **[Web Scraper](https://npmjs.com/package/n8n-nodes-n8ntools-web-scraper)**: Scrape documents from websites ## 📋 Requirements - N8N version 0.174.0 or higher - N8N Tools account and API key - Node.js 18+ (for development) ## 🆘 Support - 📧 **Email**: support@n8ntools.io - 📖 **Documentation**: [docs.n8ntools.io](https://docs.n8ntools.io) - 💬 **Community**: [Discord](https://discord.gg/n8ntools) - 🐛 **Issues**: [GitHub](https://github.com/n8ntools/n8n-nodes/issues) ## 📄 License MIT License - see [LICENSE](LICENSE) file for details. --- **Part of the N8N Tools ecosystem** • [Website](https://n8ntools.io) • [All Packages](https://npmjs.com/search?q=n8ntools)