n8n-nodes-n8ntools-document-processor
Version:
N8N Tools - Document Processor: Process and analyze documents with OCR, text extraction, and format conversion
354 lines (288 loc) • 9.27 kB
Markdown
# N8N Tools - Document Processor
[](https://www.npmjs.com/package/n8n-nodes-n8ntools-document-processor)
[](https://www.npmjs.com/package/n8n-nodes-n8ntools-document-processor)
[](https://opensource.org/licenses/MIT)
Process and analyze documents with OCR, text extraction, and format conversion capabilities. This N8N community node provides comprehensive document processing through the N8N Tools platform.
## ✨ Features
- **📄 Text Extraction**: Extract text from various document formats
- **🔍 OCR Processing**: Extract text from images and scanned documents
- **🔄 Format Conversion**: Convert between PDF, DOCX, TXT, HTML, MD, RTF
- **📊 Metadata Extraction**: Get document properties and information
- **✂️ Page Splitting**: Split documents into individual pages
- **🔗 Document Merging**: Combine multiple documents
- **🌍 Multi-language OCR**: Support for Portuguese, English, Spanish, French, German
- **💰 Cost Tracking**: Usage monitoring and budget controls
## 🚀 Quick Start
### Installation
Install this node in your N8N instance:
#### Via Community Nodes (Recommended)
1. Go to **Settings > Community Nodes** in your N8N interface
2. Click **Install a community node**
3. Enter `n8n-nodes-n8ntools-document-processor`
4. Click **Install**
#### Via npm
```bash
npm install n8n-nodes-n8ntools-document-processor
```
### Setup Credentials
1. Sign up at [N8N Tools](https://n8ntools.io) and get your API key
2. In N8N, create new **N8N Tools API** credentials
3. Enter your API URL: `https://api.n8ntools.io`
4. Enter your API key
## 📖 Usage
### Supported Operations
| Operation | Description | Input | Output |
|-----------|-------------|-------|--------|
| **Extract Text** | Extract text content | PDF, DOCX, DOC, RTF | Plain text |
| **Extract Metadata** | Get document properties | Any document | JSON metadata |
| **Convert Format** | Change document format | Various formats | PDF, DOCX, TXT, HTML, MD, RTF |
| **Split Pages** | Split into individual pages | PDF, DOCX | ZIP with pages |
| **Merge Documents** | Combine multiple documents | Multiple files | Single document |
| **OCR Processing** | Extract text from images | PDF, images | Text with OCR |
### Example Workflow
```
[File Trigger] → [N8N Tools Document Processor] → [Extract Data] → [Database/Email]
```
### Configuration Example
**Invoice Text Extraction:**
```json
{
"operation": "extractText",
"inputSource": "binaryData",
"binaryPropertyName": "data",
"advancedOptions": {
"extractImages": true,
"extractTables": true,
"preserveFormatting": true
}
}
```
## ⚙️ Node Parameters
### Input Configuration
- **Input Source**: Binary Data, File URL, or Base64
- **Binary Property**: Name of binary property (default: "data")
- **File URL**: Direct URL to document file
- **Base64 Data**: Base64 encoded document content
### Operation-Specific Options
#### Format Conversion
- **Target Format**: PDF, DOCX, TXT, HTML, MD, RTF
#### Page Splitting
- **Page Range**: Specific pages (e.g., "1-5") or "all"
#### OCR Processing
- **Language**: Portuguese, English, Spanish, French, German, Auto-detect
### Advanced Options
- **Extract Images**: Include images from document
- **Extract Tables**: Parse table data
- **Preserve Formatting**: Maintain original formatting
- **Password**: For password-protected documents
## 📤 Output Data
### Text Extraction Result
```json
{
"text": "This is the extracted text content...",
"wordCount": 1250,
"pageCount": 3,
"hasImages": true,
"hasTables": true,
"images": [
{
"page": 1,
"base64": "iVBORw0KGgoAAAANSUhEUgAA...",
"format": "png"
}
],
"tables": [
{
"page": 2,
"rows": 5,
"columns": 3,
"data": [["Header1", "Header2", "Header3"], ...]
}
],
"success": true,
"operation": "extractText",
"creditsUsed": 2,
"originalFilename": "invoice.pdf"
}
```
### Format Conversion Result
Returns the converted document as binary data with metadata:
```json
{
"success": true,
"operation": "convertFormat",
"originalFilename": "document.pdf",
"convertedFilename": "document.docx",
"targetFormat": "docx",
"creditsUsed": 1
}
```
### Metadata Extraction Result
```json
{
"filename": "report.pdf",
"fileSize": 2048000,
"mimeType": "application/pdf",
"pageCount": 15,
"author": "John Doe",
"title": "Annual Report 2024",
"subject": "Company Performance",
"keywords": ["business", "report", "annual"],
"creationDate": "2024-01-15T10:30:00Z",
"modificationDate": "2024-01-16T14:20:00Z",
"hasPassword": false,
"isEncrypted": false,
"success": true
}
```
## 🔧 Supported File Formats
### Input Formats
- **PDF**: PDF documents (including password-protected)
- **Microsoft Word**: DOCX, DOC
- **Text**: TXT, RTF
- **Web**: HTML, XML
- **Images**: PNG, JPG, TIFF (for OCR)
### Output Formats
- **PDF**: Portable Document Format
- **DOCX**: Microsoft Word (newer format)
- **TXT**: Plain text
- **HTML**: HyperText Markup Language
- **MD**: Markdown
- **RTF**: Rich Text Format
## 🔍 OCR Capabilities
### Supported Languages
- **Portuguese** (`por`): Optimized for Brazilian Portuguese
- **English** (`eng`): US and UK English
- **Spanish** (`spa`): Latin American and Iberian Spanish
- **French** (`fra`): French language support
- **German** (`deu`): German language support
- **Auto-detect** (`auto`): Automatic language detection
### OCR Example
```json
{
"operation": "ocrProcessing",
"inputSource": "fileUrl",
"fileUrl": "https://example.com/scanned-invoice.pdf",
"ocrLanguage": "por",
"advancedOptions": {
"extractTables": true,
"preserveFormatting": true
}
}
```
## 🛠️ Advanced Use Cases
### Invoice Processing Pipeline
```
[Email Trigger] → [Download Attachment] → [Extract Text] → [Parse Data] → [Update CRM]
```
### Document Classification
```
[File Upload] → [Extract Metadata] → [Classify Type] → [Route to Process]
```
### Bulk Document Conversion
```
[File Monitor] → [Document Processor] → [Convert to PDF] → [Archive]
```
### Contract Analysis
```
[Document Input] → [Extract Text] → [Find Key Terms] → [Generate Summary]
```
## 📊 Processing Examples
### Extract Contract Details
```javascript
// Extract specific information from legal documents
{
"operation": "extractText",
"advancedOptions": {
"extractTables": true,
"preserveFormatting": true
}
}
// Then use regex or NLP to find specific clauses
```
### Convert Legacy Documents
```javascript
// Convert old DOC files to modern formats
{
"operation": "convertFormat",
"targetFormat": "docx"
}
```
### Process Scanned Forms
```javascript
// OCR processing for form data extraction
{
"operation": "ocrProcessing",
"ocrLanguage": "eng",
"advancedOptions": {
"extractTables": true // For form fields
}
}
```
## 💸 Pricing & Limits
- **Text Extraction**: 1 credit per document
- **Format Conversion**: 1 credit per conversion
- **OCR Processing**: 2 credits per document
- **Page Splitting**: 1 credit per document
- **Document Merging**: 1 credit per operation
- **File Size Limit**: 100MB per document
- **Page Limit**: 500 pages per document
## 🚨 Error Handling
Common errors and solutions:
```json
// Password-protected document
{
"error": "Document is password protected",
"success": false,
"suggestion": "Provide password in advancedOptions"
}
// Unsupported format
{
"error": "Unsupported file format: .xyz",
"success": false,
"suggestion": "Check supported input formats"
}
// OCR language not detected
{
"error": "Could not detect document language",
"success": false,
"suggestion": "Specify OCR language manually"
}
```
### Password-Protected Documents
```json
{
"advancedOptions": {
"password": "your-document-password"
}
}
```
## 🔄 Integration Examples
### With PDF Generator
```
[Data] → [Generate PDF] → [Extract Text] → [Validate Content]
```
### With Web Scraper
```
[Scrape URLs] → [Download PDFs] → [Process Documents] → [Store Data]
```
### With Email
```
[Email Attachment] → [Process Document] → [Extract Key Info] → [Reply with Summary]
```
## 🔗 Related Packages
- **[PDF Generator](https://npmjs.com/package/n8n-nodes-n8ntools-pdf-generator)**: Create PDFs from processed data
- **[Web Scraper](https://npmjs.com/package/n8n-nodes-n8ntools-web-scraper)**: Scrape documents from websites
## 📋 Requirements
- N8N version 0.174.0 or higher
- N8N Tools account and API key
- Node.js 18+ (for development)
## 🆘 Support
- 📧 **Email**: support@n8ntools.io
- 📖 **Documentation**: [docs.n8ntools.io](https://docs.n8ntools.io)
- 💬 **Community**: [Discord](https://discord.gg/n8ntools)
- 🐛 **Issues**: [GitHub](https://github.com/n8ntools/n8n-nodes/issues)
## 📄 License
MIT License - see [LICENSE](LICENSE) file for details.
---
**Part of the N8N Tools ecosystem** • [Website](https://n8ntools.io) • [All Packages](https://npmjs.com/search?q=n8ntools)