pdf-parse-new

# PDF Parse - Advanced Parsing Methods This directory contains advanced implementations for parsing large PDFs efficiently. ## Files - **pdf-parse.js** - Original batch parallelization method - **pdf-parse-stream.js** - Streaming with chunking for memory efficiency - **pdf-parse-aggressive.js** - Aggressive streaming with large batches - **pdf-parse-workers.js** - Worker threads for true CPU parallelism - **pdf-parse-processes.js** - Child processes for maximum parallelism - **pdf-worker.js** - Worker thread implementation - **pdf-child.js** - Child process implementation - **SmartPDFParser.js** - Intelligent parser that auto-selects best method ## When to Use Each Method ### Standard (pdf-parse.js) **Best for:** < 500 pages ```javascript const pdf = require('pdf-parse-new'); await pdf(buffer, { parallelizePages: true, batchSize: 10 }); ``` ### Streaming (pdf-parse-stream.js) **Best for:** 500-1000 pages, memory-constrained environments ```javascript const pdf = require('pdf-parse-new'); await pdf.stream(buffer, { chunkSize: 500, batchSize: 10 }); ``` **Benefits:** - Reduced memory pressure - Better garbage collection - Progress tracking - 15-25% faster for large files ### Workers (pdf-parse-workers.js) **Best for:** 1000+ pages, multi-core systems ```javascript const PDFWorkers = require('pdf-parse-new/lib/pdf-parse-workers'); await PDFWorkers(buffer, { chunkSize: 500, maxWorkers: 4 }); ``` **Benefits:** - True multi-threading (worker threads) - 30-50% faster for huge files - Maximum CPU utilization - Lightweight, fast startup ### Processes (pdf-parse-processes.js) **Best for:** 1000+ pages, need maximum stability ```javascript const PDFProcesses = require('pdf-parse-new/lib/pdf-parse-processes'); await PDFProcesses(buffer, { chunkSize: 500, maxProcesses: 4 }); ``` **Benefits:** - True multi-processing (child processes) - 35-55% faster for huge files - Better memory isolation - More stable than workers - Best overall performance for large PDFs ### Aggressive (pdf-parse-aggressive.js) **Best for:** Very large PDFs with complex layouts ```javascript const PDFAggressive = require('pdf-parse-new/lib/pdf-parse-aggressive'); await PDFAggressive(buffer, { chunkSize: 500, batchSize: 20 }); ``` **Benefits:** - Combines streaming + aggressive batching - Good for complex PDFs - Balanced memory/speed ## Performance Tips 1. **Enable Garbage Collection** ```bash node --expose-gc your-script.js ``` 2. **Adjust Chunk Size** - Small PDFs: N/A - Medium (50-500): batchSize 10-20 - Large (500-1000): chunkSize 500, batchSize 10 - Huge (1000+): chunkSize 500, maxWorkers = CPU cores - 1 3. **Monitor Memory** ```javascript console.log(`Memory: ${(process.memoryUsage().heapUsed / 1024 / 1024).toFixed(2)} MB`); ``` ## Architecture ### Batch Parallelization ``` [Page 1-10] → Promise.all → [Text 1-10] [Page 11-20] → Promise.all → [Text 11-20] ... ``` - Single process, parallel promises - Shared memory space - Best for small-medium PDFs ### Streaming ``` Chunk 1 [Page 1-500] ├─ Batch [1-10] ├─ Batch [11-20] └─ ... → GC Chunk 2 [Page 501-1000] └─ ... ``` - Sequential chunks with batch processing - Memory-efficient (GC between chunks) - Best for large PDFs with limited memory ### Workers (Worker Threads) ``` Main Thread ├─→ Worker 1 → [Page 1-500] ─┐ ├─→ Worker 2 → [Page 501-1000] ├─→ Combine → Result └─→ Worker 3 → [Page 1001-1500]┘ ``` - True multi-threading (shared memory possible) - Worker threads from Node.js - Lightweight, faster startup - Best for huge PDFs on multi-core systems ### Processes (Child Processes) ``` Main Process ├─→ Child Process 1 → [Page 1-500] ─┐ ├─→ Child Process 2 → [Page 501-1000] ├─→ Combine → Result └─→ Child Process 3 → [Page 1001-1500]┘ ``` - True multi-processing (isolated memory) - Separate Node.js processes via fork() - Better memory isolation - Slightly more overhead than workers - Best for huge PDFs, more stable than workers ## Performance Benchmarks Quick reference (from real-world testing): - **Small PDFs (< 100 pages)**: Batch processing with small batch sizes (5-10) - **Medium PDFs (100-500 pages)**: Batch processing with larger batch sizes (20-50) - **Large PDFs (500-1000 pages)**: Streaming with chunk size 500 - **Huge PDFs (1000+ pages)**: Workers/Processes for true parallelism (45-50% faster)