pdf-parse-new
Version:
Pure javascript cross-platform module to extract text from PDFs with AI-powered optimization and multi-core processing.
241 lines (180 loc) ⢠9.14 kB
Markdown
# Changelog
All notable changes to pdf-parse-new will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
---
## [2.0.0] - 2025-11-23
### š Major Release - Complete Rewrite with AI-Powered Optimization
This is a major release that introduces intelligent automatic method selection, multi-core processing, and comprehensive performance optimizations while maintaining 100% backward compatibility.
### ⨠Added
#### SmartPDFParser
- **Intelligent method selection** based on PDF characteristics and system resources
- **CPU-aware thresholds** that adapt from 4-core laptops to 48-core servers
- **Fast-path optimization**: 50x faster overhead for small PDFs (25ms ā 0.5ms)
- **LRU caching**: 25x faster on repeated similar PDFs (cache hit in ~1ms)
- **Common scenario matching**: 90%+ hit rate for typical PDFs
- **Decision tree** trained on 9,417 real-world benchmark samples
- **Statistics tracking**: method usage, cache hits, optimization rates
#### Multi-Core Processing
- **Child Processes** (`pdf-parse-processes.js`): True multi-processing for maximum performance
- **Worker Threads** (`pdf-parse-workers.js`): Alternative multi-threading with lower overhead
- **Oversaturation factor**: Use 1.5x-2x cores for better CPU utilization (I/O-bound optimization)
- **Automatic memory limiting**: Prevents OOM by monitoring available RAM
- **Progress callbacks**: Real-time progress tracking for long-running tasks
#### Performance Optimizations
- **Fast-path for tiny PDFs** (< 0.5 MB): Instant decision, no tree navigation
- **Fast-path for small PDFs** (< 1 MB): Immediate batch-5 selection
- **Cache for similar PDFs**: Second parse of similar PDF takes ~1ms
- **CPU normalization**: Thresholds scale with available cores
- **Memory-safe**: Automatic worker limiting based on available RAM
#### Developer Experience
- **7 production-ready examples** in `test/examples/`:
- `01-basic-parse.js` - Basic usage
- `02-batch-parse.js` - Batch optimization
- `03-stream-parse.js` - Memory-efficient streaming
- `04-workers-parse.js` - Worker threads
- `05-processes-parse.js` - Child processes
- `06-smart-parser.js` - SmartPDFParser (recommended)
- `07-compare-all.js` - Compare all methods
- **npm scripts** for quick example execution (`npm run example:smart`)
- **Complete TypeScript definitions** with all new features
- **Comprehensive benchmarking tools** in `benchmark/`
- **Detailed documentation** with real-world performance data
#### Infrastructure
- **CPU-aware benchmarking**: Tools for collecting data across different CPUs
- **Training pipeline**: Re-train decision tree from benchmark data
- **Incremental saving**: No data loss during long benchmark runs
- **URL support**: Benchmark remote PDFs via HTTP/HTTPS
### š Improved
#### Performance
- **2-4x faster** for huge PDFs (1000+ pages) using processes/workers
- **50x faster overhead** for tiny PDFs (< 0.5 MB) via fast-path
- **25x faster** on cache hits for repeated similar PDFs
- **Better CPU utilization** via oversaturation (1.5x cores)
- **Reduced memory usage** with automatic worker limiting
#### API
- **Backward compatible**: All v1.x code continues to work
- **New `_meta` field** in results with method, duration, analysis
- **Progress callbacks** for all parallel methods
- **Timeout support** for child processes
- **Resource limits** for worker threads
#### Code Quality
- **Organized structure**: Examples in `test/examples/`, benchmarks in `benchmark/`
- **Clean root**: No more scattered test files
- **TypeScript coverage**: 100% of public API
- **Error handling**: Comprehensive error messages with troubleshooting hints
- **Path resolution**: NPM-safe, works in `node_modules/`
### š§ Changed
#### Default Behavior
- SmartPDFParser now uses **processes** as default for huge PDFs (more consistent than workers)
- **Oversaturation factor** default is 1.5x (was 1.0x, i.e., cores - 1)
- **Fast-path enabled** by default (can disable with `enableFastPath: false`)
- **Caching enabled** by default (can disable with `enableCache: false`)
#### Benchmarking
- Moved all benchmark tools to `benchmark/` directory
- Private URLs/paths now in `benchmark/test-pdfs.json` (gitignored)
- Template provided in `benchmark/test-pdfs.example.json`
- Removed redundant `intensive-benchmarks.json` file
### šļø Removed
#### Deprecated Files
- Removed `QUICKSTART.js` (replaced by 7 focused examples)
- Removed scattered test files from root (consolidated in `test/examples/`)
- Removed redundant markdown files (consolidated in main README.md)
- Removed `intensive-benchmarks.json` (kept only `smart-parser-benchmarks.json`)
### š Documentation
#### New Documentation
- **Complete README.md**: All features, examples, benchmarks
- **test/examples/README.md**: Guide to all 7 examples
- **benchmark/README.md**: Benchmarking guide
- **benchmark/CPU_BENCHMARKING_GUIDE.md**: Multi-CPU testing guide
- **TypeScript definitions**: Complete with JSDoc comments
#### Updated Documentation
- Added "What's New in 2.0.0" section
- Added migration guide from 1.x
- Added real-world performance data
- Added comparison table with original pdf-parse
- Added troubleshooting section
- Added oversaturation explanation
### š Fixed
#### Workers/Processes
- Fixed worker exit code 1 error (Buffer serialization issue)
- Fixed memory exhaustion on large PDFs (added safety limits)
- Fixed path resolution for npm module installation
- Fixed double processing on errors (added completion flags)
- Fixed memory calculation for worker limiting
#### SmartPDFParser
- Fixed hardcoded method selection (now respects benchmark data)
- Fixed missing cpuCores in analysis
- Fixed cache key generation
- Fixed stats initialization
### š Security
- No known security vulnerabilities
- All dependencies updated to latest secure versions
- Proper cleanup of worker threads and child processes
- Memory limits prevent DoS via large PDFs
### š Performance Data
#### Benchmark Results (9,924 pages, 13.77 MB, 24 cores)
```
Method Time vs Sequential vs Batch
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Sequential ~15,000ms 1.00x 3.3x slower
Batch-50 ~11,723ms 1.28x faster 1.00x
Workers ~6,963ms 2.15x faster 1.68x faster
Processes ~4,468ms 3.36x faster 2.62x faster ā”
SmartParser: Automatically selects Processes
```
#### Overhead Comparison
```
PDF Type Before After Speedup
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Tiny (< 0.5 MB) 25ms 0.5ms 50x faster
Small (< 1 MB) 25ms 0.5ms 50x faster
Cached 25ms 1ms 25x faster
Common 25ms 2ms 12x faster
Rare 25ms 25ms Same
```
### ā ļø Breaking Changes
**None** - Version 2.0.0 is fully backward compatible with 1.x.
All existing code continues to work without modifications. New features are opt-in via SmartPDFParser.
### š Migration Guide
#### From 1.x to 2.0.0
**No changes required** - your code will continue to work:
```javascript
// v1.x code (still works in v2.0.0)
const pdf = require('pdf-parse-new');
pdf(buffer).then(data => console.log(data.text));
```
**To use new features:**
```javascript
// Use SmartPDFParser for automatic optimization
const SmartParser = require('pdf-parse-new/lib/SmartPDFParser');
const parser = new SmartParser();
const result = await parser.parse(buffer);
console.log(`Method: ${result._meta.method}`);
console.log(`Duration: ${result._meta.duration}ms`);
console.log(`Fast-path: ${result._meta.fastPath}`);
```
**To force specific method:**
```javascript
// Force processes for huge PDFs
const parser = new SmartParser({ forceMethod: 'processes' });
// Force workers (alternative)
const parser = new SmartParser({ forceMethod: 'workers' });
// Adjust oversaturation
const parser = new SmartParser({ oversaturationFactor: 2.0 });
```
### š Contributors
- Simone Gosetto - Lead developer, v2.0 implementation
- autokent - Original pdf-parse library
- Mozilla - PDF.js library
### š¦ Dependencies
- `debug`: ^4.3.4
- `node-ensure`: ^0.0.0
- PDF.js: v4.5.136 (bundled)
No breaking dependency changes.
---
## [1.x] - Previous Versions
For changelog of versions prior to 2.0.0, see the original [pdf-parse changelog](https://gitlab.com/autokent/pdf-parse/-/blob/master/CHANGELOG.md).
---
**[Unreleased]**: https://github.com/simonegosetto/pdf-parse-new/compare/v2.0.0...HEAD
**[2.0.0]**: https://github.com/simonegosetto/pdf-parse-new/releases/tag/v2.0.0