ppu-pdf
Version:
Easily extract text from digital PDF files with coordinate and font size included, and optionally group text by lines or render scanned pdf to canvas/png.
621 lines (462 loc) • 22.6 kB
Markdown
# ppu-pdf
Easily extract text from digital PDF and Scanned PDF files with coordinate and font size included, and optionally group text by lines or render scanned pdf to canvas/png.
There are two class of `PdfReader` (uses mupdfjs) and `PdfReaderLegacy` uses (pdfjs-dist).
> **[Live Demo →](https://pt-perkasa-pilar-utama.github.io/ppu-pdf/)**
## Features
- **Text Extraction:** Retrieve all text content from a PDF.
- **LLM-Friendly Text Extraction:** Retrieve all text content while maintaining bbox and encode in Token Object Notation (TOON).
- **Coordinate Data:** Get precise bounding box and dimension information for each text element.
- **Line Grouping:** Merge individual text tokens into coherent lines.
- **Scanned PDF Detection:** Determine if a PDF/individual page appears to be scanned or digitally generated.
- **Scanned PDF Canvas Rendering:** Convert scanned pdf per page into a ready to processed canvas.
- **Scanned PDF to PNG Images:** Convert and write all pdf pages to PNG images.
- **Scanned PDF Text Extraction:** Retrieve all text content from a scanned PDF using `ppu-paddle-ocr`.
## Differences
| Indicator | PdfReader | PdfReaderLegacy |
| ----------------------------------- | --------- | --------------- |
| Library | mupdfjs | pdfjs-dist |
| Pages index start | 0 | 1 |
| open() | ✅ | ✅ |
| getTexts() | ✅ | ✅ |
| getTextsScanned() | ✅ | ✅ |
| isScanned() | ✅ | ✅ |
| isPageScanned() | ✅ | ✅ |
| getLinesFromTexts() | ✅ | ✅ |
| getCompactLinesFromTexts() | ✅ | ✅ |
| destroy() | ✅ | ✅ |
| destroyPage() | ✅ | ❌ |
| renderAll() | ✅ | ✅ |
| saveCanvasToPng() | ✅ | ✅ |
| dumpCanvasMap() | ✅ | ✅ |
| Resize viewport/Custom DPI | ✅ | ✅ |
| pdfReader.getLinesFromTextsInToon() | ✅ | ✅ |
## Benchmark
Both digital pdf and scanned pdf with a total 28 pages.
```sh
clk: ~4.02 GHz
cpu: 11th Gen Intel(R) Core(TM) i5-11400H @ 2.70GHz
runtime: bun 1.2.5 (x64-linux)
benchmark avg (min … max) p75 / p99 (min … top 1%)
--------------------------------------------------------- -------------------------------
pdfReader.getTexts() 461.74 ms/iter 459.53 ms █
(436.19 ms … 524.12 ms) 513.69 ms ███
( 2.29 mb … 107.16 mb) 27.86 mb █▁███▁█▁█▁▁▁▁▁▁▁▁▁▁▁█
pdfReaderLegacy.getTexts() 217.58 ms/iter 233.21 ms █
(196.63 ms … 238.71 ms) 236.38 ms █ ▅
( 1.58 mb … 118.27 mb) 40.84 mb ▇▁█▁▁▁▇▁▁▁▁▁▁▁▁▁▇▁█▇▇
summary
pdfReaderLegacy.getTexts()
2.12x faster than pdfReader.getTexts()
--------------------------------------------------------- -------------------------------
pdfReader.getLinesFromTexts() 3.82 ms/iter 3.91 ms ▆ █
(3.52 ms … 4.95 ms) 4.53 ms ▇█ ▂ ▄█▆
( 0.00 b … 1.29 mb) 590.91 kb ▅██▇█▂███▇▅▃▂▂▂▁▂▂▂▂▂
pdfReaderLegacy.getLinesFromTexts() 4.77 ms/iter 4.87 ms █▄ ▄▅▅
(4.41 ms … 5.66 ms) 5.52 ms ██▂ ███▅
( 0.00 b … 1.29 mb) 647.72 kb ████▇████▆▇▆█▂▂▄▄▅▂▁▄
summary
pdfReader.getLinesFromTexts()
1.25x faster than pdfReaderLegacy.getLinesFromTexts()
--------------------------------------------------------- -------------------------------
pdfReader.getCompactLinesFromTexts() 3.87 ms/iter 3.98 ms ▂█ ▄▃▂
(3.57 ms … 4.60 ms) 4.40 ms ██▃▂ ███▄▂
( 0.00 b … 2.06 mb) 920.63 kb ▇█████▃▇█████▄▄▄▃▄▃▁▂
pdfReaderLegacy.getCompactLinesFromTexts() 4.62 ms/iter 4.71 ms ▃▂ █▂
(4.27 ms … 5.59 ms) 5.52 ms ██ ▆██
( 0.00 b … 1.55 mb) 876.22 kb ▆██▇▆███▇█▅▂▅▄▂▂▁▂▁▁▂
summary
pdfReader.getCompactLinesFromTexts()
1.19x faster than pdfReaderLegacy.getCompactLinesFromTexts()
--------------------------------------------------------- -------------------------------
pdfReader.open() 14.30 ms/iter 19.97 ms █▄
(8.56 ms … 31.51 ms) 27.49 ms ██
( 0.00 b … 58.52 mb) 19.89 mb ██▅▁▃▃▁▃▇▁▃▁▅▁▇█▁▅▁▁▃
pdfReaderLegacy.open() 6.19 ms/iter 6.51 ms █
(5.11 ms … 13.27 ms) 10.40 ms ██▂ ▂
( 0.00 b … 29.13 mb) 682.14 kb ██████▅▄▃▁▃▁▄▃▁▂▁▁▁▁▂
summary
pdfReaderLegacy.open()
2.31x faster than pdfReader.open()
--------------------------------------------------------- -------------------------------
pdfReader.renderAll() 1.10 s/iter 1.12 s █
(1.05 s … 1.13 s) 1.12 s █ █ █
( 66.35 mb … 248.17 mb) 191.68 mb █▁▁▁▁▁▁▁▁██▁▁▁█▁▁▁███
pdfReaderLegacy.renderAll() 1.68 s/iter 1.70 s █
(1.56 s … 1.84 s) 1.77 s ▅ ▅▅ ▅▅ ▅█▅ ▅ ▅
(231.91 mb … 384.77 mb) 352.18 mb █▁██▁▁▁▁██▁███▁█▁▁▁▁█
summary
pdfReader.renderAll()
1.52x faster than pdfReaderLegacy.renderAll()
```
## Installation
Using Bun is recommended
Install the package via npm:
```bash
npm install ppu-pdf ppu-paddle-ocr
```
Or using Yarn:
```bash
yarn add ppu-pdf ppu-paddle-ocr
```
Bun:
```bash
bun add ppu-pdf ppu-paddle-ocr
```
You can opt-out `ppu-paddle-ocr` if you are planning on not extracting text from scanned pdf.
## Usage
Below is an example of how to use the library with Bun.
Digital PDF Example:
```ts
import { PdfReader } from "ppu-pdf";
const pdfReader = new PdfReader({ verbose: false });
const file = Bun.file("./assets/opposite-expectation.pdf");
const buffer = await file.arrayBuffer();
const pdf = pdfReader.open(buffer);
// remember it's a map
const texts = await pdfReader.getTexts(pdf);
const page0texts = texts.get(0);
console.log("texts: ", page0texts);
pdfReader.destroy(pdf);
const isScanned = pdfReader.isScanned(texts);
console.log("is pdf scanned: ", isScanned);
```
Scanned PDF Example:
```ts
import { join } from "path";
import { PdfReader } from "ppu-pdf";
import { PaddleOcrService } from "ppu-paddle-ocr";
const fonts = [
{
path: join(__dirname, "..", "fonts", "Arial.ttf"),
name: "Arial",
},
];
const pdfReader = new PdfReader({ verbose: false, fonts: fonts });
const ocr = new PaddleOcrService();
const fileScan = Bun.file("./assets/opposite-expectation-scan.pdf");
const bufferScan = await fileScan.arrayBuffer();
const pdfScan = pdfReader.open(bufferScan);
const canvasMap = await pdfReader.renderAll(pdfScan);
pdfReader.destroy(pdfScan); // Destroy early since we already hold the canvases
pdfReader.dumpCanvasMap(canvasMap, "my-dumped-pdf");
const texts = await pdfReader.getTextsScanned(ocr, canvasMap);
console.log("texts: ", texts.get(0));
```
### Rebuilding a Scanned PDF into a Searchable PDF
You can rebuild a scanned (flattened) PDF, turning it into a fully searchable document by overlaying the invisible recognized text on top of the original images. For a complete script, refer to `examples/rebuild-pdf.example.ts`.
```ts
import { PdfReader } from "ppu-pdf";
import { PaddleOcrService } from "ppu-paddle-ocr";
const pdfReader = new PdfReader({ verbose: false });
const ocr = new PaddleOcrService();
await ocr.initialize();
const fileScan = Bun.file("./assets/test_japanese.pdf");
const bufferScan = await fileScan.arrayBuffer();
// 1. Open and Render the PDF into Canvases
const pdfScan = pdfReader.open(bufferScan);
const canvasMap = await pdfReader.renderAll(pdfScan);
pdfReader.destroy(pdfScan);
// 2. Extract OCR Texts
const texts = await pdfReader.getTextsScanned(ocr, canvasMap);
// 3. Rebuild Searchable PDF
// Note: We open a fresh instance of the PDF for rebuilding
const pdfForRebuild = pdfReader.open(bufferScan);
const rebuiltPdfBuffer = await pdfReader.rebuild(pdfForRebuild, texts);
pdfReader.destroy(pdfForRebuild);
// 4. Save onto disk
await Bun.write("./test_japanese_searchable.pdf", rebuiltPdfBuffer);
```
## `PdfReaderOptions`
Configuration options for `PdfReader`, allowing customization of PDF text extraction behavior.
| Option | Type | Default Value | Description |
| ---------------------------- | --------- | ------------- | --------------------------------------------------------------------------- |
| `verbose` | `boolean` | `false` | Enables logging for debugging purposes. |
| `excludeFooter` | `boolean` | `true` | Excludes detected footer text from the extracted content. |
| `excludeHeader` | `boolean` | `true` | Excludes detected header text from the extracted content. |
| `raw` | `boolean` | `false` | If `true`, returns raw text without additional processing. |
| `headerFromHeightPercentage` | `number` | `0.02` | Defines the height percentage from the top used to identify header text. |
| `footerFromHeightPercentage` | `number` | `0.95` | Defines the height percentage from the bottom used to identify footer text. |
| `mergeCloseTextNeighbor` | `boolean` | `true` | Merges text elements that are close to each other into a single entity. |
| `simpleSortAlgorithm` | `boolean` | `false` | Uses a simplified sorting algorithm for text positioning. |
| `scale` | `number` | `1` | The pdf document scale |
| `enableToon` | `boolean` | `false` | To enable pdf words extraction in TOON format |
### Usage Example:
```typescript
const reader = new PdfReader({ verbose: true, excludeFooter: false });
```
These options allow fine-tuned control over how text is extracted and processed from PDFs.
### Compiling
You can compile your project into a single binary, but remember to also copy `node_modules/mupdf/dist/mupdf-wasm.wasm` and place it alongside your binary.
See `package.json` for an example.
## Method Documentation
### `PdfReader` Class
#### Constructor: `constructor(options?: Partial<PdfReaderOptions>)`
Creates an instance of `PdfReader`.
- **Parameters:**
- `options` (optional): Partial options to override the defaults. Refer to the `PdfReaderOptions` interface for available options.
#### Method: `open(filename: string | ArrayBuffer): Promise<PDFDocumentProxy>`
Opens a PDF document.
- **Parameters:**
- `filename`: The path to the PDF file or an `ArrayBuffer` containing the PDF data.
- **Returns:** A promise that resolves with the `PDFDocumentProxy`.
#### Method: `getTexts(pdf: PDFDocumentProxy): Promise<PageTexts>`
Extracts the text content from the PDF document.
- **Parameters:**
- `pdf`: The `PDFDocumentProxy` instance.
- **Returns:** A promise that resolves with a `Map` of page numbers to their corresponding `PdfTexts`.
Sample return:
```json
// Map (1) starting index from 0
{
"0": {
"fullText": "Opposite Expectation: How to See the World as Two-Sided Lorem ipsum",
"words": [
{
"text": "Opposite Expectation: How to See the World as Two-Sided",
"bbox": {
"x0": 72,
"y0": 84,
"x1": 464,
"y1": 99
},
"dimension": {
"width": 392,
"height": 15
},
"metadata": {
"writing": "horizontal",
"direction": "",
"font": {
"name": "AAAAAA+Arial-BoldItalicMT",
"family": "sans-serif",
"weight": "bold",
"style": "italic",
"size": 14
},
"hasEOL": undefined,
"pageNum": 0
},
"id": 0
}
]
}
}
```
#### Method: `getLinesFromTexts(pageTexts: PageTexts): PageLines`
Retrieves line information from the page texts.
- **Parameters:**
- `pageTexts`: A `Map` of page numbers to their corresponding `PdfTexts`.
- **Returns:** A `Map` of page numbers to an array of `PdfLine` objects.
Sample return:
```json
// Map (1) starting index from 0
{
"0": [
{
"bbox": {
"x0": 72,
"y0": 84,
"x1": 464,
"y1": 99
},
"averageFontSize": 14,
"dimension": {
"width": 392,
"height": 15
},
"words": [
{
"text": "Opposite Expectation: How to See the World as Two-Sided",
"bbox": {
"x0": 72,
"y0": 84,
"x1": 464,
"y1": 99
},
"dimension": {
"width": 392,
"height": 15
},
"metadata": {
"writing": "horizontal",
"direction": "",
"font": {
"name": "AAAAAA+Arial-BoldItalicMT",
"family": "sans-serif",
"weight": "bold",
"style": "italic",
"size": 14
},
"hasEOL": false,
"pageNum": 0
},
"id": 0
}
],
"text": "Opposite Expectation: How to See the World as Two-Sided"
}
]
}
```
#### Method: `getCompactLinesFromTexts(pageTexts: PageTexts, algorithm: PdfCompactLineAlgorithm = "middleY"): CompactPageLines`
Retrieves a compact representation of line information from the page texts using the specified algorithm.
- **Parameters:**
- `pageTexts`: A `Map` of page numbers to their corresponding `PdfTexts`.
- `algorithm`: An optional `PdfCompactLineAlgorithm` specifying the method for grouping lines. Defaults to `middleY`.
- **Returns:** A `Map` of page numbers to an array of `CompactPdfLine` objects, where the line extraction method depends on the chosen algorithm.
Sample return:
```json
// Map (1) starting index from 0
{
"0": [
{
"bbox": {
"x0": 72,
"y0": 83.13183584999996,
"x1": 461.4900053795799,
"y1": 97.13183534999996
},
"words": [
{
"text": "Opposite Expectation: How to See the World as Two-Sided",
"bbox": {
"x0": 72,
"y0": 83.13183584999996,
"x1": 461.4900053795799,
"y1": 97.13183534999996
}
}
],
"text": "Opposite Expectation: How to See the World as Two-Sided"
}
]
}
```
#### Method: `isScanned(pageTexts: PageTexts, options?: PdfScannedThreshold): boolean`
Determines whether the PDF appears to be a scanned document.
- **Parameters:**
- `pageTexts`: A `Map` of page numbers to their corresponding `PdfTexts`.
- `options` (optional): Thresholds for scanned detection. Defaults to `CONSTANT.WORDS_PER_PAGE_THRESHOLD` and `CONSTANT.TEXT_LENGTH_THRESHOLD`.
- **Returns:** `true` if the PDF is considered scanned; otherwise, `false`.
For other methods I encourage you to try it out yourself.
## Web / Browser Support
Starting from v6.0.0, **ppu-pdf** supports running directly in the browser! Import from `ppu-pdf/web` to use `PdfReaderLegacyWeb` — a browser-native PDF reader powered by `pdfjs-dist`.
### Web Features
| Feature | Node (`PdfReaderLegacy`) | Web (`PdfReaderLegacyWeb`) |
| ------------------------ | :----------------------: | :------------------------: |
| `open()` | ✅ | ✅ |
| `getTexts()` | ✅ | ✅ |
| `getTextsScanned()` | ✅ | ✅ (ppu-paddle-ocr/web) |
| `getLinesFromTexts()` | ✅ | ✅ |
| `getCompactLinesFromTexts()` | ✅ | ✅ |
| `getLinesFromTextsInToon()` | ✅ | ✅ |
| `isScanned()` | ✅ | ✅ |
| `isPageScanned()` | ✅ | ✅ |
| `renderAll()` | ✅ (Node Canvas) | ✅ (HTMLCanvasElement) |
| `destroy()` | ✅ | ✅ |
| `dumpCanvasMap()` | ✅ | ❌ (no filesystem) |
| `saveCanvasToPng()` | ✅ | ❌ (no filesystem) |
> **Note:** `PdfReader` (mupdf-based) is **not available** in the browser — it requires native bindings. Only `PdfReaderLegacyWeb` (pdfjs-dist) is supported on the web.
### Web Installation
```bash
npm install ppu-pdf
# Optional for scanned PDF OCR in the browser:
npm install ppu-paddle-ocr
```
### Web Usage (Bundler)
```ts
import { PdfReaderLegacyWeb } from "ppu-pdf/web";
const reader = new PdfReaderLegacyWeb({ verbose: false });
// From a file input or fetch
const response = await fetch("document.pdf");
const buffer = await response.arrayBuffer();
const pdf = await reader.open(buffer);
const texts = await reader.getTexts(pdf);
console.log(texts.get(1)?.fullText);
const lines = reader.getLinesFromTexts(texts);
const compactLines = reader.getCompactLinesFromTexts(texts);
const isScanned = reader.isScanned(texts);
await reader.destroy(pdf);
```
### Web Usage (CDN / No Bundler)
```html
<script type="importmap">
{
"imports": {
"pdfjs-dist": "https://cdn.jsdelivr.net/npm/pdfjs-dist@4.9.155/+esm",
"pdfjs-dist/types/src/display/api": "https://cdn.jsdelivr.net/npm/pdfjs-dist@4.9.155/+esm",
"pdfjs-dist/types/web/interfaces": "https://cdn.jsdelivr.net/npm/pdfjs-dist@4.9.155/+esm",
"@toon-format/toon": "https://cdn.jsdelivr.net/npm/@toon-format/toon@2.1.0/+esm",
"onnxruntime-web": "https://cdn.jsdelivr.net/npm/onnxruntime-web@1.24.2/dist/ort.all.bundle.min.mjs",
"ppu-paddle-ocr/web": "https://cdn.jsdelivr.net/npm/ppu-paddle-ocr@4/web/index.js",
"ppu-ocv/web": "https://cdn.jsdelivr.net/npm/ppu-ocv@2/index.web.js"
}
}
</script>
<script type="module">
import { PdfReaderLegacyWeb } from "https://cdn.jsdelivr.net/npm/ppu-pdf@6/web/index.js";
const reader = new PdfReaderLegacyWeb();
// ... use as shown above
</script>
```
### Scanned PDF OCR (Web)
For scanned PDF text extraction in the browser, explicitly load the engine in top-level await space (`<script type="module">`) instead of subscribing to `DOMContentLoaded`.
```html
<script type="module">
import { PdfReaderLegacyWeb } from "ppu-pdf/web";
import { PaddleOcrService } from "ppu-paddle-ocr/web";
// Eagerly pre-warm the WebAssembly engine
const ocrService = new PaddleOcrService();
await ocrService.initialize();
// Create an adapter matching the exact interface `getTextsScanned` requires
const ocrAdapter = {
initialize: async () => {}, // Pre-initialized above to avoid stutter
recognize: async (canvas) => {
return await ocrService.recognize(canvas);
}
};
const reader = new PdfReaderLegacyWeb();
const pdf = await reader.open(buffer);
// Render PDF explicitly into mapped canvas elements first
const canvasMap = await reader.renderAll(pdf, 72);
// Inject the mapped canvas and the OCR engine to extract words and bounding boxes
const texts = await reader.getTextsScanned(ocrAdapter, canvasMap);
console.log(texts.get(1)?.fullText);
await reader.destroy(pdf);
</script>
```
## Contributing
Contributions are welcome! If you would like to contribute, please follow these steps:
1. **Fork the Repository:** Create your own fork of the project.
2. **Create a Feature Branch:** Use a descriptive branch name for your changes.
3. **Implement Changes:** Make your modifications, add tests, and ensure everything passes.
4. **Submit a Pull Request:** Open a pull request to discuss your changes and get feedback.
### Running Tests
This project uses Bun for testing. To run the tests locally, execute:
```bash
bun test
```
Ensure that all tests pass before submitting your pull request.
## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
## Support
If you encounter any issues or have suggestions, please open an issue in the repository.
Happy coding!
## Scripts
Recommended development environment is in linux-based environment. Library template: https://github.com/aquapi/lib-template
All script sources and usage.
### [Build](./scripts/build.ts)
Emit `.js` and `.d.ts` files to [`lib`](./lib).
### [Publish](./scripts/publish.ts)
Move [`package.json`](./package.json), [`README.md`](./README.md) to [`lib`](./lib) and publish the package.
### [Bench](./scripts/bench.ts)
Run files that ends with `.bench.ts` extension.
To run a specific file.
```bash
bun task bench index # Run bench/index.bench.ts
```
To run the benchmark in `node`, add a `--node` parameter
```bash
bun task bench --node
bun task bench --node index # Run bench/index.bench.ts with node
```