pdf2html
Version:
PDF to HTML or Text conversion using Apache Tika. Also generate PDF thumbnail using Apache PDFBox.
470 lines (354 loc) • 13.6 kB
Markdown
# pdf2html
[](https://www.npmjs.com/package/pdf2html)
[](https://www.npmjs.org/package/pdf2html)
[](https://travis-ci.org/shebinleo/pdf2html)
[](https://www.npmjs.org/package/pdf2html)
[](https://nodejs.org)
> Convert PDF files to HTML, extract text, generate thumbnails, extract images, and extract metadata using Apache Tika and PDFBox
## 🚀 Features
- **PDF to HTML conversion** - Maintains formatting and structure
- **Text extraction** - Extract plain text content from PDFs
- **Page-by-page processing** - Process PDFs page by page
- **Metadata extraction** - Extract author, title, creation date, and more
- **Thumbnail generation** - Generate preview images from PDF pages
- **Image extraction** - Extract all embedded images from PDFs
- **Buffer support** - Process PDFs from memory buffers or file paths
- **TypeScript support** - Full type definitions included
- **Async/Promise based** - Modern async API
- **Configurable** - Extensive options for customization
## 📋 Prerequisites
- **Node.js** >= 12.0.0
- **Java Runtime Environment (JRE)** >= 8
- Required for Apache Tika and PDFBox
- [Download Java](https://www.java.com/en/download/)
## 📦 Installation
### Using npm:
```bash
npm install pdf2html
```
### Using yarn:
```bash
yarn add pdf2html
```
### Using pnpm:
```bash
pnpm add pdf2html
```
The installation process will automatically download the required Apache Tika and PDFBox JAR files. You'll see a progress indicator during the download.
## 🔧 Basic Usage
### Convert PDF to HTML
```javascript
const pdf2html = require('pdf2html');
const fs = require('fs');
// From file path
const html = await pdf2html.html('path/to/document.pdf');
console.log(html);
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const html = await pdf2html.html(pdfBuffer);
console.log(html);
// With options
const html = await pdf2html.html(pdfBuffer, {
maxBuffer: 1024 * 1024 * 10, // 10MB buffer
});
```
### Extract Text
```javascript
// From file path
const text = await pdf2html.text('path/to/document.pdf');
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const text = await pdf2html.text(pdfBuffer);
console.log(text);
```
### Process Pages Individually
```javascript
// From file path
const htmlPages = await pdf2html.pages('path/to/document.pdf');
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const htmlPages = await pdf2html.pages(pdfBuffer);
htmlPages.forEach((page, index) => {
console.log(`Page ${index + 1}:`, page);
});
// Get text for each page
const textPages = await pdf2html.pages(pdfBuffer, {
text: true,
});
```
### Extract Metadata
```javascript
// From file path or buffer
const metadata = await pdf2html.meta(pdfBuffer);
console.log(metadata);
// Output: {
// title: 'Document Title',
// author: 'John Doe',
// subject: 'Document Subject',
// keywords: 'pdf, conversion',
// creator: 'Microsoft Word',
// producer: 'Adobe PDF Library',
// creationDate: '2023-01-01T00:00:00Z',
// modificationDate: '2023-01-02T00:00:00Z',
// pages: 10
// }
```
### Generate Thumbnails
```javascript
// From file path
const thumbnailPath = await pdf2html.thumbnail('path/to/document.pdf');
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer);
console.log('Thumbnail saved to:', thumbnailPath);
// Custom thumbnail options
const thumbnailPath = await pdf2html.thumbnail(pdfBuffer, {
page: 1, // Page number (default: 1)
imageType: 'png', // 'png' or 'jpg' (default: 'png')
width: 300, // Width in pixels (default: 160)
height: 400, // Height in pixels (default: 226)
});
```
### Extract Images
```javascript
// From file path
const imagePaths = await pdf2html.extractImages('path/to/document.pdf');
console.log('Extracted images:', imagePaths);
// Output: ['/absolute/path/to/files/image/document1.jpg', '/absolute/path/to/files/image/document2.png', ...]
// From buffer
const pdfBuffer = fs.readFileSync('path/to/document.pdf');
const imagePaths = await pdf2html.extractImages(pdfBuffer);
// With custom output directory
const imagePaths = await pdf2html.extractImages(pdfBuffer, {
outputDirectory: './extracted-images', // Custom output directory
});
// With custom buffer size for large PDFs
const imagePaths = await pdf2html.extractImages('large-document.pdf', {
outputDirectory: './output',
maxBuffer: 1024 * 1024 * 10, // 10MB buffer
});
```
## 💻 TypeScript Support
This package includes TypeScript type definitions out of the box. No need to install `@types/pdf2html`.
### Basic TypeScript Usage
```typescript
import * as pdf2html from 'pdf2html';
// or
import { html, text, pages, meta, thumbnail, extractImages, PDFMetadata, PDFProcessingError } from 'pdf2html';
async function convertPDF() {
try {
// All methods accept string paths or Buffers
const htmlContent: string = await pdf2html.html('document.pdf');
const textContent: string = await pdf2html.text(Buffer.from(pdfData));
// Full type safety for options
const thumbnailPath = await pdf2html.thumbnail('document.pdf', {
page: 1, // number
imageType: 'png', // 'png' | 'jpg'
width: 300, // number
height: 400, // number
});
// TypeScript knows the shape of metadata
const metadata: PDFMetadata = await pdf2html.meta('document.pdf');
console.log(metadata['pdf:producer']); // string | undefined
console.log(metadata.resourceName); // string | undefined
} catch (error) {
if (error instanceof pdf2html.PDFProcessingError) {
console.error('PDF processing failed:', error.message);
console.error('Exit code:', error.exitCode);
}
}
}
```
### Type Definitions
```typescript
// Input types - all methods accept either file paths or Buffers
type PDFInput = string | Buffer;
// Options interfaces
interface ProcessingOptions {
maxBuffer?: number; // Maximum buffer size in bytes
}
interface PageOptions extends ProcessingOptions {
text?: boolean; // Extract text instead of HTML
}
interface ThumbnailOptions extends ProcessingOptions {
page?: number; // Page number (default: 1)
imageType?: 'png' | 'jpg'; // Image format (default: 'png')
width?: number; // Width in pixels (default: 160)
height?: number; // Height in pixels (default: 226)
}
// Metadata structure with common fields
interface PDFMetadata {
'pdf:PDFVersion'?: string;
'pdf:producer'?: string;
'xmp:CreatorTool'?: string;
'dc:title'?: string;
'dc:creator'?: string;
resourceName?: string;
[key: string]: any; // Allows additional properties
}
// Error class
class PDFProcessingError extends Error {
command?: string; // The command that failed
exitCode?: number; // The process exit code
}
```
### IntelliSense Support
Full IntelliSense support in VS Code and other TypeScript-aware editors:
- Auto-completion for all methods and options
- Inline documentation on hover
- Type checking at compile time
- Catch errors before runtime
### Advanced TypeScript Usage
```typescript
import { PDFProcessor, utils } from 'pdf2html';
// Using the PDFProcessor class directly
const html = await PDFProcessor.toHTML('document.pdf');
// Using utility classes
const { FileManager, HTMLParser } = utils;
await FileManager.ensureDirectories();
// Type guards
function isPDFProcessingError(error: unknown): error is pdf2html.PDFProcessingError {
return error instanceof pdf2html.PDFProcessingError;
}
// Generic helper with proper typing
async function processPDFSafely<T>(operation: () => Promise<T>, fallback: T): Promise<T> {
try {
return await operation();
} catch (error) {
if (isPDFProcessingError(error)) {
console.error(`PDF operation failed: ${error.message}`);
}
return fallback;
}
}
// Usage
const pages = await processPDFSafely(
() => pdf2html.pages('document.pdf', { text: true }),
[] // fallback to empty array
);
```
## ⚙️ Advanced Configuration
### Buffer Size Configuration
By default, the maximum buffer size is 2MB. For large PDFs, you may need to increase this:
```javascript
const options = {
maxBuffer: 1024 * 1024 * 50, // 50MB buffer
};
// Apply to any method
await pdf2html.html('large-file.pdf', options);
await pdf2html.text('large-file.pdf', options);
await pdf2html.pages('large-file.pdf', options);
await pdf2html.meta('large-file.pdf', options);
await pdf2html.thumbnail('large-file.pdf', options);
```
### Error Handling
Always wrap your calls in try-catch blocks for proper error handling:
```javascript
try {
const html = await pdf2html.html('document.pdf');
// Process HTML
} catch (error) {
if (error.code === 'ENOENT') {
console.error('PDF file not found');
} else if (error.message.includes('Java')) {
console.error('Java is not installed or not in PATH');
} else {
console.error('PDF processing failed:', error.message);
}
}
```
## 🏗️ API Reference
### `pdf2html.html(input, [options])`
Converts PDF to HTML format.
- **input** `string | Buffer` - Path to the PDF file or PDF buffer
- **options** `object` (optional)
- `maxBuffer` `number` - Maximum buffer size in bytes (default: 2MB)
- **Returns:** `Promise<string>` - HTML content
### `pdf2html.text(input, [options])`
Extracts text from PDF.
- **input** `string | Buffer` - Path to the PDF file or PDF buffer
- **options** `object` (optional)
- `maxBuffer` `number` - Maximum buffer size in bytes
- **Returns:** `Promise<string>` - Extracted text
### `pdf2html.pages(input, [options])`
Processes PDF page by page.
- **input** `string | Buffer` - Path to the PDF file or PDF buffer
- **options** `object` (optional)
- `text` `boolean` - Extract text instead of HTML (default: false)
- `maxBuffer` `number` - Maximum buffer size in bytes
- **Returns:** `Promise<string[]>` - Array of HTML or text strings
### `pdf2html.meta(input, [options])`
Extracts PDF metadata.
- **input** `string | Buffer` - Path to the PDF file or PDF buffer
- **options** `object` (optional)
- `maxBuffer` `number` - Maximum buffer size in bytes
- **Returns:** `Promise<object>` - Metadata object
### `pdf2html.thumbnail(input, [options])`
Generates a thumbnail image from PDF.
- **input** `string | Buffer` - Path to the PDF file or PDF buffer
- **options** `object` (optional)
- `page` `number` - Page to thumbnail (default: 1)
- `imageType` `string` - 'png' or 'jpg' (default: 'png')
- `width` `number` - Thumbnail width (default: 160)
- `height` `number` - Thumbnail height (default: 226)
- `maxBuffer` `number` - Maximum buffer size in bytes
- **Returns:** `Promise<string>` - Path to generated thumbnail
## 🔧 Manual Dependency Installation
If automatic download fails (e.g., due to network restrictions), you can manually download the dependencies:
1. Create the vendor directory:
```bash
mkdir -p node_modules/pdf2html/vendor
```
2. Download the required JAR files:
```bash
cd node_modules/pdf2html/vendor
# Download Apache PDFBox
wget https://archive.apache.org/dist/pdfbox/2.0.34/pdfbox-app-2.0.34.jar
# Download Apache Tika
wget https://archive.apache.org/dist/tika/3.2.0/tika-app-3.2.0.jar
```
3. Verify the files are in place:
```bash
ls -la node_modules/pdf2html/vendor/
# Should show both JAR files
```
## 🐛 Troubleshooting
### Common Issues
1. **"Java is not installed"**
- Install Java JRE 8 or higher
- Ensure `java` is in your system PATH
- Verify with: `java -version`
2. **"File not found" errors**
- Check that the PDF path is correct
- Use absolute paths for better reliability
- Ensure the file has read permissions
3. **"Buffer size exceeded"**
- Increase maxBuffer option
- Process large PDFs page by page
- Consider splitting very large PDFs
4. **"Download failed during installation"**
- Check internet connection
- Try manual installation (see above)
- Check proxy settings if behind firewall
### Debug Mode
Enable debug output for troubleshooting:
```bash
DEBUG=pdf2html node your-script.js
```
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## 📝 License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- [Apache Tika](https://tika.apache.org/) - Content analysis toolkit
- [Apache PDFBox](https://pdfbox.apache.org/) - PDF manipulation library
## 📊 Dependencies
- **Production**: Apache Tika 3.2.0, Apache PDFBox 2.0.34
- **Development**: See package.json for development dependencies
---
Made with ❤️ by the pdf2html community