file2md
Version:
A TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX, HWP, HWPX) into Markdown with image and layout preservation
293 lines (222 loc) โข 7.76 kB
Markdown
# file2md
[](https://badge.fury.io/js/file2md)
[](https://www.typescriptlang.org/)
[](https://opensource.org/licenses/MIT)
A modern TypeScript library for converting various document types (PDF, DOCX, XLSX, PPTX) into Markdown with **advanced layout preservation**, **image extraction**, and **chart conversion**.
## โจ Features
- ๐ **Multiple Format Support**: PDF, DOCX, XLSX, PPTX
- ๐จ **Layout Preservation**: Maintains document structure, tables, and formatting
- ๐ผ๏ธ **Image Extraction**: Automatically extracts and references images
- ๐ **Chart Conversion**: Converts charts to Markdown tables
- ๐ **List & Table Support**: Proper nested lists and complex tables
- ๐ **Type Safety**: Full TypeScript support with comprehensive types
- โก **Modern ESM**: ES2022 modules with CommonJS compatibility
- ๐ **Zero Config**: Works out of the box
## ๐ฆ Installation
```bash
npm install file2md
```
## ๐ Quick Start
### TypeScript / ES Modules
```typescript
import { convert } from 'file2md';
// Convert from file path
const result = await convert('./document.pdf');
console.log(result.markdown);
// Convert with options
const result = await convert('./presentation.pptx', {
imageDir: 'extracted-images',
preserveLayout: true,
extractCharts: true
});
console.log(`โ
Converted successfully!`);
console.log(`๐ Markdown length: ${result.markdown.length}`);
console.log(`๐ผ๏ธ Images extracted: ${result.images.length}`);
console.log(`๐ Charts found: ${result.charts.length}`);
```
### CommonJS
```javascript
const { convert } = require('file2md');
const result = await convert('./document.docx');
console.log(result.markdown);
```
### From Buffer
```typescript
import { convert } from 'file2md';
import { readFile } from 'fs/promises';
const buffer = await readFile('./document.xlsx');
const result = await convert(buffer, {
imageDir: 'spreadsheet-images'
});
```
## ๐ API Reference
### `convert(input, options?)`
**Parameters:**
- `input: string | Buffer` - File path or buffer containing document data
- `options?: ConvertOptions` - Conversion options
**Returns:** `Promise<ConversionResult>`
### Options
```typescript
interface ConvertOptions {
imageDir?: string; // Directory for extracted images (default: 'images')
preserveLayout?: boolean; // Maintain document layout (default: true)
extractCharts?: boolean; // Convert charts to tables (default: true)
extractImages?: boolean; // Extract embedded images (default: true)
maxPages?: number; // Max pages for PDFs (default: unlimited)
}
```
### Result
```typescript
interface ConversionResult {
markdown: string; // Generated Markdown content
images: ImageData[]; // Extracted image information
charts: ChartData[]; // Extracted chart data
metadata: DocumentMetadata; // Document metadata
}
```
## ๐ฏ Format-Specific Features
### ๐ PDF
- โ
**Text extraction** with layout enhancement
- โ
**Table detection** and formatting
- โ
**List recognition** (bullets, numbers)
- โ
**Heading detection** (ALL CAPS, colons)
- โ
**Page-to-image fallback** for complex layouts
### ๐ DOCX
- โ
**Heading hierarchy** (H1-H6)
- โ
**Text formatting** (bold, italic)
- โ
**Complex tables** with merged cells
- โ
**Nested lists** with proper indentation
- โ
**Embedded images** and charts
- โ
**Cell styling** (alignment, colors)
### ๐ XLSX
- โ
**Multiple worksheets** as separate sections
- โ
**Cell formatting** (bold, colors, alignment)
- โ
**Data type preservation**
- โ
**Chart extraction** to data tables
- โ
**Conditional formatting** notes
### ๐ฌ PPTX
- โ
**Slide-by-slide** organization
- โ
**Text positioning** and layout
- โ
**Image placement** per slide
- โ
**Table extraction** from slides
- โ
**Multi-column layouts**
## ๐ผ๏ธ Image Handling
Images are automatically extracted and saved to the specified directory:
```typescript
const result = await convert('./presentation.pptx', {
imageDir: 'my-images'
});
// Result structure:
// my-images/
// โโโ image_1.png
// โโโ image_2.jpg
// โโโ chart_1.png
// Markdown will contain:
// 
```
## ๐ Chart Conversion
Charts are converted to Markdown tables:
```markdown
#### Chart 1: Sales Data
| Category | Q1 | Q2 | Q3 | Q4 |
| --- | --- | --- | --- | --- |
| Revenue | 100 | 150 | 200 | 250 |
| Profit | 20 | 30 | 45 | 60 |
```
## ๐ก๏ธ Error Handling
```typescript
import {
convert,
UnsupportedFormatError,
FileNotFoundError,
ParseError
} from 'file2md';
try {
const result = await convert('./document.pdf');
} catch (error) {
if (error instanceof UnsupportedFormatError) {
console.error('Unsupported file format');
} else if (error instanceof FileNotFoundError) {
console.error('File not found');
} else if (error instanceof ParseError) {
console.error('Failed to parse document:', error.message);
}
}
```
## ๐งช Advanced Usage
### Custom Error Handling
```typescript
import { convert, ConversionError } from 'file2md';
try {
const result = await convert('./complex-document.docx');
} catch (error) {
if (error instanceof ConversionError) {
console.error(`Conversion failed [${error.code}]:`, error.message);
if (error.originalError) {
console.error('Original error:', error.originalError);
}
}
}
```
### Batch Processing
```typescript
import { convert } from 'file2md';
import { readdir } from 'fs/promises';
async function convertFolder(folderPath: string) {
const files = await readdir(folderPath);
const results = [];
for (const file of files) {
if (file.match(/\.(pdf|docx|xlsx|pptx)$/i)) {
try {
const result = await convert(`${folderPath}/${file}`);
results.push({ file, success: true, result });
} catch (error) {
results.push({ file, success: false, error });
}
}
}
return results;
}
```
## ๐๏ธ Development
### Build from Source
```bash
git clone https://github.com/yourusername/file2md.git
cd file2md
npm install
npm run build
```
### Testing
```bash
npm test # Run tests
npm run test:watch # Watch mode
npm run test:coverage # Coverage report
```
### Linting
```bash
npm run lint # Check code style
npm run lint:fix # Fix issues
```
## ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## ๐ Links
- [npm package](https://www.npmjs.com/package/file2md)
- [GitHub repository](https://github.com/yourusername/file2md)
- [Issues & Bug Reports](https://github.com/yourusername/file2md/issues)
## ๐ Supported Formats
| Format | Extension | Layout | Images | Charts | Tables | Lists |
|--------|-----------|---------|---------|---------|---------|--------|
| PDF | `.pdf` | โ
| โ
* | โ | โ
| โ
|
| Word | `.docx` | โ
| โ
| โ
| โ
| โ
|
| Excel | `.xlsx` | โ
| โ | โ
| โ
| โ |
| PowerPoint | `.pptx` | โ
| โ
| โ
| โ
| โ |
*PDF images via page-to-image conversion
---
**Made with โค๏ธ and TypeScript**