UNPKG

@thasmorato/docx-parser

Version:

A modern JavaScript library for parsing and processing Microsoft Word DOCX documents with support for both buffer and stream operations. Features incremental parsing, checkbox detection, footnote support, and document validation.

828 lines (663 loc) • 20.9 kB
# DOCX Parser A modern JavaScript/TypeScript library for parsing DOCX documents using generator architecture and Clean Architecture. ## šŸš€ Features - **Streaming**: Processes DOCX documents incrementally using async generators - **Memory Efficient**: Doesn't load the entire document into memory - **TypeScript**: Complete typing with well-defined interfaces - **Clean Architecture**: Clear separation between layers (Domain, Application, Infrastructure, Interfaces) - **Flexible**: Extracts text, images, tables, metadata, and formatting - **Configurable**: Advanced options to control parsing - **Checkbox Detection**: Automatically detects and parses checkbox states in lists - **Footnote Support**: Identifies and extracts footnotes with proper references - **Header Levels**: Detects document structure with header hierarchy (H1-H6) - **Document Validation**: Built-in validation for DOCX file integrity - **Page Elements**: Supports headers, footers, and page breaks - **List Processing**: Handles numbered, bulleted, and checkbox lists ## šŸ“¦ Installation ```bash npm install @thasmorato/docx-parser # or yarn add @thasmorato/docx-parser # or pnpm add @thasmorato/docx-parser ``` ## šŸ“‹ Main Exports The library exports only essential items for a clean API: ```typescript // Main parsing functions import { parseDocx, // Buffer → AsyncGenerator parseDocxStream, // ReadStream → AsyncGenerator parseDocxHttpStream, // HTTP Readable Stream → AsyncGenerator parseDocxReadable, // Node.js Readable Stream → AsyncGenerator parseDocxWebStream, // ReadableStream → AsyncGenerator parseDocxFile, // File path → AsyncGenerator parseDocxToArray, // Any source → Promise<Array> extractText, // Extract only text extractImages, // Extract only images getMetadata // Extract only metadata } from 'docx-parser'; // Essential types import type { DocumentElement, // Union of all element types ParseOptions, // Configuration options ValidationResult, // Validation response MetadataElement, // Document metadata ParagraphElement, // Text paragraphs ImageElement, // Images with metadata TableElement, // Tables with rows/cells HeaderElement, // Document headers FooterElement, // Document footers ListElement, // Lists (bullet/numbered) PageBreakElement, // Page breaks SectionElement // Document sections } from 'docx-parser'; // Error handling import { DocxParseError } from 'docx-parser'; // Advanced utilities import { StreamAdapter } from 'docx-parser'; import { ValidateDocumentUseCaseImpl } from 'docx-parser'; ``` ## šŸ”„ Basic Usage ### Incremental Parsing (Recommended) ```typescript import { parseDocx } from 'docx-parser'; import { readFileSync } from 'fs'; const buffer = readFileSync('document.docx'); // Process document incrementally for await (const element of parseDocx(buffer)) { console.log(`Type: ${element.type}`); if (element.type === 'paragraph') { console.log(`Text: ${element.content}`); } else if (element.type === 'image') { console.log(`Image: ${element.metadata?.filename}`); } else if (element.type === 'table') { console.log(`Table with ${element.content.length} rows`); } } ``` ### File Parsing ```typescript import { parseDocxFile } from 'docx-parser'; // Read file directly from filesystem for await (const element of parseDocxFile('./document.docx')) { console.log(element.type, element.content); } ``` ### Stream Parsing ```typescript import { parseDocxStream, parseDocxHttpStream, parseDocxWebStream } from 'docx-parser'; import { createReadStream } from 'fs'; import axios from 'axios'; // Using Node.js ReadStream (recommended for local files) const fileStream = createReadStream('./document.docx'); for await (const element of parseDocxStream(fileStream)) { console.log(element.type); } // Using HTTP streams (axios, fetch, etc.) - NEW! const response = await axios({ url: 'https://example.com/document.docx', responseType: 'stream' }); for await (const element of parseDocxHttpStream(response.data)) { console.log(element.type); } // Using Web ReadableStream (for fetch, responses, etc.) const fetchResponse = await fetch('https://example.com/document.docx'); const webStream = fetchResponse.body!; for await (const element of parseDocxWebStream(webStream)) { console.log(element.type); } ``` ### StreamAdapter Utility ```typescript import { StreamAdapter } from 'docx-parser'; import { createReadStream } from 'fs'; // Convert ReadStream to Web ReadableStream const nodeStream = createReadStream('./document.docx'); const webStream = StreamAdapter.toWebStream(nodeStream); // Convert ReadableStream to Buffer const buffer = await StreamAdapter.toBuffer(webStream); // Create ReadableStream from Buffer const streamFromBuffer = StreamAdapter.fromBuffer(buffer); ``` ## šŸ› ļø Complete API ### Main Functions #### `parseDocx(buffer, options?)` Processes a DOCX Buffer incrementally. ```typescript import { parseDocx } from 'docx-parser'; import { readFileSync } from 'fs'; const buffer = readFileSync('doc.docx'); for await (const element of parseDocx(buffer, { includeImages: true, includeTables: true, normalizeWhitespace: true })) { // Process each element } ``` #### `parseDocxStream(stream, options?)` Processes a DOCX ReadStream (Node.js) incrementally. ```typescript import { parseDocxStream } from 'docx-parser'; import { createReadStream } from 'fs'; const stream = createReadStream('doc.docx'); for await (const element of parseDocxStream(stream)) { // Process each element } ``` #### `parseDocxReadable(stream, options?)` Processes a DOCX from any Node.js Readable stream incrementally. ```typescript import { parseDocxReadable } from 'docx-parser'; import { Readable } from 'stream'; // Custom Readable stream const customStream = new Readable({ read() { // Custom stream logic } }); for await (const element of parseDocxReadable(customStream)) { // Process each element } // Transform streams, pipeline streams, etc. import { Transform } from 'stream'; const transformedStream = someStream.pipe(new Transform({ transform(chunk, encoding, callback) { // Transform logic callback(null, chunk); } })); for await (const element of parseDocxReadable(transformedStream)) { // Process transformed stream } ``` #### `parseDocxHttpStream(stream, options?)` Processes a DOCX from HTTP Readable streams (axios, fetch, etc.) incrementally. ```typescript import { parseDocxHttpStream } from 'docx-parser'; import axios from 'axios'; const response = await axios({ url: 'https://example.com/doc.docx', responseType: 'stream' }); for await (const element of parseDocxHttpStream(response.data)) { // Process each element } ``` #### `parseDocxWebStream(stream, options?)` Processes a DOCX ReadableStream (Web API) incrementally. ```typescript import { parseDocxWebStream } from 'docx-parser'; const response = await fetch('document.docx'); const stream = response.body!; for await (const element of parseDocxWebStream(stream)) { // Process each element } ``` #### `parseDocxToArray(source, options?)` Returns all elements as an array (non-streaming). ```typescript import { parseDocxToArray } from 'docx-parser'; import { createReadStream } from 'fs'; // From buffer const elements = await parseDocxToArray(buffer); // From ReadStream const stream = createReadStream('doc.docx'); const elements2 = await parseDocxToArray(stream); console.log(`Document has ${elements.length} elements`); ``` #### `extractText(source, options?)` Extracts only text from the document. ```typescript import { extractText } from 'docx-parser'; import { createReadStream } from 'fs'; // From buffer const text = await extractText(buffer, { preserveFormatting: true }); // From ReadStream const stream = createReadStream('doc.docx'); const text2 = await extractText(stream); console.log(text); ``` #### `extractImages(source)` Extracts only images from the document. ```typescript import { extractImages } from 'docx-parser'; import { createReadStream } from 'fs'; // From buffer for await (const image of extractImages(buffer)) { console.log(`Image: ${image.metadata?.filename}`); } // From ReadStream const stream = createReadStream('doc.docx'); for await (const image of extractImages(stream)) { // image.content contains the image Buffer } ``` #### `getMetadata(source)` Extracts only metadata from the document. ```typescript import { getMetadata } from 'docx-parser'; import { createReadStream } from 'fs'; // From buffer const metadata = await getMetadata(buffer); // From ReadStream const stream = createReadStream('doc.docx'); const metadata2 = await getMetadata(stream); console.log(`Title: ${metadata.title}`); console.log(`Author: ${metadata.author}`); console.log(`Created: ${metadata.created}`); ``` ### Document Validation #### `ValidateDocumentUseCaseImpl` Validates DOCX document structure and integrity. ```typescript import { ValidateDocumentUseCaseImpl } from 'docx-parser'; const validator = new ValidateDocumentUseCaseImpl(); const result = await validator.validate(buffer); if (!result.isValid) { console.log('Invalid document:'); result.errors.forEach(error => { console.log(`- ${error.message} (${error.code})`); }); } else { console.log('Valid document!'); } ``` ## 🌐 HTTP Streams Support The library now supports **HTTP streams** from popular libraries like axios, fetch, and others. This resolves the common error `"stream.getReader is not a function"` when working with HTTP responses. ### Using with Axios ```typescript import axios from 'axios'; import { parseDocxHttpStream, parseDocxToArray } from 'docx-parser'; // Method 1: Specific HTTP stream function (recommended) async function parseFromUrl(url: string) { const response = await axios({ method: 'get', url: url, responseType: 'stream' // Important: use 'stream', not 'buffer' }); for await (const element of parseDocxHttpStream(response.data)) { console.log(`${element.type}: ${element.content}`); } } // Method 2: Automatic detection async function parseToArray(url: string) { const response = await axios({ method: 'get', url: url, responseType: 'stream' }); // parseDocxToArray automatically detects Node.js streams const elements = await parseDocxToArray(response.data, { includeImages: true, includeTables: true }); return elements; } ``` ### Using with Fetch API ```typescript // Fetch API (Node.js 18+) const response = await fetch('https://example.com/document.docx'); if (!response.ok) throw new Error(`HTTP error! status: ${response.status}`); // response.body is a Web ReadableStream - works automatically const elements = await parseDocxToArray(response.body); ``` ### Error Handling for HTTP Streams ```typescript import axios from 'axios'; import { parseDocxHttpStream } from 'docx-parser'; async function safeParseFromUrl(url: string) { try { const response = await axios({ method: 'get', url: url, responseType: 'stream', timeout: 30000, // 30 seconds timeout headers: { 'Accept': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document' } }); const elements = []; for await (const element of parseDocxHttpStream(response.data)) { elements.push(element); } return elements; } catch (error) { if (axios.isAxiosError(error)) { console.error('HTTP Error:', error.response?.status, error.message); } else if (error.name === 'DocxParseError') { console.error('DOCX Parse Error:', error.message); } else { console.error('Unknown Error:', error); } throw error; } } ``` **šŸ“– For complete HTTP streams documentation, see:** [`project/HTTP_STREAMS_GUIDE.md`](./project/HTTP_STREAMS_GUIDE.md) ## āš™ļø Configuration Options ```typescript interface ParseOptions { // Content control includeMetadata?: boolean; // Include metadata (default: true) includeImages?: boolean; // Include images (default: true) includeTables?: boolean; // Include tables (default: true) includeHeaders?: boolean; // Include headers (default: false) includeFooters?: boolean; // Include footers (default: false) // Image processing imageFormat?: 'buffer' | 'base64' | 'stream'; // Image format maxImageSize?: number; // Maximum size in bytes (default: 10MB) // Text formatting preserveFormatting?: boolean; // Preserve formatting (default: true) normalizeWhitespace?: boolean; // Normalize whitespace (default: true) // Performance chunkSize?: number; // Chunk size (default: 64KB) concurrent?: boolean; // Parallel processing (default: false) } ``` ## šŸ“‹ Element Types ### MetadataElement ```typescript { type: 'metadata', id: string, position: { page: number, section: number, order: number }, content: { title?: string, author?: string, subject?: string, created?: Date, modified?: Date, // ... other metadata } } ``` ### ParagraphElement ```typescript { type: 'paragraph', id: string, position: { page: number, section: number, order: number }, content: string, formatting?: { fontFamily?: string, fontSize?: number, bold?: boolean, italic?: boolean, underline?: boolean, color?: string, highlight?: string }, // Special properties checkbox?: { // For checkbox list items checked: boolean }, isFootnote?: boolean, // If this is a footnote footnoteId?: string // Footnote identifier } ``` ### ImageElement ```typescript { type: 'image', id: string, position: { page: number, section: number, order: number }, content: Buffer, metadata?: { filename?: string, format: 'png' | 'jpg' | 'gif' | 'svg' | 'wmf' | 'emf', width: number, height: number }, positioning?: { inline?: boolean, x?: number, y?: number } } ``` ### TableElement ```typescript { type: 'table', id: string, position: { page: number, section: number, order: number }, content: Array<{ cells: Array<{ content: string }>, isHeader: boolean }> } ``` ### HeaderElement ```typescript { type: 'header', id: string, position: { page: number, section: number, order: number }, content: string, level: 1 | 2 | 3 | 4 | 5 | 6, hasPageNumber?: boolean // If header contains page numbers } ``` ### FooterElement ```typescript { type: 'footer', id: string, position: { page: number, section: number, order: number }, content: string } ``` ### ListElement ```typescript { type: 'list', id: string, position: { page: number, section: number, order: number }, content: string[], listType: 'bullet' | 'number', level: number } ``` ### Additional Elements ```typescript // Page breaks { type: 'pageBreak', id: string, position: { page: number, section: number, order: number }, content: null } // Document sections { type: 'section', id: string, position: { page: number, section: number, order: number }, content: string, sectionType: 'header' | 'footer' | 'body' } ``` ## šŸ—ļø Advanced Examples ### Filtering Content Types ```typescript import { parseDocx } from 'docx-parser'; // Only process text and tables for await (const element of parseDocx(buffer, { includeImages: false, includeMetadata: false, includeTables: true })) { if (element.type === 'paragraph') { console.log('Paragraph:', element.content); } else if (element.type === 'table') { console.log('Table found'); element.content.forEach((row, i) => { console.log(`Row ${i}:`, row.cells.map(c => c.content).join(' | ')); }); } } ``` ### Working with Checkboxes ```typescript import { parseDocx } from 'docx-parser'; for await (const element of parseDocx(buffer)) { if (element.type === 'paragraph' && element.checkbox) { const status = element.checkbox.checked ? 'āœ…' : '☐'; console.log(`${status} ${element.content}`); } } ``` ### Processing Footnotes ```typescript import { parseDocx } from 'docx-parser'; const footnotes = []; for await (const element of parseDocx(buffer)) { if (element.type === 'paragraph' && element.isFootnote) { footnotes.push({ id: element.footnoteId, content: element.content }); } } console.log('Found footnotes:', footnotes); ``` ### Header Level Detection ```typescript import { parseDocx } from 'docx-parser'; for await (const element of parseDocx(buffer)) { if (element.type === 'header') { const indent = ' '.repeat(element.level - 1); console.log(`${indent}H${element.level}: ${element.content}`); if (element.hasPageNumber) { console.log(`${indent} (contains page number)`); } } } ``` ### Working with Node.js Readable Streams ```typescript import { parseDocxReadable } from 'docx-parser'; import { Readable, Transform, pipeline } from 'stream'; import { promisify } from 'util'; // Custom stream processing const customStream = new Readable({ read() { // Custom logic to read DOCX data } }); for await (const element of parseDocxReadable(customStream)) { console.log(`${element.type}: ${element.content}`); } // Pipeline with transforms const pipelineAsync = promisify(pipeline); const transformStream = new Transform({ transform(chunk, encoding, callback) { // Apply transformations if needed callback(null, chunk); } }); // Process transformed stream const transformedStream = someDocxStream.pipe(transformStream); for await (const element of parseDocxReadable(transformedStream)) { console.log('Transformed element:', element); } ``` ### Image Processing ```typescript import { writeFileSync } from 'fs'; import { extractImages } from 'docx-parser'; let imageCount = 0; for await (const image of extractImages(buffer)) { const filename = image.metadata?.filename || `image_${++imageCount}.${image.metadata?.format}`; writeFileSync(`./output/${filename}`, image.content); console.log(`Image saved: ${filename}`); } ``` ### Document Validation ```typescript import { ValidateDocumentUseCaseImpl } from 'docx-parser'; const validator = new ValidateDocumentUseCaseImpl(); const result = await validator.validate(buffer); if (!result.isValid) { console.log('Invalid document:'); result.errors.forEach(error => { console.log(`- ${error.message} (${error.code})`); }); } else { console.log('Valid document!'); } ``` ## šŸ”§ Architecture The library follows Clean Architecture with 4 layers: - **Domain**: Types, interfaces, and business rules - **Application**: Use cases and application logic - **Infrastructure**: Concrete implementations (JSZip, XML parsing) - **Interfaces**: Public API and controllers ``` src/ ā”œā”€ā”€ domain/ # Entities and business rules ā”œā”€ā”€ application/ # Use cases and application logic ā”œā”€ā”€ infrastructure/ # Adapters and implementations └── interfaces/ # Public API ``` ## 🚨 Error Handling ```typescript import { DocxParseError } from 'docx-parser'; try { for await (const element of parseDocx(invalidBuffer)) { console.log(element); } } catch (error) { if (error instanceof DocxParseError) { console.log('DOCX-specific error:', error.message); } else { console.log('General error:', error); } } ``` ## šŸ“ Performance Tips 1. **Use streaming**: Prefer `parseDocx()` over `parseDocxToArray()` for large documents 2. **Filter content**: Disable `includeImages` if you don't need images 3. **Chunk size**: Adjust `chunkSize` based on document sizes 4. **Limit images**: Use `maxImageSize` to avoid very large images ## šŸ¤ Contributing 1. Fork the project 2. Create a feature branch 3. Commit your changes 4. Push to the branch 5. Open a Pull Request ### Development Setup ```bash # Clone the repository git clone https://github.com/ThaSMorato/docx-parser.git cd docx-parser # Install dependencies pnpm install # Run tests pnpm test # Run linting pnpm lint # Build the project pnpm build ``` ### Automated Publishing This project uses GitHub Actions for automated NPM publishing with semantic versioning: - **MAJOR**: `MAJOR: breaking change` or `BREAKING CHANGE` in commit - **MINOR**: `MINOR: new feature` or `feat:` prefix - **PATCH**: `fix:`, `docs:`, `chore:` or any other commit type See [`.github/VERSIONING.md`](.github/VERSIONING.md) for detailed information about the versioning system. ## šŸ“„ License MIT License - see [LICENSE](LICENSE) for details. ## šŸ› Reporting Bugs Open an issue on GitHub with: - Library version - Sample DOCX file (if possible) - Code that reproduces the problem - Complete error message --- **Made with ā¤ļø and TypeScript**