pdf-tax-reader-cl

Version:

PDF scraping library for Chilean tax documents. Extract emitter name, economic activities, and address from structured PDF documents like 'CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS'

github.com/Jmzp/pdf-tax-reader-cl

Jmzp/pdf-tax-reader-cl

307 lines (236 loc) • 8.68 kB

Markdown

# pdf-tax-reader-cl A Node.js library for extracting specific data from Chilean tax PDF documents. This library is designed to scrape structured PDF documents like "CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS" and extract key information. [![npm version](https://badge.fury.io/js/pdf-tax-reader-cl.svg)](https://badge.fury.io/js/pdf-tax-reader-cl) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Node.js](https://img.shields.io/badge/node-%3E%3D14.0.0-brightgreen.svg)](https://nodejs.org/) ## 🚀 Quick Start ```bash npm install pdf-tax-reader-cl ``` ```javascript const { extractTaxData } = require('pdf-tax-reader-cl'); // Extract data from a PDF file extractTaxData('./tax-document.pdf') .then(data => { console.log('Extracted Data:', data); // { // emitterName: "GUITAL Y PARTNERS LIMITADA", // economicActivities: [ // "ASES.COMER.PUBLICIDAD,REPONEDORES,COMERC.FRUT,VERD,BEBIDAS DE FANTASIA", // "463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS" // ], // address: "VITACURA 4380 , Dpto. 31 , VITACURA" // } }) .catch(error => { console.error('Error:', error.message); }); ``` ## Features - Extract emitter name from PDF documents - Extract economic activities list - Extract address information - Process single PDF files or entire directories - Save extracted data to JSON format - Comprehensive error handling and logging ## Installation ```bash npm install pdf-tax-reader-cl ``` ### Requirements - Node.js >= 14.0.0 - PDF files must be text-based (not scanned images) - PDFs must follow the Chilean tax document structure ## Usage ### Single PDF Processing ```javascript const { extractTaxData } = require('pdf-tax-reader-cl'); // Process a single PDF file extractTaxData('./documents/tax-document.pdf') .then(data => { console.log('Extracted Data:', data); }) .catch(error => { console.error('Error:', error.message); }); ``` ### Multiple PDF Processing ```javascript const { processMultiplePDFs } = require('pdf-tax-reader-cl'); // Process all PDF files in a directory processMultiplePDFs('./documents') .then(results => { console.log('Processing completed:', results); // [ // { // filename: "document1.pdf", // data: { emitterName: "...", economicActivities: [...], address: "..." } // }, // { // filename: "document2.pdf", // error: "Invalid PDF format" // } // ] }) .catch(error => { console.error('Error:', error); }); ``` ### TypeScript Support ```typescript import { extractTaxData, ExtractedTaxData } from 'pdf-tax-reader-cl'; // Extract data with TypeScript types extractTaxData('./path/to/document.pdf') .then((data: ExtractedTaxData) => { console.log('Extracted Data:', data); // data.emitterName is string | null // data.economicActivities is string[] // data.address is string | null }) .catch(error => { console.error('Error:', error); }); ``` ## Testing If you're developing or contributing to this library: ```bash # Clone the repository git clone https://github.com/Jmzp/pdf-tax-reader-cl.git cd pdf-tax-reader-cl # Install dependencies npm install # Run tests npm test ``` The test suite includes: - Mock data validation - Single PDF processing test - Multiple PDF processing test ## Error Handling Examples The application provides detailed error messages for different types of invalid files: ```javascript // Example error handling try { const data = await extractTaxData('./invalid-file.txt'); } catch (error) { console.log('Error:', error.message); // Output: "Invalid file extension. Expected .pdf, got: txt" } try { const data = await extractTaxData('./corrupted-file.pdf'); } catch (error) { console.log('Error:', error.message); // Output: "Invalid PDF format. File does not appear to be a valid PDF document." } try { const data = await extractTaxData('./non-tax-document.pdf'); } catch (error) { console.log('Error:', error.message); // Output: "Document does not appear to be a Chilean tax document. Missing expected tax document structure." } ``` ## Data Extraction The application extracts the following information from PDF documents: ### 1. Emitter Name (Nombre del emisor) - Extracts the company or entity name that generated the document - Pattern: `Nombre del emisor: [COMPANY_NAME]` ### 2. Economic Activities (Actividades Económicas) - Extracts all economic activities listed in the document - Includes both general descriptions and specific activity codes - Pattern: Looks for lines containing activity codes (6 digits) or specific keywords ### 3. Address (Domicilio) - Extracts the registered address of the taxpayer - Pattern: `Domicilio: [ADDRESS]` ## Output Format The extracted data is returned in the following JSON format: ```json { "emitterName": "GUITAL Y PARTNERS LIMITADA", "economicActivities": [ "ASES.COMER.PUBLICIDAD, REPONEDORES, COMERC.FRUT, VERD, BEBIDAS DE FANTASIA", "463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS", "463020 VENTA AL POR MAYOR DE BEBIDAS ALCOHOLICAS Y NO ALCOHOLICAS", "692000 ACTIVIDADES DE CONTABILIDAD, TENEDURIA DE LIBROS Y AUDITORIA; CONSULTO", "731001 SERVICIOS DE PUBLICIDAD PRESTADOS POR EMPRESAS", "783000 OTRAS ACTIVIDADES DE DOTACION DE RECURSOS HUMANOS", "854909 OTROS TIPOS DE ENSEÑANZA N.C.P.", "855000 ACTIVIDADES DE APOYO A LA ENSEÑANZA" ], "address": "VITACURA 4380, Dpto. 31, VITACURA" } ``` ## API Reference ### Functions #### `extractTaxData(pdfPath: string): Promise<ExtractedTaxData>` Extract tax data from a PDF file. #### `processMultiplePDFs(directoryPath: string): Promise<ProcessingResult[]>` Process multiple PDF files in a directory. #### `saveToJSON(data: any, outputPath: string): void` Save extracted data to JSON file. #### `isValidPDF(dataBuffer: Buffer): boolean` Validate if a file is a valid PDF. #### `hasValidExtension(filePath: string): boolean` Validate file extension. #### `isTaxDocument(text: string): boolean` Check if the document appears to be a Chilean tax document. #### `validateExtractedData(data: ExtractedTaxData): ValidationResult` Validate extracted data completeness. ### Types #### `ExtractedTaxData` ```typescript interface ExtractedTaxData { emitterName: string | null; economicActivities: string[]; address: string | null; } ``` #### `ProcessingResult` ```typescript interface ProcessingResult { filename: string; data?: ExtractedTaxData; error?: string; } ``` ## Dependencies - `pdf-parse`: For extracting text content from PDF files - Built-in Node.js modules: `fs`, `path` ## Error Handling & Validation The application includes comprehensive error handling and validation for: ### File Validation - **File existence**: Checks if the file exists before processing - **File extension**: Validates that the file has a `.pdf` extension - **File size**: Ensures the file is not empty - **PDF format**: Validates PDF structure and signatures ### Content Validation - **PDF structure**: Verifies the file is a valid PDF document - **Text content**: Ensures the PDF contains extractable text (not just scanned images) - **Tax document structure**: Validates that the document appears to be a Chilean tax document - **Data completeness**: Ensures all required fields are successfully extracted ### Error Types Handled - File not found errors - Invalid file extensions (.txt, .doc, etc.) - Corrupted or invalid PDF files - Empty files - PDFs without extractable text - Non-tax documents - Incomplete data extraction - Directory access errors ## Limitations - The application is designed specifically for Chilean tax documents with the structure shown in the example - PDF must be text-based (not scanned images) - Extraction accuracy depends on the consistency of the PDF format - The application will reject non-PDF files, corrupted PDFs, and documents that don't match the expected tax document structure ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add some amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Author **Jorge Zapata** - [GitHub](https://github.com/Jmzp) ## Support If you find this library useful, please consider giving it a ⭐️ on GitHub!