UNPKG

pdf-tax-reader-cl

Version:

PDF scraping library for Chilean tax documents. Extract emitter name, economic activities, and address from structured PDF documents like 'CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS'

307 lines (236 loc) 8.68 kB
# pdf-tax-reader-cl A Node.js library for extracting specific data from Chilean tax PDF documents. This library is designed to scrape structured PDF documents like "CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS" and extract key information. [![npm version](https://badge.fury.io/js/pdf-tax-reader-cl.svg)](https://badge.fury.io/js/pdf-tax-reader-cl) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Node.js](https://img.shields.io/badge/node-%3E%3D14.0.0-brightgreen.svg)](https://nodejs.org/) ## 🚀 Quick Start ```bash npm install pdf-tax-reader-cl ``` ```javascript const { extractTaxData } = require('pdf-tax-reader-cl'); // Extract data from a PDF file extractTaxData('./tax-document.pdf') .then(data => { console.log('Extracted Data:', data); // { // emitterName: "GUITAL Y PARTNERS LIMITADA", // economicActivities: [ // "ASES.COMER.PUBLICIDAD,REPONEDORES,COMERC.FRUT,VERD,BEBIDAS DE FANTASIA", // "463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS" // ], // address: "VITACURA 4380 , Dpto. 31 , VITACURA" // } }) .catch(error => { console.error('Error:', error.message); }); ``` ## Features - Extract emitter name from PDF documents - Extract economic activities list - Extract address information - Process single PDF files or entire directories - Save extracted data to JSON format - Comprehensive error handling and logging ## Installation ```bash npm install pdf-tax-reader-cl ``` ### Requirements - Node.js >= 14.0.0 - PDF files must be text-based (not scanned images) - PDFs must follow the Chilean tax document structure ## Usage ### Single PDF Processing ```javascript const { extractTaxData } = require('pdf-tax-reader-cl'); // Process a single PDF file extractTaxData('./documents/tax-document.pdf') .then(data => { console.log('Extracted Data:', data); }) .catch(error => { console.error('Error:', error.message); }); ``` ### Multiple PDF Processing ```javascript const { processMultiplePDFs } = require('pdf-tax-reader-cl'); // Process all PDF files in a directory processMultiplePDFs('./documents') .then(results => { console.log('Processing completed:', results); // [ // { // filename: "document1.pdf", // data: { emitterName: "...", economicActivities: [...], address: "..." } // }, // { // filename: "document2.pdf", // error: "Invalid PDF format" // } // ] }) .catch(error => { console.error('Error:', error); }); ``` ### TypeScript Support ```typescript import { extractTaxData, ExtractedTaxData } from 'pdf-tax-reader-cl'; // Extract data with TypeScript types extractTaxData('./path/to/document.pdf') .then((data: ExtractedTaxData) => { console.log('Extracted Data:', data); // data.emitterName is string | null // data.economicActivities is string[] // data.address is string | null }) .catch(error => { console.error('Error:', error); }); ``` ## Testing If you're developing or contributing to this library: ```bash # Clone the repository git clone https://github.com/Jmzp/pdf-tax-reader-cl.git cd pdf-tax-reader-cl # Install dependencies npm install # Run tests npm test ``` The test suite includes: - Mock data validation - Single PDF processing test - Multiple PDF processing test ## Error Handling Examples The application provides detailed error messages for different types of invalid files: ```javascript // Example error handling try { const data = await extractTaxData('./invalid-file.txt'); } catch (error) { console.log('Error:', error.message); // Output: "Invalid file extension. Expected .pdf, got: txt" } try { const data = await extractTaxData('./corrupted-file.pdf'); } catch (error) { console.log('Error:', error.message); // Output: "Invalid PDF format. File does not appear to be a valid PDF document." } try { const data = await extractTaxData('./non-tax-document.pdf'); } catch (error) { console.log('Error:', error.message); // Output: "Document does not appear to be a Chilean tax document. Missing expected tax document structure." } ``` ## Data Extraction The application extracts the following information from PDF documents: ### 1. Emitter Name (Nombre del emisor) - Extracts the company or entity name that generated the document - Pattern: `Nombre del emisor: [COMPANY_NAME]` ### 2. Economic Activities (Actividades Económicas) - Extracts all economic activities listed in the document - Includes both general descriptions and specific activity codes - Pattern: Looks for lines containing activity codes (6 digits) or specific keywords ### 3. Address (Domicilio) - Extracts the registered address of the taxpayer - Pattern: `Domicilio: [ADDRESS]` ## Output Format The extracted data is returned in the following JSON format: ```json { "emitterName": "GUITAL Y PARTNERS LIMITADA", "economicActivities": [ "ASES.COMER.PUBLICIDAD, REPONEDORES, COMERC.FRUT, VERD, BEBIDAS DE FANTASIA", "463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS", "463020 VENTA AL POR MAYOR DE BEBIDAS ALCOHOLICAS Y NO ALCOHOLICAS", "692000 ACTIVIDADES DE CONTABILIDAD, TENEDURIA DE LIBROS Y AUDITORIA; CONSULTO", "731001 SERVICIOS DE PUBLICIDAD PRESTADOS POR EMPRESAS", "783000 OTRAS ACTIVIDADES DE DOTACION DE RECURSOS HUMANOS", "854909 OTROS TIPOS DE ENSEÑANZA N.C.P.", "855000 ACTIVIDADES DE APOYO A LA ENSEÑANZA" ], "address": "VITACURA 4380, Dpto. 31, VITACURA" } ``` ## API Reference ### Functions #### `extractTaxData(pdfPath: string): Promise<ExtractedTaxData>` Extract tax data from a PDF file. #### `processMultiplePDFs(directoryPath: string): Promise<ProcessingResult[]>` Process multiple PDF files in a directory. #### `saveToJSON(data: any, outputPath: string): void` Save extracted data to JSON file. #### `isValidPDF(dataBuffer: Buffer): boolean` Validate if a file is a valid PDF. #### `hasValidExtension(filePath: string): boolean` Validate file extension. #### `isTaxDocument(text: string): boolean` Check if the document appears to be a Chilean tax document. #### `validateExtractedData(data: ExtractedTaxData): ValidationResult` Validate extracted data completeness. ### Types #### `ExtractedTaxData` ```typescript interface ExtractedTaxData { emitterName: string | null; economicActivities: string[]; address: string | null; } ``` #### `ProcessingResult` ```typescript interface ProcessingResult { filename: string; data?: ExtractedTaxData; error?: string; } ``` ## Dependencies - `pdf-parse`: For extracting text content from PDF files - Built-in Node.js modules: `fs`, `path` ## Error Handling & Validation The application includes comprehensive error handling and validation for: ### File Validation - **File existence**: Checks if the file exists before processing - **File extension**: Validates that the file has a `.pdf` extension - **File size**: Ensures the file is not empty - **PDF format**: Validates PDF structure and signatures ### Content Validation - **PDF structure**: Verifies the file is a valid PDF document - **Text content**: Ensures the PDF contains extractable text (not just scanned images) - **Tax document structure**: Validates that the document appears to be a Chilean tax document - **Data completeness**: Ensures all required fields are successfully extracted ### Error Types Handled - File not found errors - Invalid file extensions (.txt, .doc, etc.) - Corrupted or invalid PDF files - Empty files - PDFs without extractable text - Non-tax documents - Incomplete data extraction - Directory access errors ## Limitations - The application is designed specifically for Chilean tax documents with the structure shown in the example - PDF must be text-based (not scanned images) - Extraction accuracy depends on the consistency of the PDF format - The application will reject non-PDF files, corrupted PDFs, and documents that don't match the expected tax document structure ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add some amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Author **Jorge Zapata** - [GitHub](https://github.com/Jmzp) ## Support If you find this library useful, please consider giving it a ⭐️ on GitHub!