pdf-tax-reader-cl
Version:
PDF scraping library for Chilean tax documents. Extract emitter name, economic activities, and address from structured PDF documents like 'CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS'
307 lines (236 loc) • 8.68 kB
Markdown
# pdf-tax-reader-cl
A Node.js library for extracting specific data from Chilean tax PDF documents. This library is designed to scrape structured PDF documents like "CARPETA TRIBUTARIA ELECTRÓNICA PARA SOLICITAR CRÉDITOS" and extract key information.
[](https://badge.fury.io/js/pdf-tax-reader-cl)
[](https://opensource.org/licenses/MIT)
[](https://nodejs.org/)
## 🚀 Quick Start
```bash
npm install pdf-tax-reader-cl
```
```javascript
const { extractTaxData } = require('pdf-tax-reader-cl');
// Extract data from a PDF file
extractTaxData('./tax-document.pdf')
.then(data => {
console.log('Extracted Data:', data);
// {
// emitterName: "GUITAL Y PARTNERS LIMITADA",
// economicActivities: [
// "ASES.COMER.PUBLICIDAD,REPONEDORES,COMERC.FRUT,VERD,BEBIDAS DE FANTASIA",
// "463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS"
// ],
// address: "VITACURA 4380 , Dpto. 31 , VITACURA"
// }
})
.catch(error => {
console.error('Error:', error.message);
});
```
## Features
- Extract emitter name from PDF documents
- Extract economic activities list
- Extract address information
- Process single PDF files or entire directories
- Save extracted data to JSON format
- Comprehensive error handling and logging
## Installation
```bash
npm install pdf-tax-reader-cl
```
### Requirements
- Node.js >= 14.0.0
- PDF files must be text-based (not scanned images)
- PDFs must follow the Chilean tax document structure
## Usage
### Single PDF Processing
```javascript
const { extractTaxData } = require('pdf-tax-reader-cl');
// Process a single PDF file
extractTaxData('./documents/tax-document.pdf')
.then(data => {
console.log('Extracted Data:', data);
})
.catch(error => {
console.error('Error:', error.message);
});
```
### Multiple PDF Processing
```javascript
const { processMultiplePDFs } = require('pdf-tax-reader-cl');
// Process all PDF files in a directory
processMultiplePDFs('./documents')
.then(results => {
console.log('Processing completed:', results);
// [
// {
// filename: "document1.pdf",
// data: { emitterName: "...", economicActivities: [...], address: "..." }
// },
// {
// filename: "document2.pdf",
// error: "Invalid PDF format"
// }
// ]
})
.catch(error => {
console.error('Error:', error);
});
```
### TypeScript Support
```typescript
import { extractTaxData, ExtractedTaxData } from 'pdf-tax-reader-cl';
// Extract data with TypeScript types
extractTaxData('./path/to/document.pdf')
.then((data: ExtractedTaxData) => {
console.log('Extracted Data:', data);
// data.emitterName is string | null
// data.economicActivities is string[]
// data.address is string | null
})
.catch(error => {
console.error('Error:', error);
});
```
## Testing
If you're developing or contributing to this library:
```bash
# Clone the repository
git clone https://github.com/Jmzp/pdf-tax-reader-cl.git
cd pdf-tax-reader-cl
# Install dependencies
npm install
# Run tests
npm test
```
The test suite includes:
- Mock data validation
- Single PDF processing test
- Multiple PDF processing test
## Error Handling Examples
The application provides detailed error messages for different types of invalid files:
```javascript
// Example error handling
try {
const data = await extractTaxData('./invalid-file.txt');
} catch (error) {
console.log('Error:', error.message);
// Output: "Invalid file extension. Expected .pdf, got: txt"
}
try {
const data = await extractTaxData('./corrupted-file.pdf');
} catch (error) {
console.log('Error:', error.message);
// Output: "Invalid PDF format. File does not appear to be a valid PDF document."
}
try {
const data = await extractTaxData('./non-tax-document.pdf');
} catch (error) {
console.log('Error:', error.message);
// Output: "Document does not appear to be a Chilean tax document. Missing expected tax document structure."
}
```
## Data Extraction
The application extracts the following information from PDF documents:
### 1. Emitter Name (Nombre del emisor)
- Extracts the company or entity name that generated the document
- Pattern: `Nombre del emisor: [COMPANY_NAME]`
### 2. Economic Activities (Actividades Económicas)
- Extracts all economic activities listed in the document
- Includes both general descriptions and specific activity codes
- Pattern: Looks for lines containing activity codes (6 digits) or specific keywords
### 3. Address (Domicilio)
- Extracts the registered address of the taxpayer
- Pattern: `Domicilio: [ADDRESS]`
## Output Format
The extracted data is returned in the following JSON format:
```json
{
"emitterName": "GUITAL Y PARTNERS LIMITADA",
"economicActivities": [
"ASES.COMER.PUBLICIDAD, REPONEDORES, COMERC.FRUT, VERD, BEBIDAS DE FANTASIA",
"463011 VENTA AL POR MAYOR DE FRUTAS Y VERDURAS",
"463020 VENTA AL POR MAYOR DE BEBIDAS ALCOHOLICAS Y NO ALCOHOLICAS",
"692000 ACTIVIDADES DE CONTABILIDAD, TENEDURIA DE LIBROS Y AUDITORIA; CONSULTO",
"731001 SERVICIOS DE PUBLICIDAD PRESTADOS POR EMPRESAS",
"783000 OTRAS ACTIVIDADES DE DOTACION DE RECURSOS HUMANOS",
"854909 OTROS TIPOS DE ENSEÑANZA N.C.P.",
"855000 ACTIVIDADES DE APOYO A LA ENSEÑANZA"
],
"address": "VITACURA 4380, Dpto. 31, VITACURA"
}
```
## API Reference
### Functions
#### `extractTaxData(pdfPath: string): Promise<ExtractedTaxData>`
Extract tax data from a PDF file.
#### `processMultiplePDFs(directoryPath: string): Promise<ProcessingResult[]>`
Process multiple PDF files in a directory.
#### `saveToJSON(data: any, outputPath: string): void`
Save extracted data to JSON file.
#### `isValidPDF(dataBuffer: Buffer): boolean`
Validate if a file is a valid PDF.
#### `hasValidExtension(filePath: string): boolean`
Validate file extension.
#### `isTaxDocument(text: string): boolean`
Check if the document appears to be a Chilean tax document.
#### `validateExtractedData(data: ExtractedTaxData): ValidationResult`
Validate extracted data completeness.
### Types
#### `ExtractedTaxData`
```typescript
interface ExtractedTaxData {
emitterName: string | null;
economicActivities: string[];
address: string | null;
}
```
#### `ProcessingResult`
```typescript
interface ProcessingResult {
filename: string;
data?: ExtractedTaxData;
error?: string;
}
```
## Dependencies
- `pdf-parse`: For extracting text content from PDF files
- Built-in Node.js modules: `fs`, `path`
## Error Handling & Validation
The application includes comprehensive error handling and validation for:
### File Validation
- **File existence**: Checks if the file exists before processing
- **File extension**: Validates that the file has a `.pdf` extension
- **File size**: Ensures the file is not empty
- **PDF format**: Validates PDF structure and signatures
### Content Validation
- **PDF structure**: Verifies the file is a valid PDF document
- **Text content**: Ensures the PDF contains extractable text (not just scanned images)
- **Tax document structure**: Validates that the document appears to be a Chilean tax document
- **Data completeness**: Ensures all required fields are successfully extracted
### Error Types Handled
- File not found errors
- Invalid file extensions (.txt, .doc, etc.)
- Corrupted or invalid PDF files
- Empty files
- PDFs without extractable text
- Non-tax documents
- Incomplete data extraction
- Directory access errors
## Limitations
- The application is designed specifically for Chilean tax documents with the structure shown in the example
- PDF must be text-based (not scanned images)
- Extraction accuracy depends on the consistency of the PDF format
- The application will reject non-PDF files, corrupted PDFs, and documents that don't match the expected tax document structure
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Author
**Jorge Zapata** - [GitHub](https://github.com/Jmzp)
## Support
If you find this library useful, please consider giving it a ⭐️ on GitHub!