smart-docs-parser
Version:
Document Details Parsing using OCR
139 lines (121 loc) • 3.98 kB
Markdown
is a NodeJs library to parse details from ID images.
https://medium.com/urbanclap-engineering/document-details-parsing-using-ocr-170bf6ad8a97
```smart-docs-parser``` works in three steps:
- Extraction of raw text from document image using OCR
- Validation of document image based on passed document type and extracted raw text
- Parsing relevant information from raw text using document parser
```
$ npm install smart-docs-parser
```
Create a _config_ folder at the root of your project. Add _default.json_ file to the _config_ folder.
```Javascript
{
"smart-docs-parser": {
"api_keys": {
"google-vision": "YOUR_API_KEY"
}
}
}
```
```
// ES6 import statement
import SmartDocuments from 'smart-docs-parser';
// Sample Request
const extractedDocumentDetails = await SmartDocuments.extractDocumentDetailsFromImage({
document_url: 'https://avatars2.githubusercontent.com/u/20634933?s=40&v=4',
document_type: 'PAN_CARD',
ocr_library: 'google-vision'
});
// Sample Response
{ raw_text:
[ 'INCOME TAX DEPARTMENT',
'GOVT. OF INDIA',
'Permanent Account Number Card',
'PANAM8144G',
'/Name',
'ID NAME',
'frar TT /Father\'s Name',
'FATHER NAME',
'ae of Birth',
'13/02/1994',
'SIGN',
'at / Signature',
'' ],
is_document_valid: true,
document_details:
{ document_type: 'PAN_CARD',
identification_number: 'PANAM8144G',
name: 'ID NAME',
date_of_birth: '1994-02-13T00:00:00.000Z',
fathers_name: 'FATHER NAME'
}
}
```
```Javascript
export interface ExtractDocumentDetailsFromImageRequest {
document_url: string;
document_type: string;
ocr_library: string;
custom_parser?: object; // Only for custom parsers
custom_ocr?: object; // Only for custom OCRs
timeout?: number; //Optional request timeout parameter, defaults to 30 secs
}
```
```Javascript
export interface ExtractDocumentDetailsFromImageResponse {
raw_text: Array<string>;
is_document_valid: boolean;
document_details: DocumentDetails | object;
}
interface DocumentDetails {
document_type?: string;
identification_number?: string;
name?: string;
fathers_name?: string;
date_of_birth?: string;
gender?: 'M'|'F';
address?: string;
}
```
**raw_text** is the text extracted by the OCR
**is_document_valid** denotes whether the document is valid based on input *document_type* and extracted *raw_text*
**document_details** is the document information parsed using the specific document parser
* PAN CARD
``` Javascript
document_type: 'PAN_CARD'
```
* AADHAAR CARD
``` Javascript
document_type: 'AADHAAR_CARD'
```
* Google Vision
``` Javascript
ocr_library: 'google-vision'
```
Library can parse state name and pin-code but the accuracy of the system for complete address text parsing is not upto the mark due to the noise introduced by multilingual text.
Contributions are welcome. Please create a pull-request if you want to add more document parsers, OCR libraries, test-support or enhance the existing code.
* [Parsing more documents](https://github.com/urbanclap-engg/smart-docs-parser/blob/master/docs/custom_parser.md)
* [Adding more OCR libraries](https://github.com/urbanclap-engg/smart-docs-parser/blob/master/docs/custom_ocr.md)
* [Parsing more documents](https://github.com/urbanclap-engg/smart-docs-parser/blob/master/docs/document_parser.md)
* [Adding more OCR libraries](https://github.com/urbanclap-engg/smart-docs-parser/blob/master/docs/ocr_library.md)
[ ](https://github.com/urbanclap-engg/smart-docs-parser/blob/master/LICENSE)
```smart-docs-parser```