@pdftron/data-extraction

Version:

The Apryse SDK Data-Extraction Module.

www.apryse.com

68 lines (45 loc) • 2.82 kB

Markdown

## @pdftron/data-extraction This package is meant to be used in conjunction with @pdftron/pdfnet-node to support IDP data extraction from Apryse. Follow this guide for more info on usage. https://docs.apryse.com/documentation/core/guides/intelligent-data-extraction/ For further reading checkout our blog post on the project. https://apryse.com/blog/introducing-automated-data-extraction-pdf-idp #### Supported platform, Node.js, and Electron versions This package depends on unmanaged add-on binaries, and the add-on binaries are not cross-platform. At the moment we have support for * **OS**: Linux (excluding Alpine), Windows(x64) * **Node.js version**: 8 - 23 * **Electron version**: 6 - 30 Installation will fail if your OS, Node.js or Electron version is not supported. #### Usage Add the `@pdftron/data-extraction` package as a dependency in your `package.json` Inside of your @pdftron/pdfnet-node code after initialization you should include the following line: ```javascript await PDFNet.addResourceSearchPath("./node_modules/@pdftron/data-extraction/lib") ``` Here is an example of data extraction being used with this line. ```javascript const { PDFNet } = require('@pdftron/pdfnet-node'); const licenseKey = "Insert license key here" const inputFile = "Insert input file location here" async function main() { // This is where we import data-extraction await PDFNet.addResourceSearchPath("./node_modules/@pdftron/data-extraction/lib") // Extract document structure as a JSON file console.log('Extract document structure as a JSON file'); let outputFile = 'out/paragraphs_and_tables.json'; await PDFNet.DataExtractionModule.extractData(inputFile, outputFile, PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure); console.log('Result saved in ' + outputFile); /////////////////////////////////////////////////////// // Extract document structure as a JSON string console.log('Extract document structure as a JSON string'); outputFile = 'out/tagged.json'; const json = await PDFNet.DataExtractionModule.extractDataAsString(inputFile, PDFNet.DataExtractionModule.DataExtractionEngine.e_DocStructure); fs.writeFileSync(outputFile, json); } PDFNet.runWithCleanup(main, licenseKey).catch(function (error) { console.log('Error: ' + JSON.stringify(error)); }).then(function () { return PDFNet.shutdown(); });; ``` A larger code sample can be found [here](https://docs.apryse.com/documentation/samples/node/js/DataExtractionTest/) To get started please see the documentation at https://www.pdftron.com/documentation/nodejs/get-started/integration. #### Licensing Please go to https://docs.apryse.com/documentation/core/info/license/ to obtain a demo or production license.