@stevemao/pdf-extraction

# pdf-extraction **Pure javascript cross-platform module to extract texts from PDFs.** [![version](https://img.shields.io/npm/v/pdf-extraction.svg)](https://www.npmjs.org/package/pdf-extraction) [![downloads](https://img.shields.io/npm/dt/pdf-extraction.svg)](https://www.npmjs.org/package/pdf-extraction) [![node](https://img.shields.io/node/v/pdf-extraction.svg)](https://nodejs.org/) [![status](https://gitlab.com/fwiwDev/pdf-extraction/badges/master/pipeline.svg)](https://gitlab.com/fwiwDev/pdf-extraction/pipelines)  ## Installation `npm install pdf-extraction` ## Basic Usage - Local Files ```js const fs = require("fs"); const pdf = require("pdf-extraction"); let dataBuffer = fs.readFileSync("path to PDF file..."); pdf(dataBuffer).then(function (data) { // number of pages console.log(data.numpages); // number of rendered pages console.log(data.numrender); // PDF info console.log(data.info); // PDF metadata console.log(data.metadata); // PDF.js version // check https://mozilla.github.io/pdf.js/getting_started/ console.log(data.version); // PDF text console.log(data.text); }); ``` ## Basic Usage - HTTP You can use [crawler-request](https://www.npmjs.com/package/crawler-request) which uses the `pdf-extraction` ## Exception Handling ```js const fs = require("fs"); const pdf = require("pdf-extraction"); let dataBuffer = fs.readFileSync("path to PDF file..."); pdf(dataBuffer) .then(function (data) { // use data }) .catch(function (error) { // handle exceptions }); ``` ## Extend - v1.0.9 and above break pagerender callback [changelog](https://gitlab.com/fwiwDev/pdf-extraction/blob/master/CHANGELOG) - If you need another format like json, you can change page render behaviour with a callback - Check out https://mozilla.github.io/pdf.js/ ```js // default render callback function render_page(pageData) { //check documents https://mozilla.github.io/pdf.js/ let render_options = { //replaces all occurrences of whitespace with standard spaces (0x20). The default value is `false`. normalizeWhitespace: false, //do not attempt to combine same line TextItem's. The default value is `false`. disableCombineTextItems: false, }; return pageData.getTextContent(render_options).then(function (textContent) { let lastY, text = ""; for (let item of textContent.items) { if (lastY == item.transform[5] || !lastY) { text += item.str; } else { text += "\n" + item.str; } lastY = item.transform[5]; } return text; }); } let options = { pagerender: render_page, }; let dataBuffer = fs.readFileSync("path to PDF file..."); pdf(dataBuffer, options).then(function (data) { //use new format }); ``` ## Options ```js const DEFAULT_OPTIONS = { // internal page parser callback // you can set this option, if you need another format except raw text pagerender: render_page, // max page number to parse max: 0, //check https://mozilla.github.io/pdf.js/getting_started/ version: "v1.10.100", }; ``` ### _pagerender_ (callback) If you need another format except raw text. ### _max_ (number) Max number of page to parse. If the value is less than or equal to 0, parser renders all pages. ### _version_ (string, pdf.js version) check [pdf.js](https://mozilla.github.io/pdf.js/getting_started/) - `'default'` - `'v1.9.426'` - `'v1.10.100'` - `'v1.10.88'` - `'v2.0.550'` > _default_ uses version _v1.10.100_ > [mozilla.github.io/pdf.js](https://mozilla.github.io/pdf.js/getting_started/#download) ## Test - `mocha` or `npm test` - Check [test folder](https://gitlab.com/fwiwDev/pdf-extraction/tree/master/test) and [quickstart.js](https://gitlab.com/fwiwDev/pdf-extraction/blob/master/quickstart.js) for extra usages. ## Support I use this package actively myself, so it has my top priority. You can chat on WhatsApp about any infos, ideas and suggestions.  ### Submitting an Issue If you find a bug or a mistake, you can help by submitting an issue to [GitLab Repository](https://gitlab.com/fwiwDev/pdf-extraction/issues) ### Creating a Merge Request GitLab calls it merge request instead of pull request. - [A Guide for First-Timers](https://about.gitlab.com/2016/06/16/fearless-contribution-a-guide-for-first-timers/) - [How to create a merge request](https://docs.gitlab.com/ee/gitlab-basics/add-merge-request.html) - Check [Contributing Guide](https://gitlab.com/fwiwDev/pdf-extraction/blob/master/CONTRIBUTING.md) ## License [MIT licensed](https://gitlab.com/fwiwDev/pdf-extraction/blob/master/LICENSE) and all it's dependencies are MIT or BSD licensed.