UNPKG

pdf-parse

Version:

Pure javascript cross-platform module to extract text from PDFs.

157 lines (119 loc) 4.83 kB
# pdf-parse **A Pure javascript cross-platform module to extract texts from PDFs.** [![version](https://img.shields.io/npm/v/pdf-parse.svg)](https://www.npmjs.org/package/pdf-parse) [![downloads](https://img.shields.io/npm/dt/pdf-parse.svg)](https://www.npmjs.org/package/pdf-parse) [![node](https://img.shields.io/node/v/pdf-parse.svg)](https://nodejs.org/) [![status](https://gitlab.com/autokent/pdf-parse/badges/master/pipeline.svg)](https://gitlab.com/autokent/pdf-parse/pipelines) ## Similar Packages * [pdf2json](https://www.npmjs.com/package/pdf2json) buggy, no support anymore, memory leak, throws non-catchable fatal errors. * [j-pdfjson](https://www.npmjs.com/package/j-pdfjson) fork of pdf2json. * [pdf-parser](https://github.com/dunso/pdf-parse) buggy, no tests. * [pdfreader](https://www.npmjs.com/package/pdfreader) using pdf2json. * [pdf-extract](https://www.npmjs.com/package/pdf-extract) not cross-platform using xpdf. ## Installation `npm install pdf-parse` ## Basic Usage - Local Files ```js const fs = require('fs'); const pdf = require('pdf-parse'); let dataBuffer = fs.readFileSync('path to PDF file...'); pdf(dataBuffer).then(function(data) { // number of pages console.log(data.numpages); // number of rendered pages console.log(data.numrender); // PDF info console.log(data.info); // PDF metadata console.log(data.metadata); // PDF.js version // check https://mozilla.github.io/pdf.js/getting_started/ console.log(data.version); // PDF text console.log(data.text); }); ``` ## Basic Usage - HTTP You can use [crawler-request](https://www.npmjs.com/package/crawler-request) which uses the `pdf-parse` ## Exception Handling ```js const fs = require('fs'); const pdf = require('pdf-parse'); let dataBuffer = fs.readFileSync('path to PDF file...'); pdf(dataBuffer).then(function(data) { // use data }) .catch(function(error){ // handle exceptions }) ``` ## Extend * If you need another format like json, you can change page render behaviour with a callback. * Check out https://mozilla.github.io/pdf.js/ ```js // default render callback function render_page(pageData, ret) { //check documents https://mozilla.github.io/pdf.js/ ret.text = ret.text ? ret.text : ""; let render_options = { //replaces all occurrences of whitespace with standard spaces (0x20). normalizeWhitespace: true, //do not attempt to combine same line TextItem's. disableCombineTextItems: false } return pageData.getTextContent(render_options) .then(function(textContent) { let strings = textContent.items.map(item => item.str); let text = strings.join(' '); ret.text = ${ret.text}+${text}+" \n\n"; }); } let options = { pagerender: render_page } let dataBuffer = fs.readFileSync('path to PDF file...'); pdf(dataBuffer,options).then(function(data) { //use new format }); ``` ## Options ### default options ```js const DEFAULT_OPTIONS = { // internal page parser callback // you can set this option, if you need another format except raw text pagerender: render_page, // max page number to parse max: 0, //check https://mozilla.github.io/pdf.js/getting_started/ version: 'v1.9.426' } ``` ### pagerender (callback) If you need another format except raw text. ### max (number) Max number of page to parse. If the value is less than or equal to 0, parser renders all pages. ### version (string, pdf.js version) check [pdf.js](https://mozilla.github.io/pdf.js/getting_started/) * `'default'` * `'v1.9.426'` * `'v1.10.88'` *default* uses version *v1.9.426* *v1.9.426* stable, *v1.10.88* beta ## Test `mocha` or `npm test` check [test folder](https://gitlab.com/autokent/pdf-parse/tree/master/test) and [QUICKSTART.js](https://gitlab.com/autokent/pdf-parse/blob/master/QUICKSTART.js) for extra usages. ## Support I use this package actively myself, so it has my top priority. ### Submitting an Issue If you find a bug or a mistake, you can help by submitting an issue to [GitLab Repository](https://gitlab.com/autokent/pdf-parse/issues) ### Creating a Merge Request GitLab calls it merge request instead of pull request. * [A Guide for First-Timers](https://about.gitlab.com/2016/06/16/fearless-contribution-a-guide-for-first-timers/) * [How to create a merge request](https://docs.gitlab.com/ee/gitlab-basics/add-merge-request.html) * Check [Contributing Guide](https://gitlab.com/autokent/pdf-parse/blob/master/CONTRIBUTING.md) ### WhatsApp Chat on WhatsApp about any infos, ideas and suggestions. [![WhatsApp](https://img.shields.io/badge/style-chat-green.svg?style=flat&label=whatsapp)](https://api.whatsapp.com/send?phone=905063042480&text=Hi%2C%0ALet%27s%20talk%20about%20pdf-parse) ## License [MIT licensed](https://gitlab.com/autokent/pdf-parse/blob/master/LICENSE) and all it's dependencies are MIT or BSD licensed.