@sugarcube/plugin-tika

Version:

Parse files and metadata using Tika.

github.com/critocrito/sugarcube/tree/master/packages/plugin-tika

63 lines (41 loc) • 1.61 kB

Markdown

# `@sugarcube/plugin-tika` Use the [Apache Tika](https://tika.apache.org/) toolkit to detect and extract metadata and text from over a thousand different file types. ## Installation ```shell npm install --save @sugarcube/plugin-tika ``` To use this plugin you need as well Java installed. ## Plugins ### `tika_parse` Parse a list of file specified by the query type `glob_pattern`. ```shell sugarcube -Q glob_pattern:files/**/*.pdf -p tika_parse ``` ### `tika_links` This plugin iterates over all links in `_sc_media` and fetches the text and meta data for this link. This plugin ignores any errors that the fetch might throw. ### `tika_location` This plugin parses any location specified using the `tika_location_field` query type. This fetches the text and meta data of e.g. a url inside the unit. ```shell sugarcube -Q google_search:Keith\ Johnstone \ -Q tika_location_field:href \ -p google_search,tika_location ``` The text and meta data are added into the `_sc_media` collection and placed directly on the unit as well, e.g. if the location field is `href`, the `href_text` and `href_meta` fields are added to the unit. ### `tika_export` Export the text and meta data that `tika_location` parses to a file. ```shell sugarcube -Q google_search:Keith\ Johnstone \ -p google_search,tika_location,tika_export \ --tika.location_field href ``` **Configuration Options**: - `tika.data_dir`: Specify the target directory where to store all files. Defaults to `./data/tika_location`. ## License [GPL3](./LICENSE) @ [Christo](christo@cryptodrunks.net)