microdata-rdf-streaming-parser
Version:
A fast and lightweight streaming Microdata to RDF parser
103 lines (72 loc) • 4.5 kB
Markdown
# Microdata to RDF Streaming Parser
[](https://github.com/rubensworks/microdata-rdf-streaming-parser.js/actions?query=workflow%3ACI)
[](https://coveralls.io/github/rubensworks/microdata-rdf-streaming-parser.js?branch=master)
[](https://www.npmjs.com/package/microdata-rdf-streaming-parser)
A [fast](https://gist.github.com/rubensworks/ec15c73fe042441d74e1ba6157ccc7bc) and lightweight _streaming_ and 100% _spec-compliant_ [Microdata to RDF](https://w3c.github.io/microdata-rdf/) parser,
with [RDFJS](https://github.com/rdfjs/representation-task-force/) representations of RDF terms, quads and triples.
The streaming nature allows triples to be emitted _as soon as possible_, and documents _larger than memory_ to be parsed.
## Installation
```bash
$ npm install microdata-rdf-streaming-parser
```
or
```bash
$ yarn add microdata-rdf-streaming-parser
```
This package also works out-of-the-box in browsers via tools such as [webpack](https://webpack.js.org/) and [browserify](http://browserify.org/).
## Require
```javascript
import {MicrodataRdfParser} from "microdata-rdf-streaming-parser";
```
_or_
```javascript
const MicrodataRdfParser = require("microdata-rdf-streaming-parser").MicrodataRdfParser;
```
## Usage
`MicrodataRdfParser` is a Node [Transform stream](https://nodejs.org/api/stream.html#stream_class_stream_transform)
that takes in chunks of Microdata to RDF data,
and outputs [RDFJS](http://rdf.js.org/)-compliant quads.
It can be used to [`pipe`](https://nodejs.org/api/stream.html#stream_readable_pipe_destination_options) streams to,
or you can write strings into the parser directly.
## Configuration
Optionally, the following parameters can be set in the `MicrodataRdfParser` constructor:
* `dataFactory`: A custom [RDFJS DataFactory](http://rdf.js.org/#datafactory-interface) to construct terms and triples. _(Default: `require('@rdfjs/data-model')`)_
* `baseIRI`: An initial default base IRI. _(Default: `''`)_
* `defaultGraph`: The default graph for constructing [quads](http://rdf.js.org/#dom-datafactory-quad). _(Default: `defaultGraph()`)_
* `htmlParseListener`: An optional listener for the internal HTML parse events, should implement [`IHtmlParseListener`](https://github.com/rubensworks/rdfa-streaming-parser.js/blob/master/lib/IHtmlParseListener.ts) _(Default: `null`)_
* `xmlMode`: If the parser should assume strict X(HT)ML documents. _(Default: `false`)_
* `vocabRegistry`: A vocabulary registry to define specific behaviour for given URI prefixes. _(Default: contents of http://www.w3.org/ns/md)_
```javascript
new RdfaParser({
dataFactory: require('@rdfjs/data-model'),
baseIRI: 'http://example.org/',
defaultGraph: namedNode('http://example.org/graph'),
htmlParseListener: new MyHtmlListener(),
xmlMode: true,
vocabRegistry: {
"http://schema.org/": {
"properties": {
"additionalType": {"subPropertyOf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"}
}
},
"http://microformats.org/profile/hcard": {}
},
});
```
## How it works
This tool makes use of the highly performant [htmlparser2](https://www.npmjs.com/package/htmlparser2) library for parsing HTML in a streaming way.
It listens to tag-events, and maintains the required tag metadata in a [stack-based datastructure](https://www.rubensworks.net/blog/2019/03/13/streaming-rdf-parsers/),
which can then be emitted as triples as soon as possible.
Our algorithm closely resembles the [suggested algorithm for transforming Microdata to RDF](https://w3c.github.io/microdata-rdf/#algorithm),
with a few changes to make it work in a streaming way.
If you want to make use of a different HTML/XML parser,
you can create a regular instance of `MicrodataRdfParser`,
and just call the following methods yourself directly:
* `onTagOpen(name: string, attributes: {[s: string]: string})`
* `onText(data: string)`
* `onTagClose()`
## Specification Compliance
This parser passes all tests from the [Microdata to RDF test suite](https://w3c.github.io/microdata-rdf/tests/).
## License
This software is written by [Ruben Taelman](http://rubensworks.net/).
This code is released under the [MIT license](http://opensource.org/licenses/MIT).