UNPKG

stricter-htmlparser2

Version:

Fast & forgiving HTML/XML/RSS parser with stricter attribute parsing rules.

183 lines (143 loc) 5.77 kB
# stricter-htmlparser2 [![NPM version](http://img.shields.io/npm/v/stricter-htmlparser2.svg?style=flat)](https://npmjs.org/package/stricter-htmlparser2) [![Downloads](https://img.shields.io/npm/dm/stricter-htmlparser2.svg?style=flat)](https://npmjs.org/package/stricter-htmlparser2) [![Build Status](http://img.shields.io/travis/yanghuabei/stricter-htmlparser2/master.svg?style=flat)](http://travis-ci.org/yanghuabei/htmlparser2) [![Coverage](http://img.shields.io/coveralls/yanghuabei/stricter-htmlparser2.svg?style=flat)](https://coveralls.io/r/yanghuabei/htmlparser2) A forgiving HTML/XML/RSS parser. The parser can handle streams and provides a callback interface. ## Installation npm install 'stricter-htmlparser2' A live demo of htmlparser2 is available [here](https://astexplorer.net/#/2AmVrGuGVJ). ## Differences to htmlparser2 1. Attribute value without quote wrapped is not allowed. ``` <foo name=value /> // output { attribs: { name: "", value: "" } } ``` 2. "\"" is allowed in attribute value. ``` <foo name="hello \"world" /> // output { attribs: { name: "hello \\"world" } } ``` 3. Record attributes key in a object that value is wrapped by single quote. The object is passed to `onopentag` method of parser's domhandler through the third argument. ``` <foo name='{{flag ? "true" : "false"}}' title="title" class='hidden'> </foo> // The third argument passed to onopentag: { name: true, class: true } ``` // If used with x-domhandler which is exported by this module, the parse result is like this: ```javascript { type: "tag", name: "foo", attribs: { name: "{{flag ? \"true\" : \"false\"}}", title: "title", class: "hidden" }, singleQuoteAttribs: { name: true, class: true }, selfClose: false } ``` 4. Add `selfClose` flag to parsed node element. **Input:** ```html <foo name='{{flag ? "true" : "false"}}' title="title" class='hidden' /> ``` **Parser code:** ```javascript const {Parser, DomHandler} = require('stricter-htmlparser2'); const parser = new Parser(new DomHandler(), {}).end(inputStr); ``` **Output** ```javascript { type: "tag", name: "foo", attribs: { name: "{{flag ? \"true\" : \"false\"}}", title: "title", class: "hidden" }, singleQuoteAttribs: { name: true, class: true }, selfClose: true } ``` ## Usage ```javascript var htmlparser = require("stricter-htmlparser2"); var parser = new htmlparser.Parser({ onopentag: function(name, attribs){ if(name === "script" && attribs.type === "text/javascript"){ console.log("JS! Hooray!"); } }, ontext: function(text){ console.log("-->", text); }, onclosetag: function(tagname){ if(tagname === "script"){ console.log("That's it?!"); } } }, {decodeEntities: true}); parser.write("Xyz <script type='text/javascript'>var foo = '<<bar>>';</ script>"); parser.end(); ``` Output (simplified): ``` --> Xyz JS! Hooray! --> var foo = '<<bar>>'; That's it?! ``` ## Documentation Read more about the parser and its options in the [wiki](https://github.com/fb55/htmlparser2/wiki/Parser-options). ## Get a DOM The `DomHandler` (known as `DefaultHandler` in the original `htmlparser` module) produces a DOM (document object model) that can be manipulated using the [`DomUtils`](https://github.com/fb55/DomUtils) helper. The `DomHandler`, while still bundled with this module, was moved to its [own module](https://github.com/fb55/domhandler). Have a look at it for further information. ## Parsing RSS/RDF/Atom Feeds ```javascript new htmlparser.FeedHandler(function(<error> error, <object> feed){ ... }); ``` Note: While the provided feed handler works for most feeds, you might want to use [danmactough/node-feedparser](https://github.com/danmactough/node-feedparser), which is much better tested and actively maintained. ## Performance After having some artificial benchmarks for some time, __@AndreasMadsen__ published his [`htmlparser-benchmark`](https://github.com/AndreasMadsen/htmlparser-benchmark), which benchmarks HTML parses based on real-world websites. At the time of writing, the latest versions of all supported parsers show the following performance characteristics on [Travis CI](https://travis-ci.org/AndreasMadsen/htmlparser-benchmark/builds/10805007) (please note that Travis doesn't guarantee equal conditions for all tests): ``` gumbo-parser : 34.9208 ms/file ± 21.4238 html-parser : 24.8224 ms/file ± 15.8703 html5 : 419.597 ms/file ± 264.265 htmlparser : 60.0722 ms/file ± 384.844 htmlparser2-dom: 12.0749 ms/file ± 6.49474 htmlparser2 : 7.49130 ms/file ± 5.74368 hubbub : 30.4980 ms/file ± 16.4682 libxmljs : 14.1338 ms/file ± 18.6541 parse5 : 22.0439 ms/file ± 15.3743 sax : 49.6513 ms/file ± 26.6032 ``` ## How does this module differ from [node-htmlparser](https://github.com/tautologistics/node-htmlparser)? This is a fork of the `htmlparser` module. The main difference is that this is intended to be used only with node (it runs on other platforms using [browserify](https://github.com/substack/node-browserify)). `htmlparser2` was rewritten multiple times and, while it maintains an API that's compatible with `htmlparser` in most cases, the projects don't share any code anymore. The parser now provides a callback interface close to [sax.js](https://github.com/isaacs/sax-js) (originally targeted at [readabilitySAX](https://github.com/fb55/readabilitysax)). As a result, old handlers won't work anymore. The `DefaultHandler` and the `RssHandler` were renamed to clarify their purpose (to `DomHandler` and `FeedHandler`). The old names are still available when requiring `htmlparser2`, your code should work as expected.