html-pe
Version:
Accurate yet performant HTML parser and encoder
172 lines (134 loc) • 4.13 kB
Markdown
is for those who expect accuracy from their HTML parser, but are
unwilling to compromise on performance. This parser handles a variety of edge
cases as prescribed by the HTML5 standard and remains performant thanks to
optimizations such as avoiding regular expressions.
Chunked parsing is supported, and an efficient HTML entity encoder/decoder is
included.
`npm install html-pe`
```javascript
var html = require('html-pe');
var parser = new html.Parser();
// parse chunks
parser.write('<p style="color:red">foo');
parser.write('</p>');
parser.end();
// parse entire string
parser.parse('<div id="main">bar</div>');
```
Optionally, an `options` object may be passed to the constructor (default values
shown):
```javascript
var parser = new html.Parser({
trim: false, // trim whitespace in text nodes
decode: true // decode HTML character entities in attribute values and text nodes
});
```
The parser is an `EventEmitter` and emits the following events:
```javascript
// opening tag
parser.on('open', function (name, attributes, empty) {
// <string> name
// <object> attributes
// <bool> empty: is this an empty (self-closing) tag
});
// text node
parser.on('text', function (text) {
// <string> text
});
// closing tag
parser.on('close', function (name) {
// <string> name
});
// end
parser.on('end', function () {});
```
Two handlers are provided:
```javascript
var dom = new html.DOM(function (document) {
// <object> document
});
dom.listen(parser);
// ...
dom.reset();
var printer = new html.Printer();
printer.listen(parser);
```
The `Printer` handler pretty-prints the parsed HTML to the console. The `DOM`
handler passes a `document` object to its callback, which is detailed below.
Follow the structure in `handlers.js` to create new handlers.
The `DOM` handler will parse `<div id="main">hello <b>world</b></div>` into the
following `document`:
```javascript
{
type: 'element',
parent: {},
name: '',
attributes: {},
children: [{
type: 'element',
parent : [Circular],
name: 'div',
attributes: {
id: 'main'
},
children: [{
type: 'text',
parent: [Circular],
text: 'hello '
},
{
type: 'element',
parent: [Circular],
name: 'b',
attributes: {},
children: [{
type: 'text',
parent: [Circular],
text: 'world'
}]
}]
}]
}
```
Nodes are either `Element` nodes or `Text` nodes. All nodes have a `textContent`
property. `Element` nodes provide the DOM navigation methods `getElementById`,
`getElementsByName`, `getElementsByClassName`, and `getElementsByTagName`. Two
primitive methods are also available: `getElementsBy(test)` and
`getElementBy(test)` where `test` is a function of type `Element -> bool`. For
example, this is how `getElementsBy` is used to implement
`getElementsByTagName`:
```javascript
var Element = html.Element;
Element.prototype.getElementsByTagName = function (name) {
return this.getElementsBy(function (elem) {
return elem.name == name;
});
}
```
See `dom.js` for details.
```javascript
var html = require('html-pe');
html.encode('<p id="main">Hello & world!</p>',
false // encode &<>"' only (default)
);
// '<p id="main">Hello &amp; world!</p>'
html.encode('<p>здравствуйте!</p>',
true // also encode all non-ASCII and non-printable characters
);
// '<p>здравствуйте!</p>'
html.decode('It's ¬ me');
// 'It\'s ¬ me'
```
See `encoder.js` for details.
`html-pe` has slightly worse performance than
[ ](https://github.com/fb55/node-htmlparser). A benchmark is available
in `tests/bench.js`.
MIT. See `LICENSE`.
`html-pe`