@shieldsbetter/nearley-indentify
Version:
Adapts Nearley-compatible lexers to generate indent/dedent events.
266 lines (210 loc) • 11.5 kB
Markdown
# Nearley-Indentify [](https://codecov.io/gh/hamptonsmith/nearley-indentify) [](https://travis-ci.org/hamptonsmith/nearley-indentify)
Adapts existing [Nearley](https://www.npmjs.com/package/nearley)-compatible
lexers such as [moo](https://www.npmjs.com/package/moo) to emit `indent` and
`dedent` tokens in order to support indent-aware languages like Python.
## Quickstart
```javascript
const IndentifyLexer = require("@shieldsbetter/nearley-indentify");
const indentifiedLexer = new IndentifyLexer(mooLexer());
indentifiedLexer.reset(`
Hello World!
Here's some
indentation
And dedentation
`);
let token = indentifiedLexer.next();
while (token) {
console.log({ type: token.type, value: token.value });
token = indentifiedLexer.next();
}
function mooLexer() {
return require("moo").compile({
ws: /[ \t]+/,
nonws: /[^ \t\n]+/,
newline: { match: /\n/, lineBreaks: true }
});
}
```
Outputs:
```javascript
{ type: 'nonws', value: 'Hello' }
{ type: 'ws', value: ' ' }
{ type: 'nonws', value: 'World!' }
{ type: 'eol', value: '\n' }
{ type: 'indent', value: ' ' }
{ type: 'nonws', value: 'Here\'s' }
{ type: 'ws', value: ' ' }
{ type: 'nonws', value: 'some' }
{ type: 'eol', value: '\n' }
{ type: 'nonws', value: 'indentation' }
{ type: 'eol', value: '\n' }
{ type: 'dedent', value: ' ' }
{ type: 'nonws', value: 'And' }
{ type: 'ws', value: ' ' }
{ type: 'nonws', value: 'dedentation' }
{ type: 'eol', value: '\n' }
```
## Overview
Indentified lexers are themselves Nearley-compatible. Input is provided by a
call to `reset()`, and tokens are read by repeatedly calling `next()` until it
returns `undefined`. By default, generated indentation-related tokens can be
recognized by having a `type` field equal to `"indent"`, `"dedent"`, or `"eol"`.
All other tokens will be as they were produced by the base lexer.
The default options require only that base tokens have the single
Nearley-required `value` field, but more advanced customization can rely on
additional implementation-specific realities of the base tokens.
## Options
Additional options may be passed during construction via the second constructor
argument. For example:
```javascript
const indentifiedLexer = new IndentifyLexer(baseLexer, {
determineIndentationLevel: (asString, tokens) => tokens.length
});
```
Available options are:
- `controlTokenRecognizer` - a function that classifies tokens from the base
lexer according to their relevance to indentation parsing. This function
should take the form `baseToken => controlTokenType`, where `baseToken` will
be a token from the base lexer, and `controlTokenType` should be a string from
the set {`"indent"`, `"newline"`}, or `undefined` if the given base token has
no relevance to indentation parsing.
The default function is:
```javascript
baseToken => {
let result;
if (/[ \t]+/.test(baseToken.value)) {
result = 'indent';
} else if (/[\n\r]+/.test(baseToken.value)) {
result = 'newline';
}
return result;
}
```
- `determineIndentationDepth` - assigns a numeric indentation depth to a line of
input, as delimited by `"newline"`-type tokens or the end of the base token
stream. Only lines that contain non-`"indent"`, non-`"newline"` tokens will
be passed to this function.
This numeric depth need not be integral nor contiguous. The only
requirements are that equivalent levels of indent yield the same number,
"deeper" levels of indent yield numbers greater than "shallower" levels, and
"shallower" levels of indent yield numbers less than "deeper" levels.
This function should take the form
`(indentTokens, indentAsString, indentBreakingToken, previousDepth) => depth`.
The first parameter, `indentTokens`, will be the _indent prefix_ of the line,
i.e. an array of contiguous `"indent"`-type base lexer tokens that began the
line and preceded first non-`"indent"`-type token of the line. This array
could be empty for lines with no indentation.
The second parameter, `indentAsString`, will be the concatenation of the
`value` field of each of the tokens in `indentTokens`; i.e.: the result of
`indentTokens.map(t => t.value).join('')`.
The third parameter, `indentBreakingToken`, will be the first
non-`"indent"`-type base token of the line.
The fourth parameter, `previousDepth`, will be a number indicating the
parser's currently established indentation depth, or `undefined` if no
indentation depth has yet been established. This parameter can be useful when
one wishes to ignore the actual indent prefix of the line and instead "force"
the line to exist at a particular depth relative to the current indentation
depth. For example, one may wish to have comment lines always exist at the
previously established depth, irrespective of their actual indent prefix.
- `emptyLineStrategy` - a `(newlineToken, emit) => {}` function to be executed
upon encountering a line consisting of only zero or more `"indent"`-type base
tokens followed by a `"newline"`-type token or the end of the base token input
stream.
The first parameter, `newlineToken`, will be the `"newline"`-type base token
that ended the line, or `undefined` if the line was ended by the end of the
base token stream.
The second parameter, `emit`, is a function that takes a single parameter
value and emits it as a token into the indentified lexer's stream. This can be
useful, for example, when you'd like empty lines to get their own `eol`-type
token. These tokens will be emitted exactly as provided.
The default value is `() => {}`, which emits no tokens when encountering an
empty line.
Any returned value will be ignored.
- `lineListeners` - an array of
`(indentTokens, indentAsString, indentBreakingToken, indentBreakingTokenType) => {}`
functions to be executed after the full indent prefix of each line is parsed
but before the token that broke the indent prefix is emitted. Each listener
will be called for each line, including empty lines.
This can be useful, for example, to check that indent prefixes use consistent
indent characters (indeed, the default provides this functionality).
The first and second parameters, `indentTokens` and `indentAsString`, will
reflect the line's indent prefix as described for the
`determineIndentationDepth` option.
The third parameter, `indentBreakingToken`, will be the non-`"indent"` token
that broke the indent prefix, which may be a content token, a `"newline"`-type
token, or `undefined` if the indent prefix was broken by the end of the base
token stream.
The fourth parameter, `indentBreakingTokenType`, will be the result of
applying `options.controlTokenRecognizer()` to `indentBreakingToken`, or
`undefined` if `indentBreakingToken` is itself `undefined`.
Any returned value will be ignored.
The default is `[ new IndentifyLexer.ConsistentIndentEnforcer() ]`, which
ensures that the shared prefix of characters forming the indent from line to
line does not change and raises an `Error` if it does.
- `tokenBuilder` - a function for constructing `"indent"`, `"dedent"`, and
`"newline"` alignment tokens to be inserted into the indentified stream.
Takes the form `(type, value, baseToken) => alignmentToken`.
The first parameter, `type`, will be a string indicating the requested type of
token (one of `"indent"`, `"dedent"`, or `"eol"`).
The second parmaeter, `value`, will be one of the following:
- If `type === 'eol'`, `value` is the `value` field of the base token that
triggered the end of the line.
- If `type === 'indent'`, `value` is the concatenation of the `value` fields
of each of the tokens that forms the indent prefix of the line whose content
tokens will follow the constructed `indent` token.
- If `type === 'dedent'`, `value` is either the concatenation of the `value`
fields of each of the tokens that forms the indent prefix of the line whose
content tokens will follow the constructed `dedent` token, or, in the case
of a `dedent` token being constructed for an intermediate indentation level
(which will be followed by another `dedent` token or the end of the stream
rather than the content tokens of some line), `value` is the concatenation
of the `value` fields of the tokens that formed the indent prefix of the
line that originally established the intermediate indentation level.
The third parameter, `baseToken` will be some base token to be used as a
template to form the constructed alignment token. More specifically:
- If `type === 'eol'`, `baseToken` will be the `"newline"`-type base token
that triggered the end of the line.
- If `type === 'indent'`, `baseToken` will be the indent-breaking token of
the line whose content tokens will follow the constructed `"indent"` token.
- If `type === 'dedent'`, `baseToken` will be the indent-breaking token of the
line whose content tokens will follow the constructed `"dedent"` token, or,
in the case of a `dedent` token being constructed for an intermediate
indentation level (which will be followed by another `dedent` token or the
end of the stream rather than the content tokens of some line), `baseToken`
will be the `"newline"`-type base token that preceded the dedented line.
The return value should be the requested alignment token, ready to be emitted
into the token stream. Nearley-Indentify performs no further processing on
these returned tokens and they are unconstrained other than the need to be
acceptable to the token consumer.
The default is:
```javascript
(type, value, baseToken) => {
const token = clone(baseToken);
token.type = type;
token.value = value;
return token;
};
```
Where `clone()` is [`clone`](https://www.npmjs.com/package/clone).
## Required Base Lexer Interface
Base lexers should conform to the
[interface expected by Nearley](https://nearley.js.org/docs/tokenizers#custom-lexers).
Nearley doesn't fully specify how token streams are terminated, but we assume
moo-like behavior and specify that wrapped lexers must return `undefined` from
`next()` when there are no further tokens.
Tokens from the `next()` method of the base lexer must be objects with the
Nearley-specified `value` field, but there are no other requirements under the
default configuration. If custom `options.controlTokenRecognizer()` or
`options.buildToken()` functions are specified, tokens must additionally be
acceptable to them. The default control token recognizer requires only a `value`
field and the default token builder requires only that base tokens are objects.
## Extras
- `IndentifyLexer.ConsistentIndentEnforcer` - a line listener, suitable to
provide to `options.lineListeners`, that enforces consistent indentation
between lines.
"Consistent" here means that the string indent prefix of contiguous non-empty
lines at the same numerical indent level are the same string, the indent
prefix of indented non-empty lines are prefixed by the indent prefix of the
previous less-indented non-empty line, and the indent prefix of dedented
non-empty lines forms a prefix of the indent prefix of the previous
more-indented non-empty line.