@shieldsbetter/nearley-indentify

# Nearley-Indentify [![codecov](https://codecov.io/gh/hamptonsmith/nearley-indentify/branch/master/graph/badge.svg)](https://codecov.io/gh/hamptonsmith/nearley-indentify) [![Build Status](https://travis-ci.org/hamptonsmith/nearley-indentify.svg?branch=master)](https://travis-ci.org/hamptonsmith/nearley-indentify) Adapts existing [Nearley](https://www.npmjs.com/package/nearley)-compatible lexers such as [moo](https://www.npmjs.com/package/moo) to emit `indent` and `dedent` tokens in order to support indent-aware languages like Python. ## Quickstart ```javascript const IndentifyLexer = require("@shieldsbetter/nearley-indentify"); const indentifiedLexer = new IndentifyLexer(mooLexer()); indentifiedLexer.reset(` Hello World! Here's some indentation And dedentation `); let token = indentifiedLexer.next(); while (token) { console.log({ type: token.type, value: token.value }); token = indentifiedLexer.next(); } function mooLexer() { return require("moo").compile({ ws: /[ \t]+/, nonws: /[^ \t\n]+/, newline: { match: /\n/, lineBreaks: true } }); } ``` Outputs: ```javascript { type: 'nonws', value: 'Hello' } { type: 'ws', value: ' ' } { type: 'nonws', value: 'World!' } { type: 'eol', value: '\n' } { type: 'indent', value: ' ' } { type: 'nonws', value: 'Here\'s' } { type: 'ws', value: ' ' } { type: 'nonws', value: 'some' } { type: 'eol', value: '\n' } { type: 'nonws', value: 'indentation' } { type: 'eol', value: '\n' } { type: 'dedent', value: ' ' } { type: 'nonws', value: 'And' } { type: 'ws', value: ' ' } { type: 'nonws', value: 'dedentation' } { type: 'eol', value: '\n' } ``` ## Overview Indentified lexers are themselves Nearley-compatible. Input is provided by a call to `reset()`, and tokens are read by repeatedly calling `next()` until it returns `undefined`. By default, generated indentation-related tokens can be recognized by having a `type` field equal to `"indent"`, `"dedent"`, or `"eol"`. All other tokens will be as they were produced by the base lexer. The default options require only that base tokens have the single Nearley-required `value` field, but more advanced customization can rely on additional implementation-specific realities of the base tokens. ## Options Additional options may be passed during construction via the second constructor argument. For example: ```javascript const indentifiedLexer = new IndentifyLexer(baseLexer, { determineIndentationLevel: (asString, tokens) => tokens.length }); ``` Available options are: - `controlTokenRecognizer` - a function that classifies tokens from the base lexer according to their relevance to indentation parsing. This function should take the form `baseToken => controlTokenType`, where `baseToken` will be a token from the base lexer, and `controlTokenType` should be a string from the set {`"indent"`, `"newline"`}, or `undefined` if the given base token has no relevance to indentation parsing. The default function is: ```javascript baseToken => { let result; if (/[ \t]+/.test(baseToken.value)) { result = 'indent'; } else if (/[\n\r]+/.test(baseToken.value)) { result = 'newline'; } return result; } ``` - `determineIndentationDepth` - assigns a numeric indentation depth to a line of input, as delimited by `"newline"`-type tokens or the end of the base token stream. Only lines that contain non-`"indent"`, non-`"newline"` tokens will be passed to this function. This numeric depth need not be integral nor contiguous. The only requirements are that equivalent levels of indent yield the same number, "deeper" levels of indent yield numbers greater than "shallower" levels, and "shallower" levels of indent yield numbers less than "deeper" levels. This function should take the form `(indentTokens, indentAsString, indentBreakingToken, previousDepth) => depth`. The first parameter, `indentTokens`, will be the _indent prefix_ of the line, i.e. an array of contiguous `"indent"`-type base lexer tokens that began the line and preceded first non-`"indent"`-type token of the line. This array could be empty for lines with no indentation. The second parameter, `indentAsString`, will be the concatenation of the `value` field of each of the tokens in `indentTokens`; i.e.: the result of `indentTokens.map(t => t.value).join('')`. The third parameter, `indentBreakingToken`, will be the first non-`"indent"`-type base token of the line. The fourth parameter, `previousDepth`, will be a number indicating the parser's currently established indentation depth, or `undefined` if no indentation depth has yet been established. This parameter can be useful when one wishes to ignore the actual indent prefix of the line and instead "force" the line to exist at a particular depth relative to the current indentation depth. For example, one may wish to have comment lines always exist at the previously established depth, irrespective of their actual indent prefix. - `emptyLineStrategy` - a `(newlineToken, emit) => {}` function to be executed upon encountering a line consisting of only zero or more `"indent"`-type base tokens followed by a `"newline"`-type token or the end of the base token input stream. The first parameter, `newlineToken`, will be the `"newline"`-type base token that ended the line, or `undefined` if the line was ended by the end of the base token stream. The second parameter, `emit`, is a function that takes a single parameter value and emits it as a token into the indentified lexer's stream. This can be useful, for example, when you'd like empty lines to get their own `eol`-type token. These tokens will be emitted exactly as provided. The default value is `() => {}`, which emits no tokens when encountering an empty line. Any returned value will be ignored. - `lineListeners` - an array of `(indentTokens, indentAsString, indentBreakingToken, indentBreakingTokenType) => {}` functions to be executed after the full indent prefix of each line is parsed but before the token that broke the indent prefix is emitted. Each listener will be called for each line, including empty lines. This can be useful, for example, to check that indent prefixes use consistent indent characters (indeed, the default provides this functionality). The first and second parameters, `indentTokens` and `indentAsString`, will reflect the line's indent prefix as described for the `determineIndentationDepth` option. The third parameter, `indentBreakingToken`, will be the non-`"indent"` token that broke the indent prefix, which may be a content token, a `"newline"`-type token, or `undefined` if the indent prefix was broken by the end of the base token stream. The fourth parameter, `indentBreakingTokenType`, will be the result of applying `options.controlTokenRecognizer()` to `indentBreakingToken`, or `undefined` if `indentBreakingToken` is itself `undefined`. Any returned value will be ignored. The default is `[ new IndentifyLexer.ConsistentIndentEnforcer() ]`, which ensures that the shared prefix of characters forming the indent from line to line does not change and raises an `Error` if it does. - `tokenBuilder` - a function for constructing `"indent"`, `"dedent"`, and `"newline"` alignment tokens to be inserted into the indentified stream. Takes the form `(type, value, baseToken) => alignmentToken`. The first parameter, `type`, will be a string indicating the requested type of token (one of `"indent"`, `"dedent"`, or `"eol"`). The second parmaeter, `value`, will be one of the following: - If `type === 'eol'`, `value` is the `value` field of the base token that triggered the end of the line. - If `type === 'indent'`, `value` is the concatenation of the `value` fields of each of the tokens that forms the indent prefix of the line whose content tokens will follow the constructed `indent` token. - If `type === 'dedent'`, `value` is either the concatenation of the `value` fields of each of the tokens that forms the indent prefix of the line whose content tokens will follow the constructed `dedent` token, or, in the case of a `dedent` token being constructed for an intermediate indentation level (which will be followed by another `dedent` token or the end of the stream rather than the content tokens of some line), `value` is the concatenation of the `value` fields of the tokens that formed the indent prefix of the line that originally established the intermediate indentation level. The third parameter, `baseToken` will be some base token to be used as a template to form the constructed alignment token. More specifically: - If `type === 'eol'`, `baseToken` will be the `"newline"`-type base token that triggered the end of the line. - If `type === 'indent'`, `baseToken` will be the indent-breaking token of the line whose content tokens will follow the constructed `"indent"` token. - If `type === 'dedent'`, `baseToken` will be the indent-breaking token of the line whose content tokens will follow the constructed `"dedent"` token, or, in the case of a `dedent` token being constructed for an intermediate indentation level (which will be followed by another `dedent` token or the end of the stream rather than the content tokens of some line), `baseToken` will be the `"newline"`-type base token that preceded the dedented line. The return value should be the requested alignment token, ready to be emitted into the token stream. Nearley-Indentify performs no further processing on these returned tokens and they are unconstrained other than the need to be acceptable to the token consumer. The default is: ```javascript (type, value, baseToken) => { const token = clone(baseToken); token.type = type; token.value = value; return token; }; ``` Where `clone()` is [`clone`](https://www.npmjs.com/package/clone). ## Required Base Lexer Interface Base lexers should conform to the [interface expected by Nearley](https://nearley.js.org/docs/tokenizers#custom-lexers). Nearley doesn't fully specify how token streams are terminated, but we assume moo-like behavior and specify that wrapped lexers must return `undefined` from `next()` when there are no further tokens. Tokens from the `next()` method of the base lexer must be objects with the Nearley-specified `value` field, but there are no other requirements under the default configuration. If custom `options.controlTokenRecognizer()` or `options.buildToken()` functions are specified, tokens must additionally be acceptable to them. The default control token recognizer requires only a `value` field and the default token builder requires only that base tokens are objects. ## Extras - `IndentifyLexer.ConsistentIndentEnforcer` - a line listener, suitable to provide to `options.lineListeners`, that enforces consistent indentation between lines. "Consistent" here means that the string indent prefix of contiguous non-empty lines at the same numerical indent level are the same string, the indent prefix of indented non-empty lines are prefixed by the indent prefix of the previous less-indented non-empty line, and the indent prefix of dedented non-empty lines forms a prefix of the indent prefix of the previous more-indented non-empty line.