re-build
Version:
Building regular expressions with natural language
289 lines (203 loc) • 10.3 kB
Markdown
RE-Build
========
Build regular expressions with natural language.
## Introduction
Have you ever dealt with complex regular expressions like the following one?
```js
var ipMatch = /(?:(?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\.){3}(?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\b/;
```
Using a meaningful variable name can help, writing comments helps even more, but what's always hard to understand is what the regular expression actually *does*: They're left as some sort of magic trick that it's never updated because their syntax is so obscure that even the authors themselves hardly fell like facing them again. Debugging a regular expression often means rewriting it from scratch.
RE-Build's aim is to change that, converting the process of creating a regular expression to combining nice natural language expressions. The above regex would be composed as
```js
var ipNumber = RE.group(
RE ("1").then.digit.then.digit
.or ("2").then.oneOf.range("0", "4").then.digit
.or ("25").then.oneOf.range("0", "5")
.or .oneOf.range("1", "9").then.digit
.or .digit
),
ipMatch = RE.matching.exactly(3).group( ipNumber.then(".") )
.then(ipNumber).then.wordBoundary.regex;
```
This approach is definitely more verbose, but also much clearer and less error prone.
Another module for the same purpose is [VerbalExpressions](https://github.com/VerbalExpressions/JSVerbalExpressions), but it doesn't allow to build just *any* regular expression. RE-Build aims to fill that gap too.
Remember, as a general rule, that RE-Build does *not* care if your environment doesn't support certain `RegExp` features (for example, the `sticky` flag or extended Unicode escaping sequences), as the corresponding source code will be generated anyway. Of course, you'll get an error trying to get a `RegExp` object out of it.
## Installation
Via `npm`:
```bash
npm install re-build
```
Via `bower`:
```bash
bower install re-build
```
The package can be loaded as a CommonJS module (node.js, io.js), as an AMD module (RequireJS, ...) or as a standalone script:
```html
<script src="re-build.min.js"></script>
```
## Usage
For a detailed documentation, check the [reference sheet](doc/reference.md). Keep in mind that RE-Build is a tool to help building, understanding and debugging regular expressions, and does *not* prevent one to create incorrect results.
### Basics
The *core* point is the `RE` object (or whatever variable name you assigned to it), together with the `matching` method:
```js
var RE = require("re-build");
var builder = RE.matching("xyz");
```
The output is *not*, however, a regular expression, but a a regular expression *builder* that can be extended, or used as an extension for other builders. To get the corrisponding regular expression, use the `regex` property or the `toRegExp()/valueOf()` methods.
```js
var start = RE.matching.theStart.then(builder).toRegExp(); // /^xyz/
var foo = RE.matching(builder).then.oneOrMore.digit.regex; // /xyz\d+/
```
As you can see, you can put additional matching blocks using the `then` word, which is also a function that can take arguments as blocks to add too. The arguments can be strings (which are backslash-escaped), regular expressions or RE-Build'ers, whose `source` property is added to the builder *unescaped*.
The `or` word has a similar meaning, but adds an alternative block to the source:
```js
var hex = RE.matching.digit
.or.oneOf.range("A", "F")
.regex; // /\d|[A-F]/
```
### Regex builders are immutable
Regular expression builders are immutable objects, meaning that when extending a builder we get a new builder instance:
```js
var bld1 = RE.matching.digit;
var bld2 = bld1.or.oneOf.range("A", "F");
bld1 === bld2; // => false
```
### Special classes, aliases and escaping
RE-Build uses specific names to address common regex character classes:
Name | Result | Notes
---------------|--------------|--------------
`digit` | `\d` | from `0` to `9`
`alphaNumeric` | `\w` | digits, uppercase and lowercase letters and the underscore
`whiteSpace` | `\s` | white space characters
`wordBoundary` | `\b` |
`anyChar` | `.` | universal matcher
`theStart` | `^` |
`theEnd` | `$` |
`cReturn` | `\r` | carriage return
`newLine` | `\n` |
`tab` | `\t` |
`vTab` | `\v` | vertical tab
`formFeed` | `\f` |
`null` | `\0` |
`slash` | `\/` |
`backslash` | `\\` |
`backspace` | `\b` | can be used in character sets `[...]' *only*
The first four names can be negated prefixing them with `not` to get the complementary meaning:
* `not.digit` for `\D`;
* `not.alphaNumeric` for `\W`;
* `not.whiteSpace` for `\S`;
* `not.wordBoundary` for `\B`.
Single characters can be defined by escape sequences:
Function | Result | Meaning
---------------|----------|-----------
`ascii(n)` | `\xhh` | ASCII character corrisponding to `n`
`codePoint(n)` | `\uhhhh` / `\u{hhhhhh}` | Unicode character corrisponding to `n`
`control(a)` | `\ca` | Control sequence corrisponding to the letter `a`
With the exception of `wordBoundary`, `theStart` and `theEnd`, all of the previous words can be used inside character sets (see after).
### Flags
You can set the flags of the regex prefixing `matching` with one or more of the flagging options:
* `globally` for a global regex;
* `anyCase` for a case-insensitive regex;
* `fullText` for a "multiline" regex (i.e., the dot '`.`' matches new line characters too);
* `withUnicode` for a regex with extended Unicode support;
* `stickily` for a "sticky" regex.
Alternatively, you can set the flags with the `withFlags` method of the `RE` object.
```js
// The following regexes are equivalent: /[a-f]/gi
var foo = RE.globally.anyCase.matching.oneOf.range("a", "f").regex;
var bar = RE.withFlags("gi").matching.oneOf.range("a", "f").regex;
```
You can't change a regex builder's flags, as builders are immutable, but you can create a copy of a builder with different flags:
```js
var foo = RE.matching.oneOrMore.alphaNumeric; // /\w+/
var bar = RE.globally.matching(foo); // /\w+/g
```
If you don't need flags set, as a shortened version you can remove the `matching` word:
```js
// These are equivalent:
RE.matching("abc").then.digit;
RE("abc").then.digit;
```
This becomes useful when defining the content of groups, character sets or look-aheads.
### Grouping
Use the `group` word to define a non-capturing group, and `capture` for a capturing group:
```js
var amount = RE.matching("$").then.capture(
RE.oneOrMore.digit
.then.noneOrOne.group(".", RE.oneOrMore.digit)
).regex;
// /\$(\d+(?:\.\d+)?)/
```
The `group` and `capture` words are function, and the resulting groups will embrace everything passed as arguments. Just like `then` and `or`, arguments can be strings, regular expression or other RE-Build'ers.
Backrefences for capturing groups are obtained using the `reference` function, passing the reference number:
```js
var quote = RE.matching.capture( RE.oneOf("'\"") )
.then.anyAmountOf.alphaNumeric
.then.reference(1);
// /(['"])\w*\1/
```
### Character sets
Character sets (`[...]`) are introduced by the word `oneOf`. Several characters can be included separated by the word `and`. Additionally, one can include a character interval, using the function `range` and giving the initial and final character of the interval.
Exclusive character sets can be obtained prefixing `oneOf` by the word `not`.
```js
var hexColor = RE.matching("#").then.exactly(6)
.oneOf.digit.and.range("a", "f").and.range("A", "F");
// /#[\da-fA-F]{6}/
var hours = RE.oneOf("01").then.digit.or("2").then.oneOf.range("0", "3");
// /[01]\d|2[0-3]/
var quote = RE.matching('"').then.oneOrMore.not.oneOf('"').then('"');
// /"[^"]+"/
```
### Quantifiers
Quantifiers can be defined prefixing the quantified block by one of these constructs:
Construct | Result
----------------|---------
`anyAmountOf` | `*`
`oneOrMore` | `+`
`noneOrOne` | `?`
`atLeast(n)` | `{n,}`
`atMost(n)` | `{,n}`
`exactly(n)` | `{n}`
`between(n, m)` | `{n,m}`
Quantification is smart enough to translate constructs in their most compact form (e.g., `.atLeast(1)` becomes `+`, `.between(0, 1)` becomes `?` and so on).
Lazy quantifiers can be obtained prefixing the word `lazily` prior to the quantifier.
```js
var number = RE.oneOrMore.digit; // /\d+/
var hexnumber = RE.exactly(2).oneOf.digit.and.range("a", "f");
// /[\da-f]{2}/
var macAddress = RE.anyCase.matching(hexnumber).then.exactly(5).group(
RE("-").then(hexnumber)
);
// /[\da-f]{2}(?:-[\da-f]{2}){5}/i
var quoteAlt = RE.matching.capture(RE.oneOf("'\""))
.then.lazily.anyAmountOf.anyChar
.then.reference(1);
// /(['"]).*?\1/
```
### Look-aheads
Look-aheads are introduced by the function `followedBy` (eventually prefixed by `not` for negative look-aheads).
```js
var euro = RE.matching.oneOrMore.digit.followedBy("€");
// /\d+(?=€)/
var foo = RE("a").or.not.followedBy("b").then("c");
// /a|(?!b)c/
```
## Compatibilty
* Internet Explorer 9+
* Firefox 4+
* Safari 5+
* Chrome
* Opera 11.60+
* node.js
Basically, every Javascript environment that supports [`Object.defineProperties`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/defineProperties) should be fine.
## Tests
The unit tests are built on top of [mocha](http://mochajs.org/). Once the package is installed, run `npm install` from the package's root directory in order to locally install mocha, then `npm run test` to execute the tests. Open [index.html](test/index.html) with a browser to perform the tests on the client side.
If mocha is installed globally, served side tests can be run with just the command `mocha` from the package's root directory.
## To do
* More natural language alternatives
* Plurals, articles
* CLI tool to translate regexes to and from RE-Build's syntax
* More examples
* Consider IE8 support
## License
MIT @ Massimo Artizzu 2015-2016. See [LICENSE](LICENSE).