UNPKG

re-build

Version:

Building regular expressions with natural language

289 lines (203 loc) 10.3 kB
RE-Build ======== Build regular expressions with natural language. ## Introduction Have you ever dealt with complex regular expressions like the following one? ```js var ipMatch = /(?:(?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\.){3}(?:1\d\d|2[0-4]\d|25[0-5]|[1-9]\d|\d)\b/; ``` Using a meaningful variable name can help, writing comments helps even more, but what's always hard to understand is what the regular expression actually *does*: They're left as some sort of magic trick that it's never updated because their syntax is so obscure that even the authors themselves hardly fell like facing them again. Debugging a regular expression often means rewriting it from scratch. RE-Build's aim is to change that, converting the process of creating a regular expression to combining nice natural language expressions. The above regex would be composed as ```js var ipNumber = RE.group( RE ("1").then.digit.then.digit .or ("2").then.oneOf.range("0", "4").then.digit .or ("25").then.oneOf.range("0", "5") .or .oneOf.range("1", "9").then.digit .or .digit ), ipMatch = RE.matching.exactly(3).group( ipNumber.then(".") ) .then(ipNumber).then.wordBoundary.regex; ``` This approach is definitely more verbose, but also much clearer and less error prone. Another module for the same purpose is [VerbalExpressions](https://github.com/VerbalExpressions/JSVerbalExpressions), but it doesn't allow to build just *any* regular expression. RE-Build aims to fill that gap too. Remember, as a general rule, that RE-Build does *not* care if your environment doesn't support certain `RegExp` features (for example, the `sticky` flag or extended Unicode escaping sequences), as the corresponding source code will be generated anyway. Of course, you'll get an error trying to get a `RegExp` object out of it. ## Installation Via `npm`: ```bash npm install re-build ``` Via `bower`: ```bash bower install re-build ``` The package can be loaded as a CommonJS module (node.js, io.js), as an AMD module (RequireJS, ...) or as a standalone script: ```html <script src="re-build.min.js"></script> ``` ## Usage For a detailed documentation, check the [reference sheet](doc/reference.md). Keep in mind that RE-Build is a tool to help building, understanding and debugging regular expressions, and does *not* prevent one to create incorrect results. ### Basics The *core* point is the `RE` object (or whatever variable name you assigned to it), together with the `matching` method: ```js var RE = require("re-build"); var builder = RE.matching("xyz"); ``` The output is *not*, however, a regular expression, but a a regular expression *builder* that can be extended, or used as an extension for other builders. To get the corrisponding regular expression, use the `regex` property or the `toRegExp()/valueOf()` methods. ```js var start = RE.matching.theStart.then(builder).toRegExp(); // /^xyz/ var foo = RE.matching(builder).then.oneOrMore.digit.regex; // /xyz\d+/ ``` As you can see, you can put additional matching blocks using the `then` word, which is also a function that can take arguments as blocks to add too. The arguments can be strings (which are backslash-escaped), regular expressions or RE-Build'ers, whose `source` property is added to the builder *unescaped*. The `or` word has a similar meaning, but adds an alternative block to the source: ```js var hex = RE.matching.digit .or.oneOf.range("A", "F") .regex; // /\d|[A-F]/ ``` ### Regex builders are immutable Regular expression builders are immutable objects, meaning that when extending a builder we get a new builder instance: ```js var bld1 = RE.matching.digit; var bld2 = bld1.or.oneOf.range("A", "F"); bld1 === bld2; // => false ``` ### Special classes, aliases and escaping RE-Build uses specific names to address common regex character classes: Name | Result | Notes ---------------|--------------|-------------- `digit` | `\d` | from `0` to `9` `alphaNumeric` | `\w` | digits, uppercase and lowercase letters and the underscore `whiteSpace` | `\s` | white space characters `wordBoundary` | `\b` | `anyChar` | `.` | universal matcher `theStart` | `^` | `theEnd` | `$` | `cReturn` | `\r` | carriage return `newLine` | `\n` | `tab` | `\t` | `vTab` | `\v` | vertical tab `formFeed` | `\f` | `null` | `\0` | `slash` | `\/` | `backslash` | `\\` | `backspace` | `\b` | can be used in character sets `[...]' *only* The first four names can be negated prefixing them with `not` to get the complementary meaning: * `not.digit` for `\D`; * `not.alphaNumeric` for `\W`; * `not.whiteSpace` for `\S`; * `not.wordBoundary` for `\B`. Single characters can be defined by escape sequences: Function | Result | Meaning ---------------|----------|----------- `ascii(n)` | `\xhh` | ASCII character corrisponding to `n` `codePoint(n)` | `\uhhhh` / `\u{hhhhhh}` | Unicode character corrisponding to `n` `control(a)` | `\ca` | Control sequence corrisponding to the letter `a` With the exception of `wordBoundary`, `theStart` and `theEnd`, all of the previous words can be used inside character sets (see after). ### Flags You can set the flags of the regex prefixing `matching` with one or more of the flagging options: * `globally` for a global regex; * `anyCase` for a case-insensitive regex; * `fullText` for a "multiline" regex (i.e., the dot '`.`' matches new line characters too); * `withUnicode` for a regex with extended Unicode support; * `stickily` for a "sticky" regex. Alternatively, you can set the flags with the `withFlags` method of the `RE` object. ```js // The following regexes are equivalent: /[a-f]/gi var foo = RE.globally.anyCase.matching.oneOf.range("a", "f").regex; var bar = RE.withFlags("gi").matching.oneOf.range("a", "f").regex; ``` You can't change a regex builder's flags, as builders are immutable, but you can create a copy of a builder with different flags: ```js var foo = RE.matching.oneOrMore.alphaNumeric; // /\w+/ var bar = RE.globally.matching(foo); // /\w+/g ``` If you don't need flags set, as a shortened version you can remove the `matching` word: ```js // These are equivalent: RE.matching("abc").then.digit; RE("abc").then.digit; ``` This becomes useful when defining the content of groups, character sets or look-aheads. ### Grouping Use the `group` word to define a non-capturing group, and `capture` for a capturing group: ```js var amount = RE.matching("$").then.capture( RE.oneOrMore.digit .then.noneOrOne.group(".", RE.oneOrMore.digit) ).regex; // /\$(\d+(?:\.\d+)?)/ ``` The `group` and `capture` words are function, and the resulting groups will embrace everything passed as arguments. Just like `then` and `or`, arguments can be strings, regular expression or other RE-Build'ers. Backrefences for capturing groups are obtained using the `reference` function, passing the reference number: ```js var quote = RE.matching.capture( RE.oneOf("'\"") ) .then.anyAmountOf.alphaNumeric .then.reference(1); // /(['"])\w*\1/ ``` ### Character sets Character sets (`[...]`) are introduced by the word `oneOf`. Several characters can be included separated by the word `and`. Additionally, one can include a character interval, using the function `range` and giving the initial and final character of the interval. Exclusive character sets can be obtained prefixing `oneOf` by the word `not`. ```js var hexColor = RE.matching("#").then.exactly(6) .oneOf.digit.and.range("a", "f").and.range("A", "F"); // /#[\da-fA-F]{6}/ var hours = RE.oneOf("01").then.digit.or("2").then.oneOf.range("0", "3"); // /[01]\d|2[0-3]/ var quote = RE.matching('"').then.oneOrMore.not.oneOf('"').then('"'); // /"[^"]+"/ ``` ### Quantifiers Quantifiers can be defined prefixing the quantified block by one of these constructs: Construct | Result ----------------|--------- `anyAmountOf` | `*` `oneOrMore` | `+` `noneOrOne` | `?` `atLeast(n)` | `{n,}` `atMost(n)` | `{,n}` `exactly(n)` | `{n}` `between(n, m)` | `{n,m}` Quantification is smart enough to translate constructs in their most compact form (e.g., `.atLeast(1)` becomes `+`, `.between(0, 1)` becomes `?` and so on). Lazy quantifiers can be obtained prefixing the word `lazily` prior to the quantifier. ```js var number = RE.oneOrMore.digit; // /\d+/ var hexnumber = RE.exactly(2).oneOf.digit.and.range("a", "f"); // /[\da-f]{2}/ var macAddress = RE.anyCase.matching(hexnumber).then.exactly(5).group( RE("-").then(hexnumber) ); // /[\da-f]{2}(?:-[\da-f]{2}){5}/i var quoteAlt = RE.matching.capture(RE.oneOf("'\"")) .then.lazily.anyAmountOf.anyChar .then.reference(1); // /(['"]).*?\1/ ``` ### Look-aheads Look-aheads are introduced by the function `followedBy` (eventually prefixed by `not` for negative look-aheads). ```js var euro = RE.matching.oneOrMore.digit.followedBy("€"); // /\d+(?=€)/ var foo = RE("a").or.not.followedBy("b").then("c"); // /a|(?!b)c/ ``` ## Compatibilty * Internet Explorer 9+ * Firefox 4+ * Safari 5+ * Chrome * Opera 11.60+ * node.js Basically, every Javascript environment that supports [`Object.defineProperties`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/defineProperties) should be fine. ## Tests The unit tests are built on top of [mocha](http://mochajs.org/). Once the package is installed, run `npm install` from the package's root directory in order to locally install mocha, then `npm run test` to execute the tests. Open [index.html](test/index.html) with a browser to perform the tests on the client side. If mocha is installed globally, served side tests can be run with just the command `mocha` from the package's root directory. ## To do * More natural language alternatives * Plurals, articles * CLI tool to translate regexes to and from RE-Build's syntax * More examples * Consider IE8 support ## License MIT @ Massimo Artizzu 2015-2016. See [LICENSE](LICENSE).