UNPKG

pure-regex

Version:

A brave RegExp engine implemented in pure JS / vanilla JavaScript.

344 lines (289 loc) 11.9 kB
# Description A RegExp engine implemented in pure JavaScript. It works without any RegExp requirement, even in JavaScript runtimes lacking RegExp support. This package implements the latest features of modern regular expression (to name a few, unicode, named capture group, named capture backreference, lookbehind assertion and intersection and difference/subtraction operation), without any support from native RegExp. And it's almost invulnerable to the threat of ReDoS (see [the following section](#Security-and-Optimization)). _Remarks: the development of this package's core features is completed._ # Features Supports all modern flags as following: - [x] d: hasIndices (ES2021) - [x] g: global - [x] i: ignoreCase - [x] m: multiline - [x] s: dotAll - [x] u: unicode - [x] v: unicodeSets (ES2024) - [x] y: sticky # Tutorial ## Installation ```sh npm i pure-regex ``` ## Import ```javascript var PureRegex = require("pure-regex") ``` Or use import (in node.js or browser with a module bundler or loader): ```javascript import PureRegex from "pure-regex" ``` In browser (exports as PureRegex): ```html <script src="https://cdn.jsdelivr.net/npm/pure-regex@1.5/dist.umd.cjs"></script> or <script type="module"> import PureRegex from "https://cdn.jsdelivr.net/npm/pure-regex@1.5/dist.esm.js" //... </script> ``` <!--https://fastly.jsdelivr.net/npm/pure-regex@latest/dist.esm.js https://purge.jsdelivr.net/npm/pure-regex@latest/dist.esm.js--> # Example ## Match string ```javascript var pReg = new PureRegex("hello (.+)") var matches = pReg.exec("hello world") console.log(matches) ``` ## Match string with capture group name ```javascript var pReg = new PureRegex("hello (?<cap>.+)") var matches = pReg.exec("hello world") console.log(matches) ``` ## Match string with named capture reference ```javascript var pReg = new PureRegex("hello (?<cap>\\w) \\k<cap>") var matches = pReg.exec("hello world world") console.log(matches) ``` ## Match string with lookbehind assertion ```javascript var pReg = new PureRegex("(?<=hello ).+") var matches = pReg.exec("hello world") //match "world" console.log(matches) ``` ## Test string ```javascript var pReg = new PureRegex("w.+d") console.log(pReg.test("hello world")) ``` # Regex Api ## new PureRegex(source | regex, flags): instance This is the constructor, which could be invoked, as with the native RegExp. And it can also be called as a ordinary function without "new", such as `PureRegex(...)`. ## static PureRegex.escape(string): string Escape the reserved words in regex, so as to use directly in pattern. ```javascript PureRegex.escape("(new) file.txt") // \(new\)\\u0020file\.txt ``` ```javascript var url = "..." var pReg = new PureRegex(`https?://${PureRegex.escape(domain)}`, "g") url.replace(pReg, "") ``` Or use template literals: ```javascript // Instead of new PureRegex(`hello ${ PureRegex.escape(name) }`) // Use template tag PureRegex.raw`hello ${name}` // Avoid escape PureRegex.raw`hello ${ PureRegex(name) }` // Or regExp literal embedded, with no escape PureRegex.raw`hello ${ /regexp?/i }` // Instead of new PureRegex(`hello (\\w+)\\b`) // Let alone double backslashs PureRegex.raw`hello (\w+)\b` // Add pattern flags PureRegex.raw("i")`hello world\b` ``` ## Match methods: ## #exec(string): array | null Returns a match array when it matched, corresponding to the native RegExp. A null return means match failure. ## #test(string): bool Checks whether the string can be matched or not. ## Props: ## #flags: string Returns a string including the regex flags. ## #lastIndex: number Indicates the last-matched end position in the string, which serves as the next beginning position in the string. ## #hasIndices: bool Returns true if the `d`(hasIndices) flag exists. ## #global: bool Returns true if the `g`(global) flag exists. ## #ignoreCase: bool Returns true if the `i`(ignoreCase) flag exists. ## #multiline: bool Returns true if the `m`(multiline) flag exists. ## #dotAll: bool Returns true if the `s`(dotAll) flag exists. ## #unicode: bool Returns true if the `u`(unicode) flag exists. ## #unicodeSets: bool Returns true if the `v`(unicode) flag exists. ## #sticky: bool Returns true if the `s`(sticky) flag exists. ## #source: string Returns the regex source (also called the regex pattern). ## #toString(): string Returns the regex source, warpped in slashs, and its flags, just like "/.+/g". # Extended String Api ## String.prototype.match(pureRegex): array | null ```javascript var pReg = new PureRegex("(\\w+)$") var matches = "hello world".match(pReg) ``` ## String.prototype.matchAll(pureRegex): Iterator<array> ```javascript var pReg = new PureRegex("\\w+", "g") var matchesIterator = "hello world".matchAll(pReg) console.log([...matchesIterator]) ``` ## String.prototype.search(pureRegex): number ```javascript var pReg = new PureRegex("world", "i") //ignore letter case var index = "Hello World".search(pReg) //6 ``` ## String.prototype.replace(pureRegex, replacement): string ```javascript var pReg = new PureRegex("\\b(world)\\b") var str = "hello world".replace(pReg, "pure-regex") //"hello pure-regex" ``` ## String.prototype.replaceAll(pureRegex, replacement): string ```javascript var pReg = new PureRegex("\\s", "g") var str = "a b c".replaceAll(pReg, "_") //"a_b_c" ``` ## String.prototype.split(pureRegex[, limit]): array ```javascript var pReg = new PureRegex("\\s") var chunks = "hello world".split(pReg) ``` --- # Flags ## i - ignoreCase With case-insensitive matching of character, uppercase and lowercase letters are considered as equivalent, which involves the English alphabet. Thus "A" is equivalent to "a" in the regex source and the text to match. ## s - dotAll By default, a dot meta-character will get a match that excludes line breaks ("\n" and "\r"). When set the "s" flag, a dot meta-character hits nearly all character, the code point of which ranges from U+0000 to U+10FFFF only if the Unicode mode enabled. ## u - unicode When the "u" flag specified, the regex source and the matching text are interpret as Unicode encodings, and it supports the Unicode syntax and features. That is the regex works in the Unicode mode. ```javascript var pReg = new PureRegex("[🍀]\\u{1F338}", "u") var str = "🍀🌸" //Unicode console.log(pReg.test(str)) //true ``` ```javascript var pReg = new PureRegex("0x\\p{Hex_Digit}+", "u") var matches = pReg.exec("0xFAF1") console.log(matches) ``` ## v - unicodeSets The `v` flag is introduced in ES2024 as the upgrade from `u`(unicode) flag. Details of its features see the [document](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicodeSets). Some usages: ```javascript var pReg = new PureRegex("[\\p{White_Space}&&\\p{ASCII}]", "v") console.log(pReg.test("\n")) //true console.log(pReg.test("\u2028")) //false ``` ```javascript var pReg = new PureRegex("[\\p{White_Space}--\\q{\n}]", "v") var pReg = new PureRegex("[\\p{White_Space}--[\n]]", "v") //the same as above console.log(pReg.test("\r")) //true console.log(pReg.test("\n")) //false ``` ## m - multiline Providing the "m" flag is included, "^" and "$" delimiter would match the beginning and the end of every line respectively. ## g - global With the "g" flag, exec(), match() and replace() will support search all matches, and run in a stepping mode that every matching starts from the previous end position. Meanwhile, matchAll() and replaceAll() isn't accessible unless the global flag exists. ## y - sticky If the "y" flag is set, regex will start matching from a fixed position which pureRegex.lastIndex determines. ## d - hasIndices (ES2021 added) The case in point is that: ```javascript var pReg = new PureRegex("(a)b(c)", "d") var matches = pReg.exec("abc") console.log(matches.indices) // [ [ 0, 3 ], [ 0, 1 ], [ 2, 3 ], groups: undefined ] ``` --- # Extend ## static PureRegex.extend(propName, obj): void Due to the size limitations of the package, It doesn't include entire Unicode character set. However, external importing and extension of any character sets required are supported. Just install the following package ([link](https://www.npmjs.com/package/regenerate-unicode-properties)) and import it. ```sh npm i regenerate-unicode-properties ``` Example: ```javascript const Basic_Emoji = require("regenerate-unicode-properties/Property_of_Strings/Basic_Emoji.js"); PureRegex.extend("Basic_Emoji", Basic_Emoji) var preg = PureRegex.raw("v")`^\p{Basic_Emoji}$` preg.exec("🚀") ``` --- # Security and Optimization While using NFA, it provides fundamental immunity against [ReDoS](https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS) with a well-designed algorithm of runtime, but which differs from Thompson's construction. That is to say it also supports backtrack, including lookahead and lookbehind assertions and backreference, instead of compromising. As long as it doesn't involve NP-complete/NP-hard problems, namely backreferences, this engine guarantees worst-case linear time complexity O(n) without sacrificing functionality, giving it an advantage over existing DFA-like and NFA-like algorithms. The immunity roots in its distinctive algorithm, though it works on NFA-like, which a tree traversal and the path switch underlie. Moreover, there are also numerous optimization underlying PureRegex, notably at compile time, but more than that. Sometimes, a regex could reduce merely to string search. A simple ReDoS example: ```javascript var str = "x".repeat(20) var pat = "(x+)+y" var pReg = new PureRegex(pat) var nReg = new RegExp(pat) console.time("PureRegex") pReg.exec(str) //return instantly, the engine doesn't actually be started up or carried out console.timeEnd("PureRegex") console.time("RegExp") nReg.exec(str) //block for a while console.timeEnd("RegExp") ``` When exposed to the complicacy, the runtime engine could cope that at linearly-increasing cycles, since v1.2.0. It's noteworthy that it achieves even without the compile-time optimizations to that. So the same applies with other various patterns, like `"(x+)+(?:(x)+)+y"` and `"(x+)+(\\1)y"`. ```javascript var str = "x".repeat(40) + "!xy" var pat = "(x+)+y" var pReg = new PureRegex(pat) var nReg = new RegExp(pat) console.time("PureRegex") pReg.exec(str) console.timeEnd("PureRegex") console.time("RegExp") nReg.exec(str) //blocking for long console.timeEnd("RegExp") ``` Also stay immune to ordinary ReDoS since the first version, without any compilation optimization: ```javascript var str = "a".repeat(100) var pat = "^(([a-z])+.)+[A-Z]([a-z])+$" var pReg = new PureRegex(pat) var nReg = new RegExp(pat) console.time("PureRegex") pReg.exec(str) //a few milliseconds console.timeEnd("PureRegex") console.time("RegExp") nReg.exec(str) //always blocking console.timeEnd("RegExp") ``` Although before v1.1.0, it eliminated the capability of backtrack to obtain a complete resistance, which altered after. Since v1.3.0, the engine has gained the ability to match in reverse. Thus it reached its objective of __linear time__ for arbitrary lookbehind assertions, as any other expressions. And the early development was drawing to an end. Since v1.4.0, it has begun to enhance the adaptability of backreferrence, involving reforming the internal implement. Regretfully, determining whether a backreferrence is matched took polynomial-time, at least linear time. Since v1.5.0, it has attempted to consummate the optimization mechanism in backreferrence, with the core algorithm fully exploited. Starting from v1.5.32, it has added support for ES2025 features. e.g. Inlining flags (pattern modifiers): ```javascript var preg = new PureRegex(`x(?i:HELLO)x`) preg.test("xHellox") // true preg.test("xhellox") // true preg.test("XhelloX") // false ``` ## Feedback & Reporting Issues We value your thoughts! Any suggestion or question? [Click here to provide feedback](https://s.surveyplanet.com/st4afan2)