@gitlab/eslint-plugin
Version:
GitLab package for our custom eslint rules
239 lines (181 loc) • 10.6 kB
Markdown
# Benchmarking /require-i18n-strings
At the heart of `/require-i18n-strings` is `isNonLocalizedStringValue(value: string)` which checks whether a string matches certain criteria and would be a false positive. For example CSS class lists, or `dateformat` strings, or URLs should not be externalized.
That being said most of the checks are regex checks, and some of them very expensive. In first tests, the following are especially bad.
- `([a-zA-Z][a-zA-Z-]+[a-zA-Z])+\.\w+` inside `isCssSelector` takes ~45% of the execution time
- `^([a-zA-Z0-9]+-[a-zA-Z0-9-]*[\s,]*)+$` inside `isKebabCase` takes ~27% of the execution time
- `^(\w+\.?)*\.\w{2,}$` inside `isFileName` takes ~24% of the execution time
Why do we actually care? In context of the GitLab project the execution of: `node --cpu-prof node_modules/.bin/eslint .` takes around 7:48 minutes on my machine, of which we spend 2:15 minutes (~27%) inside of `isNonLocalizedStringValue`. As this is a heavily CPU-bound issue, we likely spend way longer on slower machines, like e.g. CI.
## Benchmarking script
Inside of this folder there are two scripts used for benchmarking. The following commands are all executed inside `scripts/benchmarking` of the eslint-plugin repo. Please run a yarn install in this folder, as we have a few different dependencies which the main plugin doesn't need.
`getAllStrings.js`: This script can be used to extract strings JS and Vue files from a code base. Just provide globs as parameters.
For example, to extract all strings from within GitLab:
```bash
node ./getAllStrings.js "/path/to/gitlab/gitlab/{ee/,}app/**/*.{js,vue}"
```
The strings found are written into `allStrings.json`.
If you are doing iterations on `isNonLocalizedStringValue`,
the idea would be to just generate the lists of strings _once_,
in order to have a comparable baseline.
---
`checkAllStrings.js`: This script has a dual purpose: Writing results of the function to a file time-stamped txt file, in order to compare the results between iterations:
```bash
node ./checkAllStrings.js --write-results
```
In order to look at the efficiency of our function, we can get a CPU profile with:
```bash
node --cpu-prof ./checkAllStrings.js
```
The generated profile can be loaded into e.g. [speedscope]. The `left heavy` view will make it easily understandable where time is spent.
Thanks here for the blog series [Speeding up the JavaScript ecosystem][blog-series] which inspired me to look into this.
[speedscope]: https://www.speedscope.app
[blog-series]: https://marvinh.dev/blog/speeding-up-javascript-ecosystem-part-3/
## Baseline
On my machine I have generated a list of strings based on [1d13ebfbb0] in the GitLab project.
I found `47801` unique strings with `./getAllStrings.js`. For 13 of them the script times out.
`hyperfine --runs 3 "node ./checkAllStrings.js"` reports:
```
Benchmark 1: node ./checkAllStrings.js
Time (mean ± σ): 42.315 s ± 0.186 s [User: 40.149 s, System: 2.791 s]
Range (min … max): 42.139 s … 42.510 s 3 runs
```
The CPU Profile shows that `isCssSelector` is the worst offender (60% of the time spend). Let's have a look.
[1d13ebfbb0]: https://gitlab.com/gitlab-org/gitlab/-/commit/1d13ebfbb0ca339452e883a047806e7f8c167197
## `isCssSelector`
```js
function isCssSelector(value) {
return (
/^,?\s?[\.\#\*\@]\w+.*/.test(value) ||
/^[a-zA-Z]+[a-zA-Z-]*[\.\#\*]+[a-zA-Z]+[a-zA-Z-]*/.test(value) ||
/([a-zA-Z][a-zA-Z-]+[a-zA-Z])+\.\w+/.test(value) ||
/([a-zA-Z]+[a-zA-Z-]*|\*)\[.*\]/.test(value) ||
/^[a-zA-Z]+[a-zA-Z-]*[\#\*]$/.test(value) ||
/^(?:[0-9]+[a-z]|[a-z]+[0-9])[a-z0-9]*\s[a-z0-9][a-z0-9]+-[a-z0-9-]*$/.test(value) ||
isListOfCssProperties(value) ||
isListOfHtmlElements(value)
);
}
```
The third RegExp in this function seems to be the offender: `/([a-zA-Z][a-zA-Z-]+[a-zA-Z])+\.\w+/`. So let us dissect it.
The `\.\w+` at the end is a literal `.` followed by _one or more_ word characters.
Because we actually do not care how _long_ the string of word characters at the end is, we simply can reduce it to: `\w`.
Now let's look at the capturing group: `([a-zA-Z][a-zA-Z-]+[a-zA-Z])+`.
One thing that makes it easier to parse Regexp: substitute a class of characters with a simple character.
So let us substitute `a-zA-Z` with `x`.
Now we have: `(x[x-]+x)+`.
The capturing group inside the brackets match strings starting with an `x`, followed by an amount of x and hyphens, and ending with an `x`.
The shortest strings matching are:
```
xxx
x-x
xxxx
xx-x
x-xx
x--x
```
The capture group can be repeated one or more times. So for example: `x-x` followed by `x-x-xx` => `x-xx-x-xx`.
But no matter the combination of two or more of those matching strings, the original regex `x[x-]+x` in matches all of them as well, which makes the repetition unnecessary and expensive.
So let's simplify it:
```js
/([a-zA-Z][a-zA-Z-]+[a-zA-Z])+\.\w+/ // | \w and \w+ are equivalent here, because we do not care how many word chars come after the first one
/([a-zA-Z][a-zA-Z-]+[a-zA-Z])+\.\w/ // | Let's substitute `[a-zA-Z]` with `x`
/(x[x-]+x)+\.\w/ // | (x[x-]+x)+ and (x[x-]+x) are producing the same results here here
/(x[x-]+x)\.\w/ // | Let's substitute `x` with `[a-zA-Z]` and remove the unnecessary capturing
/[a-zA-Z][a-zA-Z-]+[a-zA-Z]\.\w/
```
### Results
`hyperfine --runs 3 "node ./checkAllStrings.js"` reports:
````
Benchmark 1: node ./checkAllStrings.js
Time (mean ± σ): 25.922 s ± 0.181 s [User: 23.802 s, System: 2.670 s]
Range (min … max): 25.726 s … 26.085 s 3 runs```
````
And 6 of our 13 strings are not timing out any longer. The next offender is `isKebabCase`. Let's have a look.
## `isKebabCase`
```js
function isKebabCase(value) {
return (
/^([a-zA-Z0-9]+-[a-zA-Z0-9-]*[\s,]*)+$/.test(value) ||
/^([a-z0-9]+-[a-z0-9-]*)+(\s[a-z0-9-]*)*/.test(value) ||
/(\s[a-z0-9-]*)*([a-z0-9]+-[a-z0-9-]*)+$/.test(value)
);
}
```
The first regex is the offender here, but let us have a look at the last two first.
`^([a-z0-9]+-[a-z0-9-]*)+(\s[a-z0-9-]*)*` matches strings which starting with one or more `([a-z0-9]+-[a-z0-9-]*)`, followed by zero or more of `(\s[a-z0-9-]*)`.
Because we are not checking the complete string (missing `$` at the end), the latter is actually redundant, because for example: `a-b NONO` still tests true, despite `NONO` not matching that optional group.
If we substitute `[a-z0-9]` again with x, we get: `^(x+-[x-]*)+` for the first group.
As one match of the group, at the start of the string is enough, we can remove the capturing group and end up with `^x+-[x-]*`.
Now the last part of that group is also redundant, because we accept zero matches.
```js
/^([a-z0-9]+-[a-z0-9-]*)+(\s[a-z0-9-]*)*/ // | `(\s[a-z0-9-]*)*` is redundant, because the test will succeed whether it is there or not
/^([a-z0-9]+-[a-z0-9-]*)+/ // | Let's substitute `[a-z0-9]` with `x`
/^(x+-[x-]*)+/ // | One match of the group is actually enough, so we can get rid of the capturing and +
/^x+-[x-]*/ // | Similar to our first step, `[x-]*` became redundant, because zero matches are also fine.
/^x+-/ // | Let's substitute `[a-z0-9]` back in
/^[a-z0-9]+-/
```
Let's do the same for the last regex:
```js
/(\s[a-z0-9-]*)*([a-z0-9]+-[a-z0-9-]*)+$/ // | Again, `(\s[a-z0-9-]*)*` is redundant, because the test will succeed whether it is there or not
/([a-z0-9]+-[a-z0-9-]*)+$/ // | Let's substitute `[a-z0-9]` with `x`
/(x+-[x-]*)+$/ // | One match of the group is actually enough, so we can get rid of the capturing and +
/x+-[x-]*$/ // | Let's substitute `[a-z0-9]` back in
/[a-z0-9]+-[a-z0-9-]*$/
```
Now let us focus on the "beast" `/^([a-zA-Z0-9]+-[a-zA-Z0-9-]*[\s,]*)+$/`.
What is so interesting about this one? The `[\s,]*` bit is an optional list of whitespace characters and commas.
Essentially we want to check whether a list of values separated by those characters matches the rest.
So conceptually it isn't too different from:
```js
value.split(/[\s,]+/).every((item) => /^([a-zA-Z0-9]+-[a-zA-Z0-9-]*)+$/.test(item));
```
One caveat here: e.g. `foo-bar ,` would fail now because the split method actually would have the empty string `''` as the last value of the array.
However, we can build a little helper function:
```js
// Hey, I am actually doing splitAndTrim, but who cares.
function trimAndSplit(value, separator, trimLeft = true, trimRight = true) {
const list = value.split(separator);
if (trimRight && list[list.length - 1] === '') {
list.pop();
}
if (trimLeft && list[0] === '') {
list.shift();
}
return list;
}
// Aaand
trimAndSplit(value, /[\s,]+/, false, true).every((item) =>
/^[a-zA-Z0-9]+-[a-zA-Z0-9-]*$/.test(item),
);
```
Nice. Now if we look at the regex, our exercise will let us simplify the thing again, back to: `/^[a-zA-Z0-9]+-[a-zA-Z0-9-]*$/`.
Let's combine all the things, and now we have:
```js
function isKebabCase(value) {
return (
/^[a-z0-9]+-/.test(value) ||
/[a-z0-9]+-[a-z0-9-]*$/.test(value) ||
trimAndSplit(value, /[\s,]+/, false, true).every((item) =>
/^[a-zA-Z0-9]+-[a-zA-Z0-9-]*$/.test(item),
)
);
}
```
This solves the remaining checks which used to time out and hyperfine reports:
```bash
$ hyperfine --runs 3 "node ./checkAllStrings.js"
Benchmark 1: node ./checkAllStrings.js
Time (mean ± σ): 15.141 s ± 0.015 s [User: 13.198 s, System: 2.514 s]
Range (min … max): 15.127 s … 15.158 s 3 runs
```
As we have no checks timing out any longer, also add a `--no-timeout` arg to checkAllStrings, as the timeout guard comes with a significant performance overhead.
```bash
$ hyperfine "node ./checkAllStrings.js --no-timeout"
Time (mean ± σ): 320.3 ms ± 0.8 ms [User: 324.3 ms, System: 10.5 ms]
Range (min … max): 319.0 ms … 321.4 ms 10 runs
```
What used to take _minutes_, now takes just 320 ms.
Eslint inside the GitLab project has finished on my machine and produced the following results:
- `isNonLocalizedStringValue` now takes 197ms instead of 2:15 minutes (0.06% vs 29%) of the total run time
- In CI the execution time went down from ~1400s to ~940s, which is about a 30% improvement and aligns with the results from the profiling.
For now, this is such a big improvement, that we can move forward with getting this merged and save loads of time and cpu cycles.