longestrepeatedstrings
Version:
Finds duplicated text strings and generates a report about the longest substrings or most frequent words in supplied text
197 lines (143 loc) ⢠7.62 kB
Markdown
[](#)
Longest Repeated Strings
========================
Finds ***duplicated text*** and generates a report about the ***longest substrings*** or
***most frequent words*** in supplied text, weighted by how much space the string
takes up overall (length * occurences).
> You supply input text or files. It returns raw data or a text report.
š§µ [Try an online demo](http://braksator.github.io/lrs)
(This module was designed to analyze javascript code for refactoring opportunities in a Gulp task)
## Stand-alone usage
See online demo link above, or [download project zip file](https://github.com/braksator/LongestRepeatedStrings/releases/) and open index.html to use the GUI.
## Installation
This is a Node.JS module available from the Node Package Manager (NPM).
https://www.npmjs.com/package/longestrepeatedstrings
Here's the command to download and install from NPM:
`npm install longestrepeatedstrings -S`
or with Yarn:
`yarn add longestrepeatedstrings`
## Usage
Include Longest Repeated Strings in your project:
```javascript
var LRS = require('longestrepeatedstrings');
```
### Finding Repeated Substrings in Text
You can analyze a single text by using the `text` function to find the longest repeated substrings:
```javascript
const text = 'Your text content goes here';
const results = LRS.text(text, { maxRes: 20, minLen: 8 });
console.log(results);
```
**Parameters**:
- `text` (String): The input text to analyze.
- `opts` (Object, optional): A configuration object with the following properties:
- `maxRes` (Number, default: 50): The maximum number of results to return.
Restricts the final list to highest scoring results and does not speed up processing.
- `minLen` (Number, default: `4`): The minimum length of substrings to consider.
- `maxLen` (Number, default: `40`): The maximum length of substrings to consider.
- `minOcc` (Number, default: `2`): The minimum number of occurrences a substring must have to be included.
- `penalty` (Number, default: `0`): Per-occurence score penalty, helps order results for deduplication. Setting to the same
- `split` (Array, default: `[' ', ',', '.', '\n']`): Splits input after specified strings. If not using the `words` and `clean`
options, settings THIS up properly for expected input will be key to making this module effective.
- `break` (Array, default: `[]`): Splits input ON these strings and won't include them in matches.
Can be used to concatenate an array of texts with a special char.
- `escSafe` (Boolean, default: `true`): Will take extra care around escaped characters. May as well leave this on.
- `words` (Boolean, default: `true`): If `true`, matches only whole words.
- `clean` (Boolean, default: `false`): If `true`, strips all symbols from input.
- `trim` (Boolean, default: `true`): If `true`, trims white space from results.
- `omit` (Array, default: `[]`): An array of substrings to omit from the results. Can be used to ignore accepted long/frequent words.
as `minLen`, for example, will cause longer substrings to appear earlier in the results. Negative penalty will favor more frequent substrings.
**Returns**: An array of objects containing the repeated substrings, their count, and a score for each.
### Analyzing Files
You can analyze multiple files by using the `files` function. This will read the contents of the files and find repeated substrings in each one.
```javascript
const fs = require('fs');
const files = ['file1.txt', 'file2.txt'];
const results = LRS.files(files, opts);
console.log(results);
```
**Parameters**:
- `files` (Array): An array of file paths to analyze.
- `opts` (Object, optional): Same options as in the `text` function.
**Returns**: An object where the keys are file names and the values are the repeated substrings found in each file.
### Creating Reports
#### File Analysis Report
```javascript
const report = LRS.filesReport(results, 1); // Pass `1` to log to console
console.log(report);
```
**Parameters**:
- `results` (Object): The results returned by the `files` function.
- `out` (Number, optional, default: `0`): If set to `1`, the report will be logged to the console too.
- `chars` (Object, optional): A configuration object with the following properties:
- `delim` (String, default: 'ā
'): Character/s to insert between each result.
- `open` (String, default: 'ā¦
'): Character/s to insert before the repeat count.
- `close` (String, default: 'Ćā¦'): Character/s to insert after the repeat count.
**Returns**: A text report summarizing the repeated substrings found in each file.
#### Text Analysis Report
```javascript
const report = LRS.textReport(results, 1); // Pass `1` to log to console
console.log(report);
```
**Parameters**:
- `results` (Array): The results returned by the `text` function.
- `out` (Number, optional, default: `0`): If set to `1`, the report will be logged to the console too.
- `chars` (Object, optional): Same options as in the `filesReport` function.
**Returns**: A list of repeated substrings with their occurrence counts.
### Example Workflow
1. Either, analyze a single text or multiple files:
```javascript
const text = 'This is an example text with repeated substrings';
const results = LRS.text(text);
```
or
```javascript
const files = ['file1.txt', 'file2.txt'];
const results = LRS.files(files);
```
2. Afterward, generate a report:
```javascript
const report = LRS.filesReport(results, 1); // Logs the report to console
```
### Notes
- Results are sorted by a score, which is calculated based on the length of the substring and the number of occurrences.
- This package is used in [JCrush](https://www.npmjs.com/package/jcrush); a Javascript code deduplicator.
## Gulp usage
In your `gulpfile.mjs`, use **Longest Repeated Strings** as a Gulp plugin:
#### Step 1: Import **Longest Repeated Strings**
```javascript
import LRS from 'longestrepeatedstrings';
```
#### Step 2: Create a Gulp Task for Longest Repeated Strings
```javascript
var analyzeStrings = true;
gulp.task('analyze', function (done) {
if (analyzeStrings) {
LRS.filesReport(LRS.files(['./script.min.js', './styles.min.css', './index.html'], {
clean: 1, words: 1,
omit: [
// This is a list of words that we just accept we've used a lot in the
// content, and we don't need to see them appear in repeated-strings
// reports. (supply all with lower-case)
'consciousness', 'enlightenment', 'ephemeral', 'watching', 'observing',
'communication', 'inspiring', 'realizing', 'uplifting', 'illusion',
],
}), 1, {delim: ", "});
analyzeStrings = false;
}
setTimeout(() => {analyzeStrings = true}, 1000 * 60 * 60); // Only run once an hour.
done(); // Signal completion
});
```
#### Step 3: Run **Longest Repeated Strings** After Minification
To run **Longest Repeated Strings** after your minification tasks, add Longest Repeated Strings in series after other tasks, such as in this example:
```javascript
gulp.task('default', gulp.series(
gulp.parallel('minify-css', 'minify-js', 'minify-html'), // Run your minification tasks first
'analyze' // Then run LRS
));
```
## Contributing
https://github.com/braksator/LongestRepeatedStrings
In lieu of a formal style guide, take care to maintain the existing coding
style.