UNPKG

longestrepeatedstrings

Version:

Finds duplicated text strings and generates a report about the longest substrings or most frequent words in supplied text

197 lines (143 loc) • 7.62 kB
[![npm](https://img.shields.io/npm/dt/longestrepeatedstrings.svg)](#) Longest Repeated Strings ======================== Finds ***duplicated text*** and generates a report about the ***longest substrings*** or ***most frequent words*** in supplied text, weighted by how much space the string takes up overall (length * occurences). > You supply input text or files. It returns raw data or a text report. 🧵 [Try an online demo](http://braksator.github.io/lrs) (This module was designed to analyze javascript code for refactoring opportunities in a Gulp task) ## Stand-alone usage See online demo link above, or [download project zip file](https://github.com/braksator/LongestRepeatedStrings/releases/) and open index.html to use the GUI. ## Installation This is a Node.JS module available from the Node Package Manager (NPM). https://www.npmjs.com/package/longestrepeatedstrings Here's the command to download and install from NPM: `npm install longestrepeatedstrings -S` or with Yarn: `yarn add longestrepeatedstrings` ## Usage Include Longest Repeated Strings in your project: ```javascript var LRS = require('longestrepeatedstrings'); ``` ### Finding Repeated Substrings in Text You can analyze a single text by using the `text` function to find the longest repeated substrings: ```javascript const text = 'Your text content goes here'; const results = LRS.text(text, { maxRes: 20, minLen: 8 }); console.log(results); ``` **Parameters**: - `text` (String): The input text to analyze. - `opts` (Object, optional): A configuration object with the following properties: - `maxRes` (Number, default: 50): The maximum number of results to return. Restricts the final list to highest scoring results and does not speed up processing. - `minLen` (Number, default: `4`): The minimum length of substrings to consider. - `maxLen` (Number, default: `40`): The maximum length of substrings to consider. - `minOcc` (Number, default: `2`): The minimum number of occurrences a substring must have to be included. - `penalty` (Number, default: `0`): Per-occurence score penalty, helps order results for deduplication. Setting to the same - `split` (Array, default: `[' ', ',', '.', '\n']`): Splits input after specified strings. If not using the `words` and `clean` options, settings THIS up properly for expected input will be key to making this module effective. - `break` (Array, default: `[]`): Splits input ON these strings and won't include them in matches. Can be used to concatenate an array of texts with a special char. - `escSafe` (Boolean, default: `true`): Will take extra care around escaped characters. May as well leave this on. - `words` (Boolean, default: `true`): If `true`, matches only whole words. - `clean` (Boolean, default: `false`): If `true`, strips all symbols from input. - `trim` (Boolean, default: `true`): If `true`, trims white space from results. - `omit` (Array, default: `[]`): An array of substrings to omit from the results. Can be used to ignore accepted long/frequent words. as `minLen`, for example, will cause longer substrings to appear earlier in the results. Negative penalty will favor more frequent substrings. **Returns**: An array of objects containing the repeated substrings, their count, and a score for each. ### Analyzing Files You can analyze multiple files by using the `files` function. This will read the contents of the files and find repeated substrings in each one. ```javascript const fs = require('fs'); const files = ['file1.txt', 'file2.txt']; const results = LRS.files(files, opts); console.log(results); ``` **Parameters**: - `files` (Array): An array of file paths to analyze. - `opts` (Object, optional): Same options as in the `text` function. **Returns**: An object where the keys are file names and the values are the repeated substrings found in each file. ### Creating Reports #### File Analysis Report ```javascript const report = LRS.filesReport(results, 1); // Pass `1` to log to console console.log(report); ``` **Parameters**: - `results` (Object): The results returned by the `files` function. - `out` (Number, optional, default: `0`): If set to `1`, the report will be logged to the console too. - `chars` (Object, optional): A configuration object with the following properties: - `delim` (String, default: 'ā˜…'): Character/s to insert between each result. - `open` (String, default: '⦅'): Character/s to insert before the repeat count. - `close` (String, default: '×⦆'): Character/s to insert after the repeat count. **Returns**: A text report summarizing the repeated substrings found in each file. #### Text Analysis Report ```javascript const report = LRS.textReport(results, 1); // Pass `1` to log to console console.log(report); ``` **Parameters**: - `results` (Array): The results returned by the `text` function. - `out` (Number, optional, default: `0`): If set to `1`, the report will be logged to the console too. - `chars` (Object, optional): Same options as in the `filesReport` function. **Returns**: A list of repeated substrings with their occurrence counts. ### Example Workflow 1. Either, analyze a single text or multiple files: ```javascript const text = 'This is an example text with repeated substrings'; const results = LRS.text(text); ``` or ```javascript const files = ['file1.txt', 'file2.txt']; const results = LRS.files(files); ``` 2. Afterward, generate a report: ```javascript const report = LRS.filesReport(results, 1); // Logs the report to console ``` ### Notes - Results are sorted by a score, which is calculated based on the length of the substring and the number of occurrences. - This package is used in [JCrush](https://www.npmjs.com/package/jcrush); a Javascript code deduplicator. ## Gulp usage In your `gulpfile.mjs`, use **Longest Repeated Strings** as a Gulp plugin: #### Step 1: Import **Longest Repeated Strings** ```javascript import LRS from 'longestrepeatedstrings'; ``` #### Step 2: Create a Gulp Task for Longest Repeated Strings ```javascript var analyzeStrings = true; gulp.task('analyze', function (done) { if (analyzeStrings) { LRS.filesReport(LRS.files(['./script.min.js', './styles.min.css', './index.html'], { clean: 1, words: 1, omit: [ // This is a list of words that we just accept we've used a lot in the // content, and we don't need to see them appear in repeated-strings // reports. (supply all with lower-case) 'consciousness', 'enlightenment', 'ephemeral', 'watching', 'observing', 'communication', 'inspiring', 'realizing', 'uplifting', 'illusion', ], }), 1, {delim: ", "}); analyzeStrings = false; } setTimeout(() => {analyzeStrings = true}, 1000 * 60 * 60); // Only run once an hour. done(); // Signal completion }); ``` #### Step 3: Run **Longest Repeated Strings** After Minification To run **Longest Repeated Strings** after your minification tasks, add Longest Repeated Strings in series after other tasks, such as in this example: ```javascript gulp.task('default', gulp.series( gulp.parallel('minify-css', 'minify-js', 'minify-html'), // Run your minification tasks first 'analyze' // Then run LRS )); ``` ## Contributing https://github.com/braksator/LongestRepeatedStrings In lieu of a formal style guide, take care to maintain the existing coding style.