leo-profanity
Version:
Profanity filter, based on Shutterstock dictionary
239 lines (185 loc) • 8.77 kB
Markdown




[](https://codecov.io/github/jojoee/leo-profanity)
[](https://www.npmjs.com/package/leo-profanity)
[](http://opensource.org/licenses/MIT)
[](https://github.com/semantic-release/semantic-release)
[](https://dashboard.stryker-mutator.io/reports/github.com/jojoee/leo-profanity/master)
Profanity filter, based on "Shutterstock" dictionary. [Demo page](https://jojoee.github.io/leo-profanity/), [API document page](https://jojoee.github.io/leo-profanity/doc/LeoProfanity.html)
Node.js 18 or higher. TypeScript definitions included.
```
// npm
npm install leo-profanity
npm install leo-profanity --no-optional
// yarn
yarn add leo-profanity
yarn add leo-profanity --ignore-optional
// Bower
bower install leo-profanity
// dictionary/default.json
// githack
<script src="https://raw.githack.com/jojoee/bahttext/master/src/index.js"></script>
const filter = LeoProfanity
filter.clearList()
filter.add(["boobs", "butt"])
```
```javascript
// support languages: en, fr, ru
const filter = require('leo-profanity');
// output: I have ****, etc.
filter.clean('I have boob, etc.');
// replace current dictionary with the french
filter.loadDictionary('fr');
// create new dictionary
filter.addDictionary('th', ['หนึ่ง', 'สอง', 'สาม', 'สี่', 'ห้า']);
```
```typescript
import filter from 'leo-profanity';
// Basic usage
filter.clean('I have boob'); // 'I have ****'
// With options object
filter.clean('I have boob', { replaceKey: '+', nbLetters: 2 }); // 'I have bo++'
```
Exclude specific words from being filtered:
```javascript
const filter = require('leo-profanity');
// Add words to whitelist
filter.addWhitelist('ass');
filter.addWhitelist(['boob', 'butt']);
// These words will not be filtered
filter.clean('I have ass'); // 'I have ass'
filter.check('I have ass'); // false
// Remove from whitelist
filter.removeWhitelist('ass');
// Clear all whitelisted words
filter.clearWhitelist();
// Get all whitelisted words
filter.getWhitelist(); // []
```
The `clean()` method supports an options object:
```javascript
const filter = require('leo-profanity');
// Traditional positional parameters
filter.clean('I have boob', '+', 2); // 'I have bo++'
// Options object (recommended)
filter.clean('I have boob', { replaceKey: '+', nbLetters: 2 }); // 'I have bo++'
// Partial options
filter.clean('I have boob', { replaceKey: '-' }); // 'I have ----'
filter.clean('I have boob', { nbLetters: 2 }); // 'I have bo**'
```
See more here [LeoProfanity - Documentation](https://jojoee.github.io/leo-profanity/doc/LeoProfanity.html)
This project decide to split it into 2 parts, `Sanitize` and `Filter`
and these below is a interesting algorithms.
```
Attempt 1 (1.1): Convert all into lowercase string
Example:
- "SomeThing" to "something"
Advantage:
- Simple to understand
- Simple to implement
Disadvantage or Caution:
- Will ignore "case sensitive" word
Attempt 2 (1.2): Turn "similar-like" symbol to alphabet
Example:
- "@" to "a"
- "5" or "$" to "s"
- "@ss" to "ass"
- "b00b" to "boob"
- "a$$a$$in" to "assassin"
Advantage:
- Detect some trick words
Disadvantage or Caution:
- False positive
- Subjective, which depends on each person think about the symbol
- Limit user imagination (user cannot play with word)
e.g. "joe@ssociallife.com"
e.g. user want to try something funny like "a$$a$$in"
Attempt 3 (1.3): Replace "." and "," with space to separate words
In some sentence, people usually using "." and "," to connect or end the sentence
Example:
- "I like a55,b00b.t1ts" to "I like a55 b00b t1ts"
Advantage:
- Increase founding possibility e.g. "I like a55,b00b.t1ts"
Disadvantage or Caution:
- Disconnect some words e.g. "john.doe@gmail.com"
```
### Filter
```
Attempt 1 (2.1): Split into array (or using regex)
Using space to split "word string" into "word array" then check by profanity word list
Example:
- "I like ass boob" to ["I", "like", "ass", "boob"]
Advantage:
- Simple to implement
Disadvantage:
- Need proper list of profanity word
- Some "false positive" e.g. Great tit (https://en.wikipedia.org/wiki/Great_tit)
Attempt 2 (2.2): Filter word inside (with or without space)
Detect all alphabet that contain "profanity word"
Example:
- "thistextisfunnyboobsanda55" which contains suspicious words: "boobs", "a55"
Advantage:
- Can detect "un-spaced" profanity word
Disadvantage:
- Many "false positive" e.g. http://www.morewords.com/contains/ass/, Clbuttic mistake (filter mistake)
```
### In Summary
- We don't know all methods that can produce profanity word
(e.g. how many different ways can you enter a55 ?)
- There have a non-algorithm-based approach to achieve it (yet)
- People will always find a way to connect with each other
(e.g. [Leet](https://en.wikipedia.org/wiki/Leet))
**So, this project decide to go with 1.1, 1.3 and 2.1.**
(note - you can found other attempts in "Reference" section)
## CMD
```
npm run test.watch
npm run validate
npm run doc.generate
# test npm publish
npm publish --dry-run
# mutation test
npm install -g stryker-cli
stryker init
export STRYKER_DASHBOARD_API_KEY=<the_project_api_token>
echo $STRYKER_DASHBOARD_API_KEY
npx stryker run
```
- [x] Javascript on [npmjs.com/package/leo-profanity](https://www.npmjs.com/package/leo-profanity)
- [x] PHP on [packagist.org/packages/jojoee/leo-profanity](https://packagist.org/packages/jojoee/leo-profanity)
- [x] Python on [pypi.org/project/leoprofanity](https://pypi.org/project/leoprofanity)
- [ ] Java on [Maven](https://maven.apache.org/)
- [ ] Wordpress on [wordpress.org](https://wordpress.org/)
- Inspired by [jwils0n/profanity-filter](https://github.com/jwils0n/profanity-filter)
- Algorithm / Discussion
- ["similar-like" symbol to alphabet](http://stackoverflow.com/questions/24515/bad-words-filter#answer-24615)
- [Replace Bad words using Regex](http://stackoverflow.com/questions/3342011/replace-bad-words-using-regex)
- [Clbuttic](http://www.computerhope.com/jargon/c/clbuttic.htm)
- [The Clbuttic Mistake](http://thedailywtf.com/articles/The-Clbuttic-Mistake-)
- [The Clbuttic Mistake: When obscenity filters go wrong](http://www.telegraph.co.uk/news/newstopics/howaboutthat/2667634/The-Clbuttic-Mistake-When-obscenity-filters-go-wrong.html)
- [Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?](https://blog.codinghorror.com/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea/)
- [How do you implement a good profanity filter?](http://stackoverflow.com/questions/273516/how-do-you-implement-a-good-profanity-filter)
- [The Untold History of Toontown’s SpeedChat (or BlockChattm from Disney finally arrives)](http://habitatchronicles.com/2007/03/the-untold-history-of-toontowns-speedchat-or-blockchattm-from-disney-finally-arrives/)
- [Profanity Filter Performance in Java](http://softwareengineering.stackexchange.com/questions/91177/profanity-filter-performance-in-java)
- Resource bad-word list
- [Bad words list (458 words) by Alejandro U. Alvarez](https://urbanoalvarez.es/blog/2008/04/04/bad-words-list/)
- DansGuardian - [dansguardian.org](http://dansguardian.org/), [DansGuardian Phraselists](http://contentfilter.futuragts.com/phraselists/)
- [Seven dirty words](https://en.wikipedia.org/wiki/Seven_dirty_words)
- [Shutterstock](https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)
- [MauriceButler/badwords](https://github.com/MauriceButler/badwords)
- http://www.cs.cmu.edu/~biglou/resources/bad-words.txt
- Tool
- [RegExr](http://regexr.com/)