string-punctuation-tokenizer
Version:
Small library that provides functions to tokenize a string into an array of words with or without punctuation
160 lines (135 loc) • 4.75 kB
Markdown
The `tokenize` function breaks a string of text into smaller contiguous strings called tokens.
### Quick Start
The default options meet most use cases of just getting the word tokens out of a string.
```js
// import {tokenize} from 'string-punctuation-tokenizer';
// use the previous line in your code not the next one...
import {tokenize} from '../tokenizers.js';
const text = `Hello, World!`;
const tokens = tokenize({text});
const output = JSON.stringify(tokens, null, 2);
// wrapped in a React fragment for rendering output:
<>
<p><strong>text:</strong> {text}</p>
<p><strong>{tokens.length} tokens:</strong></p>
<pre>{output}</pre>
</>
```
### Token Classifier
It is based on an internal token classifier based on the work here:
Tiny tokenizer - https://gist.github.com/borgar/451393
### Options
- `text {string}`: The string to be tokenized.
- `includeWords {bool}`: Include the word tokens in the output.
- `includeNumbers {bool}`: Include the number tokens in the output.
- `includePunctuation {bool}`: Include the punctuation tokens in the output.
- `includeWhitespace {bool}`: Include the whitespace tokens in the output.
- `greedy {bool}`: Use [Greedy Tokens](/#/Greedy%20Tokens), overridden by `parsers`.
- `verbose {bool}`: Output an array of token objects with types.
- `occurrences {bool}`: Output `occurrence` and `occurrences` for each token, requires verbose.
- `parsers {object}`: Override the internal token classifier's Regular Expressions for `{word, number, punctuation, whitespace}`.
- `nomalize {bool}`: Enable normalization.
- `norrmalizers {array}`: Override the internal normalization.
### Usage Example
Edit the options and watch the effect on the output.
```js
// import {tokenize} from 'string-punctuation-tokenizer';
import {tokenize, word, number, punctuation, whitespace} from '../tokenizers.js';
const text = `It's said that th\u200Dere are 1,000.00 different ways,\nto say...\t"I—Love—You." (r) ™`;
const options = {
text,
includeWords: true,
includeNumbers: true,
includePunctuation: true,
includeWhitespace: true,
includeUnknown: true,
greedy: true,
verbose: true,
occurrences: true,
//parsers: {word, number, punctuation, whitespace},
//parsers: {word: /\w+/, number: /\d+/, punctuation: /[\.,'"]/, whitespace: /\s+/},
normalize: true,
// normalizations: [
// { inputs: [/\((r|R)\)/g, '®'], output: 'Registered' },
// { inputs: [/\(tm\)/gi, '™'], output: 'Trademark' },
// ],
}
const tokens = tokenize(options);
const output = JSON.stringify(tokens, null, 2);
// wrapped in a React fragment for rendering:
<>
<p>{text}</p>
<p>{tokens.length} tokens:</p>
<pre>{output}</pre>
</>
```
### Hebrew Cantillations Example
Edit the options and watch the effect on the output.
```js
// import {tokenize} from 'string-punctuation-tokenizer';
import {tokenize, word, number, punctuation, whitespace} from '../tokenizers.js';
const text = `
לׅׄוּלֵׅׄ֗אׅׄ
לׅׄוּלֵׅ֗ׄאׅׄ
לוּלֵא
`;
const options = {
text,
includeWords: true,
includeNumbers: true,
includePunctuation: true,
includeWhitespace: true,
includeUnknown: true,
greedy: true,
verbose: true,
occurrences: true,
//parsers: {word, number, punctuation, whitespace},
//parsers: {word: /\w+/, number: /\d+/, punctuation: /[\.,'"]/, whitespace: /\s+/},
normalize: true,
// normalizations: [
// { inputs: [/\((r|R)\)/g, '®'], output: 'Registered' },
// { inputs: [/\(tm\)/gi, '™'], output: 'Trademark' },
// ],
}
const tokens = tokenize(options);
const output = JSON.stringify(tokens, null, 2);
// wrapped in a React fragment for rendering:
<>
<p>{text}</p>
<p>{tokens.length} tokens:</p>
<pre>{output}</pre>
</>
```
### Greek Accents Example
Edit the options and watch the effect on the output.
```js
// import {tokenize} from 'string-punctuation-tokenizer';
import {tokenize, word, number, punctuation, whitespace} from '../tokenizers.js';
const text = `καὶ μὴ και μη`;
const options = {
text,
includeWords: true,
includeNumbers: true,
includePunctuation: true,
includeWhitespace: true,
includeUnknown: true,
greedy: true,
verbose: true,
occurrences: true,
//parsers: {word, number, punctuation, whitespace},
//parsers: {word: /\w+/, number: /\d+/, punctuation: /[\.,'"]/, whitespace: /\s+/},
normalize: true,
// normalizations: [
// { inputs: [/\((r|R)\)/g, '®'], output: 'Registered' },
// { inputs: [/\(tm\)/gi, '™'], output: 'Trademark' },
// ],
}
const tokens = tokenize(options);
const output = JSON.stringify(tokens, null, 2);
// wrapped in a React fragment for rendering:
<>
<p>{text}</p>
<p>{tokens.length} tokens:</p>
<pre>{output}</pre>
</>
```