bpe-tokenizer-encoding
Version:
A simple JavaScript implementation of Byte Pair Encoding (BPE)
80 lines (49 loc) • 1.31 kB
Markdown
# bpe-tokenizer
> 🔤 A simple JavaScript implementation of Byte Pair Encoding (BPE) for tokenizing text into subword units.
## 🚀 Install
```bash
npm install bpe-tokenizer
```
## 📆 Usage
### In Node.js:
```js
const { bytePairEncoding } = require('bpe-tokenizer');
const tokens = bytePairEncoding("low_low_lower", 5, true);
console.log("Tokens:", tokens);
```
### Output:
```
Step 1: Merged 'l o' -> lo w _ lo w _ lo w e r
Step 2: Merged 'lo w' -> low _ low _ low e r
Step 3: Merged 'low _' -> low_ low_ low e r
Step 4: Merged 'e r' -> low_ low_ low er
Step 5: Merged 'low_ low_' -> low_low_ low er
Tokens: [ 'low_low_', 'low', 'er' ]
```
## 🔧 CLI Usage
If you installed globally or linked it locally:
```bash
npx bpe-tokenizer "low_low_lower" 5
```
### Parameters:
- First argument: input string (e.g. `"low_low_lower"`)
- Second argument (optional): number of merges (default = 10)
## 📚 API
### `bytePairEncoding(text: string, numMerges: number = 10, verbose: boolean = false): string[]`
- **text**: Input string
- **numMerges**: Number of BPE merge iterations
- **verbose**: Whether to log each merge step
## 🛠️ Development
### Run locally:
```bash
npm install
node cli.js "low_low_lower" 5
```
## 📃 License
MIT © 2025 MOHD RAZA