base32768
Version:
Binary-to-text encoding highly optimised for UTF-16
159 lines (137 loc) • 6.17 kB
Markdown
# base32768
Base32768 is a binary encoding optimised for UTF-16-encoded text. This JavaScript module, `base32768`, is the first implementation of this encoding.
The efficiency chart speaks for itself. Efficiency ratings are averaged over long inputs. Higher is better.
<table>
<thead>
<tr>
<th colspan="2" rowspan="2">Encoding</th>
<th colspan="3">Efficiency</th>
<th rowspan="2">Bytes per Tweet *</th>
</tr>
<tr>
<th>UTF‑8</th>
<th>UTF‑16</th>
<th>UTF‑32</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="5">ASCII‑constrained</td>
<td>Unary / <a href="https://github.com/ferno/base1">Base1</a></td>
<td style="text-align: right;">0%</td>
<td style="text-align: right;">0%</td>
<td style="text-align: right;">0%</td>
<td style="text-align: right;">1</td>
</tr>
<tr>
<td>Binary</td>
<td style="text-align: right;">13%</td>
<td style="text-align: right;">6%</td>
<td style="text-align: right;">3%</td>
<td style="text-align: right;">35</td>
</tr>
<tr>
<td>Hexadecimal</td>
<td style="text-align: right;">50%</td>
<td style="text-align: right;">25%</td>
<td style="text-align: right;">13%</td>
<td style="text-align: right;">140</td>
</tr>
<tr>
<td>Base64</td>
<td style="text-align: right;"><strong>75%</strong></td>
<td style="text-align: right;">38%</td>
<td style="text-align: right;">19%</td>
<td style="text-align: right;">210</td>
</tr>
<tr>
<td>Base85 †</td>
<td style="text-align: right;">80%</td>
<td style="text-align: right;">40%</td>
<td style="text-align: right;">20%</td>
<td style="text-align: right;">224</td>
</tr>
<tr>
<td rowspan="4">BMP‑constrained</td>
<td><a href="https://github.com/ferno/hexagram-encode">HexagramEncode</a></td>
<td style="text-align: right;">25%</td>
<td style="text-align: right;">38%</td>
<td style="text-align: right;">19%</td>
<td style="text-align: right;">105</td>
</tr>
<tr>
<td><a href="https://github.com/ferno/braille-encode">BrailleEncode</a></td>
<td style="text-align: right;">33%</td>
<td style="text-align: right;">50%</td>
<td style="text-align: right;">25%</td>
<td style="text-align: right;">140</td>
</tr>
<tr>
<td><a href="https://github.com/qntm/base2048">Base2048</a></td>
<td style="text-align: right;">56%</td>
<td style="text-align: right;">69%</td>
<td style="text-align: right;">34%</td>
<td style="text-align: right;"><strong>385</strong></td>
</tr>
<tr>
<td><a href="https://github.com/ferno/base32768">Base32768</a></td>
<td style="text-align: right;">63%</td>
<td style="text-align: right;"><strong>94%</strong></td>
<td style="text-align: right;">47%</td>
<td style="text-align: right;">263</td>
</tr>
<tr>
<td rowspan="3">Full Unicode</td>
<td><a href="https://github.com/keith-turner/ecoji">Ecoji</a></td>
<td style="text-align: right;">31%</td>
<td style="text-align: right;">31%</td>
<td style="text-align: right;">31%</td>
<td style="text-align: right;">175</td>
</tr>
<tr>
<td><a href="https://github.com/ferno/base65536">Base65536</a></td>
<td style="text-align: right;">56%</td>
<td style="text-align: right;">64%</td>
<td style="text-align: right;"><strong>50%</strong></td>
<td style="text-align: right;">280</td>
</tr>
<tr>
<td><a href="https://github.com/ferno/base131072">Base131072</a> ‡</td>
<td style="text-align: right;">53%+</td>
<td style="text-align: right;">53%+</td>
<td style="text-align: right;">53%</td>
<td style="text-align: right;">297</td>
</tr>
</tbody>
</table>
\* New-style "long" Tweets, up to 280 Unicode characters give or take Twitter's complex "weighting" calculation.<br/>
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation *etc.*.<br/>
‡ Base131072 is a work in progress, not yet ready for general use.<br/>
Base32768 uses only ["safe" Unicode code points](https://qntm.org/safe) - no unassigned code points, no whitespace, no control characters, etc..
## Installation
```bash
npm install base32768
```
## Usage
```js
import { encode, decode } from 'base32768'
const uint8Array = new Uint8Array([104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100])
const str = encode(uint8Array)
console.log(str)
// 6 code points, '媒腻㐤┖ꈳ埳'
const uint8Array2 = decode(str)
console.log(uint8Array2)
// [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
```
## API
### base32768.encode(uint8Array)
Encodes a [`Uint8Array`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array) and returns a Base32768 `String`. Note that every Node.js [`Buffer`](https://nodejs.org/docs/latest/api/buffer.html#buffer_buffers_and_typedarray) is a `Uint8Array`.
The string is suitable for passing safely through almost any "Unicode-clean" text-handling API. This string contains no special characters and is immune to Unicode normalization. Give or take some padding characters, the output string has 1 character per 15 bits of input.
All characters are chosen from the Basic Multilingual Plane. This means that when encoded as UTF-16, all characters occupy 16 bits. Thus, there are 16 bits of output UTF-16 text per 15 bits of input, an efficiency of 93.75%.
### base32768.decode(str)
Decodes a Base32768 `String` and returns a `Uint8Array` containing the original binary data. Note that a `Uint8Array` can be converted to a Node.js `Buffer` like so:
```js
const buffer = Buffer.from(uint8Array.buffer, uint8Array.byteOffset, uint8Array.byteLength)
```
## License
MIT