UNPKG

multibyte

Version:
66 lines (50 loc) β€’ 2.92 kB
# multibyte [![NPM Link](https://badgen.net/npm/v/multibyte?v=1.0.4)](https://npmjs.com/package/multibyte) [![Language](https://badgen.net/static/language/TS?v=1.0.4)](https://github.com/search?q=repo:kensnyder/multibyte++language:TypeScript&type=code) [![Build Status](https://github.com/kensnyder/multibyte/actions/workflows/workflow.yml/badge.svg?v=1.0.4)](https://github.com/kensnyder/multibyte/actions) [![Code Coverage](https://codecov.io/gh/kensnyder/multibyte/branch/main/graph/badge.svg?v=1.0.4)](https://codecov.io/gh/kensnyder/multibyte) [![Gzipped Size](https://badgen.net/bundlephobia/minzip/multibyte?label=minzipped&v=1.0.4)](https://bundlephobia.com/package/multibyte@1.0.4) [![Dependency details](https://badgen.net/bundlephobia/dependency-count/multibyte?v=1.0.4)](https://www.npmjs.com/package/multibyte?activeTab=dependencies) [![Tree shakeable](https://badgen.net/bundlephobia/tree-shaking/multibyte?v=1.0.4)](https://www.npmjs.com/package/multibyte) [![ISC License](https://badgen.net/github/license/kensnyder/multibyte?v=1.0.4)](https://opensource.org/licenses/ISC) multibyte provides common string functions that respect multibyte Unicode characters. ```bash npm install multibyte ``` ## The problem and the solution On one hand, JavaScript strings use UTF-16 encoding, and on the other hand, JavaScript strings behave like an Array of code points. Unicode characters that take more than 2 bytes (like newer emoji) get split into 2 code points in many situations. If you display Unicode text from a UTF-8 source, you need these multibyte functions that take advantage of the fact that `Array.from(string)` is Unicode safe. ```js import { charAt, codePointAt, length, slice, split, truncateBytes, } from 'multibyte'; // JavaScript String.prototype.charAt() can return a UTF-16 surrogate 'aπŸš€c'.charAt(1); // ❌ "\ud83d" (half a rocket) charAt('aπŸš€c', 1); // βœ… "πŸš€" // JavaScript String.prototype.codePointAt() can return a UTF-16 surrogate 'πŸš€abc'.codePointAt(1); // ❌ 56960 (surrogate pair of rocket emoji) codePointAt('πŸš€abc', 1); // βœ… 97 (the letter a) // JavaScript returns length in UTF-16, not Unicode characters 'aπŸš€c'.length; // ❌ 4 length('aπŸš€c'); // βœ… 3 // JavaScript slices along UTF-16 boundaries, not Unicode characters 'aπŸš€cdef'.slice(2, 3); // ❌ "\ude80" (half a rocket) slice('aπŸš€cdef', 2, 3); // βœ… "c" // JavaScript splits along UTF-16 boundaries, not Unicode characters 'aπŸš€c'.split(''); // ❌ ["a", "\ud83d", "\ude80", "c"] split('aπŸš€c', ''); // βœ… ["a", "πŸš€", "c"] βœ… // JavaScript slices strings along UTF-16 boundaries, not Unicode characters 'aπŸš€cdef'.slice(0, 2); // ❌ "a\ud83d" (half a rocket) truncateBytes('aπŸš€cdef', 2); // βœ… "a" (including the rocket would be 3 total bytes) ``` ## BOM (Byte order mark) - U+FEFF Under the hood, all these functions strip a leading BOM if present.