universal-emoji-parser
Version:
This tool allow parse unicode and emoji codes to html images using emojilib && Twemoji CDN
174 lines (118 loc) • 10.7 kB
Markdown
# Performance
Performance characteristics of Universal Emoji Parser and the levers you have when something gets slow. The package is intentionally simple — most "performance" work here is about _not_ introducing slow paths in new code.
## Hot paths
### `parse(text, options)` end-to-end
```
parse()
├── getDefaultOptions() ← O(1) — small object merge
├── parseToShortcode(text)? ← O(N · M) where N = text length, M = catalog size
├── parseToUnicode(text)? ← O(K · M_avg) where K = #shortcodes in text
└── __parseEmojiToHtml(text)? ← O(E + N · E) where E = #emoji entities
```
For typical inputs (a chat message with 1–10 emojis), the whole pipeline runs in under a millisecond. The catalog lookups are O(1) on direct slug hits and O(M) on keyword-scan fallbacks.
### `parseToShortcode` is the slowest
It builds a single regex from `Object.keys(emojiLibJsonData).join('|')` — a 1906-alternation pattern — and runs `text.matchAll` over it. Two costs:
1. **RegExp construction** is O(M) and happens **on every call**. With 1906 alternates and the keycap escape, this is the dominant cost for short inputs
2. **Matching** is O(N · M) in the worst case (alternation regexes don't backtrack as efficiently as character classes)
If a consumer calls `parseToShortcode` in a hot loop, **the regex construction dominates**. Caching it would help but introduces stateful behavior — currently the package recreates it every call.
### Catalog lookup priority
`getEmojiObjectByShortcode(shortcode)`:
1. **Direct hit** — `emojiLibJsonData[shortcode]` — O(1)
2. **Keyword scan** — `Object.keys(...).find(...)` — O(M · K_avg) where K_avg ≈ 5 (average keywords per emoji)
For canonical slugs the fast path always wins. For dialect aliases (`:thumbsup:` → 👍), the scan is unavoidable. There's no inverted index — it would double the catalog memory for a marginal speedup.
## Optimizations already in place
| Optimization | Where | Why it matters |
| ---------------------------------------- | ------------------------------------------------- | -------------------------------------------------------------------- |
| Catalog as static JSON import | `import emojiLibJson from './lib/emoji-lib.json'` | Bundlers inline it as a JS object literal — no JSON.parse at runtime |
| Single Twemoji parse per HTML conversion | `__parseEmojiToHtml` | Twemoji is the slowest single op; calling it twice is wasted work |
| `entitiesFound` dedup | `__parseEmojiToHtml` | Same emoji appearing 5× → 1 regex replace, not 5 |
| Two-tier shortcode lookup | `getEmojiObjectByShortcode` | Slug path is O(1); keyword scan only when needed |
| Frozen-by-convention catalog | `emojiLibJsonData` | Consumers can safely cache references; nothing mutates |
## Optimizations _not_ applied (and why)
| Not done | Reason |
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| RegExp caching for `parseToShortcode` | Adds stateful module-level cache; tests pass without it; speeds matter only in tight loops |
| Inverted keyword index (`{ keyword: emoji }`) | Doubles memory of the catalog (~5 MB resident) for marginal speedup; current scan is fast enough |
| Streaming/chunked parsing for very long inputs | Unrealistic input size for this package's domain (chat messages, blog posts) |
| Web Worker offload | Not the package's job — consumers can `worker.postMessage(text)` themselves |
| Async API | Adds complexity without benefit; everything is in-memory |
If a consumer benchmarks a real workload and finds these to be the bottleneck, file an issue with numbers and we'll reconsider.
## Bundle size
The package adds ~600 KB minified (~543 KB JSON catalog + ~50 KB code + Twemoji) to consumer bundles. This dominates everything:
```bash
ls -lh dist/index.js # ~600 KB minified
ls -lh src/lib/emoji-lib.json # ~543 KB raw
```
### Why so big?
The catalog has 1906 entries × average ~250 bytes per entry (name, slug, group, version, char, keywords array). Most of the size is keyword arrays and the `name` field.
### What we won't drop
- `name` — useful for accessibility (`alt=` could use it; we currently use the unicode char instead)
- `slug` — the canonical shortcode for `parseToShortcode`
- `keywords` — the alias support is the package's main value-add
- `char` — used by `parseToUnicode` and `__parseEmojiToHtml`
### What might drop in a future major
- `group`, `emoji_version`, `unicode_version`, `skin_tone_support` — currently exported as part of `EmojiType` but **never read by the runtime**. They're metadata for consumers using `emojiLibJsonData` directly. Removing them would save ~30% of the catalog size but breaks any consumer that reads them. Slate for a 3.x migration discussion
### Lazy-loading for consumers
A consumer worried about initial-load performance can lazy-import:
```ts
let parserPromise: Promise<typeof import('universal-emoji-parser').default> | null = null
async function parseLazy(text: string): Promise<string> {
parserPromise ??= import('universal-emoji-parser').then((m) => m.default)
return (await parserPromise).parse(text)
}
```
Webpack/Vite/Rollup turn this into a code-split chunk — the catalog only ships when first needed. Trade-off: first call awaits a network fetch.
### Tree-shaking
The catalog is **not tree-shakeable** — `getEmojiObjectByShortcode` enumerates all keys via `Object.keys(emojiLibJsonData)`, so every emoji is reachable. Even consumers who only call `parseToHtml` ship the whole catalog.
A custom subset (e.g., "only emojis used in our app") would require a fork or a wrapper package that pre-filters at build time. Out of scope for this package.
## Memory footprint
Per-process overhead:
- **Catalog** — ~5 MB resident (parsed JS object representation of the 543 KB JSON)
- **Code** — ~50 KB
- **`@twemoji/parser`** — ~50 KB
Loaded once per process; doesn't grow with usage.
In Node, the catalog is the biggest single string-heavy object in `process.memoryUsage().heapUsed` for any app that uses this package and not much else. Not a problem for typical Node servers; relevant for memory-constrained environments (256 MB Lambdas, Cloudflare Workers).
## Throughput benchmarks
We don't have a wired-up benchmark suite. Rough numbers from manual measurement (Node 22 on M1):
| Operation | Input | Latency |
| ---------------------------------------------- | ----------------------------------- | -------------------------------------- |
| `parse('hello')` | No emojis | < 0.1 ms |
| `parse('hello :smile: 🚀')` | 2 emojis, 1 shortcode | ~0.3 ms |
| `parse(<200 char chat message with 5 emojis>)` | Realistic chat | ~0.5 ms |
| `parseToShortcode('🚀 ⭐️ ❤️ 😎 🔥')` | 5 unicodes | ~0.8 ms (regex construction dominates) |
| `parseToShortcode(<1 KB text>)` | Same alternation regex, longer scan | ~1.5 ms |
| `parseToHtml(<1 KB text with 20 emojis>)` | Full pipeline | ~2 ms |
Twemoji's `parse()` is the dominant cost in `parseToHtml`. Catalog lookups are sub-microsecond.
If a consumer reports >10 ms latencies on realistic input, that's a bug — open an issue.
## Adding new code paths — performance checklist
When adding a new method or feature:
- [ ] **No async.** Don't introduce `Promise` returns or `await` calls — the catalog is in-memory; sync is faster and simpler
- [ ] **No catalog mutation.** `emojiLibJsonData` must remain a reference-stable, deep-frozen-by-convention object
- [ ] **Cache RegExp where possible.** If a regex doesn't depend on the input, build it once at module init, not per call
- [ ] **No iteration of the catalog** if a direct lookup will do. `emojiLibJsonData[slug]` is always faster than `Object.keys(...).find(...)`
- [ ] **No new fields on `EmojiType`** without measuring the bundle-size delta. Each field × 1906 entries × every consumer's bundle adds up
- [ ] **Test with realistic input.** `:smile:` is fine for unit tests; don't optimize for the trivial case at the cost of long inputs
## Profiling
Quick profile of a hot call:
```bash
node --prof -e "const u = require('./dist/index.js'); for (let i = 0; i < 10000; i++) u.parse('hello :smile: 🚀 ⭐️ ❤️ 😎')"
node --prof-process isolate-*.log
```
The output will show you which functions dominate. Expected: `RegExp.prototype[@@matchAll]` and the alternation regex construction in `parseToShortcode` — confirming the analysis above.
For browser perf, Chrome DevTools' Performance panel works on bundles built by `npm run build:dev` (production minification obscures function names).
## Common performance mistakes
1. **Constructing a new `RegExp` per loop iteration** — pull it out:
```ts
// ❌
for (const t of texts) {
const r = new RegExp(...) // built every iteration
t.match(r)
}
// ✅
const r = new RegExp(...)
for (const t of texts) t.match(r)
```
2. **Cloning the catalog** — `JSON.parse(JSON.stringify(emojiLibJsonData))` is the regenerator's pattern; never do it at runtime
3. **`for-in` over the catalog** — `for-in` is slower than `Object.keys(...).forEach`. The package uses `Object.keys` consistently
4. **`indexOf` chains** — `array.indexOf(x) !== -1` is slower than `array.includes(x)` and harder to read
5. **Reading `emojiLibJsonData.length`** — there's no `length`; it's an object, not an array. Use `Object.keys(emojiLibJsonData).length`