extract-zhongwen
Version:
Utility for extracting chinese characters from a string
100 lines (74 loc) • 4.37 kB
Markdown
# extract-zhongwen
extract-zhongwen is a small utility designed to extract Chinese characters from a given string based on Unicode Ranges.
---
## Installation
```bash
npm install extract-zhongwen
```
## Features
- Extracts Chinese characters from an input string.
- Supports **Unicode normalization** (NFKC) with optional preservation of specified characters.
- Allows **whitelisting or blacklisting** of specific characters.
- Option to **remove duplicate characters**.
## Function Signature
```typescript
const extract = (
input: string,
options?: Options
): string
```
### Parameters
### `input` (string)
The input string from which Chinese characters will be extracted.
### `options` (Options)
An object containing configuration options.
| Option | Type | Default | Description |
| ------------------- | ------- | ------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `normalizeUnicode` | boolean | `true` | If `true`, normalizes Unicode characters to NFKC form, while preserving whitelisted characters. |
| `removeDuplicates` | boolean | `true` | If `true`, removes duplicate Chinese characters in the output. |
| `includeCharacters` | string | `""` | A string of characters to explicitly include in the extracted output, even if they don't match general Chinese character ranges. |
| `excludeCharacters` | string | `""` | A string of characters to exclude from the extracted output, even if they match Chinese character ranges. |
## Notes
- Whitelisted characters in `includeCharacters` will not be normalized if present.
- `includeCharacters` and `excludeCharacters` will treat each character individually. This means that it is not possible to whitelist or blacklist specific words or phrases.
- If `includeCharacters` and `excludeCharacters` contain overlapping characters, the overlapping characters will be filtered out.
- Whitespaces, punctuation, and any non-Chinese characters are filtered out by default.
- Duplicate characters are removed at the very end. This means that unnormalized characters and their normalized counterparts may be considered duplicates if the `normalizeUnicode` option is enabled, even if their Unicode values are technically different.
- If `normalizeUnicode` is disabled, characters with **different Unicode representations** might not be merged correctly.
- Performance may vary for **very large input strings**, especially when `removeDuplicates` is enabled, since it requires additional processing.
- If no Chinese characters are found in the input, an **empty string** will be returned.
- The function does not differentiate between **Simplified and Traditional Chinese**.
## Example Usage
```typescript
import { extract } from "extract-zhongwen";
console.log(extract("中文字符 English Characters"));
// Output: "中文字符"
// Example with normalization (NFKC)
console.log(extract("社 社 祖 租", { normalizeUnicode: true }));
// Output: "社社租租"
// Example with duplicate removal
console.log(extract("你好 你好 世界 世界", { removeDuplicates: true }));
// Output: "你好世界"
// Example with duplicate removal disabled
console.log(extract("你好 你好 世界 世界", { removeDuplicates: false }));
// Output: "你好你好世界世界"
// Example including a specific character
console.log(
extract("Hello 你好,世界!", { includeCharacters: "l,! " })
);
// Output: "ll 你好,世界!"
// Example excluding a specific character
console.log(extract("Hello 你好,世界!", { excludeCharacters: "世" }));
// Output: "你好界"
// Example including and excluding characters
console.log(
extract(
"那座山,正当顶上,有一块仙石 On the summit of the mountain was a mythical stone",
{
includeCharacters: "On the summit of the mountain",
excludeCharacters: "那座山",
}
)
);
// Output: "正当顶上有一块仙石 On the summit of the mountain"
```