extract-zhongwen

Version:

Utility for extracting chinese characters from a string

github.com/TheRobertLing/extract-zhongwen

100 lines (74 loc) • 4.37 kB

Markdown

# extract-zhongwen extract-zhongwen is a small utility designed to extract Chinese characters from a given string based on Unicode Ranges. --- ## Installation ```bash npm install extract-zhongwen ``` ## Features - Extracts Chinese characters from an input string. - Supports **Unicode normalization** (NFKC) with optional preservation of specified characters. - Allows **whitelisting or blacklisting** of specific characters. - Option to **remove duplicate characters**. ## Function Signature ```typescript const extract = ( input: string, options?: Options ): string ``` ### Parameters ### `input` (string) The input string from which Chinese characters will be extracted. ### `options` (Options) An object containing configuration options. | Option | Type | Default | Description | | ------------------- | ------- | ------- | -------------------------------------------------------------------------------------------------------------------------------- | | `normalizeUnicode` | boolean | `true` | If `true`, normalizes Unicode characters to NFKC form, while preserving whitelisted characters. | | `removeDuplicates` | boolean | `true` | If `true`, removes duplicate Chinese characters in the output. | | `includeCharacters` | string | `""` | A string of characters to explicitly include in the extracted output, even if they don't match general Chinese character ranges. | | `excludeCharacters` | string | `""` | A string of characters to exclude from the extracted output, even if they match Chinese character ranges. | ## Notes - Whitelisted characters in `includeCharacters` will not be normalized if present. - `includeCharacters` and `excludeCharacters` will treat each character individually. This means that it is not possible to whitelist or blacklist specific words or phrases. - If `includeCharacters` and `excludeCharacters` contain overlapping characters, the overlapping characters will be filtered out. - Whitespaces, punctuation, and any non-Chinese characters are filtered out by default. - Duplicate characters are removed at the very end. This means that unnormalized characters and their normalized counterparts may be considered duplicates if the `normalizeUnicode` option is enabled, even if their Unicode values are technically different. - If `normalizeUnicode` is disabled, characters with **different Unicode representations** might not be merged correctly. - Performance may vary for **very large input strings**, especially when `removeDuplicates` is enabled, since it requires additional processing. - If no Chinese characters are found in the input, an **empty string** will be returned. - The function does not differentiate between **Simplified and Traditional Chinese**. ## Example Usage ```typescript import { extract } from "extract-zhongwen"; console.log(extract("中文字符 English Characters")); // Output: "中文字符" // Example with normalization (NFKC) console.log(extract("社社祖租", { normalizeUnicode: true })); // Output: "社社租租" // Example with duplicate removal console.log(extract("你好你好世界世界", { removeDuplicates: true })); // Output: "你好世界" // Example with duplicate removal disabled console.log(extract("你好你好世界世界", { removeDuplicates: false })); // Output: "你好你好世界世界" // Example including a specific character console.log( extract("Hello 你好，世界！", { includeCharacters: "l,! " }) ); // Output: "ll 你好，世界！" // Example excluding a specific character console.log(extract("Hello 你好，世界！", { excludeCharacters: "世" })); // Output: "你好界" // Example including and excluding characters console.log( extract( "那座山，正当顶上，有一块仙石 On the summit of the mountain was a mythical stone", { includeCharacters: "On the summit of the mountain", excludeCharacters: "那座山", } ) ); // Output: "正当顶上有一块仙石 On the summit of the mountain" ```