@palasimi/ipa-cluster
Version:
Cluster words with similar IPA transcriptions together
111 lines (78 loc) • 2.82 kB
Markdown
# ipa-cluster
Cluster words with similar IPA transcriptions together.
Similar, in this context, means that the edit distance of the IPA transcriptions is small enough.
The [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) is used to measure this distance.
## Installation
```bash
npm install @palasimi/ipa-cluster
```
## Basic example
```typescript
import { clusterByIPA } from "@palasimi/ipa-cluster";
// The IPA transcriptions should be tokenized.
const dataset = [{ ipa: "f uː" }, { ipa: "b ɑː r" }, { ipa: "b ɑː z" }];
// Produces one cluster for [fuː] and another for [bɑːr] and [bɑːz].
const clusters = clusterByIPA(dataset);
```
## Equivalent sounds
By default, the algorithm matches two IPA segments if and only if their string representations are the same.
It is possible to override this behavior by specifying a set of "equivalent sounds."
Here's an example.
```txt
a ~ o
b ~ p
```
Normally, the algorithm would consider [a] and [o] to be a mismatch, but by including the first rule, the algorithm will treat [a] and [o] as equivalent.
```typescript
const options = {
ignores: `
a ~ o
b ~ p
`,
};
const clusters = clusterByIPA(dataset, options);
```
It is called `ignores`, because it specifies which edits/sound changes the algorithm should "ignore".
## Environments
You may define sounds to only be equivalent in specific environments.
```txt
-- This is a comment.
-- Treat [b] and [p] as equivalent only at the end of a word.
b ~ p / _ #
-- Treat [b] and [v] as equivalent when surrounded by [a]s.
b ~ v / a _ a
```
## Sound classes
Consider the following set of rules.
```txt
q ~ g
q ~ h
q ~ k
q ~ x
g ~ h
g ~ k
g ~ x
h ~ k
h ~ x
k ~ x
```
It says that [q], [g], [h], [k] and [x] should all be considered equivalent to each other.
Sound classes make it possible to define rules like this in a more concise manner.
The following expands to the ruleset above.
```txt
{ q g h k x } ~ { q g h k x }
-- Or:
A = { q g h k x }
A ~ A
-- Note that variable names should be capitalized.
```
Classes can also be used in environments.
```txt
A = { a e i o u y }
s ~ z / # _ A
```
## License
Copyright 2023 Levi Gruspe
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.