string-probability

Version:

A TypeScript library to calculate Jaro-Winkler distance and string similarity probabilities between two strings.

112 lines (71 loc) • 3.81 kB

Markdown

# string-probability **string-probability** is a TypeScript library for calculating the similarity between strings using probabilistic models based on edit distance. Unlike traditional string comparison methods, this library emphasizes probability-based similarity, providing a more nuanced measure of how closely two strings match. --- ## Features * Calculate **string similarity probability** between 0 (completely different) and 1 (identical). * Support for multiple probability models: * **Standard**: normalized inverse distance * **Alpha**: exponential decay for strict sensitivity * **Beta**: power/exponent-based sensitivity curve * Uses **Jaro-Winkler distance** internally for robust handling of transpositions and common prefixes. * Flexible configuration to tune sensitivity for your specific use case. --- ## Installation ```bash npm install string-probability # or yarn add string-probability # or bun add string-probability ``` --- ## Usage ```typescript import { probability } from "string-probability"; // Standard probability (default) const prob1 = probability("hello", "hello"); // ~1.0 const prob2 = probability("hello", "world"); // lower probability // Alpha mode: exponential decay sensitivity const prob3 = probability("test", "best", { mode: "alpha", value: 1.5 }); // Beta mode: power/exponent sensitivity const prob4 = probability("cat", "bat", { mode: "beta", value: 2.0 }); ``` --- ## API ### `probability(str1: string, str2: string, options?)` Calculates the similarity probability between two strings. **Parameters**: | Parameter | Type | Description | | --------- | ------------------------------------------------------------ | ----------------------------------------------------------- | | `str1` | `string` | First string | | `str2` | `string` | Second string | | `options` | `{ value?: number; mode?: "standard" \| "alpha" \| "beta" }` | Optional configuration for calculation mode and sensitivity | **Returns**: `number` — a probability between 0 and 1. --- ### Probability Modes 1. **Standard (default)** `p = 1 / (1 + d / L)` * Balanced, normalized approach. * Intuitive probability values. 2. **Alpha (exponential decay)** `p = e^(-α * d)` * High α → stricter matching * Low α → more forgiving * Smooth probability degradation 3. **Beta (power/exponent)** `p = 1 - d^β` * β > 1 → more forgiving * β < 1 → stricter * β = 1 → linear relationship > `d` is the Jaro-Winkler distance between strings, `L` is the maximum string length. --- ## Why Probability Over Direct Matching? Traditional string matching methods (e.g., exact equality or simple thresholds) are binary — they only indicate whether strings are identical or “close enough.” Probabilistic approaches provide several advantages: 1. **Graded Similarity**: Probability values express the degree of similarity rather than a yes/no result. 2. **Robustness to Minor Differences**: Small typos, transpositions, or variations reduce the probability smoothly instead of failing outright. 3. **Custom Sensitivity**: Exponential and power models allow fine-tuning for strict or forgiving matching. 4. **Better Integration with Machine Learning**: Probability scores can be used directly in algorithms that require continuous similarity metrics. Using probability enables smarter decisions in search, matching, deduplication, and natural language applications. --- ## License MIT © 2025 Mohtasim Alam Sohom