UNPKG

classification.js

Version:

A powerful text classification library using Damerau-Levenshtein distance algorithm

237 lines (170 loc) 6.68 kB
# classification.js A powerful and flexible text classification library for Node.js that uses an optimized Damerau-Levenshtein distance algorithm to match input texts against a dataset. ## Features - **Multiple Algorithm Levels**: Choose between Mini, Core, Pro, and Ultra levels to balance performance and accuracy - **Text Normalization**: Option to normalize text for improved matching - **Multi-language Support**: Use with any language by providing appropriate datasets - **Result Logging**: Save classification results to log files - **Customizable Parameters**: Fine-tune the classification process with various options ## Installation ```bash npm install classification.js ``` ```bash bun add classification.js ``` ```bash yarn add classification.js ``` ## Quick Start ```javascript import Classifier from 'classification.js'; // Load a dataset const dataset = Classifier.loadDatasetFromFile('tur'); // Load Turkish dataset // Create a classifier instance const classifier = new Classifier(dataset, { normalize: true, algorithmLevel: 'Pro', language: 'tur' }); // Classify some text const result = classifier.classify(['Bu bir test cümlesidir.']); console.log(result); ``` ## API Reference ### Classifier Class #### Constructor ```javascript new Classifier(dataset, options) ``` - **dataset**: Array of objects with `text` and `label` properties - **options**: (Optional) Configuration object with the following properties: - **normalize**: (Boolean, default: false) Whether to normalize the text - **algorithmLevel**: ('Mini' | 'Core' | 'Pro' | 'Ultra', default: 'Pro') The algorithm level - **keepLogToFile**: (Boolean, default: false) Whether to save classification results to a log file - **truncateLength**: (Number, default: 2048) The length to truncate texts for similarity calculation - **language**: (String, default: 'unknown') Language code (for logging purposes) - **max_steps**: (Number) Max steps (calculated from algorithmLevel or provided) #### Static Methods ##### `loadDatasetFromFile(language)` Loads a dataset from a JSON file in the `datasets` directory. - **language**: (String) The language code to load (e.g., 'tur', 'eng') - **Returns**: Array of objects with `text` and `label` properties, or null on error #### Instance Methods ##### `classify(inputs)` Classifies the input texts. - **inputs**: (Array<string>) The input texts to classify - **Returns**: Object with classification results ##### `normalizeText(text)` Normalizes the text. - **text**: (String) The text to normalize - **Returns**: The normalized text ##### `getSimilarity(text1, text2)` Calculates the similarity between two texts. - **text1**: (String) The first text - **text2**: (String) The second text - **Returns**: Object with similarity metrics ##### `saveResultsToLog(results, language)` Saves the classification results to a log file. - **results**: (Array) The classification results - **language**: (String) The language code for the log file name ## Dataset Format The dataset should be a JSON file with the following structure: ```json { "example text 1": "label1", "example text 2": "label2", "example text 3": "label1" } ``` Save your dataset files in the `datasets` directory with the naming convention `datas_[language].json` (e.g., `datas_tur.json` for Turkish). ## Algorithm Levels - **Mini**: Fast but less accurate (max_steps: 10) - **Core**: Balanced performance (max_steps: 50) - **Pro**: Good accuracy with reasonable performance (max_steps: 200) - **Ultra**: Maximum accuracy, may be slower (max_steps: Infinity) ## Example ```javascript import Classifier from 'classification.js'; // Create a custom dataset const dataset = [ { text: "Hello world", label: "greeting" }, { text: "How are you?", label: "question" }, { text: "Goodbye", label: "farewell" } ]; // Create a classifier with custom options const classifier = new Classifier(dataset, { normalize: true, algorithmLevel: 'Ultra', keepLogToFile: true, language: 'eng' }); // Classify multiple texts const results = classifier.classify([ "Hello there!", "How's everything?", "See you later" ]); console.log(results); ``` ## How It Works The library uses a modified Damerau-Levenshtein distance algorithm to calculate the similarity between input texts and dataset entries. This algorithm measures the "edit distance" between two strings - the minimum number of operations (insertions, deletions, substitutions, and transpositions) required to change one string into another. ### Classification Process 1. The input text is normalized (if the option is enabled) 2. The algorithm compares the input text with each entry in the dataset 3. For each comparison, a similarity score is calculated 4. The dataset entry with the highest similarity score is selected as the match 5. The label of the best match is returned as the classification result ### Performance Optimization The algorithm includes several optimizations: - Early termination when the edit distance exceeds the maximum steps - Text truncation to limit the comparison length - Different algorithm levels to balance accuracy and performance ## Advanced Usage ### Custom Dataset Creation ```javascript // Create a dataset programmatically const dataset = []; for (let i = 0; i < 100; i++) { dataset.push({ text: `Example text ${i}`, label: i % 2 === 0 ? 'even' : 'odd' }); } const classifier = new Classifier(dataset, { algorithmLevel: 'Core' }); ``` ### Batch Classification ```javascript // Classify multiple texts at once const textsToClassify = [ "First example", "Second example", "Third example", // ... more texts ]; const results = classifier.classify(textsToClassify); ``` ### Using with Different Languages ```javascript // Turkish const turkishDataset = Classifier.loadDatasetFromFile('tur'); const turkishClassifier = new Classifier(turkishDataset, { language: 'tur', normalize: true }); // English const englishDataset = Classifier.loadDatasetFromFile('eng'); const englishClassifier = new Classifier(englishDataset, { language: 'eng', normalize: true }); // Classify in both languages const turkishResults = turkishClassifier.classify(['Merhaba dünya']); const englishResults = englishClassifier.classify(['Hello world']); ``` ## Contributing Contributions are welcome! Please feel free to submit a Pull Request. ## License ISC ## Author Nixaut