classification.js
Version:
A powerful text classification library using Damerau-Levenshtein distance algorithm
237 lines (170 loc) • 6.68 kB
Markdown
# classification.js
A powerful and flexible text classification library for Node.js that uses an optimized Damerau-Levenshtein distance algorithm to match input texts against a dataset.
## Features
- **Multiple Algorithm Levels**: Choose between Mini, Core, Pro, and Ultra levels to balance performance and accuracy
- **Text Normalization**: Option to normalize text for improved matching
- **Multi-language Support**: Use with any language by providing appropriate datasets
- **Result Logging**: Save classification results to log files
- **Customizable Parameters**: Fine-tune the classification process with various options
## Installation
```bash
npm install classification.js
```
```bash
bun add classification.js
```
```bash
yarn add classification.js
```
## Quick Start
```javascript
import Classifier from 'classification.js';
// Load a dataset
const dataset = Classifier.loadDatasetFromFile('tur'); // Load Turkish dataset
// Create a classifier instance
const classifier = new Classifier(dataset, {
normalize: true,
algorithmLevel: 'Pro',
language: 'tur'
});
// Classify some text
const result = classifier.classify(['Bu bir test cümlesidir.']);
console.log(result);
```
## API Reference
### Classifier Class
#### Constructor
```javascript
new Classifier(dataset, options)
```
- **dataset**: Array of objects with `text` and `label` properties
- **options**: (Optional) Configuration object with the following properties:
- **normalize**: (Boolean, default: false) Whether to normalize the text
- **algorithmLevel**: ('Mini' | 'Core' | 'Pro' | 'Ultra', default: 'Pro') The algorithm level
- **keepLogToFile**: (Boolean, default: false) Whether to save classification results to a log file
- **truncateLength**: (Number, default: 2048) The length to truncate texts for similarity calculation
- **language**: (String, default: 'unknown') Language code (for logging purposes)
- **max_steps**: (Number) Max steps (calculated from algorithmLevel or provided)
#### Static Methods
##### `loadDatasetFromFile(language)`
Loads a dataset from a JSON file in the `datasets` directory.
- **language**: (String) The language code to load (e.g., 'tur', 'eng')
- **Returns**: Array of objects with `text` and `label` properties, or null on error
#### Instance Methods
##### `classify(inputs)`
Classifies the input texts.
- **inputs**: (Array<string>) The input texts to classify
- **Returns**: Object with classification results
##### `normalizeText(text)`
Normalizes the text.
- **text**: (String) The text to normalize
- **Returns**: The normalized text
##### `getSimilarity(text1, text2)`
Calculates the similarity between two texts.
- **text1**: (String) The first text
- **text2**: (String) The second text
- **Returns**: Object with similarity metrics
##### `saveResultsToLog(results, language)`
Saves the classification results to a log file.
- **results**: (Array) The classification results
- **language**: (String) The language code for the log file name
## Dataset Format
The dataset should be a JSON file with the following structure:
```json
{
"example text 1": "label1",
"example text 2": "label2",
"example text 3": "label1"
}
```
Save your dataset files in the `datasets` directory with the naming convention `datas_[language].json` (e.g., `datas_tur.json` for Turkish).
## Algorithm Levels
- **Mini**: Fast but less accurate (max_steps: 10)
- **Core**: Balanced performance (max_steps: 50)
- **Pro**: Good accuracy with reasonable performance (max_steps: 200)
- **Ultra**: Maximum accuracy, may be slower (max_steps: Infinity)
## Example
```javascript
import Classifier from 'classification.js';
// Create a custom dataset
const dataset = [
{ text: "Hello world", label: "greeting" },
{ text: "How are you?", label: "question" },
{ text: "Goodbye", label: "farewell" }
];
// Create a classifier with custom options
const classifier = new Classifier(dataset, {
normalize: true,
algorithmLevel: 'Ultra',
keepLogToFile: true,
language: 'eng'
});
// Classify multiple texts
const results = classifier.classify([
"Hello there!",
"How's everything?",
"See you later"
]);
console.log(results);
```
## How It Works
The library uses a modified Damerau-Levenshtein distance algorithm to calculate the similarity between input texts and dataset entries. This algorithm measures the "edit distance" between two strings - the minimum number of operations (insertions, deletions, substitutions, and transpositions) required to change one string into another.
### Classification Process
1. The input text is normalized (if the option is enabled)
2. The algorithm compares the input text with each entry in the dataset
3. For each comparison, a similarity score is calculated
4. The dataset entry with the highest similarity score is selected as the match
5. The label of the best match is returned as the classification result
### Performance Optimization
The algorithm includes several optimizations:
- Early termination when the edit distance exceeds the maximum steps
- Text truncation to limit the comparison length
- Different algorithm levels to balance accuracy and performance
## Advanced Usage
### Custom Dataset Creation
```javascript
// Create a dataset programmatically
const dataset = [];
for (let i = 0; i < 100; i++) {
dataset.push({
text: `Example text ${i}`,
label: i % 2 === 0 ? 'even' : 'odd'
});
}
const classifier = new Classifier(dataset, { algorithmLevel: 'Core' });
```
### Batch Classification
```javascript
// Classify multiple texts at once
const textsToClassify = [
"First example",
"Second example",
"Third example",
// ... more texts
];
const results = classifier.classify(textsToClassify);
```
### Using with Different Languages
```javascript
// Turkish
const turkishDataset = Classifier.loadDatasetFromFile('tur');
const turkishClassifier = new Classifier(turkishDataset, {
language: 'tur',
normalize: true
});
// English
const englishDataset = Classifier.loadDatasetFromFile('eng');
const englishClassifier = new Classifier(englishDataset, {
language: 'eng',
normalize: true
});
// Classify in both languages
const turkishResults = turkishClassifier.classify(['Merhaba dünya']);
const englishResults = englishClassifier.classify(['Hello world']);
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
ISC
## Author
Nixaut