audio-duplicates
Version:
Fast audio duplicate detection using Chromaprint fingerprinting
497 lines (377 loc) • 13.6 kB
Markdown
# Audio Duplicates
[](https://badge.fury.io/js/audio-duplicates)
[](https://opensource.org/licenses/MIT)
[](https://github.com/mcande21/audio-duplicates/actions)
[](https://www.npmjs.com/package/audio-duplicates)
A high-performance audio duplicate detection library built with native C++ and Chromaprint fingerprinting technology. Quickly find duplicate audio files across large collections with robust detection that handles different encodings, bitrates, and formats.
## ✨ Features
- **🚀 High Performance**: Native C++ implementation ~200x faster than JavaScript
- **🎵 Format Support**: MP3, WAV, FLAC, OGG, M4A, AAC, WMA, and more
- **⚡ Fast Matching**: Optimized inverted index for O(1) duplicate lookups
- **🔧 Robust Detection**: Handles different bitrates, sample rates, and encodings
- **💻 CLI Tool**: Full-featured command-line interface for batch processing
- **📝 TypeScript Support**: Complete TypeScript definitions included
- **🌍 Cross-Platform**: Windows, macOS, and Linux support
- **📊 Progress Reporting**: Real-time progress bars and statistics
## 📦 Installation
### Prerequisites
Install the required system libraries first:
#### macOS
```bash
brew install chromaprint libsndfile
```
#### Ubuntu/Debian
```bash
sudo apt-get update
sudo apt-get install libchromaprint-dev libsndfile1-dev
```
#### Windows
Download and install:
- [Chromaprint](https://github.com/acoustid/chromaprint/releases)
- [libsndfile](http://www.mega-nerd.com/libsndfile/#Download)
### Install the Package
#### Global Installation (Recommended for CLI)
```bash
npm install -g audio-duplicates
```
#### Local Installation (for API usage)
```bash
npm install audio-duplicates
```
The package automatically uses prebuilt binaries when available, falling back to source compilation if needed.
## 🚀 Quick Start
### CLI Usage
#### Scan for Duplicates
```bash
# Scan a single directory
audio-duplicates scan /path/to/music
# Scan multiple directories
audio-duplicates scan /music/collection1 /music/collection2
# Scan with custom threshold
audio-duplicates scan /path/to/music --threshold 0.9
# Save results to file
audio-duplicates scan /path/to/music --output duplicates.json --format json
```
#### Compare Two Files
```bash
audio-duplicates compare song1.mp3 song2.mp3
```
#### Generate Fingerprint
```bash
# Generate and display fingerprint
audio-duplicates fingerprint song.mp3
# Save fingerprint to file
audio-duplicates fingerprint song.mp3 --output fingerprint.json
```
### API Usage
#### Basic Duplicate Detection
```javascript
const audioDuplicates = require('audio-duplicates');
async function findDuplicates() {
// Scan directory for duplicates
const duplicates = await audioDuplicates.scanDirectoryForDuplicates('/path/to/music', {
threshold: 0.85,
onProgress: (progress) => {
console.log(`Processing: ${progress.current}/${progress.total} - ${progress.file}`);
}
});
// Display results
duplicates.forEach((group, index) => {
console.log(`\nDuplicate Group ${index + 1}:`);
group.files.forEach(file => {
console.log(` ${file.path} (similarity: ${file.similarity})`);
});
});
}
findDuplicates().catch(console.error);
```
#### Manual Fingerprint Comparison
```javascript
const audioDuplicates = require('audio-duplicates');
async function compareFiles() {
// Generate fingerprints
const fp1 = await audioDuplicates.generateFingerprint('file1.mp3');
const fp2 = await audioDuplicates.generateFingerprint('file2.mp3');
// Compare fingerprints
const result = await audioDuplicates.compareFingerprints(fp1, fp2);
console.log('Similarity Score:', result.similarityScore);
console.log('Are Duplicates:', result.isDuplicate);
console.log('Confidence:', result.confidence);
}
compareFiles().catch(console.error);
```
#### Batch Processing with Index
```javascript
const audioDuplicates = require('audio-duplicates');
async function batchProcess() {
// Initialize index for batch processing
await audioDuplicates.initializeIndex();
// Add files to index
const files = ['song1.mp3', 'song2.mp3', 'song3.mp3'];
for (const file of files) {
const fileId = await audioDuplicates.addFileToIndex(file);
console.log(`Added ${file} with ID: ${fileId}`);
}
// Find all duplicates in the index
const duplicateGroups = await audioDuplicates.findAllDuplicates();
console.log('Found', duplicateGroups.length, 'duplicate groups');
// Get index statistics
const stats = await audioDuplicates.getIndexStats();
console.log('Index Stats:', stats);
// Clear index when done
await audioDuplicates.clearIndex();
}
batchProcess().catch(console.error);
```
### TypeScript Usage
```typescript
import * as audioDuplicates from 'audio-duplicates';
import { DuplicateGroup, ScanOptions, Fingerprint } from 'audio-duplicates';
async function findDuplicatesTyped(): Promise<DuplicateGroup[]> {
const options: ScanOptions = {
threshold: 0.85,
maxDuration: 300, // 5 minutes max
onProgress: (progress: { current: number; total: number; file: string }) => {
console.log(`${progress.current}/${progress.total}: ${progress.file}`);
}
};
return await audioDuplicates.scanDirectoryForDuplicates('/path/to/music', options);
}
async function generateTypedFingerprint(filePath: string): Promise<Fingerprint> {
return await audioDuplicates.generateFingerprint(filePath);
}
```
## 📖 API Reference
### Core Functions
#### `generateFingerprint(filePath: string): Promise<Fingerprint>`
Generate an audio fingerprint from a file.
```javascript
const fingerprint = await audioDuplicates.generateFingerprint('song.mp3');
console.log('Duration:', fingerprint.duration);
console.log('Sample Rate:', fingerprint.sampleRate);
```
#### `generateFingerprintLimited(filePath: string, maxDuration: number): Promise<Fingerprint>`
Generate fingerprint with duration limit (in seconds).
```javascript
// Only fingerprint first 30 seconds
const fingerprint = await audioDuplicates.generateFingerprintLimited('song.mp3', 30);
```
#### `compareFingerprints(fp1: Fingerprint, fp2: Fingerprint): Promise<MatchResult>`
Compare two fingerprints and return similarity metrics.
```javascript
const result = await audioDuplicates.compareFingerprints(fp1, fp2);
console.log('Similarity:', result.similarityScore); // 0.0 to 1.0
console.log('Is Duplicate:', result.isDuplicate); // boolean
console.log('Confidence:', result.confidence); // 0.0 to 1.0
```
### Index Management
#### `initializeIndex(): Promise<boolean>`
Initialize the fingerprint index for batch processing.
#### `addFileToIndex(filePath: string): Promise<number>`
Add a file to the index and return its unique ID.
#### `findAllDuplicates(): Promise<DuplicateGroup[]>`
Find all duplicate groups in the current index.
#### `getIndexStats(): Promise<IndexStats>`
Get statistics about the current index.
```javascript
const stats = await audioDuplicates.getIndexStats();
console.log('Files:', stats.fileCount);
console.log('Index Size:', stats.indexSize);
console.log('Load Factor:', stats.loadFactor);
```
#### `clearIndex(): Promise<boolean>`
Clear the current index and free memory.
### Configuration
#### `setSimilarityThreshold(threshold: number): Promise<boolean>`
Set the similarity threshold (0.0 to 1.0) for duplicate detection.
```javascript
await audioDuplicates.setSimilarityThreshold(0.9); // Stricter matching
```
### High-Level Utilities
#### `scanDirectoryForDuplicates(directory: string, options?: ScanOptions): Promise<DuplicateGroup[]>`
Scan a directory for duplicates with progress reporting.
**Options:**
- `threshold?: number` - Similarity threshold (default: 0.85)
- `maxDuration?: number` - Max duration to fingerprint in seconds
- `onProgress?: (progress) => void` - Progress callback
- `recursive?: boolean` - Scan subdirectories (default: true)
## 🖥️ CLI Reference
### Commands
#### `scan <directories...>`
Scan directories for duplicate audio files.
```bash
# Basic scan
audio-duplicates scan /music
# Advanced options
audio-duplicates scan /music \
--threshold 0.9 \
--format json \
--output results.json \
--max-duration 180 \
--no-progress
```
**Options:**
- `--threshold <number>` - Similarity threshold (0.0-1.0, default: 0.85)
- `--format <format>` - Output format: `json`, `csv`, or `text` (default: text)
- `--output <file>` - Output file path
- `--max-duration <seconds>` - Maximum duration to fingerprint
- `--no-progress` - Disable progress bar
- `--recursive` - Scan subdirectories (default: true)
#### `compare <file1> <file2>`
Compare two audio files directly.
```bash
audio-duplicates compare song1.mp3 song2.wav --max-duration 60
```
#### `fingerprint <file>`
Generate and display fingerprint for an audio file.
```bash
audio-duplicates fingerprint song.mp3 --output fingerprint.json
```
### Global Options
- `-v, --verbose` - Verbose output with detailed information
- `--threshold <number>` - Global similarity threshold
- `--format <format>` - Global output format
## 📊 Performance
### Benchmarks
On a modern CPU (Apple M1):
- **Fingerprint Generation**: 2-5x real-time (faster than playback)
- **Index Lookup**: ~1ms per query
- **Full Comparison**: 10-50ms depending on file length
- **Memory Usage**: ~4KB per minute of audio
- **Scalability**: Efficiently handles 10,000+ files
### Example Performance
```
Collection Size: 10,000 files (50GB)
Scan Time: ~8 minutes
Memory Usage: ~200MB
Duplicates Found: 847 groups (2,341 files)
```
## 🔧 Advanced Usage
### Custom Similarity Thresholds
```javascript
// Exact duplicates only (very strict)
await audioDuplicates.setSimilarityThreshold(0.95);
// Similar versions (more permissive)
await audioDuplicates.setSimilarityThreshold(0.75);
// Near-identical files (default)
await audioDuplicates.setSimilarityThreshold(0.85);
```
### Handling Large Collections
```javascript
async function processLargeCollection(directories) {
await audioDuplicates.initializeIndex();
for (const dir of directories) {
console.log(`Processing directory: ${dir}`);
// Process in batches to manage memory
const duplicates = await audioDuplicates.scanDirectoryForDuplicates(dir, {
threshold: 0.85,
maxDuration: 300, // Limit to 5 minutes per file
onProgress: (progress) => {
if (progress.current % 100 === 0) {
console.log(`Processed ${progress.current}/${progress.total} files`);
}
}
});
console.log(`Found ${duplicates.length} duplicate groups in ${dir}`);
}
// Get final results
const allDuplicates = await audioDuplicates.findAllDuplicates();
console.log(`Total duplicate groups: ${allDuplicates.length}`);
await audioDuplicates.clearIndex();
}
```
### Output Formats
#### JSON Output
```bash
audio-duplicates scan /music --format json --output results.json
```
```json
{
"summary": {
"totalFiles": 1500,
"duplicateGroups": 23,
"duplicateFiles": 67,
"spaceWasted": "1.2GB"
},
"duplicateGroups": [
{
"groupId": 1,
"avgSimilarity": 0.94,
"files": [
{
"path": "/music/song1.mp3",
"size": 5242880,
"similarity": 1.0
},
{
"path": "/music/copy/song1.mp3",
"size": 5242880,
"similarity": 0.94
}
]
}
]
}
```
#### CSV Output
```bash
audio-duplicates scan /music --format csv --output results.csv
```
## 🐛 Troubleshooting
### Common Issues
#### Build Errors
```bash
# macOS: Install dependencies
brew install chromaprint libsndfile
# Ubuntu: Install dependencies
sudo apt-get install libchromaprint-dev libsndfile1-dev
# Clear npm cache and rebuild
npm cache clean --force
npm rebuild
```
#### Runtime Errors
**"Could not locate bindings file"**
```bash
npm run build
```
**"Failed to open audio file"**
- Check file format is supported
- Verify file permissions
- Ensure file is not corrupted
**"Index not initialized"**
```javascript
// Always initialize before using index functions
await audioDuplicates.initializeIndex();
```
### Performance Optimization
For large collections:
- Use `maxDuration` to limit fingerprint length
- Process directories in batches
- Increase similarity threshold for faster results
- Use SSD storage for audio files
## 🤝 Contributing
1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make your changes and add tests
4. Run the test suite: `npm test`
5. Submit a pull request
### Development Setup
```bash
git clone https://github.com/mcande21/audio-duplicates.git
cd audio-duplicates
npm install
npm run build
npm test
```
## 📄 License
MIT License - see [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- [Chromaprint](https://github.com/acoustid/chromaprint) - Audio fingerprinting library
- [libsndfile](http://www.mega-nerd.com/libsndfile/) - Audio file I/O library
- [Node-API](https://nodejs.org/api/n-api.html) - Native addon interface
## 🔗 Related Projects
- [AcoustID](https://acoustid.org/) - Audio identification service
- [fpcalc](https://github.com/acoustid/chromaprint) - Command-line fingerprinting tool
- [MusicBrainz](https://musicbrainz.org/) - Music metadata database
---
**Happy duplicate hunting! 🎵**