subtexty
Version:
Extract clean plain-text from subtitle files
349 lines (261 loc) โข 8.35 kB
Markdown
# Subtexty
Extract clean plain-text from subtitle files with intelligent deduplication and format support.
[](https://opensource.org/licenses/MIT)
## Overview
Subtexty is a lightweight, open-source CLI tool and TypeScript library that extracts clean, deduplicated plain-text from subtitle files. It intelligently handles styling tags, timing metadata, and removes redundant content while preserving the original text flow.
## Features
- ๐ฏ **Smart Text Extraction**: Removes timing, positioning, and style tags while preserving content
- ๐ **Intelligent Deduplication**: Eliminates redundant lines and prefix duplicates
- ๐ **Multi-Format Support**: WebVTT (.vtt), SRT (.srt), TTML (.ttml/.xml), SBV (.sbv), JSON3 (.json/.json3)
- ๐ค **Encoding Handling**: UTF-8 by default with fallback encoding detection and manual override support
- ๐ **Dual Interface**: Both CLI tool and programmatic library
- โก **Performance**: Stream processing for memory efficiency
- ๐งช **Well Tested**: 80%+ test coverage with comprehensive test suite
## Installation
### NPM (Global CLI)
```bash
npm install -g subtexty
```
### NPM (Project Dependency)
```bash
npm install subtexty
```
## Quick Start
### CLI Usage
```bash
# Extract text to stdout
subtexty input.vtt
# Save to file
subtexty input.srt -o clean-text.txt
# Specify encoding
subtexty input.vtt --encoding utf-8
```
### Library Usage
```typescript
import { extractText } from 'subtexty';
// Basic extraction
const cleanText = await extractText('subtitles.vtt');
console.log(cleanText);
// With options
const cleanText = await extractText('subtitles.srt', {
encoding: 'utf-8'
});
```
## CLI Reference
### Basic Usage
```bash
subtexty [options] <input-file>
```
### Arguments
- `input-file` - Subtitle file to process (required)
### Options
- `-v, --version` - Display version number
- `-o, --output <file>` - Output file (default: stdout)
- `--encoding <encoding>` - File encoding (default: utf-8)
- `-h, --help` - Display help for command
### Examples
```bash
# Basic text extraction
subtexty movie-subtitles.vtt
# Multiple file processing with output
subtexty episode1.srt -o episode1-text.txt
subtexty episode2.srt -o episode2-text.txt
# Handle different encodings
subtexty foreign-film.srt --encoding latin1
# Pipe to other tools
subtexty subtitles.vtt | wc -w # Word count
subtexty subtitles.vtt | grep "keyword" # Search
```
### Exit Codes
- `0` - Success
- `1` - File error (not found, permissions, etc.)
- `2` - Parsing error (invalid format, corrupted data)
## Library API
### `extractText(filePath, options?)`
Extracts clean text from a subtitle file.
**Parameters:**
- `filePath` (string) - Path to the subtitle file
- `options` (object, optional) - Extraction options
- `encoding` (string) - File encoding (default: utf-8)
**Returns:**
- `Promise<string>` - Clean extracted text
**Example:**
```typescript
import { extractText } from 'subtexty';
try {
const text = await extractText('./subtitles.vtt');
console.log(text);
} catch (error) {
console.error('Extraction failed:', error.message);
}
```
### Error Handling
```typescript
import { extractText, isSubtextyError } from 'subtexty';
try {
const text = await extractText('file.vtt', { encoding: 'utf-8' });
// Process text...
} catch (error) {
if (isSubtextyError(error)) {
// Handle specific subtexty errors
switch (error.code) {
case 'FILE_NOT_FOUND':
console.error('Subtitle file does not exist');
break;
case 'UNSUPPORTED_FORMAT':
console.error('File format not supported');
break;
case 'FILE_NOT_READABLE':
console.error('Cannot read the file');
break;
default:
console.error('Extraction error:', error.message);
}
} else {
console.error('Unexpected error:', error.message);
}
}
```
## Supported Formats
| Format | Extensions | Description |
|--------|------------|-------------|
| **WebVTT** | `.vtt` | Web Video Text Tracks |
| **SRT** | `.srt` | SubRip Subtitle |
| **TTML** | `.ttml`, `.xml` | Timed Text Markup Language |
| **SBV** | `.sbv` | YouTube SBV format |
| **JSON3** | `.json`, `.json3` | JSON-based subtitle format |
## Text Processing Features
### Tag Removal
Removes HTML, XML, and styling tags:
```
Input: <b>Bold text</b> and <i>italic</i>
Output: Bold text and italic
```
### Entity Conversion
Converts HTML entities:
```
Input: Tom & Jerry say "Hello"
Output: Tom & Jerry say "Hello"
```
### Smart Deduplication
Removes redundant content intelligently:
**Exact Duplicates:**
```
Input: Same line
Same line
Different line
Output: Same line
Different line
```
**Prefix Removal:**
```
Input: I love coding
I love coding with TypeScript
Amazing results
Output: I love coding with TypeScript
Amazing results
```
### Whitespace Normalization
Cleans up spacing issues:
```
Input: Multiple spaces and tabs
Output: Multiple spaces and tabs
```
## Development
### Prerequisites
- Node.js โฅ14.0.0
- pnpm (recommended) or npm
### Installation
```bash
git clone https://github.com/bytesnack114/subtexty.git
cd subtexty
pnpm install
```
### Development Scripts
```bash
# Development
pnpm dev input.vtt # Run CLI in development mode
pnpm build # Build TypeScript
pnpm clean # Clean build artifacts
# Testing
pnpm test # Run test suite
pnpm test:watch # Watch mode testing
pnpm test:coverage # Coverage report
# Code Quality
pnpm lint # Run ESLint
pnpm lint:fix # Fix linting issues
```
### Project Structure
```
subtexty/
โโโ src/
โ โโโ cli.ts # CLI interface
โ โโโ constants.ts # Application constants
โ โโโ errors.ts # Custom error classes
โ โโโ index.ts # Library entry point
โ โโโ validation.ts # Input validation
โ โโโ cli/ # CLI-specific modules
โ โโโ parsers/ # Format-specific parsers
โ โโโ types/ # TypeScript definitions
โ โโโ utils/ # Text cleaning utilities
โ โโโ __tests__/ # Test suite
โโโ coverage/ # Coverage Report (if run `pnpm test:coverage`)
โโโ dist/ # Built files (if run `pnpm build`)
โโโ example/ # Example input files
```
## Contributing
### Quick Contribution Steps
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Make changes and add tests
4. Run tests with coverage: `pnpm test:coverage`
5. Commit changes: `git commit -m 'Add amazing feature'`
6. Push to branch: `git push origin feature/amazing-feature`
7. Open a Pull Request
## Testing
Subtexty has comprehensive test coverage:
```bash
# Run all tests
pnpm test
# Generate coverage report
pnpm test:coverage
# View coverage report
open coverage/lcov-report/index.html
```
### Test Categories
- **Unit Tests**: Individual component testing
- **Integration Tests**: End-to-end workflow testing
- **Parser Tests**: Format-specific parsing validation
- **CLI Tests**: Command-line interface testing
## Performance
- **Memory Efficient**: Stream processing for large files
- **Fast Processing**: Optimized text cleaning pipeline
- **Minimal Dependencies**: Only essential packages included
## Troubleshooting
### Common Issues
**File Not Found Error**
```bash
Error: Input file not found: subtitle.vtt
```
*Solution*: Check file path and permissions
**Unsupported Format**
```bash
Error: Unsupported file format: .txt
```
*Solution*: Use supported subtitle formats (.vtt, .srt, .ttml, .sbv, .json)
**Encoding Issues**
```bash
# Specify encoding manually
subtexty file.srt --encoding latin1
```
**Permission Errors**
```bash
# Check file permissions
ls -la subtitle-file.vtt
chmod +r subtitle-file.vtt
```
## License
MIT License - see [LICENSE.md](LICENSE.md) file for details.
## Support
- ๐ **Bug Reports**: [GitHub Issues](https://github.com/bytesnack114/subtexty/issues)
- ๐ง **Email**: bytesnack114@gmail.com