UNPKG

subtexty

Version:

Extract clean plain-text from subtitle files

349 lines (261 loc) โ€ข 8.35 kB
# Subtexty Extract clean plain-text from subtitle files with intelligent deduplication and format support. [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT) ## Overview Subtexty is a lightweight, open-source CLI tool and TypeScript library that extracts clean, deduplicated plain-text from subtitle files. It intelligently handles styling tags, timing metadata, and removes redundant content while preserving the original text flow. ## Features - ๐ŸŽฏ **Smart Text Extraction**: Removes timing, positioning, and style tags while preserving content - ๐Ÿ”„ **Intelligent Deduplication**: Eliminates redundant lines and prefix duplicates - ๐ŸŒ **Multi-Format Support**: WebVTT (.vtt), SRT (.srt), TTML (.ttml/.xml), SBV (.sbv), JSON3 (.json/.json3) - ๐Ÿ”ค **Encoding Handling**: UTF-8 by default with fallback encoding detection and manual override support - ๐Ÿ“ **Dual Interface**: Both CLI tool and programmatic library - โšก **Performance**: Stream processing for memory efficiency - ๐Ÿงช **Well Tested**: 80%+ test coverage with comprehensive test suite ## Installation ### NPM (Global CLI) ```bash npm install -g subtexty ``` ### NPM (Project Dependency) ```bash npm install subtexty ``` ## Quick Start ### CLI Usage ```bash # Extract text to stdout subtexty input.vtt # Save to file subtexty input.srt -o clean-text.txt # Specify encoding subtexty input.vtt --encoding utf-8 ``` ### Library Usage ```typescript import { extractText } from 'subtexty'; // Basic extraction const cleanText = await extractText('subtitles.vtt'); console.log(cleanText); // With options const cleanText = await extractText('subtitles.srt', { encoding: 'utf-8' }); ``` ## CLI Reference ### Basic Usage ```bash subtexty [options] <input-file> ``` ### Arguments - `input-file` - Subtitle file to process (required) ### Options - `-v, --version` - Display version number - `-o, --output <file>` - Output file (default: stdout) - `--encoding <encoding>` - File encoding (default: utf-8) - `-h, --help` - Display help for command ### Examples ```bash # Basic text extraction subtexty movie-subtitles.vtt # Multiple file processing with output subtexty episode1.srt -o episode1-text.txt subtexty episode2.srt -o episode2-text.txt # Handle different encodings subtexty foreign-film.srt --encoding latin1 # Pipe to other tools subtexty subtitles.vtt | wc -w # Word count subtexty subtitles.vtt | grep "keyword" # Search ``` ### Exit Codes - `0` - Success - `1` - File error (not found, permissions, etc.) - `2` - Parsing error (invalid format, corrupted data) ## Library API ### `extractText(filePath, options?)` Extracts clean text from a subtitle file. **Parameters:** - `filePath` (string) - Path to the subtitle file - `options` (object, optional) - Extraction options - `encoding` (string) - File encoding (default: utf-8) **Returns:** - `Promise<string>` - Clean extracted text **Example:** ```typescript import { extractText } from 'subtexty'; try { const text = await extractText('./subtitles.vtt'); console.log(text); } catch (error) { console.error('Extraction failed:', error.message); } ``` ### Error Handling ```typescript import { extractText, isSubtextyError } from 'subtexty'; try { const text = await extractText('file.vtt', { encoding: 'utf-8' }); // Process text... } catch (error) { if (isSubtextyError(error)) { // Handle specific subtexty errors switch (error.code) { case 'FILE_NOT_FOUND': console.error('Subtitle file does not exist'); break; case 'UNSUPPORTED_FORMAT': console.error('File format not supported'); break; case 'FILE_NOT_READABLE': console.error('Cannot read the file'); break; default: console.error('Extraction error:', error.message); } } else { console.error('Unexpected error:', error.message); } } ``` ## Supported Formats | Format | Extensions | Description | |--------|------------|-------------| | **WebVTT** | `.vtt` | Web Video Text Tracks | | **SRT** | `.srt` | SubRip Subtitle | | **TTML** | `.ttml`, `.xml` | Timed Text Markup Language | | **SBV** | `.sbv` | YouTube SBV format | | **JSON3** | `.json`, `.json3` | JSON-based subtitle format | ## Text Processing Features ### Tag Removal Removes HTML, XML, and styling tags: ``` Input: <b>Bold text</b> and <i>italic</i> Output: Bold text and italic ``` ### Entity Conversion Converts HTML entities: ``` Input: Tom &amp; Jerry say &quot;Hello&quot; Output: Tom & Jerry say "Hello" ``` ### Smart Deduplication Removes redundant content intelligently: **Exact Duplicates:** ``` Input: Same line Same line Different line Output: Same line Different line ``` **Prefix Removal:** ``` Input: I love coding I love coding with TypeScript Amazing results Output: I love coding with TypeScript Amazing results ``` ### Whitespace Normalization Cleans up spacing issues: ``` Input: Multiple spaces and tabs Output: Multiple spaces and tabs ``` ## Development ### Prerequisites - Node.js โ‰ฅ14.0.0 - pnpm (recommended) or npm ### Installation ```bash git clone https://github.com/bytesnack114/subtexty.git cd subtexty pnpm install ``` ### Development Scripts ```bash # Development pnpm dev input.vtt # Run CLI in development mode pnpm build # Build TypeScript pnpm clean # Clean build artifacts # Testing pnpm test # Run test suite pnpm test:watch # Watch mode testing pnpm test:coverage # Coverage report # Code Quality pnpm lint # Run ESLint pnpm lint:fix # Fix linting issues ``` ### Project Structure ``` subtexty/ โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ cli.ts # CLI interface โ”‚ โ”œโ”€โ”€ constants.ts # Application constants โ”‚ โ”œโ”€โ”€ errors.ts # Custom error classes โ”‚ โ”œโ”€โ”€ index.ts # Library entry point โ”‚ โ”œโ”€โ”€ validation.ts # Input validation โ”‚ โ”œโ”€โ”€ cli/ # CLI-specific modules โ”‚ โ”œโ”€โ”€ parsers/ # Format-specific parsers โ”‚ โ”œโ”€โ”€ types/ # TypeScript definitions โ”‚ โ”œโ”€โ”€ utils/ # Text cleaning utilities โ”‚ โ””โ”€โ”€ __tests__/ # Test suite โ”œโ”€โ”€ coverage/ # Coverage Report (if run `pnpm test:coverage`) โ”œโ”€โ”€ dist/ # Built files (if run `pnpm build`) โ””โ”€โ”€ example/ # Example input files ``` ## Contributing ### Quick Contribution Steps 1. Fork the repository 2. Create a feature branch: `git checkout -b feature/amazing-feature` 3. Make changes and add tests 4. Run tests with coverage: `pnpm test:coverage` 5. Commit changes: `git commit -m 'Add amazing feature'` 6. Push to branch: `git push origin feature/amazing-feature` 7. Open a Pull Request ## Testing Subtexty has comprehensive test coverage: ```bash # Run all tests pnpm test # Generate coverage report pnpm test:coverage # View coverage report open coverage/lcov-report/index.html ``` ### Test Categories - **Unit Tests**: Individual component testing - **Integration Tests**: End-to-end workflow testing - **Parser Tests**: Format-specific parsing validation - **CLI Tests**: Command-line interface testing ## Performance - **Memory Efficient**: Stream processing for large files - **Fast Processing**: Optimized text cleaning pipeline - **Minimal Dependencies**: Only essential packages included ## Troubleshooting ### Common Issues **File Not Found Error** ```bash Error: Input file not found: subtitle.vtt ``` *Solution*: Check file path and permissions **Unsupported Format** ```bash Error: Unsupported file format: .txt ``` *Solution*: Use supported subtitle formats (.vtt, .srt, .ttml, .sbv, .json) **Encoding Issues** ```bash # Specify encoding manually subtexty file.srt --encoding latin1 ``` **Permission Errors** ```bash # Check file permissions ls -la subtitle-file.vtt chmod +r subtitle-file.vtt ``` ## License MIT License - see [LICENSE.md](LICENSE.md) file for details. ## Support - ๐Ÿ› **Bug Reports**: [GitHub Issues](https://github.com/bytesnack114/subtexty/issues) - ๐Ÿ“ง **Email**: bytesnack114@gmail.com