UNPKG

paragrafs

Version:

A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions.

255 lines (183 loc) 8.41 kB
# paragrafs [![wakatime](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/77131919-f79c-4be7-a329-d54199396eae.svg)](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/77131919-f79c-4be7-a329-d54199396eae) ![Bun](https://img.shields.io/badge/Bun-%23000000.svg?style=for-the-badge&logo=bun&logoColor=white) [![Node.js CI](https://github.com/ragaeeb/paragrafs/actions/workflows/build.yml/badge.svg)](https://github.com/ragaeeb/paragrafs/actions/workflows/build.yml) ![GitHub License](https://img.shields.io/github/license/ragaeeb/paragrafs) ![GitHub Release](https://img.shields.io/github/v/release/ragaeeb/paragrafs) [![codecov](https://codecov.io/gh/ragaeeb/paragrafs/graph/badge.svg?token=B3IRBVOS3H)](https://codecov.io/gh/ragaeeb/paragrafs) [![Size](https://deno.bundlejs.com/badge?q=paragrafs@latest&badge=detailed)](https://bundlejs.com/?q=paragrafs%40latest) ![typescript](https://badgen.net/badge/icon/typescript?icon=typescript&label&color=blue) ![npm](https://img.shields.io/npm/dm/paragrafs) ![GitHub issues](https://img.shields.io/github/issues/ragaeeb/paragrafs) ![GitHub stars](https://img.shields.io/github/stars/ragaeeb/paragrafs?style=social) ![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/ragaeeb/paragrafs?utm_source=oss&utm_medium=github&utm_campaign=ragaeeb%2Fparagrafs&labelColor=171717&color=FF570A&link=https%3A%2F%2Fcoderabbit.ai&label=CodeRabbit+Reviews) A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability. ## Features - **Segment Recognition**: Intelligently groups text into logical paragraphs - **Filler Removal**: Identifies and removes common speech fillers (uh, umm, etc.) - **Gap Detection**: Detects significant pauses to identify paragraph breaks - **Timestamp Formatting**: Converts seconds to readable timestamps (HH:MM:SS) - **Punctuation Awareness**: Uses punctuation to identify natural segment breaks - **Customizable Parameters**: Configure minimum words per segment, max segment length, etc. - **Arabic Support**: Handles Arabic question marks and other non-Latin punctuation - **Transcript Formatting**: Converts raw token streams into readable text with appropriate line breaks - **Ground-Truth Token Mapping**: Aligns AI-generated word timestamps to human-edited transcript text using an LCS-based algorithm with intelligent interpolation ## Installation ```bash npm install paragrafs ``` or ```bash pnpm install paragrafs ``` or ```bash yarn add paragrafs ``` or ```bash bun add paragrafs ``` ## Usage ### Basic Example ```typescript import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs'; // Example token from transcription const token = { start: 0, end: 5, text: 'This is a sample text. It should be properly segmented.', }; // Estimate segment with word-level tokens const segment = estimateSegmentFromToken(token); // Combine and format segments const formattedSegments = mapSegmentsIntoFormattedSegments([segment]); console.log(formattedSegments[0].text); // Output: "This is a sample text. It should be properly segmented." ``` ### Working with Transcriptions ```typescript import { markAndCombineSegments, mapSegmentsIntoFormattedSegments, formatSegmentsToTimestampedTranscript, } from 'paragrafs'; // Example transcription segments const segments = [ { start: 0, end: 6.5, text: 'The quick brown fox!', tokens: [ { start: 0, end: 1, text: 'The' }, { start: 1, end: 2, text: 'quick' }, { start: 2, end: 3, text: 'brown' }, { start: 3, end: 6.5, text: 'fox!' }, ], }, { start: 8, end: 13, text: 'Jumps right over the', tokens: [ { start: 8, end: 9, text: 'Jumps' }, { start: 9, end: 10, text: 'right' }, { start: 10, end: 11, text: 'over' }, { start: 12, end: 13, text: 'the' }, ], }, ]; // Options for segment formatting const options = { fillers: ['uh', 'umm', 'hmmm'], gapThreshold: 3, maxSecondsPerSegment: 12, minWordsPerSegment: 3, }; // Process the segments const combinedSegments = markAndCombineSegments(segments, options); const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments); // Get timestamped transcript const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10); console.log(transcript); // Output: // 0:00: The quick brown fox! // 0:08: Jumps right over the ``` ### Aligning AI Tokens to Human-Edited Text ```typescript import { updateSegmentWithGroundTruth } from 'paragrafs'; const rawSegment = { start: 0, end: 10, text: 'The Buick crown flock jumps right over the crazy dog.', tokens: [ /* AI-generated word timestamps */ ], }; const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.'); console.log(aligned.tokens); // Each token now matches the ground-truth words exactly, // with missing words interpolated where needed. ``` ## API Reference ### Core Functions #### `estimateSegmentFromToken(token: Token): Segment` Splits a single token into word-level tokens and estimates timing for each word. #### `markTokensWithDividers(tokens: Token[], options): MarkedToken[]` Marks tokens with segment breaks based on fillers, gaps, and punctuation. #### `groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]` Groups marked tokens into logical segments based on maximum segment length. #### `mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]` Merges segments with too few words into the previous segment. #### `mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]` Converts marked segments into clean, formatted segments with proper text representation. #### `formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number): string` Formats segments into a human-readable transcript with timestamps. #### `markAndCombineSegments(segments: Segment[], options): MarkedSegment[]` Combined utility that processes segments through all the necessary steps. #### `mapTokensToGroundTruth(segment: Segment): Segment` Synchronizes AI-generated word timestamps with the human-edited transcript (`segment.text`): - Uses a longest-common-subsequence (LCS) to find matching words and preserve their original timing. - Evenly interpolates timestamps for runs of missing words (only when two or more are missing). - Falls back to `estimateSegmentFromToken` if no matches are found. ### Types ```typescript type Token = { start: number; // Start time in seconds end: number; // End time in seconds text: string; // The transcribed text }; type Segment = Token & { tokens: Token[]; // Word-by-word breakdown with timings }; type MarkedToken = 'SEGMENT_BREAK' | Token; type MarkedSegment = { start: number; end: number; tokens: MarkedToken[]; }; ``` ### Utility Functions #### `isEndingWithPunctuation(text: string): boolean` Checks if the text ends with punctuation (including Arabic punctuation). #### `formatSecondsToTimestamp(seconds: number): string` Formats seconds into a human-readable timestamp (H:MM:SS). ## Use Cases - **Transcript Formatting**: Convert raw transcriptions into readable text - **Subtitle Generation**: Create properly formatted subtitles from audio transcriptions - **Document Reconstruction**: Rebuild properly formatted documents from extracted text ## Contributing Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests. To get started: 1. Fork the repository 2. Install dependencies: `bun install` (requires [Bun](https://bun.sh/)) 3. Make your changes 4. Run tests: `bun test` 5. Submit a pull request ## License `paragrafs` is released under the MIT License. See the [LICENSE.MD](./LICENSE.MD) file for more details. ## Author Ragaeeb Haq --- Built with TypeScript and Bun. Uses ESM module format.