paragrafs
Version:
A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions.
304 lines (226 loc) • 14.6 kB
Markdown
# paragrafs
[](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/77131919-f79c-4be7-a329-d54199396eae)

[](https://github.com/ragaeeb/paragrafs/actions/workflows/build.yml)


[](https://codecov.io/gh/ragaeeb/paragrafs)
[](https://bundlejs.com/?q=paragrafs%40latest)





A lightweight TypeScript library designed to reconstruct paragraphs from AI transcriptions. It helps format unstructured text with appropriate paragraph breaks, handles timestamps for transcripts, and optimizes for readability.
## Features
- **Segment reconstruction** – marks filler words, hints, and time gaps to create natural paragraph boundaries and merges overly short segments back into their predecessors.【F:src/transcript.ts†L40-L204】【F:src/transcript.ts†L236-L300】
- **Timestamped formatting** – produces human-friendly transcripts with optional custom formatting callbacks and automatic timestamp rendering.【F:src/transcript.ts†L212-L300】
- **Ground-truth alignment** – synchronises AI generated tokens with human edited text, interpolating timings for missing words and removing unknown tokens when applying the ground truth.【F:src/utils/transcriptUtils.ts†L1-L226】【F:src/transcript.ts†L328-L395】
- **Selection helpers** – exposes utilities to find tokens for string queries or cursor selections, enabling rich text editors to jump to precise timestamps.【F:src/transcript.ts†L424-L493】
- **Hint system (Arabic-first)** – robust multi-word hint matching using normalization (diacritics/punctuation tolerant), plus hard boundary insertion via `ALWAYS_BREAK`.【F:src/utils/textUtils.ts†L59-L156】【F:src/transcript.ts†L40-L121】
- **Auto-hint generation** – mines frequent repeated phrases from `Token[]` or `Segment[]` and returns sorted hint candidates for Arabic-heavy transcripts.【F:src/utils/hints.ts†L303-L379】
- **Utility toolkit** – includes helpers for timestamp formatting, punctuation detection (including Arabic punctuation), ground-truth tokenization, and normalization utilities.【F:src/utils/textUtils.ts†L4-L185】
- **Bun-native toolchain** – powered by the upstream `tsdown` CLI for bundling and Biome for linting, so the same commands run locally and in CI without any custom wrappers.【F:package.json†L7-L41】【F:tsdown.config.ts†L1-L9】【F:biome.json†L1-L16】
## Breaking changes (recent)
- **Hints are normalized by default**: `createHints(...)` now uses Arabic-first normalization for matching and mining. If you relied on exact string matching, update your expectations and/or pass explicit normalization options.【F:src/utils/textUtils.ts†L121-L156】
- **`ALWAYS_BREAK` is a true hard boundary**: segments/lines after an `ALWAYS_BREAK` must not be merged into previous segments.【F:src/transcript.ts†L95-L167】【F:src/transcript.ts†L173-L211】
## Installation
```bash
npm install paragrafs
```
or
```bash
pnpm install paragrafs
```
or
```bash
yarn add paragrafs
```
or
```bash
bun add paragrafs
```
## Usage
### Basic Example
```typescript
import { estimateSegmentFromToken, markAndCombineSegments, mapSegmentsIntoFormattedSegments } from 'paragrafs';
// Example token from transcription
const token = {
start: 0,
end: 5,
text: 'This is a sample text. It should be properly segmented.',
};
// Estimate segment with word-level tokens
const segment = estimateSegmentFromToken(token);
// Combine and format segments
const formattedSegments = mapSegmentsIntoFormattedSegments([segment]);
console.log(formattedSegments[0].text);
// Output: "This is a sample text. It should be properly segmented."
```
### Working with Transcriptions
```typescript
import {
markAndCombineSegments,
mapSegmentsIntoFormattedSegments,
formatSegmentsToTimestampedTranscript,
} from 'paragrafs';
// Example transcription segments
const segments = [
{
start: 0,
end: 6.5,
text: 'The quick brown fox!',
tokens: [
{ start: 0, end: 1, text: 'The' },
{ start: 1, end: 2, text: 'quick' },
{ start: 2, end: 3, text: 'brown' },
{ start: 3, end: 6.5, text: 'fox!' },
],
},
{
start: 8,
end: 13,
text: 'Jumps right over the',
tokens: [
{ start: 8, end: 9, text: 'Jumps' },
{ start: 9, end: 10, text: 'right' },
{ start: 10, end: 11, text: 'over' },
{ start: 12, end: 13, text: 'the' },
],
},
];
// Options for segment formatting
const options = {
fillers: ['uh', 'umm', 'hmmm'],
gapThreshold: 3,
maxSecondsPerSegment: 12,
minWordsPerSegment: 3,
};
// Process the segments
const combinedSegments = markAndCombineSegments(segments, options);
const formattedSegments = mapSegmentsIntoFormattedSegments(combinedSegments);
// Get timestamped transcript
const transcript = formatSegmentsToTimestampedTranscript(combinedSegments, 10);
console.log(transcript);
// Output:
// 0:00: The quick brown fox!
// 0:08: Jumps right over the
```
### Aligning AI Tokens to Human-Edited Text
```typescript
import { updateSegmentWithGroundTruth } from 'paragrafs';
const rawSegment = {
start: 0,
end: 10,
text: 'The Buick crown flock jumps right over the crazy dog.',
tokens: [
/* AI-generated word timestamps */
],
};
const aligned = updateSegmentWithGroundTruth(rawSegment, 'The quick brown fox jumps right over the lazy dog.');
console.log(aligned.tokens);
// Each token now matches the ground-truth words exactly,
// with missing words interpolated where needed.
```
### Auto-generate hint candidates (Arabic-first)
Use this when you have a corpus of tokens/segments and want to discover repeated phrases like "احسن الله اليكم".
```typescript
import { createHints, generateHintsFromTokens, markTokensWithDividers } from 'paragrafs';
const tokens = [
{ start: 0, end: 1, text: 'أَحْسَنَ' },
{ start: 1, end: 2, text: 'الله' },
{ start: 2, end: 3, text: 'إليكم،' },
// ... repeated in the stream ...
];
const mined = generateHintsFromTokens(tokens, {
minN: 2,
maxN: 4,
minCount: 2,
dedupe: 'closed',
normalization: { normalizeAlef: true },
});
// Turn mined phrases into matching hints
const hints = createHints({ normalizeAlef: true }, ...mined.slice(0, 25).map((h) => h.phrase));
const marked = markTokensWithDividers(tokens, { fillers: [], gapThreshold: 999, hints });
```
## Commands
- `bun run build` – compiles the library with the official tsdown pipeline configured in `tsdown.config.ts`.【F:package.json†L33-L41】【F:tsdown.config.ts†L1-L9】
- `bun run lint` – runs Biome’s formatter and linter against the repository root.【F:package.json†L33-L41】【F:biome.json†L1-L16】
- `bun test` – executes the Bun test suite.
- `bun test --coverage` – runs tests with coverage reporting (useful for refactors of segmentation/matching logic).
### Demo app (Svelte + Vite)
This repo includes a minimal static demo app in `demo/` that exercises the major exported functions with configurable JSON/text inputs. It’s intended to be deployed to **`paragrafs.surge.sh`**.
- Live demo: [paragrafs.surge.sh](https://paragrafs.surge.sh)
- **Install**: `bun run demo:install`
- **Dev**: `bun run demo:dev`
- **Build**: `bun run demo:build`
- **Deploy to Surge**: `bun run demo:deploy`
Notes:
- The demo depends on the local package via `file:..`, so `demo:build` runs `bun run build` first to ensure `dist/` exists.
- Deploy target folder is `demo/dist`.
## API Reference
### Transcript builders
- `estimateSegmentFromToken(token: Token): Segment` – splits multi-word tokens into per-word timings so they can participate in downstream processing.【F:src/transcript.ts†L15-L39】
- `markTokensWithDividers(tokens: Token[], options: MarkTokensWithDividersOptions): MarkedToken[]` – inserts divider markers based on fillers, hints, punctuation, and timing gaps.【F:src/transcript.ts†L44-L121】
- `groupMarkedTokensIntoSegments(markedTokens: MarkedToken[], maxSecondsPerSegment: number): MarkedSegment[]` – chunks marked tokens into bounded-length segments.【F:src/transcript.ts†L123-L171】
- `mergeShortSegmentsWithPrevious(segments: MarkedSegment[], minWordsPerSegment: number): MarkedSegment[]` – merges segments that contain fewer than the required word count into their predecessors.【F:src/transcript.ts†L173-L211】
- `cleanupIsolatedTokens(markedTokens: MarkedToken[]): MarkedToken[]` – removes redundant divider markers that would isolate a single token on a line.【F:src/transcript.ts†L314-L326】
- `markAndCombineSegments(segments: Segment[], options): MarkedSegment[]` – convenience pipeline that flattens tokens, marks dividers, groups, and merges short runs in one call.【F:src/transcript.ts†L302-L326】
- `mapSegmentsIntoFormattedSegments(segments: MarkedSegment[], maxSecondsPerLine?: number): Segment[]` – flattens marked segments into readable text while respecting optional line duration caps.【F:src/transcript.ts†L236-L300】
- `formatSegmentsToTimestampedTranscript(segments: MarkedSegment[], maxSecondsPerLine: number, formatTokens?: (buffer: Token) => string): string` – emits newline separated transcript lines with timestamps or a custom formatter.【F:src/transcript.ts†L204-L234】
### Ground-truth alignment
- `updateSegmentWithGroundTruth(segment: Segment, groundTruth: string): GroundedSegment` – applies LCS-based alignment to replace tokens with the ground-truth words while flagging unmatched entries.【F:src/transcript.ts†L328-L359】
- `applyGroundTruthToSegment(segment: Segment, groundTruth: string): Segment` – wraps `updateSegmentWithGroundTruth` and filters unknown tokens for production-ready output.【F:src/transcript.ts†L361-L395】
- `mergeSegments(segments: Segment[], delimiter?: string): Segment` – concatenates sequential segments into one continuous block, preserving timing.【F:src/transcript.ts†L397-L411】
- `splitSegment(segment: Segment, splitTime: number): Segment[]` – divides a segment into two at a specific timestamp.【F:src/transcript.ts†L413-L448】
### Editor helpers
- `getFirstMatchingToken(tokens: Token[], query: string): Token | null` – scans for the first occurrence of a hint sequence produced by `createHints`.【F:src/transcript.ts†L450-L493】
- `getFirstTokenForSelection(segment: Segment, selectionStart: number, selectionEnd: number): Token | null` – maps character selections within `segment.text` back to the corresponding timed token.【F:src/transcript.ts†L495-L546】
### Utility functions
- `createHints(first: ArabicNormalizationOptions | string, ...rest: string[]): Hints` – creates **normalized** hints for robust Arabic matching (diacritics/punctuation tolerant).【F:src/utils/textUtils.ts†L121-L156】
- `formatSecondsToTimestamp(seconds: number): string` – renders numeric durations into `m:ss` or `h:mm:ss` strings.【F:src/utils/textUtils.ts†L14-L33】
- `isEndingWithPunctuation(text: string): boolean` – checks for trailing punctuation, including Arabic variants.【F:src/utils/textUtils.ts†L4-L12】
- `tokenizeGroundTruth(groundTruth: string): string[]` – tokenises human transcripts while attaching punctuation to the preceding word.【F:src/utils/textUtils.ts†L75-L112】
- `normalizeTokenText(text: string, options?: ArabicNormalizationOptions): string` – Arabic-first normalization used by hint matching and hint mining.【F:src/utils/textUtils.ts†L59-L103】
### Auto-hint generation
- `generateHintsFromTokens(tokens: Token[], options?: GenerateHintsOptions): GeneratedHint[]` – mines frequent n-grams from a token stream and returns candidates sorted by count/length.【F:src/utils/hints.ts†L303-L331】
- `generateHintsFromSegments(segments: Segment[], options?: GenerateHintsOptions): GeneratedHint[]` – mines frequent n-grams from segments; by default phrases do not cross segment boundaries.【F:src/utils/hints.ts†L333-L379】
### Types
```typescript
type Token = {
start: number;
end: number;
text: string;
};
type Segment = Token & {
tokens: Token[];
};
type MarkedToken = Token | typeof SEGMENT_BREAK | typeof ALWAYS_BREAK;
type MarkedSegment = {
start: number;
end: number;
tokens: MarkedToken[];
};
type GroundedToken = Token & { isUnknown?: boolean };
type GroundedSegment = Omit<Segment, 'tokens'> & { tokens: GroundedToken[] };
```
## Use Cases
- **Transcript Formatting**: Convert raw transcriptions into readable text
- **Subtitle Generation**: Create properly formatted subtitles from audio transcriptions
- **Document Reconstruction**: Rebuild properly formatted documents from extracted text
## Contributing
Contributions are welcome! Please make sure your contributions adhere to the coding standards and are accompanied by relevant tests.
To get started:
1. Fork the repository
2. Install dependencies: `bun install` (requires [Bun](https://bun.sh/))
3. Make your changes
4. Run linting: `bun run lint`
5. Build the package: `bun run build`
6. Run tests: `bun test --coverage`
7. Submit a pull request
## License
`paragrafs` is released under the MIT License. See the [LICENSE.MD](./LICENSE.MD) file for more details.
## Author
Ragaeeb Haq
---
Built with TypeScript and Bun. Uses ESM module format.