flappa-doormal
Version:
Arabic text marker pattern library for generating regex from declarative configurations
358 lines (273 loc) • 9.38 kB
Markdown
# flappa-doormal
[](https://wakatime.com/badge/user/a0b906ce-b8e7-4463-8bce-383238df6d4b/project/384fa29d-72e8-4078-980f-45d363f10507)
[](https://github.com/ragaeeb/flappa-doormal/actions/workflows/build.yml) 

[](https://bundlejs.com/?q=flappa-doormal%40latest)





[](https://codecov.io/gh/ragaeeb/flappa-doormal)
[](https://badge.fury.io/js/flappa-doormal)
**Arabic text marker pattern library** - Generate regex patterns from declarative marker configurations.
🎯 **Purpose:** Simplify Arabic text segmentation by replacing complex regex patterns with readable, composable templates.
## Installation
```bash
bun add flappa-doormal
# Peer dependencies
bun add bitaboom baburchi shamela
```
## Quick Start
```typescript
import { generateRegexFromMarker } from 'flappa-doormal';
// Simple numbered marker
const regex = generateRegexFromMarker({
type: 'numbered' // Defaults: Arabic-Indic numerals, dash separator
});
regex.exec('٥ - نص الحديث');
// Returns: ['٥ - نص الحديث', 'نص الحديث']
```
## Features
✅ **13 Preset Types** - Common patterns like `bab`, `hadith-chain`, `basmala`
✅ **Template System** - Use `{num}`, `{dash}`, `{bullet}` instead of regex
✅ **Type-Safe** - Full TypeScript support
✅ **Composable** - Mix and match tokens with quantifiers
✅ **Diacritic-Insensitive** - Handles Arabic text variations
## Marker Types
### Basic Types
```typescript
{ type: 'numbered' } // ٥ - text
{ type: 'bullet' } // • text
{ type: 'bab' } // باب chapter
{ type: 'hadith-chain' } // حَدَّثَنَا narrator
{ type: 'basmala' } // بسم الله
{ type: 'square-bracket' } // [٦٥] reference
```
### Numbered Variants
```typescript
{ type: 'num-letter' } // ٥ أ - (number + letter)
{ type: 'num-paren' } // ٥ (أ) - (number + paren)
{ type: 'num-slash' } // ٥/٦ - (number/number)
```
### Custom Patterns
**Using templates (recommended):**
```typescript
{
type: 'pattern',
template: '{bullet}? {num}+ {dash}'
}
```
**Using raw regex (for complex patterns):**
```typescript
{
type: 'pattern',
pattern: '^CUSTOM: (.*)' // When templates aren't sufficient
}
```
**Using format for numbered:**
```typescript
{
type: 'numbered',
format: '{bullet}+ {num} {letter} {dash}'
}
```
## Complex Pattern Examples
### Comma-Separated Numerals
Match patterns like: `٩٩٣٦، ٩٩٣٧ - حَدَّثَنَا`
```typescript
{
type: 'pattern',
template: '{num}(?:،{s}{num})*{s}{dash}'
}
```
### Number / Letter
Match patterns like: `١١٠٧٣/ أ - حَدَّثَنَا`
```typescript
{
type: 'pattern',
template: '{num}{s}/{s}{letter}{s}{dash}'
}
```
### Number / Number (Built-in)
Match patterns like: `١٠٢٦٦ / ١ - "وَإِذَا`
```typescript
{
type: 'num-slash' // Built-in preset
}
```
### Repeating Dots
Match patterns like: `. . . . . . . . . .`
```typescript
{
type: 'pattern',
template: '\\.(?:{s}\\.)+'
}
```
### Asterisk + Dots + Number
Match patterns like: `*. . . / ٨٦ - حَدَّثَنَا`
**Option 1: Capture from asterisk**
```typescript
{
type: 'pattern',
template: '\\*\\.(?:{s}\\.)*{s}/{s}{num}{s}{dash}',
removeMarker: false // Keep everything
}
```
**Option 2: Detect from asterisk, capture from number**
```typescript
{
type: 'pattern',
pattern: '^\\*\\.(?:\\s?\\.)*\\s?/\\s?([\\u0660-\\u0669]+\\s?[-–—ـ].*)'
}
```
## Template Tokens
| Token | Matches | Example |
|-------|---------|---------|
| `{num}` | Arabic-Indic numerals | `[\\u0660-\\u0669]+` |
| `{latin}` | Latin numerals | `\\d+` |
| `{roman}` | Roman numerals | `[IVXLCDM]+` |
| `{dash}` | Various dashes | `[-–—ـ]` |
| `{dot}` | Period | `\\.` |
| `{bullet}` | Bullet variants | `[•*°]` |
| `{letter}` | Arabic letters | `[أ-ي]` |
| `{s}` | Optional space | `\\s?` |
| `{space}` | Required space | `\\s+` |
**Quantifiers:** Add `+`, `*`, `?` after tokens: `{num}+`, `{bullet}?`
## Examples
### Before (Regex)
```typescript
const pattern = '^[•*°]+ ([\\u0660-\\u0669]+\\s?[-–—ـ].*)';
```
### After (Template)
```typescript
{
type: 'numbered',
format: '{bullet}+ {num} {dash}'
}
```
**80% reduction in complexity!**
## API
### `generateRegexFromMarker(config)`
```typescript
import { generateRegexFromMarker, type MarkerConfig } from 'flappa-doormal';
const config: MarkerConfig = {
type: 'numbered',
numbering: 'arabic-indic', // or 'latin', 'roman'
separator: 'dash', // or 'dot', 'colon', 'paren'
removeMarker: true, // Remove marker from capture (default: true)
};
const regex = generateRegexFromMarker(config);
```
### `expandTemplate(template, options)`
```typescript
import { expandTemplate } from 'flappa-doormal';
const pattern = expandTemplate('{num} {dash}');
// Returns: '^[\\u0660-\\u0669]+ [-–—ـ](.*)'
const pattern2 = expandTemplate('{num} {dash}', { removeMarker: false });
// Returns: '^([\\u0660-\\u0669]+ [-–—ـ].*)'
```
### `validateTemplate(template)`
```typescript
import { validateTemplate } from 'flappa-doormal';
const result = validateTemplate('{num} {invalid}');
// Returns: { valid: false, errors: ['Unknown tokens: {invalid}'] }
```
## Configuration Options
```typescript
type MarkerConfig = {
type: MarkerType;
numbering?: 'arabic-indic' | 'latin' | 'roman';
separator?: 'dash' | 'dot' | 'paren' | 'colon' | 'none' | string;
format?: string; // Template for numbered markers
template?: string; // Template for pattern markers
pattern?: string; // Raw regex (when templates aren't enough)
tokens?: Record<string, string>; // Custom token definitions
phrases?: string[]; // For 'phrase' and 'hadith-chain' types
removeMarker?: boolean; // Default: true for numbered/bullet
};
```
## Extensibility
### Extending Default Phrase Lists
```typescript
import { DEFAULT_HADITH_PHRASES, generateRegexFromMarker } from 'flappa-doormal';
// Add to existing hadith phrases
const myPhrases = [...DEFAULT_HADITH_PHRASES, 'أَخْبَرَنِي', 'سَمِعْتُ'];
const regex = generateRegexFromMarker({
type: 'hadith-chain',
phrases: myPhrases,
});
```
### Using Type-Specific Generators
```typescript
import { generateHadithChainRegex, DEFAULT_HADITH_PHRASES } from 'flappa-doormal';
// Direct access to type-specific generator
const regex = generateHadithChainRegex(
{ type: 'hadith-chain', phrases: [...DEFAULT_HADITH_PHRASES, 'extra'] },
true // removeMarker
);
```
### Custom Tokens
```typescript
import { createTokenMap, expandTemplate } from 'flappa-doormal';
const customTokens = createTokenMap({
verse: '\\[[\\u0660-\\u0669]+\\]',
tafsir: 'تفسير',
});
const pattern = expandTemplate('{verse} {tafsir}', {
tokens: customTokens,
removeMarker: true
});
```
## Available Exports
**Constants:**
- `DEFAULT_HADITH_PHRASES` - Default narrator phrases
- `DEFAULT_BASMALA_PATTERNS` - Default basmala patterns
- `TOKENS` - Token definitions
**Functions:**
- `generateRegexFromMarker()` - Main function
- `generate{Type}Regex()` - 12 type-specific generators
- `expandTemplate()` - Template expansion
- `validateTemplate()` - Template validation
- `createTokenMap()` - Custom token maps
## Testing
This project has comprehensive unit test coverage for all marker type generators.
```bash
# Run all tests
bun test
# Run specific test file
bun test src/markers/type-generators.test.ts
# Run tests with coverage
bun test --coverage
```
**Test Coverage**: 100% coverage for `type-generators.ts` with 54+ test cases covering:
- All 12 marker type generators
- Edge cases (empty phrases, diacritic variations, custom separators)
- Error handling (missing required fields)
- Various numbering styles and separators
## Development
```bash
# Install dependencies
bun install
# Run tests
bun test
# Build (if needed)
bun run build
# Format code
bunx biome format --write .
# Lint code
bunx biome lint .
```
## For AI Agents
See [AGENTS.md](./AGENTS.md) for comprehensive guidance on:
- Project architecture and design patterns
- Adding new marker types
- Testing strategies
- Code quality standards
- Extension points
## License
MIT
## Related
- [bitaboom](https://github.com/ragaeeb/bitaboom) - Arabic text utilities
- [baburchi](https://github.com/ragaeeb/baburchi) - Text sanitization
- [shamela](https://github.com/ragaeeb/shamela) - Shamela library utilities