UNPKG

twl-linker

Version:

Biblical Semantic Linker - Uses the biblical context database to create semantic links between USFM Bible text and biblical articles with confidence scoring.

369 lines (252 loc) 12.3 kB
# TWL Linker - Biblical Semantic Linker A tool that automatically creates semantic links between USFM Bible text and biblical articles using a context database. This tool generates Translation Word List (TWL) files in TSV format with confidence scoring and disambiguation. **Available as:** - **Global CLI tool**: Install with `npm install -g twl-linker`, use with `twl-linker <file>` - **NPM Package**: Install with `npm install twl-linker` for React.js/Node.js projects - **Local development**: Clone and run with `node twl-linker.js <file>` ## Features - **Semantic Matching**: Intelligent text analysis to find biblical terms in USFM files - **Confidence Scoring**: Each match includes a confidence score based on context analysis - **Disambiguation**: Automatic disambiguation of ambiguous terms with fallback to manual review - **Batch Processing**: Process multiple books at once - **Flexible Output**: Customizable output file naming and locations - **Alignment Data Handling**: Automatically removes USFM alignment markers to process clean text - **Built-in Database**: No separate database setup required for CLI or package usage ## Prerequisites - Python 3.7+ (for building the context database) - Node.js 12+ (for running the linker) - Access to the `en_tw` repository with biblical term definitions (should be in `../en_tw/` relative to this project) - Access to the `en_ult` repository with USFM Bible files (should be in `../en_ult/` relative to this project) ## Setup ### 1. Install Python Dependencies ```bash pip install -r requirements.txt ``` ### 2. Build the Biblical Context Database The context database needs to be built before running the linker. This script reads biblical term definitions from the `en_tw` repository: ```bash python build_biblical_context_database.py ``` **Note:** This script expects the `en_tw` repository to be located at `../en_tw/` relative to this project directory. The script will process definitions from `../en_tw/bible/` which should contain the `kt/`, `names/`, and `other/` subdirectories with biblical term definitions. This will create `biblical_context_database.json` which contains processed biblical term definitions, variants, and disambiguation rules. ## Usage ### Installation Options | Usage Type | Installation | Command | Use Case | | --------------------- | --------------------------- | ------------------------------------------ | ------------------------- | | **Global CLI** | `npm install -g twl-linker` | `twl-linker input.usfm` | Command-line processing | | **NPM Package** | `npm install twl-linker` | `import { generateTWL } from 'twl-linker'` | React.js/Node.js apps | | **Local Development** | `git clone <repo>` | `node twl-linker.js input.usfm` | Development/customization | **Option 1: Global Installation (Recommended for CLI usage)** ```bash npm install -g twl-linker ``` **Option 2: Local Installation (For development or package usage)** ```bash git clone <this-repository> cd twl-linker npm install # if you add dependencies later ``` **Option 3: NPM Package (For React.js/Node.js projects)** ```bash npm install twl-linker ``` ### Command Line Interface (CLI) #### Global CLI Usage (after `npm install -g twl-linker`) Process a single USFM file using the global command: ```bash twl-linker <input_file> [output_file] ``` **Examples:** ```bash # Input: 01-GEN.usfm → Output: twl_GEN.tsv (auto-generated) twl-linker ../en_ult/01-GEN.usfm # Input: test.usfm → Output: test.tsv (auto-generated) twl-linker test.usfm # Custom output file twl-linker ../en_ult/46-ROM.usfm my_output.tsv ``` #### Local CLI Usage (for development) If you're working with the source code locally: ```bash node cli.js <input_file> [output_file] ``` **Examples:** ```bash # Input: 01-GEN.usfm → Output: twl_GEN.tsv (auto-generated) node cli.js ../en_ult/01-GEN.usfm # Input: test.usfm → Output: test.tsv (auto-generated) node cli.js test.usfm # Custom output file node cli.js ../en_ult/46-ROM.usfm my_output.tsv ``` **Output File Naming Rules (applies to both CLI methods):** - Files starting with number and dash (e.g., `01-GEN.usfm`) → `twl_GEN.tsv` - Other files (e.g., `test.usfm`) → `test.tsv` - If you specify an output file, it's used exactly as given **Note:** The global CLI command (`twl-linker`) includes the built-in biblical context database, so no separate database setup is required after installation. ### Batch Processing Process all USFM files in a directory (currently requires local installation): ```bash node process_all_books.js <input_directory> [output_directory] ``` **Examples:** ```bash # Process files from ../en_ult and output to current directory node process_all_books.js ../en_ult . # Process files and output to same directory as input node process_all_books.js ../en_ult # Process files and output to a different directory node process_all_books.js ../en_ult ./output_folder ``` ## Output Format The generated TSV files contain the following columns: | Column | Description | | -------------- | ------------------------------------------------------- | | Reference | Chapter:verse reference (e.g., "1:1") | | ID | Unique 4-character hexadecimal ID | | Tags | Category of the biblical term (kt, names, other) | | OrigWords | The original word(s) found in the text | | Occurrence | Occurrence number of this term in the verse | | TWLink | Resource link to the translation word article | | Confidence | Confidence score (0.1-1.0) | | Match_Type | Type of match (exact, morphological, theological, etc.) | | Context | Surrounding text context | | Disambiguation | Disambiguation method used | ## Understanding the Output ### Confidence Scores - **0.8-1.0**: High confidence matches - **0.6-0.79**: Medium confidence matches - **0.5-0.59**: Lower confidence matches (review recommended) ### Match Types - **exact**: Direct term match from cleaned terms - **morphological**: Match using word variants (plurals, etc.) - **theological**: Match using theological variants - **disambiguated**: Automatically disambiguated ambiguous term - **ambiguous**: Ambiguous term requiring manual review ### Disambiguation - **single**: Unambiguous term with single meaning - **auto:X.XX**: Automatically disambiguated (score shown) - **manual:option1 (alternatives)**: Manual review needed with options listed ## File Structure ``` twl-linker/ ├── build_biblical_context_database.py # Context database builder ├── twl-linker.js # Main semantic linker ├── process_all_books.js # Batch processor ├── usfm-alignment-remover.js # USFM alignment data remover ├── biblical_context_database.json # Generated context database ├── requirements.txt # Python dependencies └── README.md # This file ../en_tw/ # Biblical term definitions (separate repo) └── bible/ # Biblical term definitions ├── kt/ # Key terms ├── names/ # Biblical names └── other/ # Other terms ../en_ult/ # Aligned USFM files (separate repo) ├── 01-GEN.usfm ├── 02-EXO.usfm └── ... ``` ## Examples ### Processing a Single Book ```bash # Process Genesis node twl-linker.js ../en_ult/01-GEN.usfm # Output: twl_GEN.tsv with semantic links ``` ### Processing All Books **Examples:** ```bash # Process all books in ../en_ult directory, output to current directory node process_all_books.js ../en_ult ./output # This will create: # ./output/twl_GEN.tsv, ./output/twl_EXO.tsv, ./output/twl_LEV.tsv, etc. ``` ## Troubleshooting ### Common Issues 1. **"Context database not found"** - Run `python build_biblical_context_database.py` first 2. **"No USFM files found"** - Check that the input directory contains `.usfm` files - Verify the directory path is correct 3. **"Output directory does not exist"** - Create the output directory before running batch processing - Or use `.` to output to current directory 4. **"No such file or directory: '../en_tw/bible'"** - Ensure the `en_tw` repository is cloned at `../en_tw/` relative to this project - The directory structure should be: `../en_tw/bible/kt/`, `../en_tw/bible/names/`, `../en_tw/bible/other/` - If the `en_tw` repository is in a different location, you can modify the path in `build_biblical_context_database.py` ### Performance Tips - The context database is loaded once per batch operation for efficiency - Large books (like Psalms) may take longer to process - High numbers of ambiguous terms may require manual review ## Development ### Adding New Terms 1. Add term definitions to the appropriate directory in `../en_tw/bible/` 2. Rebuild the context database: `python build_biblical_context_database.py` 3. Test with sample texts ### Customizing Disambiguation Edit the disambiguation rules in `twl-linker.js` in the `disambiguateAmbiguousTerm` function. ## License This project processes biblical text and translation resources. Please ensure compliance with the licenses of the source materials. ## NPM Package Usage This package can also be used as an npm module in React.js applications or other Node.js projects. **Version 1.0.1+ uses ES6 modules** for better compatibility with modern bundlers. ### Installation as Package ```bash npm install twl-linker ``` **Note:** If you need CommonJS support, use version 1.0.0: `npm install twl-linker@1.0.0` ### Usage in React.js/Node.js ```javascript import { generateTWL, contextDatabase } from 'twl-linker'; // Generate TWL from USFM content const usfmContent = `\\c 1 \\v 1 In the beginning God created the heaven and the earth.`; const tsvOutput = generateTWL(usfmContent); console.log(tsvOutput); // Access the biblical context database if needed console.log('Database metadata:', contextDatabase.metadata); console.log('Total articles:', contextDatabase.metadata.total_articles); // Access specific articles const godArticle = contextDatabase.articles['god.md']; console.log('God article:', godArticle); ``` ### ES6 Import (React.js) ```javascript import { generateTWL, contextDatabase } from 'twl-linker'; function MyComponent() { const handleGenerateTWL = (usfmText) => { const result = generateTWL(usfmText); // Process the TSV result return result; }; const getArticleInfo = (articleName) => { const article = contextDatabase.articles[`${articleName}.md`]; return article; }; return ( // Your component JSX ); } ``` ### API Reference for Package Usage #### `generateTWL(usfmContent)` Generates Translation Words Links (TWL) from USFM content using the built-in biblical context database. - **Parameters:** - `usfmContent` (string): The USFM text content to process - **Returns:** String containing TSV format with columns: Reference, ID, Tags, OrigWords, Occurrence, TWLink, Confidence, Match_Type, Context, Disambiguation #### `contextDatabase` The complete biblical context database object containing: - `metadata`: Statistics about the database (total articles, ambiguous terms, categories) - `articles`: Object containing all biblical articles indexed by filename - `ambiguous_terms`: Object mapping ambiguous terms to possible articles **Example accessing database:** ```javascript // Get all articles in the 'kt' (key terms) category const ktArticles = Object.entries(contextDatabase.articles) .filter(([filename, article]) => article.category === 'kt') .map(([filename, article]) => ({ filename, ...article })); // Get ambiguous terms that need disambiguation const ambiguousTerms = contextDatabase.ambiguous_terms; console.log('Ambiguous terms:', Object.keys(ambiguousTerms)); ```