twl-linker
Version:
Biblical Semantic Linker - Uses the biblical context database to create semantic links between USFM Bible text and biblical articles with confidence scoring.
369 lines (252 loc) • 12.3 kB
Markdown
# TWL Linker - Biblical Semantic Linker
A tool that automatically creates semantic links between USFM Bible text and biblical articles using a context database. This tool generates Translation Word List (TWL) files in TSV format with confidence scoring and disambiguation.
**Available as:**
- **Global CLI tool**: Install with `npm install -g twl-linker`, use with `twl-linker <file>`
- **NPM Package**: Install with `npm install twl-linker` for React.js/Node.js projects
- **Local development**: Clone and run with `node twl-linker.js <file>`
## Features
- **Semantic Matching**: Intelligent text analysis to find biblical terms in USFM files
- **Confidence Scoring**: Each match includes a confidence score based on context analysis
- **Disambiguation**: Automatic disambiguation of ambiguous terms with fallback to manual review
- **Batch Processing**: Process multiple books at once
- **Flexible Output**: Customizable output file naming and locations
- **Alignment Data Handling**: Automatically removes USFM alignment markers to process clean text
- **Built-in Database**: No separate database setup required for CLI or package usage
## Prerequisites
- Python 3.7+ (for building the context database)
- Node.js 12+ (for running the linker)
- Access to the `en_tw` repository with biblical term definitions (should be in `../en_tw/` relative to this project)
- Access to the `en_ult` repository with USFM Bible files (should be in `../en_ult/` relative to this project)
## Setup
### 1. Install Python Dependencies
```bash
pip install -r requirements.txt
```
### 2. Build the Biblical Context Database
The context database needs to be built before running the linker. This script reads biblical term definitions from the `en_tw` repository:
```bash
python build_biblical_context_database.py
```
**Note:** This script expects the `en_tw` repository to be located at `../en_tw/` relative to this project directory. The script will process definitions from `../en_tw/bible/` which should contain the `kt/`, `names/`, and `other/` subdirectories with biblical term definitions.
This will create `biblical_context_database.json` which contains processed biblical term definitions, variants, and disambiguation rules.
## Usage
### Installation Options
| Usage Type | Installation | Command | Use Case |
| --------------------- | --------------------------- | ------------------------------------------ | ------------------------- |
| **Global CLI** | `npm install -g twl-linker` | `twl-linker input.usfm` | Command-line processing |
| **NPM Package** | `npm install twl-linker` | `import { generateTWL } from 'twl-linker'` | React.js/Node.js apps |
| **Local Development** | `git clone <repo>` | `node twl-linker.js input.usfm` | Development/customization |
**Option 1: Global Installation (Recommended for CLI usage)**
```bash
npm install -g twl-linker
```
**Option 2: Local Installation (For development or package usage)**
```bash
git clone <this-repository>
cd twl-linker
npm install # if you add dependencies later
```
**Option 3: NPM Package (For React.js/Node.js projects)**
```bash
npm install twl-linker
```
### Command Line Interface (CLI)
#### Global CLI Usage (after `npm install -g twl-linker`)
Process a single USFM file using the global command:
```bash
twl-linker <input_file> [output_file]
```
**Examples:**
```bash
# Input: 01-GEN.usfm → Output: twl_GEN.tsv (auto-generated)
twl-linker ../en_ult/01-GEN.usfm
# Input: test.usfm → Output: test.tsv (auto-generated)
twl-linker test.usfm
# Custom output file
twl-linker ../en_ult/46-ROM.usfm my_output.tsv
```
#### Local CLI Usage (for development)
If you're working with the source code locally:
```bash
node cli.js <input_file> [output_file]
```
**Examples:**
```bash
# Input: 01-GEN.usfm → Output: twl_GEN.tsv (auto-generated)
node cli.js ../en_ult/01-GEN.usfm
# Input: test.usfm → Output: test.tsv (auto-generated)
node cli.js test.usfm
# Custom output file
node cli.js ../en_ult/46-ROM.usfm my_output.tsv
```
**Output File Naming Rules (applies to both CLI methods):**
- Files starting with number and dash (e.g., `01-GEN.usfm`) → `twl_GEN.tsv`
- Other files (e.g., `test.usfm`) → `test.tsv`
- If you specify an output file, it's used exactly as given
**Note:** The global CLI command (`twl-linker`) includes the built-in biblical context database, so no separate database setup is required after installation.
### Batch Processing
Process all USFM files in a directory (currently requires local installation):
```bash
node process_all_books.js <input_directory> [output_directory]
```
**Examples:**
```bash
# Process files from ../en_ult and output to current directory
node process_all_books.js ../en_ult .
# Process files and output to same directory as input
node process_all_books.js ../en_ult
# Process files and output to a different directory
node process_all_books.js ../en_ult ./output_folder
```
## Output Format
The generated TSV files contain the following columns:
| Column | Description |
| -------------- | ------------------------------------------------------- |
| Reference | Chapter:verse reference (e.g., "1:1") |
| ID | Unique 4-character hexadecimal ID |
| Tags | Category of the biblical term (kt, names, other) |
| OrigWords | The original word(s) found in the text |
| Occurrence | Occurrence number of this term in the verse |
| TWLink | Resource link to the translation word article |
| Confidence | Confidence score (0.1-1.0) |
| Match_Type | Type of match (exact, morphological, theological, etc.) |
| Context | Surrounding text context |
| Disambiguation | Disambiguation method used |
## Understanding the Output
### Confidence Scores
- **0.8-1.0**: High confidence matches
- **0.6-0.79**: Medium confidence matches
- **0.5-0.59**: Lower confidence matches (review recommended)
### Match Types
- **exact**: Direct term match from cleaned terms
- **morphological**: Match using word variants (plurals, etc.)
- **theological**: Match using theological variants
- **disambiguated**: Automatically disambiguated ambiguous term
- **ambiguous**: Ambiguous term requiring manual review
### Disambiguation
- **single**: Unambiguous term with single meaning
- **auto:X.XX**: Automatically disambiguated (score shown)
- **manual:option1 (alternatives)**: Manual review needed with options listed
## File Structure
```
twl-linker/
├── build_biblical_context_database.py # Context database builder
├── twl-linker.js # Main semantic linker
├── process_all_books.js # Batch processor
├── usfm-alignment-remover.js # USFM alignment data remover
├── biblical_context_database.json # Generated context database
├── requirements.txt # Python dependencies
└── README.md # This file
../en_tw/ # Biblical term definitions (separate repo)
└── bible/ # Biblical term definitions
├── kt/ # Key terms
├── names/ # Biblical names
└── other/ # Other terms
../en_ult/ # Aligned USFM files (separate repo)
├── 01-GEN.usfm
├── 02-EXO.usfm
└── ...
```
## Examples
### Processing a Single Book
```bash
# Process Genesis
node twl-linker.js ../en_ult/01-GEN.usfm
# Output: twl_GEN.tsv with semantic links
```
### Processing All Books
**Examples:**
```bash
# Process all books in ../en_ult directory, output to current directory
node process_all_books.js ../en_ult ./output
# This will create:
# ./output/twl_GEN.tsv, ./output/twl_EXO.tsv, ./output/twl_LEV.tsv, etc.
```
## Troubleshooting
### Common Issues
1. **"Context database not found"**
- Run `python build_biblical_context_database.py` first
2. **"No USFM files found"**
- Check that the input directory contains `.usfm` files
- Verify the directory path is correct
3. **"Output directory does not exist"**
- Create the output directory before running batch processing
- Or use `.` to output to current directory
4. **"No such file or directory: '../en_tw/bible'"**
- Ensure the `en_tw` repository is cloned at `../en_tw/` relative to this project
- The directory structure should be: `../en_tw/bible/kt/`, `../en_tw/bible/names/`, `../en_tw/bible/other/`
- If the `en_tw` repository is in a different location, you can modify the path in `build_biblical_context_database.py`
### Performance Tips
- The context database is loaded once per batch operation for efficiency
- Large books (like Psalms) may take longer to process
- High numbers of ambiguous terms may require manual review
## Development
### Adding New Terms
1. Add term definitions to the appropriate directory in `../en_tw/bible/`
2. Rebuild the context database: `python build_biblical_context_database.py`
3. Test with sample texts
### Customizing Disambiguation
Edit the disambiguation rules in `twl-linker.js` in the `disambiguateAmbiguousTerm` function.
## License
This project processes biblical text and translation resources. Please ensure compliance with the licenses of the source materials.
## NPM Package Usage
This package can also be used as an npm module in React.js applications or other Node.js projects. **Version 1.0.1+ uses ES6 modules** for better compatibility with modern bundlers.
### Installation as Package
```bash
npm install twl-linker
```
**Note:** If you need CommonJS support, use version 1.0.0: `npm install twl-linker@1.0.0`
### Usage in React.js/Node.js
```javascript
import { generateTWL, contextDatabase } from 'twl-linker';
// Generate TWL from USFM content
const usfmContent = `\\c 1
\\v 1 In the beginning God created the heaven and the earth.`;
const tsvOutput = generateTWL(usfmContent);
console.log(tsvOutput);
// Access the biblical context database if needed
console.log('Database metadata:', contextDatabase.metadata);
console.log('Total articles:', contextDatabase.metadata.total_articles);
// Access specific articles
const godArticle = contextDatabase.articles['god.md'];
console.log('God article:', godArticle);
```
### ES6 Import (React.js)
```javascript
import { generateTWL, contextDatabase } from 'twl-linker';
function MyComponent() {
const handleGenerateTWL = (usfmText) => {
const result = generateTWL(usfmText);
// Process the TSV result
return result;
};
const getArticleInfo = (articleName) => {
const article = contextDatabase.articles[`${articleName}.md`];
return article;
};
return (
// Your component JSX
);
}
```
### API Reference for Package Usage
#### `generateTWL(usfmContent)`
Generates Translation Words Links (TWL) from USFM content using the built-in biblical context database.
- **Parameters:**
- `usfmContent` (string): The USFM text content to process
- **Returns:** String containing TSV format with columns: Reference, ID, Tags, OrigWords, Occurrence, TWLink, Confidence, Match_Type, Context, Disambiguation
#### `contextDatabase`
The complete biblical context database object containing:
- `metadata`: Statistics about the database (total articles, ambiguous terms, categories)
- `articles`: Object containing all biblical articles indexed by filename
- `ambiguous_terms`: Object mapping ambiguous terms to possible articles
**Example accessing database:**
```javascript
// Get all articles in the 'kt' (key terms) category
const ktArticles = Object.entries(contextDatabase.articles)
.filter(([filename, article]) => article.category === 'kt')
.map(([filename, article]) => ({ filename, ...article }));
// Get ambiguous terms that need disambiguation
const ambiguousTerms = contextDatabase.ambiguous_terms;
console.log('Ambiguous terms:', Object.keys(ambiguousTerms));
```