@elpassion/semantic-chunking
Version:
Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).
85 lines (59 loc) • 2.93 kB
Markdown
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
This is an NPM package called `@elpassion/semantic-chunking` that provides semantic text chunking capabilities for large language model (LLM) workflows. It splits text into meaningful chunks based on sentence similarity using embedding models.
## Key Architecture
### Core Components
- **chunkit.js**: Main entry point with three primary functions:
- `chunkit()`: Semantic chunking with similarity-based grouping
- `cramit()`: Quick chunking without similarity (just token size)
- `sentenceit()`: Simple sentence splitting
- **embeddingUtils.js**: Contains embedding model classes:
- `LocalEmbeddingModel`: Uses local ONNX models via Transformers.js
- `OpenAIEmbedding`: Uses OpenAI's embedding API
- **similarityUtils.js**: Cosine similarity calculations and threshold adjustments
- **chunkingUtils.js**: Chunk creation and optimization logic
- **config.js**: Default configuration values
### Model Architecture
The package uses a dependency injection pattern where:
1. Initialize an embedding model instance once
2. Pass the model to chunking functions for reuse
3. Model handles tokenization and embedding generation
## Development Commands
```bash
# Run examples
npm run example-chunkit # Basic chunkit example
npm run example-cramit # Quick cramit example
npm run example-sentenceit # Sentence splitting example
# Model management
npm run download-models # Download pre-configured models
npm run clean-models # Clean downloaded models (Unix)
npm run clean-models-win # Clean downloaded models (Windows)
# Clean install
npm run clean # Remove node_modules and reinstall
```
## Model Configuration
- Models are downloaded to `./models` directory
- Configuration in `tools/download-models-list.json`
- Default model: `Xenova/all-MiniLM-L6-v2`
- Supported precisions: `fp32`, `fp16`, `q8`, `q4`
## Web UI
The `webui/` directory contains a standalone web interface for experimenting with chunking parameters. It has its own `package.json` and can be run independently.
## Usage Patterns
Always initialize models before use:
```javascript
const model = new LocalEmbeddingModel();
await model.initialize('Xenova/all-MiniLM-L6-v2');
const chunks = await chunkit(documents, model, options);
```
## Key Configuration Options
- `maxTokenSize`: Maximum tokens per chunk (default: 500)
- `similarityThreshold`: Minimum similarity for same chunk (default: 0.5)
- `combineChunks`: Whether to rebalance chunks (default: true)
- `returnEmbedding`: Include embedding vectors in results
- `chunkPrefix`: Add prefix for RAG applications
## Dependencies
- `@huggingface/transformers`: Core ML functionality
- `sentence-parse`: Sentence boundary detection
- `lru-cache`: Embedding caching
- `cli-progress`: Progress bars for model downloads