@elpassion/semantic-chunking

Version:

Semantically create chunks from large texts. Useful for workflows involving large language models (LLMs).

www.elpassion.com

85 lines (59 loc) • 2.93 kB

Markdown

# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is an NPM package called `@elpassion/semantic-chunking` that provides semantic text chunking capabilities for large language model (LLM) workflows. It splits text into meaningful chunks based on sentence similarity using embedding models. ## Key Architecture ### Core Components - **chunkit.js**: Main entry point with three primary functions: - `chunkit()`: Semantic chunking with similarity-based grouping - `cramit()`: Quick chunking without similarity (just token size) - `sentenceit()`: Simple sentence splitting - **embeddingUtils.js**: Contains embedding model classes: - `LocalEmbeddingModel`: Uses local ONNX models via Transformers.js - `OpenAIEmbedding`: Uses OpenAI's embedding API - **similarityUtils.js**: Cosine similarity calculations and threshold adjustments - **chunkingUtils.js**: Chunk creation and optimization logic - **config.js**: Default configuration values ### Model Architecture The package uses a dependency injection pattern where: 1. Initialize an embedding model instance once 2. Pass the model to chunking functions for reuse 3. Model handles tokenization and embedding generation ## Development Commands ```bash # Run examples npm run example-chunkit # Basic chunkit example npm run example-cramit # Quick cramit example npm run example-sentenceit # Sentence splitting example # Model management npm run download-models # Download pre-configured models npm run clean-models # Clean downloaded models (Unix) npm run clean-models-win # Clean downloaded models (Windows) # Clean install npm run clean # Remove node_modules and reinstall ``` ## Model Configuration - Models are downloaded to `./models` directory - Configuration in `tools/download-models-list.json` - Default model: `Xenova/all-MiniLM-L6-v2` - Supported precisions: `fp32`, `fp16`, `q8`, `q4` ## Web UI The `webui/` directory contains a standalone web interface for experimenting with chunking parameters. It has its own `package.json` and can be run independently. ## Usage Patterns Always initialize models before use: ```javascript const model = new LocalEmbeddingModel(); await model.initialize('Xenova/all-MiniLM-L6-v2'); const chunks = await chunkit(documents, model, options); ``` ## Key Configuration Options - `maxTokenSize`: Maximum tokens per chunk (default: 500) - `similarityThreshold`: Minimum similarity for same chunk (default: 0.5) - `combineChunks`: Whether to rebalance chunks (default: true) - `returnEmbedding`: Include embedding vectors in results - `chunkPrefix`: Add prefix for RAG applications ## Dependencies - `@huggingface/transformers`: Core ML functionality - `sentence-parse`: Sentence boundary detection - `lru-cache`: Embedding caching - `cli-progress`: Progress bars for model downloads