seta-indexer
Version:
Vector database indexer for documentation
113 lines (82 loc) ⢠2.84 kB
Markdown
# seta-indexer
A standalone CLI tool for vector database indexing and semantic search over documentation. Supports PDF, Markdown, text, and more. Powered by local embeddings and LanceDB.
## Features
- Indexes PDF, Markdown, MDX, TXT, JSON, YAML, XML, CSV
- Intelligent chunking with sentence/paragraph boundary detection
- Fast local embeddings with all-MiniLM-L6-v2 (via @xenova/transformers)
- Vector similarity search with filtering (library, topic, difficulty)
- Incremental updates and file change detection
- Flexible file selection with `--include` and `--exclude` glob patterns
## Installation
```bash
npm install -g seta-indexer
# or use npx
npx seta-indexer <folder> [options]
```
## Usage
### Index a documentation folder
```bash
npx seta-indexer /path/to/docs
```
### Local usage (from cloned repo)
```bash
# Clone and setup
git clone https://github.com/techformist/seta-indexer.git
cd seta-indexer
npm install
# Build the project
npm run build
# Run locally with node
node dist/cli.js /path/to/docs
# Or use the dev script for development
npm run dev -- index /path/to/docs
```
### Available commands (local usage)
```bash
# Index documents
node dist/cli.js index /path/to/docs --verbose
# Search indexed content
node dist/cli.js search "your query" /path/to/docs
# Show database statistics
node dist/cli.js stats /path/to/docs
# Clean/remove database
node dist/cli.js clean /path/to/docs
# Run tests
npm test
```
#### With options
- `--verbose, -v` : Detailed logging
- `--force` : Force re-index all files
- `--chunk-size <size>` : Chunk size (default: 1000)
- `--chunk-overlap <overlap>` : Overlap (default: 200)
- `--model <model>` : Embedding model (default: all-MiniLM-L6-v2)
- `--db-path <path>` : Custom DB path
- `--include <patterns...>` : Glob patterns to include (e.g. `**/*.md` `docs/**/*.pdf`)
- `--exclude <patterns...>` : Glob patterns to exclude (e.g. `**/drafts/**`)
## Supported File Types
- .pdf, .md, .mdx, .txt, .json, .yaml, .yml, .xml, .csv (by default)
- Use `--include`/`--exclude` for custom file selection
## Output Examples
**Indexing:**
```
š Starting indexing process for: /docs
š Documentation path: /docs
šļø Database path: /docs/.seta_lancedb
š Loading existing index state...
š Scanning documentation files...
š Found 25 documentation files
š§ Initializing embedding model...
š Connecting to LanceDB...
āļø Processing documentation files...
š Processing: main_guide.md
š Generated 12 chunks
ā
Generated 12 embedded chunks
ā
Indexing completed
```
## Troubleshooting
- Ensure all dependencies are installed (`npm install`)
- For PDF extraction errors, check file integrity
- For embedding errors, ensure enough RAM and disk space for model caching
- For DB errors, use `--force` to re-index from scratch
## License
MIT