vecpdf
Version:
CLI tool to process PDFs and create local vector databases using ChromaDB
141 lines (102 loc) • 4.36 kB
Markdown
[](https://github.com/thegreatbey/vecpdf)
## Badges
[](https://www.npmjs.com/package/vecpdf)
[](https://www.npmjs.com/package/vecpdf)
[](https://github.com/thegreatbey/vecpdf/actions/workflows/publish.yml)
[](https://packagephobia.com/result?p=vecpdf)
[](https://www.npmjs.com/package/vecpdf)
[](https://github.com/thegreatbey/vecpdf/blob/main/LICENSE)
# vecpdf — PDF → ChromaDB (HTTP server)
vecpdf is a tiny CLI that:
- an excuse to not need to rely on Pinecone, etc.
- extracts text from a PDF (via Python **PyMuPDF**),
- splits the text into chunks (token-aware when `tiktoken` is available),
- and indexes those chunks into a **ChromaDB collection over HTTP**.
> **Note:** Chroma is a local vector database. vecpdf talks to a running Chroma **server** (default `http://localhost:8000`). Reminder - vectors will live inside the Chroma server, not in your project folder.
---
## Requirements
- **Python** with:
```bash
pip install PyMuPDF tiktoken
```
(tiktoken is optional, but gives nicer chunking.)
- **ChromaDB server** running locally (HTTP). By default, vecpdf uses `http://localhost:8000`.
**Use a specific Python (virtualenv)**
```bash
# PowerShell example (Windows)
$env:VECPDF_PYTHON="C:\Path\to\your\venv\Scripts\python.exe"
# macOS/Linux example
export VECPDF_PYTHON="$HOME/.venvs/vecpdf/bin/python"
```
**Chroma server URL**
Default: `http://localhost:8000`
To use a different server:
```bash
export CHROMA_URL="http://localhost:8001"
```
---
## Quick Start
Create a tiny sample PDF:
```bash
python - <<'PY'
import fitz
doc = fitz.open()
page = doc.new_page()
page.insert_text((72,72), "Neural networks learn by adjusting weights.\nEmbeddings map meaning to vectors.")
doc.save("sample.pdf"); doc.close()
PY
```
Process the PDF:
```bash
# Basic usage (indexes into the 'documents' collection)
vecpdf process sample.pdf
# Append to an existing collection instead of recreating it
vecpdf process sample.pdf --keep-existing
# Use a custom chunk ID prefix (helps avoid collisions + label sources)
vecpdf process sample.pdf --id-prefix "paperA_"
# Adjust chunk size (tokens)
vecpdf process sample.pdf -s 800
```
Query the collection:
```bash
# Top 3 results (preview)
vecpdf query "neural networks" -c documents -n 3
# Print full text for each result
vecpdf query "neural networks" -c documents -n 3 --full
```
---
## CLI Reference
### `vecpdf process <pdf-path> [options]`
- `<pdf-path>`: Path to your PDF file (required)
- `-c, --collection <name>`: Chroma collection name (default: `documents`)
- `-s, --chunk-size <size>`: Token chunk size (default: `500`)
- `--python-script <path>`: Use your own Python script (advanced)
- `--keep-existing`: Append to existing collection instead of recreating it
- `--id-prefix <prefix>`: Custom prefix for new chunk IDs (default: `chunk_`)
### `vecpdf query <query-text> [options]`
- `<query-text>`: Text to search for (required)
- `-c, --collection <name>`: Collection name (default: `documents`)
- `-n, --results <number>`: Number of results to return (default: `5`)
- `--full`: Show full text for each result (instead of a preview)
**Where data lives**
- vecpdf talks to a running Chroma server over HTTP (default `http://localhost:8000`).
- Documents and vectors are stored by that server (not in a local `./vectordb` folder).
---
## Troubleshooting
**Python extraction errors**
- Make sure PyMuPDF is installed:
```bash
pip install PyMuPDF
```
- If tiktoken is missing, vecpdf falls back to a simple character split (still works).
**Embedding/Indexing errors**
- Your Chroma server needs an embedder. One path is:
```bash
pip install chromadb sentence-transformers
```
- If you see duplicate-ID errors, try a different `--id-prefix` or run without `--keep-existing`.
**No results**
- Increase `-n`, try a simpler query, or confirm the `-c` collection name.
---
## License
MIT