UNPKG

vecpdf

Version:

CLI tool to process PDFs and create local vector databases using ChromaDB

141 lines (102 loc) 4.36 kB
[![vecpdf icon](vecpdficon.svg)](https://github.com/thegreatbey/vecpdf) ## Badges [![Install](https://img.shields.io/badge/Install-npm%20i%20--g%20vecpdf-CB3837?logo=npm)](https://www.npmjs.com/package/vecpdf) [![npm](https://img.shields.io/npm/v/vecpdf?logo=npm)](https://www.npmjs.com/package/vecpdf) [![Publish](https://img.shields.io/github/actions/workflow/status/thegreatbey/vecpdf/publish.yml?label=Publish)](https://github.com/thegreatbey/vecpdf/actions/workflows/publish.yml) [![Install size](https://packagephobia.com/badge?p=vecpdf)](https://packagephobia.com/result?p=vecpdf) [![Downloads](https://img.shields.io/npm/dm/vecpdf)](https://www.npmjs.com/package/vecpdf) [![License](https://img.shields.io/github/license/thegreatbey/vecpdf)](https://github.com/thegreatbey/vecpdf/blob/main/LICENSE) # vecpdf — PDF → ChromaDB (HTTP server) vecpdf is a tiny CLI that: - an excuse to not need to rely on Pinecone, etc. - extracts text from a PDF (via Python **PyMuPDF**), - splits the text into chunks (token-aware when `tiktoken` is available), - and indexes those chunks into a **ChromaDB collection over HTTP**. > **Note:** Chroma is a local vector database. vecpdf talks to a running Chroma **server** (default `http://localhost:8000`). Reminder - vectors will live inside the Chroma server, not in your project folder. --- ## Requirements - **Python** with: ```bash pip install PyMuPDF tiktoken ``` (tiktoken is optional, but gives nicer chunking.) - **ChromaDB server** running locally (HTTP). By default, vecpdf uses `http://localhost:8000`. **Use a specific Python (virtualenv)** ```bash # PowerShell example (Windows) $env:VECPDF_PYTHON="C:\Path\to\your\venv\Scripts\python.exe" # macOS/Linux example export VECPDF_PYTHON="$HOME/.venvs/vecpdf/bin/python" ``` **Chroma server URL** Default: `http://localhost:8000` To use a different server: ```bash export CHROMA_URL="http://localhost:8001" ``` --- ## Quick Start Create a tiny sample PDF: ```bash python - <<'PY' import fitz doc = fitz.open() page = doc.new_page() page.insert_text((72,72), "Neural networks learn by adjusting weights.\nEmbeddings map meaning to vectors.") doc.save("sample.pdf"); doc.close() PY ``` Process the PDF: ```bash # Basic usage (indexes into the 'documents' collection) vecpdf process sample.pdf # Append to an existing collection instead of recreating it vecpdf process sample.pdf --keep-existing # Use a custom chunk ID prefix (helps avoid collisions + label sources) vecpdf process sample.pdf --id-prefix "paperA_" # Adjust chunk size (tokens) vecpdf process sample.pdf -s 800 ``` Query the collection: ```bash # Top 3 results (preview) vecpdf query "neural networks" -c documents -n 3 # Print full text for each result vecpdf query "neural networks" -c documents -n 3 --full ``` --- ## CLI Reference ### `vecpdf process <pdf-path> [options]` - `<pdf-path>`: Path to your PDF file (required) - `-c, --collection <name>`: Chroma collection name (default: `documents`) - `-s, --chunk-size <size>`: Token chunk size (default: `500`) - `--python-script <path>`: Use your own Python script (advanced) - `--keep-existing`: Append to existing collection instead of recreating it - `--id-prefix <prefix>`: Custom prefix for new chunk IDs (default: `chunk_`) ### `vecpdf query <query-text> [options]` - `<query-text>`: Text to search for (required) - `-c, --collection <name>`: Collection name (default: `documents`) - `-n, --results <number>`: Number of results to return (default: `5`) - `--full`: Show full text for each result (instead of a preview) **Where data lives** - vecpdf talks to a running Chroma server over HTTP (default `http://localhost:8000`). - Documents and vectors are stored by that server (not in a local `./vectordb` folder). --- ## Troubleshooting **Python extraction errors** - Make sure PyMuPDF is installed: ```bash pip install PyMuPDF ``` - If tiktoken is missing, vecpdf falls back to a simple character split (still works). **Embedding/Indexing errors** - Your Chroma server needs an embedder. One path is: ```bash pip install chromadb sentence-transformers ``` - If you see duplicate-ID errors, try a different `--id-prefix` or run without `--keep-existing`. **No results** - Increase `-n`, try a simpler query, or confirm the `-c` collection name. --- ## License MIT