seekmix
Version:
🔍 A local semantic caching library for Node.js.
331 lines (271 loc) • 7.42 kB
Markdown
# vec0 Virtual Tables
## Table of Contents
- [Basic Table Creation](#basic-table-creation)
- [Vector Column Types](#vector-column-types)
- [Metadata Columns](#metadata-columns)
- [Partition Key Columns](#partition-key-columns)
- [Auxiliary Columns](#auxiliary-columns)
- [Performance Tuning](#performance-tuning)
## Basic Table Creation
### Simple vec0 Table
```sql
CREATE VIRTUAL TABLE vec_items USING vec0(
embedding float[4]
);
```
### With Primary Key
```sql
CREATE VIRTUAL TABLE vec_documents USING vec0(
document_id integer primary key,
contents_embedding float[768]
);
```
### With Distance Metric
```sql
CREATE VIRTUAL TABLE vec_documents USING vec0(
document_id integer primary key,
contents_embedding float[768] distance_metric=cosine
);
```
Distance metrics: `l2` (default), `cosine`, `hamming` (for bit vectors)
## Vector Column Types
### float[N] - Float32 Vectors
4 bytes per element, most common for embeddings:
```sql
CREATE VIRTUAL TABLE vec_embeddings USING vec0(
embedding float[1536] -- OpenAI text-embedding-3-small
);
```
### int8[N] - 8-bit Integer Vectors
1 byte per element, for quantized embeddings:
```sql
CREATE VIRTUAL TABLE vec_quantized USING vec0(
embedding int8[768]
);
```
### bit[N] - Binary Vectors
1 bit per element (packed into bytes), for binary quantization:
```sql
CREATE VIRTUAL TABLE vec_binary USING vec0(
embedding bit[768] -- 96 bytes storage
);
```
## Metadata Columns
Metadata columns are indexed alongside vectors and can be filtered in KNN queries.
### Supported Types
- `TEXT` - strings
- `INTEGER` - 8-byte integers
- `FLOAT` - 8-byte floating point
- `BOOLEAN` - 1-bit (0 or 1)
Maximum: 16 metadata columns per table
### Declaration
```sql
CREATE VIRTUAL TABLE vec_movies USING vec0(
movie_id integer primary key,
synopsis_embedding float[1024],
genre text,
num_reviews integer,
mean_rating float,
contains_violence boolean
);
```
### Inserting with Metadata
```python
db.execute("""
INSERT INTO vec_movies(movie_id, synopsis_embedding, genre, num_reviews, mean_rating, contains_violence)
VALUES (?, ?, ?, ?, ?, ?)
""", [
1,
serialize_float32(embedding),
'scifi',
250,
4.2,
False
])
```
### Filtering in KNN Queries
```sql
SELECT *
FROM vec_movies
WHERE synopsis_embedding MATCH ?
AND k = 5
AND genre = 'scifi'
AND num_reviews BETWEEN 100 AND 500
AND mean_rating > 3.5
AND contains_violence = false
ORDER BY distance;
```
### Supported Operators
- `=` - Equals
- `!=` - Not equals
- `>` - Greater than
- `>=` - Greater than or equal
- `<` - Less than
- `<=` - Less than or equal
BOOLEAN columns only support `=` and `!=`
Unsupported: `IS NULL`, `LIKE`, `GLOB`, `REGEXP`, scalar functions
## Partition Key Columns
Partition keys internally shard the vector index for faster filtered queries.
Maximum: 4 partition key columns per table
### Use Cases
1. Multi-tenant data (user_id, organization_id)
2. Temporal data (published_date, created_month)
3. Category-based filtering (document_type, region)
### Single Partition Key
```sql
CREATE VIRTUAL TABLE vec_documents USING vec0(
document_id integer primary key,
user_id integer partition key,
contents_embedding float[1024]
);
```
Query with partition filtering:
```sql
SELECT document_id, distance
FROM vec_documents
WHERE contents_embedding MATCH :query
AND k = 20
AND user_id = 123;
```
### Multiple Partition Keys
```sql
CREATE VIRTUAL TABLE vec_articles USING vec0(
article_id integer primary key,
organization_id integer partition key,
published_date text partition key,
headline_embedding float[1024]
);
```
Query with multiple partition filters:
```sql
SELECT article_id, distance
FROM vec_articles
WHERE headline_embedding MATCH :query
AND k = 10
AND organization_id = 456
AND published_date BETWEEN '2024-01-01' AND '2024-12-31';
```
### Best Practices
- Each unique partition key value should have 100+ vectors
- Avoid over-sharding (too many unique partition values)
- Consider broader keys if queries are slow (e.g., month instead of day)
- Use 1-2 partition keys maximum in most cases
### Supported Operators
- `=` - Equals
- `BETWEEN` - Range (inclusive)
## Auxiliary Columns
Auxiliary columns store unindexed data separately, avoiding JOIN operations.
Maximum: 16 auxiliary columns per table
### Use Cases
- Large text content
- Raw image/document BLOBs
- URLs, metadata not used in WHERE clauses
- Data appearing in SELECT but not WHERE
### Declaration
Prefix column name with `+`:
```sql
CREATE VIRTUAL TABLE vec_chunks USING vec0(
contents_embedding float[1024],
+contents text
);
```
### Multiple Auxiliary Columns
```sql
CREATE VIRTUAL TABLE vec_documents USING vec0(
document_id integer primary key,
embedding float[768],
+title text,
+url text,
+full_text text,
+metadata_json text
);
```
### Querying
Auxiliary columns can appear in SELECT but not in WHERE:
```sql
-- ✓ Valid: auxiliary column in SELECT
SELECT rowid, contents, distance
FROM vec_chunks
WHERE contents_embedding MATCH ?
AND k = 10;
-- ✗ Invalid: auxiliary column in WHERE
SELECT rowid, distance
FROM vec_chunks
WHERE contents_embedding MATCH ?
AND contents LIKE '%search%'; -- ERROR
```
### Image Storage Example
```sql
CREATE VIRTUAL TABLE vec_images USING vec0(
image_id integer primary key,
image_embedding float[512],
+image blob,
+image_url text
);
SELECT image_id, image, image_url, distance
FROM vec_images
WHERE image_embedding MATCH ?
AND k = 5
ORDER BY distance;
```
## Performance Tuning
### chunk_size Parameter
Controls internal chunking for better performance:
```sql
CREATE VIRTUAL TABLE vec_large USING vec0(
embedding float[1536],
chunk_size=512
);
```
Default chunk_size is appropriate for most use cases. Tune for:
- Very large tables (millions of vectors)
- Specific memory constraints
- Bulk insert performance
### Column Type Comparison
| Column Type | Use Case | In WHERE? | In SELECT? | Max Count |
|---------------|-----------------------------------|-----------|------------|-----------|
| Vector | Embeddings | MATCH | ✓ | Multiple |
| Metadata | Filtered searches | ✓ | ✓ | 16 |
| Partition Key | Multi-tenant/temporal sharding | ✓ | ✓ | 4 |
| Auxiliary | Large content, no filtering | ✗ | ✓ | 16 |
### Complete Example
```sql
CREATE VIRTUAL TABLE vec_knowledge_base USING vec0(
-- Primary key
document_id integer primary key,
-- Partition keys (multi-tenant + temporal)
organization_id integer partition key,
created_month text partition key,
-- Vector column
content_embedding float[768] distance_metric=cosine,
-- Metadata columns (filterable)
document_type text,
language text,
word_count integer,
is_public boolean,
-- Auxiliary columns (not filterable)
+title text,
+full_content text,
+url text,
+metadata_json text,
chunk_size=256
);
```
Query example:
```sql
SELECT
document_id,
title,
full_content,
distance
FROM vec_knowledge_base
WHERE content_embedding MATCH ?
AND k = 10
AND organization_id = 123
AND created_month = '2024-12'
AND document_type = 'article'
AND is_public = true
AND language = 'en'
AND word_count > 500
ORDER BY distance;
```