@asktext/core
Version:
Core embedding and vector store utilities for AskText voice Q&A.
140 lines (103 loc) • 3.21 kB
Markdown
# @asktext/core
TypeScript-first embedding and retrieval engine for voice-enabled Q&A on articles.
## What it does
- **Text processing**: Splits HTML/Markdown into semantic chunks with configurable overlap
- **Embeddings**: Generates OpenAI embeddings for each chunk
- **Storage**: Saves chunks + embeddings to your database (Prisma JSON, pgvector, or custom)
- **Retrieval**: Semantic search to find relevant passages for user questions
## Installation
```bash
npm install @asktext/core openai @prisma/client
```
## Quick Start
### 1. Database Schema
Add to your `schema.prisma`:
```prisma
model ArticleChunk {
id String @id @default(cuid())
postId String
chunkIndex Int
content String @db.Text
startChar Int
endChar Int
embedding String @db.Text // JSON-encoded float[]
@@index([postId, chunkIndex])
}
```
Run `npx prisma db push`.
### 2. Embed Articles
```typescript
import { PrismaClient } from '@prisma/client';
import { OpenAIEmbedder, embedAndStore } from '@asktext/core';
const prisma = new PrismaClient();
const store = embedAndStore.createPrismaJsonStore(prisma);
const embedder = new OpenAIEmbedder({
apiKey: process.env.OPENAI_API_KEY!
});
// Call this when publishing/updating articles
export async function saveEmbeddings(postId: string, htmlContent: string) {
await embedAndStore({
articleId: postId,
htmlOrMarkdown: htmlContent,
embedder,
store
});
}
```
### 3. Retrieve Passages
```typescript
import { retrievePassages } from '@asktext/core';
const passages = await retrievePassages({
query: "How does binary search work?",
store,
embedder,
filter: { postId: "article-123" },
limit: 5
});
```
## Configuration
### Text Splitting
```typescript
import { TextSplitter } from '@asktext/core';
const splitter = new TextSplitter({
chunkSize: 1500, // characters per chunk
chunkOverlap: 200, // overlap between chunks
separators: ['\n\n', '\n', '. ', ' '] // split priorities
});
```
### Custom Vector Store
Implement the `VectorStore` interface for your database:
```typescript
interface VectorStore {
saveChunks(chunks: ChunkWithEmbedding[]): Promise<void>;
searchSimilar(embedding: number[], limit: number, filter?: any): Promise<ChunkWithScore[]>;
deleteByArticleId(articleId: string): Promise<void>;
}
```
## Environment Variables
```bash
OPENAI_API_KEY=sk-... # Required for embeddings
DATABASE_URL=postgresql://... # For Prisma store
```
## Advanced Usage
### Batch Processing
```typescript
const articles = await getArticlesToProcess();
for (const article of articles) {
await saveEmbeddings(article.id, article.content);
console.log(`Processed: ${article.title}`);
}
```
### Custom Embedder
```typescript
class CustomEmbedder implements Embedder {
async embed(texts: string[]): Promise<number[][]> {
// Your embedding logic
}
}
```
## Cost Estimation
- **100k words** ≈ 75k tokens ≈ **$0.01** with `text-embedding-3-small`
- **1M words** ≈ 750k tokens ≈ **$0.10**
## License
MIT