UNPKG

@asktext/core

Version:

Core embedding and vector store utilities for AskText voice Q&A.

140 lines (103 loc) 3.21 kB
# @asktext/core TypeScript-first embedding and retrieval engine for voice-enabled Q&A on articles. ## What it does - **Text processing**: Splits HTML/Markdown into semantic chunks with configurable overlap - **Embeddings**: Generates OpenAI embeddings for each chunk - **Storage**: Saves chunks + embeddings to your database (Prisma JSON, pgvector, or custom) - **Retrieval**: Semantic search to find relevant passages for user questions ## Installation ```bash npm install @asktext/core openai @prisma/client ``` ## Quick Start ### 1. Database Schema Add to your `schema.prisma`: ```prisma model ArticleChunk { id String @id @default(cuid()) postId String chunkIndex Int content String @db.Text startChar Int endChar Int embedding String @db.Text // JSON-encoded float[] @@index([postId, chunkIndex]) } ``` Run `npx prisma db push`. ### 2. Embed Articles ```typescript import { PrismaClient } from '@prisma/client'; import { OpenAIEmbedder, embedAndStore } from '@asktext/core'; const prisma = new PrismaClient(); const store = embedAndStore.createPrismaJsonStore(prisma); const embedder = new OpenAIEmbedder({ apiKey: process.env.OPENAI_API_KEY! }); // Call this when publishing/updating articles export async function saveEmbeddings(postId: string, htmlContent: string) { await embedAndStore({ articleId: postId, htmlOrMarkdown: htmlContent, embedder, store }); } ``` ### 3. Retrieve Passages ```typescript import { retrievePassages } from '@asktext/core'; const passages = await retrievePassages({ query: "How does binary search work?", store, embedder, filter: { postId: "article-123" }, limit: 5 }); ``` ## Configuration ### Text Splitting ```typescript import { TextSplitter } from '@asktext/core'; const splitter = new TextSplitter({ chunkSize: 1500, // characters per chunk chunkOverlap: 200, // overlap between chunks separators: ['\n\n', '\n', '. ', ' '] // split priorities }); ``` ### Custom Vector Store Implement the `VectorStore` interface for your database: ```typescript interface VectorStore { saveChunks(chunks: ChunkWithEmbedding[]): Promise<void>; searchSimilar(embedding: number[], limit: number, filter?: any): Promise<ChunkWithScore[]>; deleteByArticleId(articleId: string): Promise<void>; } ``` ## Environment Variables ```bash OPENAI_API_KEY=sk-... # Required for embeddings DATABASE_URL=postgresql://... # For Prisma store ``` ## Advanced Usage ### Batch Processing ```typescript const articles = await getArticlesToProcess(); for (const article of articles) { await saveEmbeddings(article.id, article.content); console.log(`Processed: ${article.title}`); } ``` ### Custom Embedder ```typescript class CustomEmbedder implements Embedder { async embed(texts: string[]): Promise<number[][]> { // Your embedding logic } } ``` ## Cost Estimation - **100k words** ≈ 75k tokens ≈ **$0.01** with `text-embedding-3-small` - **1M words** ≈ 750k tokens ≈ **$0.10** ## License MIT