UNPKG

@mastra/rag

Version:

The Retrieval-Augmented Generation (RAG) module contains document processing and embedding utilities.

168 lines (118 loc) 5.86 kB
# ExtractParams ExtractParams configures metadata extraction from document chunks using LLM analysis. ## Example ```typescript import { MDocument } from '@mastra/rag' const doc = MDocument.fromText(text) const chunks = await doc.chunk({ extract: { title: true, // Extract titles using default settings summary: true, // Generate summaries using default settings keywords: true, // Extract keywords using default settings }, }) // Example output: // chunks[0].metadata = { // documentTitle: "AI Systems Overview", // sectionSummary: "Overview of artificial intelligence concepts and applications", // excerptKeywords: "KEYWORDS: AI, machine learning, algorithms" // } ``` ## Parameters The `extract` parameter accepts the following fields: **title** (`boolean | TitleExtractorsArgs`): Enable title extraction. Set to true for default settings, or provide custom configuration. **summary** (`boolean | SummaryExtractArgs`): Enable summary extraction. Set to true for default settings, or provide custom configuration. **questions** (`boolean | QuestionAnswerExtractArgs`): Enable question generation. Set to true for default settings, or provide custom configuration. **keywords** (`boolean | KeywordExtractArgs`): Enable keyword extraction. Set to true for default settings, or provide custom configuration. **schema** (`SchemaExtractArgs`): Enable structured metadata extraction using a Zod schema. ## Extractor arguments ### `TitleExtractorsArgs` **llm** (`MastraLanguageModel`): AI SDK language model to use for title extraction **nodes** (`number`): Number of title nodes to extract **nodeTemplate** (`string`): Custom prompt template for title node extraction. Must include {context} placeholder **combineTemplate** (`string`): Custom prompt template for combining titles. Must include {context} placeholder ### `SummaryExtractArgs` **llm** (`MastraLanguageModel`): AI SDK language model to use for summary extraction **summaries** (`('self' | 'prev' | 'next')[]`): List of summary types to generate. Can only include 'self' (current chunk), 'prev' (previous chunk), or 'next' (next chunk) **promptTemplate** (`string`): Custom prompt template for summary generation. Must include {context} placeholder ### `QuestionAnswerExtractArgs` **llm** (`MastraLanguageModel`): AI SDK language model to use for question generation **questions** (`number`): Number of questions to generate **promptTemplate** (`string`): Custom prompt template for question generation. Must include both {context} and {numQuestions} placeholders **embeddingOnly** (`boolean`): If true, only generate embeddings without actual questions ### `KeywordExtractArgs` **llm** (`MastraLanguageModel`): AI SDK language model to use for keyword extraction **keywords** (`number`): Number of keywords to extract **promptTemplate** (`string`): Custom prompt template for keyword extraction. Must include both {context} and {maxKeywords} placeholders ### `SchemaExtractArgs` **schema** (`ZodType`): Zod schema defining the structure of the data to extract. **llm** (`MastraLanguageModel`): AI SDK language model to use for extraction. **instructions** (`string`): Instructions for the LLM on what to extract. **metadataKey** (`string`): Key to nest extraction results under. If omitted, results are spread into the metadata object. ## Advanced example ```typescript import { MDocument } from '@mastra/rag' const doc = MDocument.fromText(text) const chunks = await doc.chunk({ extract: { // Title extraction with custom settings title: { nodes: 2, // Extract 2 title nodes nodeTemplate: 'Generate a title for this: {context}', combineTemplate: 'Combine these titles: {context}', }, // Summary extraction with custom settings summary: { summaries: ['self'], // Generate summaries for current chunk promptTemplate: 'Summarize this: {context}', }, // Question generation with custom settings questions: { questions: 3, // Generate 3 questions promptTemplate: 'Generate {numQuestions} questions about: {context}', embeddingOnly: false, }, // Keyword extraction with custom settings keywords: { keywords: 5, // Extract 5 keywords promptTemplate: 'Extract {maxKeywords} key terms from: {context}', }, // Schema extraction with Zod schema: { schema: z.object({ productName: z.string(), category: z.enum(['electronics', 'clothing']), }), instructions: 'Extract product information.', metadataKey: 'product', }, }, }) // Example output: // chunks[0].metadata = { // documentTitle: "AI in Modern Computing", // sectionSummary: "Overview of AI concepts and their applications in computing", // questionsThisExcerptCanAnswer: "1. What is machine learning?\n2. How do neural networks work?", // excerptKeywords: "1. Machine learning\n2. Neural networks\n3. Training data", // product: { // productName: "Neural Net 2000", // category: "electronics" // } // } ``` ## Document grouping for title extraction When using the `TitleExtractor`, you can group multiple chunks together for title extraction by specifying a shared `docId` in the `metadata` field of each chunk. All chunks with the same `docId` will receive the same extracted title. If no `docId` is set, each chunk is treated as its own document for title extraction. **Example:** ```ts import { MDocument } from '@mastra/rag' const doc = new MDocument({ docs: [ { text: 'chunk 1', metadata: { docId: 'docA' } }, { text: 'chunk 2', metadata: { docId: 'docA' } }, { text: 'chunk 3', metadata: { docId: 'docB' } }, ], type: 'text', }) await doc.extractMetadata({ title: true }) // The first two chunks will share a title, while the third chunk will be assigned a separate title. ```