@mastra/rag
Version:
The Retrieval-Augmented Generation (RAG) module contains document processing and embedding utilities.
168 lines (118 loc) • 5.86 kB
Markdown
ExtractParams configures metadata extraction from document chunks using LLM analysis.
```typescript
import { MDocument } from '@mastra/rag'
const doc = MDocument.fromText(text)
const chunks = await doc.chunk({
extract: {
title: true, // Extract titles using default settings
summary: true, // Generate summaries using default settings
keywords: true, // Extract keywords using default settings
},
})
// Example output:
// chunks[0].metadata = {
// documentTitle: "AI Systems Overview",
// sectionSummary: "Overview of artificial intelligence concepts and applications",
// excerptKeywords: "KEYWORDS: AI, machine learning, algorithms"
// }
```
The `extract` parameter accepts the following fields:
**title** (`boolean | TitleExtractorsArgs`): Enable title extraction. Set to true for default settings, or provide custom configuration.
**summary** (`boolean | SummaryExtractArgs`): Enable summary extraction. Set to true for default settings, or provide custom configuration.
**questions** (`boolean | QuestionAnswerExtractArgs`): Enable question generation. Set to true for default settings, or provide custom configuration.
**keywords** (`boolean | KeywordExtractArgs`): Enable keyword extraction. Set to true for default settings, or provide custom configuration.
**schema** (`SchemaExtractArgs`): Enable structured metadata extraction using a Zod schema.
**llm** (`MastraLanguageModel`): AI SDK language model to use for title extraction
**nodes** (`number`): Number of title nodes to extract
**nodeTemplate** (`string`): Custom prompt template for title node extraction. Must include {context} placeholder
**combineTemplate** (`string`): Custom prompt template for combining titles. Must include {context} placeholder
**llm** (`MastraLanguageModel`): AI SDK language model to use for summary extraction
**summaries** (`('self' | 'prev' | 'next')[]`): List of summary types to generate. Can only include 'self' (current chunk), 'prev' (previous chunk), or 'next' (next chunk)
**promptTemplate** (`string`): Custom prompt template for summary generation. Must include {context} placeholder
**llm** (`MastraLanguageModel`): AI SDK language model to use for question generation
**questions** (`number`): Number of questions to generate
**promptTemplate** (`string`): Custom prompt template for question generation. Must include both {context} and {numQuestions} placeholders
**embeddingOnly** (`boolean`): If true, only generate embeddings without actual questions
**llm** (`MastraLanguageModel`): AI SDK language model to use for keyword extraction
**keywords** (`number`): Number of keywords to extract
**promptTemplate** (`string`): Custom prompt template for keyword extraction. Must include both {context} and {maxKeywords} placeholders
**schema** (`ZodType`): Zod schema defining the structure of the data to extract.
**llm** (`MastraLanguageModel`): AI SDK language model to use for extraction.
**instructions** (`string`): Instructions for the LLM on what to extract.
**metadataKey** (`string`): Key to nest extraction results under. If omitted, results are spread into the metadata object.
```typescript
import { MDocument } from '@mastra/rag'
const doc = MDocument.fromText(text)
const chunks = await doc.chunk({
extract: {
// Title extraction with custom settings
title: {
nodes: 2, // Extract 2 title nodes
nodeTemplate: 'Generate a title for this: {context}',
combineTemplate: 'Combine these titles: {context}',
},
// Summary extraction with custom settings
summary: {
summaries: ['self'], // Generate summaries for current chunk
promptTemplate: 'Summarize this: {context}',
},
// Question generation with custom settings
questions: {
questions: 3, // Generate 3 questions
promptTemplate: 'Generate {numQuestions} questions about: {context}',
embeddingOnly: false,
},
// Keyword extraction with custom settings
keywords: {
keywords: 5, // Extract 5 keywords
promptTemplate: 'Extract {maxKeywords} key terms from: {context}',
},
// Schema extraction with Zod
schema: {
schema: z.object({
productName: z.string(),
category: z.enum(['electronics', 'clothing']),
}),
instructions: 'Extract product information.',
metadataKey: 'product',
},
},
})
// Example output:
// chunks[0].metadata = {
// documentTitle: "AI in Modern Computing",
// sectionSummary: "Overview of AI concepts and their applications in computing",
// questionsThisExcerptCanAnswer: "1. What is machine learning?\n2. How do neural networks work?",
// excerptKeywords: "1. Machine learning\n2. Neural networks\n3. Training data",
// product: {
// productName: "Neural Net 2000",
// category: "electronics"
// }
// }
```
When using the `TitleExtractor`, you can group multiple chunks together for title extraction by specifying a shared `docId` in the `metadata` field of each chunk. All chunks with the same `docId` will receive the same extracted title. If no `docId` is set, each chunk is treated as its own document for title extraction.
**Example:**
```ts
import { MDocument } from '@mastra/rag'
const doc = new MDocument({
docs: [
{ text: 'chunk 1', metadata: { docId: 'docA' } },
{ text: 'chunk 2', metadata: { docId: 'docA' } },
{ text: 'chunk 3', metadata: { docId: 'docB' } },
],
type: 'text',
})
await doc.extractMetadata({ title: true })
// The first two chunks will share a title, while the third chunk will be assigned a separate title.
```