@mastra/core
Version:
Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.
282 lines (209 loc) • 10.2 kB
Markdown
# Search and indexing
**Added in:** `@mastra/core@1.1.0`
Search lets agents find relevant content in indexed workspace files. When an agent needs to answer a question or find information, it can search the indexed content instead of reading every file.
## How it works
Workspace search has two phases: indexing and querying.
### Indexing
Content must be indexed before it can be searched. When you index a document:
- The content is tokenized (split into searchable terms)
- For BM25: term frequencies and document statistics are computed
- For vector: the content is embedded using your embedder function and stored in the vector store
Each indexed document has:
- **id** - A unique identifier (typically the file path)
- **content** - The text content
- **metadata** - Optional key-value data stored with the document
### Querying
When you search:
1. The query is processed using the same tokenization/embedding as indexing
2. Documents are scored based on relevance to the query
3. Results are ranked by score and returned with the matching content
Workspaces support three search modes: BM25 keyword search, vector semantic search, and hybrid search that combines both.
## BM25 keyword search
BM25 scores documents based on term frequency and document length. It works well for exact matches and specific terminology.
```typescript
import { Workspace, LocalFilesystem } from '@mastra/core/workspace'
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
bm25: true,
})
```
For custom BM25 parameters (`k1` is term frequency saturation, `b` is document length normalization):
```typescript
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
bm25: {
k1: 1.5,
b: 0.75,
},
})
```
## Vector search
Vector search uses embeddings to find semantically similar content. It requires a vector store and embedder function.
```typescript
import { Workspace, LocalFilesystem } from '@mastra/core/workspace'
import { PineconeVector } from '@mastra/pinecone'
import { embed } from 'ai'
import { openai } from '@ai-sdk/openai'
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
vectorStore: new PineconeVector({
apiKey: process.env.PINECONE_API_KEY,
index: 'workspace-index',
}),
embedder: async (text: string) => {
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: text,
})
return embedding
},
})
```
### Batch embedding
The embedder above takes one text at a time. Indexing a workspace with hundreds of files calls the provider hundreds of times, which is slow and expensive.
When the provider supports batching (for example, OpenAI's `embedMany`), pass an embedder that takes an array of texts and accepts many embeddings back in one call. To opt in, set a `batch: true` property on the function. Mastra checks for that property at runtime and switches to the batched path.
The following example replaces the single-text embedder with a batched one. The embedder function takes an array, returns an array of embeddings in the same order, and carries two extra properties:
- `batch: true`: marks the function as batch-capable. Without this property, Mastra calls it one text at a time.
- `maxBatchSize`: the largest array the provider accepts in one call. Mastra splits larger requests into chunks of this size and sends them in parallel. Set this to your provider's documented limit (for example, 2048 for OpenAI, 96 for Cohere, 128 for Voyage). Omit it to send every pending text in one request.
```typescript
import { Workspace, LocalFilesystem } from '@mastra/core/workspace'
import { PineconeVector } from '@mastra/pinecone'
import { embedMany } from 'ai'
import { openai } from '@ai-sdk/openai'
const model = openai.embedding('text-embedding-3-small')
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
vectorStore: new PineconeVector({
apiKey: process.env.PINECONE_API_KEY,
index: 'workspace-index',
}),
embedder: Object.assign(
async (texts: string[]) => {
const { embeddings } = await embedMany({ model, values: texts })
return embeddings
},
{ batch: true as const, maxBatchSize: 2048 },
),
})
```
`Object.assign` adds the `batch` and `maxBatchSize` properties to the embedder function. Mastra reads them as metadata and never passes them to the provider.
Single-text embedders still work. The function signature `(text: string) => Promise<number[]>` is unchanged, so existing code keeps running without modification.
## Hybrid search
Configure both BM25 and vector search to enable hybrid mode, which combines keyword matching with semantic understanding.
```typescript
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
bm25: true,
vectorStore: pineconeVector,
embedder: embedderFn,
})
```
## Custom index name
By default, the search index name is derived from the workspace ID. To set a custom name, use `searchIndexName`:
```typescript
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
bm25: true,
searchIndexName: 'my_workspace_vectors',
})
```
The index name must be a valid SQL identifier: start with a letter or underscore, contain only letters, numbers, or underscores, and be at most 63 characters long.
## Indexing content
### Manual indexing
Use `workspace.index()` to add content to the search index programmatically. The file paths become document IDs. You can also pass metadata for each document.
```typescript
// Basic indexing
await workspace.index('/docs/guide.md', 'Content of the guide...')
// Index with metadata for filtering or context
await workspace.index('/docs/api.md', apiDocContent, {
metadata: {
category: 'api',
version: '2.0',
},
})
```
Manual indexing is useful when:
- You're indexing content that doesn't come from files (e.g., database records, API responses)
- You want to pre-process or chunk content before indexing
- You need to add custom metadata to documents
### Auto-indexing
Configure `autoIndexPaths` to automatically index files when the workspace initializes. Each entry can be a directory path (indexed recursively) or a glob pattern for selective indexing.
```typescript
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
bm25: true,
autoIndexPaths: ['docs', 'support/faq'],
})
await workspace.init()
```
When `init()` is called, all matching files are read and indexed for search. The file path becomes the document ID.
Glob patterns let you index specific file types:
```typescript
const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: './workspace' }),
bm25: true,
autoIndexPaths: ['docs/**/*.md', 'support/**/*.txt'],
})
```
## Searching
Use `workspace.search()` to find relevant content. Results are ranked by relevance score.
```typescript
const results = await workspace.search('password reset')
for (const result of results) {
console.log(`${result.id}: ${result.score}`)
console.log(result.content)
}
```
### Search options
You can customize the search behavior with options:
```typescript
const results = await workspace.search('authentication flow', {
topK: 10,
mode: 'hybrid',
minScore: 0.5,
vectorWeight: 0.5,
})
```
| Option | Description |
| -------------- | ------------------------------------------------------------------------------------------------------------- |
| `topK` | Maximum number of results to return. Default: 5 |
| `mode` | Search mode: `'bm25'`, `'vector'`, or `'hybrid'`. Defaults to the best available mode based on configuration. |
| `minScore` | Filter out results below this score threshold (0-1). |
| `vectorWeight` | In hybrid mode, how much to weight vector scores vs BM25. 0 = all BM25, 1 = all vector, 0.5 = equal. |
### Search results
Each result contains:
```typescript
interface SearchResult {
id: string // Document ID (typically file path)
content: string // The matching content
score: number // Relevance score (0-1)
lineRange?: {
// Lines where the match was found
start: number
end: number
}
metadata?: Record<string, unknown> // Metadata stored with the document
scoreDetails?: {
// Score breakdown (hybrid mode only)
vector?: number
bm25?: number
}
}
```
**Understanding scores:**
- Scores range from 0 to 1, where 1 is a perfect match
- BM25 scores are normalized based on the best match in the result set
- Vector scores represent cosine similarity between query and document embeddings
- In hybrid mode, scores are combined using the `vectorWeight` parameter
### When to use each mode
| Mode | Best for | Example queries |
| -------- | ------------------------------------ | ------------------------------------------------------------------------ |
| `bm25` | Exact terms, technical queries, code | "useState hook", "404 error", "config.yaml" |
| `vector` | Conceptual queries, natural language | "how to handle user authentication", "best practices for error handling" |
| `hybrid` | General search, unknown query types | Most agent use cases |
## Agent tools
When you configure search on a workspace, agents receive tools for searching and indexing content. See [workspace class reference](https://mastra.ai/reference/workspace/workspace-class) for details.
## Related
- [Workspace overview](https://mastra.ai/docs/workspace/overview)
- [RAG overview](https://mastra.ai/docs/rag/overview)
- [Workspace class reference](https://mastra.ai/reference/workspace/workspace-class)