@aj-archipelago/cortex
Version:
Cortex is a GraphQL API for AI. It provides a simple, extensible interface for using AI services from OpenAI, Azure and others.
1,266 lines (1,072 loc) • 48.6 kB
Markdown
# Cortex File System - Complete Documentation
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [File Handler Service](#file-handler-service)
3. [Cortex File Utilities Layer](#cortex-file-utilities-layer)
4. [File Collection System](#file-collection-system)
5. [Tools Integration](#tools-integration)
6. [Data Flow Diagrams](#data-flow-diagrams)
7. [Storage Layers](#storage-layers)
8. [Key Concepts](#key-concepts)
9. [Complete Function Reference](#complete-function-reference)
10. [Error Handling](#error-handling)
---
## Architecture Overview
The Cortex file system is a multi-layered architecture that handles file uploads, storage, retrieval, and management:
```
┌─────────────────────────────────────────────────────────────┐
│ Cortex Application │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ System Tools & Plugins │ │
│ │ (WriteFile, EditFile, Image, FileCollection, etc.) │ │
│ └──────────────────┬─────────────────────────────────┘ │
│ │ │
│ ┌───────────────────▼─────────────────────────────────┐ │
│ │ lib/fileUtils.js │ │
│ │ (Encapsulated file handler interactions) │ │
│ └───────────────────┬─────────────────────────────────┘ │
│ │ │
│ ┌───────────────────▼─────────────────────────────────┐ │
│ │ File Collection System │ │
│ │ (Redis hash maps: FileStoreMap:ctx:<contextId>) │ │
│ └───────────────────┬─────────────────────────────────┘ │
└───────────────────────┼───────────────────────────────────────┘
│
│ HTTP/HTTPS
│
┌───────────────────────▼───────────────────────────────────────┐
│ Cortex File Handler Service │
│ (External Azure Function - cortex-file-handler) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Azure Blob │ │ GCS │ │ Redis │ │
│ │ Storage │ │ Storage │ │ Metadata │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────────────────┘
```
### Key Components
1. **File Handler Service** (`cortex-file-handler`): External Azure Function that handles actual file storage
2. **File Utilities** (`lib/fileUtils.js`): Cortex's abstraction layer over the file handler
3. **File Collection System**: Redis-based metadata storage for user file collections
4. **System Tools**: Pathways that use files (WriteFile, EditFile, Image, etc.)
---
## File Handler Service
The file handler is an external Azure Function service that manages file storage and processing.
### Configuration
- **URL**: Configured via `WHISPER_MEDIA_API_URL` environment variable
- **Storage Backends**: Azure Blob Storage (primary), Google Cloud Storage (optional), Local (fallback)
### Key Features
#### 1. Single Container Architecture
- All files stored in a single Azure Blob Storage container
- Files distinguished by blob index tags, not separate containers
- No `container` parameter supported - always uses configured container
#### 2. Retention Management
- **Temporary** (default): Files tagged with `retention=temporary`, auto-deleted after 30 days
- **Permanent**: Files tagged with `retention=permanent`, retained indefinitely
- Retention changed via `setRetention` operation (updates blob tag, no file copying)
#### 3. Context Scoping
- **`contextId`**: Optional parameter for per-user/per-context file isolation
- Redis keys: `<hash>:ctx:<contextId>` for context-scoped files
- Falls back to unscoped keys if context-scoped not found
- **Strongly recommended** for multi-tenant applications
#### 4. Hash-Based Deduplication
- Files identified by xxhash64 hash
- Duplicate uploads return existing file URLs
- Hash stored in Redis for fast lookups
#### 5. Short-Lived URLs
- All operations return `shortLivedUrl` (5-minute expiration, configurable)
- Provides secure, time-limited access
- Preferred for LLM file access
### API Endpoints
#### POST `/file-handler` - Upload File
```javascript
// FormData:
{
file: <FileStream>,
hash: "abc123", // Optional: for deduplication
contextId: "user-456", // Optional: for scoping
requestId: "req-789" // Optional: for tracking
}
// Response:
{
url: "https://storage.../file.pdf?long-lived-sas",
shortLivedUrl: "https://storage.../file.pdf?short-lived-sas",
gcs: "gs://bucket/file.pdf", // If GCS configured
hash: "abc123",
filename: "file.pdf"
}
```
#### GET `/file-handler` - Retrieve/Process File
```javascript
// Query Parameters:
{
hash: "abc123", // Check if file exists
checkHash: true, // Enable hash check
contextId: "user-456", // Optional: for scoping
shortLivedMinutes: 5, // Optional: URL expiration
fetch: "https://example.com/file", // Download from URL
save: true // Save converted document
}
// Response (checkHash):
{
url: "https://storage.../file.pdf",
shortLivedUrl: "https://storage.../file.pdf?short-lived",
gcs: "gs://bucket/file.pdf",
hash: "abc123",
filename: "file.pdf",
converted: { // If file was converted
url: "https://storage.../converted.csv",
gcs: "gs://bucket/converted.csv"
}
}
```
#### DELETE `/file-handler` - Delete File
```javascript
// Query Parameters:
{
hash: "abc123", // Delete by hash
contextId: "user-456", // Optional: for scoping
requestId: "req-789" // Or delete all files for requestId
}
```
#### POST/PUT `/file-handler` - Set Retention
```javascript
// Body:
{
hash: "abc123",
retention: "permanent", // or "temporary"
contextId: "user-456", // Optional: for scoping
setRetention: true
}
// Response:
{
hash: "abc123",
filename: "file.pdf",
retention: "permanent",
url: "https://storage.../file.pdf", // Same URL (tag updated)
shortLivedUrl: "https://storage.../file.pdf?new-sas",
gcs: "gs://bucket/file.pdf"
}
```
---
## Cortex File Utilities Layer
**Location**: `lib/fileUtils.js`
This is Cortex's abstraction layer that encapsulates all file handler interactions. **No direct axios calls to the file handler should exist** - all go through these functions.
### Core Functions
#### URL Building
```javascript
buildFileHandlerUrl(baseUrl, params)
```
- Handles separator detection (`?` vs `&`)
- Properly encodes all parameters
- Skips null/undefined/empty values
- **Used by all file handler operations**
#### File Upload
```javascript
uploadFileToCloud(fileInput, mimeType, filename, pathwayResolver, contextId)
```
- **Input Types**: URL string, base64 string, or Buffer
- **Process**:
1. Converts input to Buffer
2. Computes xxhash64 hash
3. Checks if file exists via `checkHashExists` (deduplication)
4. If exists, returns existing URLs
5. If not, uploads via file handler POST
- **Returns**: `{url, gcs, hash}`
- **ContextId**: Passed in formData body (not URL)
#### File Retrieval
```javascript
checkHashExists(hash, fileHandlerUrl, pathwayResolver, contextId, shortLivedMinutes)
```
- Checks if file exists by hash
- Returns short-lived URL (prefers converted version)
- **Returns**: `{url, gcs, hash, filename}` or `null`
- Makes single API call (optimized)
```javascript
fetchFileFromUrl(fileUrl, requestId, contextId, save)
```
- Downloads file from URL via file handler
- Processes based on file type
- **Used by**: `azureVideoTranslatePlugin`, `azureCognitivePlugin`
#### File Deletion
```javascript
deleteFileByHash(hash, pathwayResolver, contextId)
```
- Deletes file from cloud storage
- Handles 404 gracefully (file already deleted)
- **Returns**: `true` if deleted, `false` if not found
#### Retention Management
```javascript
setRetentionForHash(hash, retention, contextId, pathwayResolver)
```
- Sets file retention to `'temporary'` or `'permanent'`
- Best-effort operation (logs warnings on failure)
- **Used by**: `addFileToCollection` when `permanent=true`
#### Short-Lived URL Resolution
```javascript
ensureShortLivedUrl(fileObject, fileHandlerUrl, contextId, shortLivedMinutes)
```
- Resolves file object to use short-lived URL
- Updates GCS URL if converted version exists
- **Used by**: Tools that send files to LLMs
#### Media Chunks
```javascript
getMediaChunks(file, requestId, contextId)
```
- Gets chunked media file URLs
- **Used by**: Media processing workflows
#### Cleanup
```javascript
markCompletedForCleanUp(requestId, contextId)
```
- Marks request as completed for cleanup
- **Used by**: `azureCognitivePlugin`
---
## File Collection System
**Location**: `lib/fileUtils.js` + `pathways/system/entity/tools/sys_tool_file_collection.js`
The file collection system stores file metadata in Redis hash maps using atomic operations for concurrent safety. Files are stored directly in Redis hash maps keyed by hash, with context-scoped isolation.
### Storage Architecture
```
Redis Hash Maps
└── FileStoreMap:ctx:<contextId>
└── Hash Map (hash → fileData JSON)
└── File Entry (JSON):
{
// CFH-managed fields (preserved from file handler)
url: "https://storage.../file.pdf",
gcs: "gs://bucket/file.pdf",
filename: "uuid-based-filename.pdf", // CFH-managed
// Cortex-managed fields (user metadata)
id: "timestamp-random",
displayFilename: "user-friendly-name.pdf", // User-provided name
mimeType: "application/pdf",
tags: ["pdf", "report"],
notes: "Quarterly report",
hash: "abc123",
permanent: true,
addedDate: "2024-01-15T10:00:00.000Z",
lastAccessed: "2024-01-15T10:00:00.000Z"
}
```
### Key Features
#### 1. Atomic Operations
- Uses Redis hash map operations (HSET, HGET, HDEL) which are atomic
- No version-based locking needed - Redis operations are thread-safe
- Direct hash map access: `FileStoreMap:ctx:<contextId>` → `{hash: fileData}`
#### 2. Caching
- In-memory cache with 5-second TTL
- Reduces Redis load for read operations
- Cache invalidated on writes
#### 3. Field Ownership
- **CFH-managed fields**: `url`, `gcs`, `filename` (UUID-based, managed by file handler)
- **Cortex-managed fields**: `id`, `displayFilename`, `tags`, `notes`, `mimeType`, `permanent`, `addedDate`, `lastAccessed`
- When merging data, CFH fields are preserved, Cortex fields are updated
### Core Functions
#### Loading
```javascript
loadFileCollection(contextId, contextKey, useCache)
```
- Loads collection from Redis hash map `FileStoreMap:ctx:<contextId>`
- Returns array of file entries (sorted by lastAccessed, most recent first)
- Uses cache if available and fresh (5-second TTL)
- Converts hash map entries to array format
#### Saving
```javascript
saveFileCollection(contextId, contextKey, collection)
```
- Saves collection to Redis hash map (only updates changed entries)
- Uses atomic HSET operations per file
- Optimized to only write files that actually changed
- Returns `true` if successful, `false` on error
#### Metadata Updates
```javascript
updateFileMetadata(contextId, hash, metadata)
```
- Updates Cortex-managed metadata fields atomically
- Preserves all CFH-managed fields
- Updates only specified fields (displayFilename, tags, notes, mimeType, dates, permanent)
- **Used for**: Updating lastAccessed, modifying tags/notes without full reload
#### Adding Files
```javascript
addFileToCollection(contextId, contextKey, url, gcs, filename, tags, notes, hash, fileUrl, pathwayResolver, permanent)
```
- Adds file entry to collection via atomic HSET operation
- If `fileUrl` provided, uploads file first via `uploadFileToCloud()`
- If `permanent=true`, sets retention to permanent via `setRetentionForHash()`
- Merges with existing CFH data if file with same hash already exists
- Returns file entry object with `id`
#### Processing Chat History Files
```javascript
syncAndStripFilesFromChatHistory(chatHistory, contextId, contextKey)
```
- Files IN collection: stripped from message (replaced with placeholder), tools can access them
- Files NOT in collection: left in message as-is (model sees them directly)
- Updates lastAccessed for collection files
- **Used by**: `sys_entity_agent` to process incoming chat history
### File Entry Schema
```javascript
{
id: string, // Unique ID: "timestamp-random" (Cortex-managed)
url: string, // Azure Blob Storage URL (CFH-managed)
gcs: string | null, // Google Cloud Storage URL (CFH-managed)
filename: string | null, // CFH-managed filename (UUID-based) (CFH-managed)
displayFilename: string | null, // User-friendly filename (Cortex-managed)
mimeType: string | null, // MIME type (Cortex-managed)
tags: string[], // Searchable tags (Cortex-managed)
notes: string, // User notes/description (Cortex-managed)
hash: string, // File hash for deduplication (used as Redis key)
permanent: boolean, // Whether file is permanent (Cortex-managed)
addedDate: string, // ISO timestamp when added (Cortex-managed)
lastAccessed: string // ISO timestamp of last access (Cortex-managed)
}
```
**Field Ownership Notes**:
- `filename`: Managed by CFH, UUID-based storage filename
- `displayFilename`: Managed by Cortex, user-provided friendly name
- When displaying files, prefer `displayFilename` with fallback to `filename`
---
## Tools Integration
### System Tools That Use Files
#### 1. WriteFile (`sys_tool_writefile.js`)
**Flow**:
1. User provides content and filename
2. Creates Buffer from content
3. Calls `uploadFileToCloud()` with `contextId`
4. Calls `addFileToCollection()` with `permanent=true`
5. Returns file info with `fileId`
**Key Code**:
```javascript
const uploadResult = await uploadFileToCloud(
fileBuffer, mimeType, filename, resolver, contextId
);
const fileEntry = await addFileToCollection(
contextId, contextKey, uploadResult.url, uploadResult.gcs,
filename, tags, notes, uploadResult.hash, null, resolver, true
);
```
#### 2. EditFile (`sys_tool_editfile.js`)
**Flow**:
1. User provides file identifier and modification
2. Resolves file via `resolveFileParameter()` → finds in collection
3. Downloads file content via `axios.get(file.url)`
4. Modifies content (line replacement or search/replace)
5. Uploads modified file via `uploadFileToCloud()` (creates new hash)
6. Updates collection entry atomically via `updateFileMetadata()` with new URL/hash
7. Deletes old file version (if not permanent) via `deleteFileByHash()`
**Key Code**:
```javascript
const foundFile = await resolveFileParameter(fileParam, contextId, contextKey);
const oldHash = foundFile.hash;
const uploadResult = await uploadFileToCloud(
fileBuffer, mimeType, filename, resolver, contextId
);
// Update file entry atomically (preserves CFH data, updates Cortex metadata)
await updateFileMetadata(contextId, foundFile.hash, {
url: uploadResult.url,
gcs: uploadResult.gcs,
hash: uploadResult.hash
});
if (!foundFile.permanent) {
await deleteFileByHash(oldHash, resolver, contextId);
}
```
#### 3. FileCollection (`sys_tool_file_collection.js`)
**Tools**:
- `AddFileToCollection`: Adds file to collection (with optional upload)
- `SearchFileCollection`: Searches files by filename, tags, notes
- `ListFileCollection`: Lists all files with filtering/sorting
- `RemoveFileFromCollection`: Removes files (deletes from cloud if not permanent)
**Key Code**:
```javascript
// Add file
await addFileToCollection(contextId, contextKey, url, gcs, filename, tags, notes, hash, fileUrl, resolver, permanent);
// Remove file (with permanent check)
if (!fileInfo.permanent) {
await deleteFileByHash(fileInfo.hash, resolver, contextId);
}
```
#### 4. Image Tools (`sys_tool_image.js`, `sys_tool_image_gemini.js`)
**Flow**:
1. Generates/modifies image
2. Gets image URL
3. Uploads via `uploadFileToCloud()`
4. Adds to collection with `permanent=true`
#### 5. ReadFile (`sys_tool_readfile.js`)
**Flow**:
1. Resolves file via `resolveFileParameter()` → finds in collection
2. Downloads file content via `axios.get(file.url)`
3. Validates file is text-based via `isTextMimeType()`
4. Returns content with line/character range support
#### 6. ViewImage (`sys_tool_view_image.js`)
**Flow**:
1. Finds file in collection
2. Resolves to short-lived URL via `ensureShortLivedUrl()`
3. Returns image URL for display
#### 7. AnalyzeFile (`sys_tool_analyzefile.js`)
**Flow**:
1. Extracts files from chat history via `extractFilesFromChatHistory()`
2. Generates file message content via `generateFileMessageContent()`
3. Injects files into chat history via `injectFileIntoChatHistory()`
4. Uses Gemini Vision model to analyze files
### Plugins That Use Files
#### 1. AzureVideoTranslatePlugin
**Flow**:
1. Receives video URL
2. If not from Azure storage, uploads via `fetchFileFromUrl()`
3. Uses uploaded URL for video translation
**Key Code**:
```javascript
const response = await fetchFileFromUrl(videoUrl, this.requestId, contextId, false);
const resultUrl = Array.isArray(response) ? response[0] : response.url;
```
#### 2. AzureCognitivePlugin
**Flow**:
1. Receives file for indexing
2. If not text file, converts via `fetchFileFromUrl()` with `save=true`
3. Uses converted text file for indexing
4. Marks completed via `markCompletedForCleanUp()`
**Key Code**:
```javascript
const data = await fetchFileFromUrl(file, requestId, contextId, true);
url = Array.isArray(data) ? data[0] : data.url;
```
---
## Data Flow Diagrams
### File Upload Flow
```
User/LLM Request
│
▼
System Tool (WriteFile, Image, etc.)
│
▼
uploadFileToCloud()
│
├─► Convert input to Buffer
├─► Compute xxhash64 hash
├─► checkHashExists() ──► File Handler GET /file-handler?checkHash=true
│ │
│ ├─► File exists? ──► Return existing URLs
│ │
│ └─► File not found ──► Continue
│
└─► Upload via POST ──► File Handler POST /file-handler
│ │
│ ├─► Store in Azure Blob Storage
│ ├─► Store in GCS (if configured)
│ ├─► Store metadata in Redis
│ └─► Return {url, gcs, hash, shortLivedUrl}
│
└─► addFileToCollection()
│
├─► If permanent=true ──► setRetentionForHash() ──► File Handler POST /file-handler?setRetention=true
│
└─► Save to Redis hash map (atomic operation)
│
└─► Redis HSET FileStoreMap:ctx:<contextId> <hash> <fileData>
│
├─► Merge with existing CFH data (if hash exists)
├─► Preserve CFH fields (url, gcs, filename)
└─► Update Cortex fields (displayFilename, tags, notes, etc.)
```
### File Retrieval Flow
```
User/LLM Request (e.g., "view file.pdf")
│
▼
System Tool (ViewImage, ReadFile, etc.)
│
▼
resolveFileParameter()
│
├─► Find in collection via findFileInCollection()
│ │
│ └─► Matches by: ID, filename, hash, URL, or fuzzy filename
│
└─► ensureShortLivedUrl()
│
└─► checkHashExists() ──► File Handler GET /file-handler?checkHash=true&shortLivedMinutes=5
│ │
│ ├─► Check Redis for hash metadata
│ ├─► Generate short-lived SAS token
│ └─► Return {url, gcs, hash, filename, shortLivedUrl}
│
└─► Return file object with shortLivedUrl
```
### File Edit Flow
```
User/LLM Request (e.g., "edit file.txt, replace line 5")
│
▼
EditFile Tool
│
├─► resolveFileParameter() ──► Find file in collection
│
├─► Download file content ──► axios.get(file.url)
│
├─► Modify content (line replacement or search/replace)
│
├─► uploadFileToCloud() ──► Upload modified file
│ │
│ └─► Returns new {url, gcs, hash}
│
└─► updateFileMetadata() ──► Redis HSET (atomic update)
│
├─► Preserve CFH fields (url, gcs, filename)
├─► Update Cortex fields (url, gcs, hash)
└─► If update succeeds:
└─► Delete old file (if not permanent)
└─► deleteFileByHash() ──► File Handler DELETE /file-handler?hash=oldHash
```
### File Deletion Flow
```
User/LLM Request (e.g., "remove file.pdf from collection")
│
▼
RemoveFileFromCollection Tool
│
├─► Load collection ──► findFileInCollection() for each fileId
│
├─► Capture file info (hash, permanent) from collection
│
└─► Redis HDEL FileStoreMap:ctx:<contextId> <hash> (atomic deletion)
│
└─► Async deletion (fire and forget)
│
├─► For each file:
│ │
│ ├─► If permanent=true ──► Skip deletion (keep in cloud)
│ │
│ └─► If permanent=false ──► deleteFileByHash()
│ │
│ └─► File Handler DELETE /file-handler?hash=hash&contextId=contextId
│ │
│ ├─► Delete from Azure Blob Storage
│ ├─► Delete from GCS (if configured)
│ └─► Remove from Redis metadata
```
---
## Storage Layers
### Layer 1: Cloud Storage (File Handler)
#### Azure Blob Storage (Primary)
- **Container**: Single container (configured via `AZURE_STORAGE_CONTAINER_NAME`)
- **Naming**: UUID-based filenames
- **Organization**: By `requestId` folders
- **Access**: SAS tokens (long-lived and short-lived)
- **Tags**: Blob index tags for retention (`retention=temporary` or `retention=permanent`)
- **Lifecycle**: Azure automatically deletes `retention=temporary` files after 30 days
#### Google Cloud Storage (Optional)
- **Enabled**: If `GCP_SERVICE_ACCOUNT_KEY` configured
- **URL Format**: `gs://bucket/path`
- **Usage**: Media file chunks, converted files
- **No short-lived URLs**: GCS URLs are permanent (no SAS equivalent)
#### Local Storage (Fallback)
- **Used**: If Azure not configured
- **Served**: Via HTTP on configured port
### Layer 2: Redis Metadata (File Handler)
**Purpose**: Fast hash lookups, file metadata caching
**Key Format**:
- Unscoped: `<hash>`
- Context-scoped: `<hash>:ctx:<contextId>`
- Legacy (migrated): `<hash>:<containerName>` (auto-migrated on read)
**Data Stored**:
```javascript
{
url: "https://storage.../file.pdf?long-lived-sas",
shortLivedUrl: "https://storage.../file.pdf?short-lived-sas",
gcs: "gs://bucket/file.pdf",
hash: "abc123",
filename: "file.pdf",
timestamp: "2024-01-15T10:00:00.000Z",
converted: {
url: "https://storage.../converted.csv",
gcs: "gs://bucket/converted.csv"
}
}
```
### Layer 3: File Collection (Cortex Redis Hash Maps)
**Purpose**: User-facing file collections with metadata
**Storage**: Redis hash maps (`FileStoreMap:ctx:<contextId>`)
**Format**:
```javascript
// Redis Hash Map Structure:
// Key: FileStoreMap:ctx:<contextId>
// Value: Hash map where each entry is {hash: fileDataJSON}
// Example hash map entry:
{
"abc123": JSON.stringify({
// CFH-managed fields
url: "https://storage.../file.pdf",
gcs: "gs://bucket/file.pdf",
filename: "uuid-based-name.pdf",
// Cortex-managed fields
id: "1736966400000-abc123",
displayFilename: "user-friendly-name.pdf",
mimeType: "application/pdf",
tags: ["pdf", "report"],
notes: "Quarterly report",
hash: "abc123",
permanent: true,
addedDate: "2024-01-15T10:00:00.000Z",
lastAccessed: "2024-01-15T10:00:00.000Z"
})
}
```
**Features**:
- Atomic operations (Redis HSET/HDEL/HGET are thread-safe)
- In-memory caching (5-second TTL)
- Direct hash map access (no versioning needed)
- Context-scoped isolation (`FileStoreMap:ctx:<contextId>`)
---
## Key Concepts
### 1. Context Scoping (`agentContext`)
**Purpose**: Per-user/per-context file isolation with optional cross-context reading
**Usage**:
- **`agentContext`**: Array of context objects, each with:
- `contextId`: Context identifier (required)
- `contextKey`: Encryption key for this context (optional, `null` for unencrypted)
- `default`: Boolean indicating the default context for write operations (required)
- Stored in Redis with scoped keys: `FileStoreMap:ctx:<contextId>`
**Benefits**:
- Prevents hash collisions between users
- Enables per-user file management
- Supports multi-tenant applications
- Multiple contexts allow reading files from secondary contexts (e.g., workspace files)
- Separate encryption keys allow user-encrypted files alongside unencrypted shared workspace files
- Centralized context management (single parameter instead of multiple)
**Example**:
```javascript
// Upload with contextId (from default context)
const agentContext = [
{ contextId: "user-123", contextKey: userContextKey, default: true }
];
await uploadFileToCloud(fileBuffer, mimeType, filename, resolver, agentContext[0].contextId);
// Check hash with contextId
await checkHashExists(hash, fileHandlerUrl, null, agentContext[0].contextId);
// Delete with contextId
await deleteFileByHash(hash, resolver, agentContext[0].contextId);
// Load merged collection (reads from both contexts)
// User context is encrypted (userContextKey), workspace is not (null)
const agentContext = [
{ contextId: "user-123", contextKey: userContextKey, default: true },
{ contextId: "workspace-456", contextKey: null, default: false } // Shared workspace, unencrypted
];
const collection = await loadMergedFileCollection(agentContext);
// Resolve file from any context in agentContext
const url = await resolveFileParameter("file.pdf", agentContext);
```
**`agentContext` Behavior**:
- Files are read from all contexts in the array (union)
- Each context uses its own encryption key (`contextKey`)
- Shared workspaces typically use `contextKey: null` (unencrypted) since they're shared between users
- Writes/updates only go to the context marked as `default: true`, using its `contextKey`
- Deduplication: if a file exists in multiple contexts (same hash), the first context takes precedence
- Files from non-default contexts bypass `inCollection` filtering (all files accessible)
- The default context is used for all write operations (uploads, updates, deletions)
**`agentContext` Security Note**:
- `agentContext` allows reading files from multiple contexts, including files that bypass `inCollection` filtering
- **Important**: `agentContext` should be treated as a privileged, server-derived value
- Server-side authorization MUST verify that any contexts in `agentContext` are restricted to trusted, same-tenant contexts (e.g., derived from workspace membership) before use
- Never accept `agentContext` directly from untrusted client inputs without validation
- Only the default context should be used for write operations - non-default contexts are read-only
### 2. Permanent Files (`permanent` flag)
**Purpose**: Indicate files that should be kept indefinitely
**Storage**:
- Stored in file collection entry: `permanent: true`
- Sets blob index tag: `retention=permanent`
- Prevents deletion from cloud storage
**Usage**:
```javascript
// Add permanent file
await addFileToCollection(
contextId, contextKey, url, gcs, filename, tags, notes, hash,
null, resolver, true // permanent=true
);
// Check before deletion
if (!file.permanent) {
await deleteFileByHash(file.hash, resolver, contextId);
}
```
**Behavior**:
- Permanent files are **not deleted** from cloud storage when removed from collection
- Retention set via `setRetentionForHash()` (best-effort)
- Default: `permanent=false` (temporary, 30-day retention)
### 3. Hash Deduplication
**Purpose**: Avoid storing duplicate files
**Process**:
1. Compute xxhash64 hash of file content
2. Check if hash exists via `checkHashExists()`
3. If exists, return existing URLs (no upload)
4. If not, upload and store hash
**Benefits**:
- Saves storage space
- Faster uploads (skip if duplicate)
- Consistent file references
### 4. Short-Lived URLs
**Purpose**: Secure, time-limited file access
**Features**:
- 5-minute expiration (configurable)
- Always included in file handler responses
- Preferred for LLM file access
- Automatically generated on `checkHash` operations
**Usage**:
```javascript
// Resolve to short-lived URL
const fileWithShortLivedUrl = await ensureShortLivedUrl(
fileObject, fileHandlerUrl, contextId, 5 // 5 minutes
);
// fileWithShortLivedUrl.url is now short-lived URL
```
### 5. Atomic Operations
**Purpose**: Ensure thread-safe collection modifications
**Process**:
- Redis hash map operations (HSET, HDEL, HGET) are atomic
- No version-based locking needed
- Direct hash map updates per file (not full collection replacement)
**Functions**:
- `addFileToCollection()`: Atomic HSET operation
- `updateFileMetadata()`: Atomic HSET operation (updates single file)
- `loadFileCollection()`: Atomic HGETALL operation
- File removal: Atomic HDEL operation
**Benefits**:
- No version conflicts (each file updated independently)
- Faster operations (no retry loops)
- Simpler code (no locking logic needed)
---
## Complete Function Reference
### File Handler Operations
#### `buildFileHandlerUrl(baseUrl, params)`
Builds file handler URL with query parameters.
- **Parameters**:
- `baseUrl`: File handler service URL
- `params`: Object with query parameters (null/undefined skipped)
- **Returns**: Complete URL with encoded parameters
- **Used by**: All file handler operations
#### `fetchFileFromUrl(fileUrl, requestId, contextId, save)`
Downloads and processes file from URL.
- **Parameters**:
- `fileUrl`: URL to fetch
- `requestId`: Request ID for tracking
- `contextId`: Optional context ID
- `save`: Whether to save converted file (default: false)
- **Returns**: Response data (object or array)
- **Used by**: `azureVideoTranslatePlugin`, `azureCognitivePlugin`
#### `uploadFileToCloud(fileInput, mimeType, filename, pathwayResolver, contextId)`
Uploads file to cloud storage with deduplication.
- **Parameters**:
- `fileInput`: URL string, base64 string, or Buffer
- `mimeType`: MIME type (optional)
- `filename`: Filename (optional, inferred if not provided)
- `pathwayResolver`: Optional resolver for logging
- `contextId`: Optional context ID for scoping
- **Returns**: `{url, gcs, hash}`
- **Process**:
1. Converts input to Buffer
2. Computes hash
3. Checks if exists (deduplication)
4. Uploads if not exists
- **Used by**: All tools that upload files
#### `checkHashExists(hash, fileHandlerUrl, pathwayResolver, contextId, shortLivedMinutes)`
Checks if file exists by hash.
- **Parameters**:
- `hash`: File hash
- `fileHandlerUrl`: File handler URL
- `pathwayResolver`: Optional resolver for logging
- `contextId`: Optional context ID
- `shortLivedMinutes`: URL expiration (default: 5)
- **Returns**: `{url, gcs, hash, filename}` or `null`
- **Used by**: Upload deduplication, file resolution
#### `deleteFileByHash(hash, pathwayResolver, contextId)`
Deletes file from cloud storage.
- **Parameters**:
- `hash`: File hash
- `pathwayResolver`: Optional resolver for logging
- `contextId`: Optional context ID
- **Returns**: `true` if deleted, `false` if not found
- **Handles**: 404 gracefully (file already deleted)
#### `setRetentionForHash(hash, retention, contextId, pathwayResolver)`
Sets file retention (temporary or permanent).
- **Parameters**:
- `hash`: File hash
- `retention`: `'temporary'` or `'permanent'`
- `contextId`: Optional context ID
- `pathwayResolver`: Optional resolver for logging
- **Returns**: Response data or `null`
- **Used by**: `addFileToCollection` when `permanent=true`
#### `ensureShortLivedUrl(fileObject, fileHandlerUrl, contextId, shortLivedMinutes)`
Resolves file to use short-lived URL.
- **Parameters**:
- `fileObject`: File object with `hash` and `url`
- `fileHandlerUrl`: File handler URL
- `contextId`: Optional context ID
- `shortLivedMinutes`: URL expiration (default: 5)
- **Returns**: File object with `url` updated to short-lived URL
- **Used by**: Tools that send files to LLMs
#### `getMediaChunks(file, requestId, contextId)`
Gets chunked media file URLs.
- **Parameters**:
- `file`: File URL
- `requestId`: Request ID
- `contextId`: Optional context ID
- **Returns**: Array of chunk URLs
#### `markCompletedForCleanUp(requestId, contextId)`
Marks request as completed for cleanup.
- **Parameters**:
- `requestId`: Request ID
- `contextId`: Optional context ID
- **Returns**: Response data or `null`
### File Collection Operations
#### `loadFileCollection(contextId, contextKey, useCache)`
Loads file collection from Redis hash map.
- **Parameters**:
- `contextId`: Context ID (required)
- `contextKey`: Optional encryption key
- `useCache`: Whether to use cache (default: true)
- **Returns**: Array of file entries (sorted by lastAccessed, most recent first)
- **Process**:
1. Checks in-memory cache (5-second TTL)
2. Loads from Redis hash map `FileStoreMap:ctx:<contextId>`
3. Filters by `inCollection` (only returns global files or chat-specific files)
4. Converts hash map entries to array format
5. Updates cache
- **Used by**: Primary file collection operations
#### `loadFileCollectionAll(contextId, contextKey)`
Loads ALL files from a context, bypassing `inCollection` filtering.
- **Parameters**:
- `contextId`: Context ID (required)
- `contextKey`: Optional encryption key
- **Returns**: Array of all file entries (no filtering)
- **Used by**: `loadMergedFileCollection` when loading files from all contexts
#### `loadMergedFileCollection(agentContext)`
Loads merged file collection from one or more contexts.
- **Parameters**:
- `agentContext`: Array of context objects, each with `{ contextId, contextKey, default }` (required)
- **Returns**: Array of file entries from all contexts (deduplicated by hash/url/gcs)
- **Process**:
1. Loads first context collection via `loadFileCollectionAll()` with its `contextKey`
2. Tags each file with `_contextId` (internal, stripped before returning to callers)
3. For each additional context, loads collection via `loadFileCollectionAll()` with its `contextKey`
4. Deduplicates: earlier contexts take precedence if same file exists in multiple
5. Returns merged collection (with `_contextId` stripped before returning)
- **Used by**: `syncAndStripFilesFromChatHistory`, `getAvailableFiles`, `resolveFileParameter`, file tools
#### `saveFileCollection(contextId, contextKey, collection)`
Saves file collection to Redis hash map (optimized - only updates changed entries).
- **Parameters**:
- `contextId`: Context ID
- `contextKey`: Optional encryption key (unused, kept for compatibility)
- `collection`: Array of file entries
- **Returns**: `true` if successful, `false` on error
- **Process**:
1. Compares each file with current state
2. Only updates files that changed (optimized)
3. Uses atomic HSET operations per file
4. Preserves CFH-managed fields, updates Cortex-managed fields
- **Used by**: Tools that need to save multiple file changes
#### `updateFileMetadata(contextId, hash, metadata)`
Updates Cortex-managed metadata fields atomically.
- **Parameters**:
- `contextId`: Context ID (required)
- `hash`: File hash (used as Redis key)
- `metadata`: Object with fields to update (displayFilename, tags, notes, mimeType, addedDate, lastAccessed, permanent)
- **Returns**: `true` if successful, `false` on error
- **Process**:
1. Loads existing file data from Redis
2. Merges metadata (preserves CFH fields, updates Cortex fields)
3. Writes back via atomic HSET
4. Invalidates cache
- **Used by**: Search operations (updates lastAccessed), EditFile (updates URL/hash)
#### `addFileToCollection(contextId, contextKey, url, gcs, filename, tags, notes, hash, fileUrl, pathwayResolver, permanent)`
Adds file to collection via atomic operation.
- **Parameters**:
- `contextId`: Context ID (required)
- `contextKey`: Optional encryption key (unused, kept for compatibility)
- `url`: Azure URL (optional if fileUrl provided)
- `gcs`: GCS URL (optional)
- `filename`: User-friendly filename (required)
- `tags`: Array of tags (optional)
- `notes`: Notes string (optional)
- `hash`: File hash (optional, computed if not provided)
- `fileUrl`: URL to upload (optional, uploads if provided)
- `pathwayResolver`: Optional resolver for logging
- `permanent`: Whether file is permanent (default: false)
- **Returns**: File entry object with `id`
- **Process**:
1. If `fileUrl` provided, uploads file first via `uploadFileToCloud()`
2. If `permanent=true`, sets retention to permanent via `setRetentionForHash()`
3. Creates file entry with `displayFilename` (user-friendly name)
4. Writes to Redis hash map via atomic HSET
5. Merges with existing CFH data if hash already exists
- **Used by**: WriteFile, Image tools, FileCollection tool
#### `syncAndStripFilesFromChatHistory(chatHistory, agentContext)`
Processes chat history files based on collection membership.
- **Parameters**:
- `chatHistory`: Chat history array to process
- `agentContext`: Array of context objects, each with `{ contextId, contextKey, default }` (required)
- **Returns**: `{ chatHistory, availableFiles }` - processed chat history and formatted file list
- **Process**:
1. Loads merged file collection from all contexts in `agentContext`
2. For each file in chat history:
- If in collection: strip from message, update lastAccessed and inCollection in owning context (using that context's key)
- If not in collection: leave in message as-is
3. Returns processed history and available files string
4. Uses atomic operations per file, updating the context that owns each file (identified by `_contextId` tag)
- **Used by**: `sys_entity_agent` to process incoming chat history
### File Resolution
#### `resolveFileParameter(fileParam, agentContext, options)`
Resolves file parameter to file URL.
- **Parameters**:
- `fileParam`: File ID, filename, URL, or hash
- `agentContext`: Array of context objects, each with `{ contextId, contextKey, default }` (required)
- `options`: Optional options object:
- `preferGcs`: Boolean - prefer GCS URL over Azure URL
- `useCache`: Boolean - use cache (default: true)
- **Returns**: File URL string (Azure or GCS) or `null` if not found
- **Matching** (via `findFileInCollection()`):
- Exact ID match
- Exact hash match
- Exact URL match (Azure or GCS)
- Exact filename match (case-insensitive, basename comparison)
- Fuzzy filename match (contains, minimum 4 characters)
- **Process**:
1. Loads merged file collection from all contexts in `agentContext`
2. Searches merged collection for matching file
3. Returns file URL if found
- **Used by**: ReadFile, EditFile, and other tools that need file URLs
#### `findFileInCollection(fileParam, collection)`
Finds file in collection array.
- **Parameters**:
- `fileParam`: File identifier
- `collection`: Collection array
- **Returns**: File entry or `null`
- **Used by**: `resolveFileParameter`
#### `generateFileMessageContent(fileParam, agentContext)`
Generates file content for LLM messages.
- **Parameters**:
- `fileParam`: File identifier (ID, filename, URL, or hash)
- `agentContext`: Array of context objects, each with `{ contextId, contextKey, default }` (required)
- **Returns**: File content object with `type`, `url`, `gcs`, `hash` or `null`
- **Process**:
1. Loads merged file collection from all contexts in `agentContext`
2. Finds file in merged collection via `findFileInCollection()`
3. Resolves to short-lived URL via `ensureShortLivedUrl()` using default context
4. Returns OpenAI-compatible format: `{type: 'image_url', url, gcs, hash}`
- **Used by**: AnalyzeFile tool to inject files into chat history
#### `extractFilesFromChatHistory(chatHistory)`
Extracts file metadata from chat history messages.
- **Parameters**:
- `chatHistory`: Chat history array to scan
- **Returns**: Array of file metadata objects `{url, gcs, hash, type}`
- **Process**:
1. Scans all messages for file content objects
2. Extracts from `image_url`, `file`, or direct URL objects
3. Returns normalized format
- **Used by**: File extraction utilities
#### `getAvailableFiles(chatHistory, agentContext)`
Gets formatted list of available files from collection.
- **Parameters**:
- `chatHistory`: Unused (kept for API compatibility)
- `agentContext`: Array of context objects, each with `{ contextId, contextKey, default }` (required)
- **Returns**: Formatted string of available files (last 10 most recent)
- **Process**:
1. Loads merged file collection from all contexts in `agentContext`
2. Formats files via `formatFilesForTemplate()`
3. Returns compact one-line format per file
- **Used by**: Template rendering to show available files
### Utility Functions
#### `getDefaultContext(agentContext)`
Helper function to extract the default context from an agentContext array.
- **Parameters**:
- `agentContext`: Array of context objects, each with `{ contextId, contextKey, default }`
- **Returns**: Context object with `default: true`, or first context if none marked as default, or `null` if array is empty
- **Used by**: Functions that need to determine which context to use for write operations
#### `computeFileHash(filePath)`
Computes xxhash64 hash of file.
- **Returns**: Hash string (hex)
#### `computeBufferHash(buffer)`
Computes xxhash64 hash of buffer.
- **Returns**: Hash string (hex)
#### `extractFilenameFromUrl(url, gcs)`
Extracts filename from URL (prefers GCS).
- **Returns**: Filename string
#### `ensureFilenameExtension(filename, mimeType)`
Ensures filename has correct extension based on MIME type.
- **Returns**: Filename with correct extension
#### `determineMimeTypeFromUrl(url, gcs, filename)`
Determines MIME type from URL or filename.
- **Returns**: MIME type string
#### `isTextMimeType(mimeType)`
Checks if MIME type is text-based.
- **Parameters**:
- `mimeType`: MIME type string to check
- **Returns**: Boolean (true if text-based)
- **Supports**: All `text/*` types, plus application types like JSON, JavaScript, XML, YAML, Python, etc.
- **Used by**: ReadFile, EditFile to validate file types
#### `getMimeTypeFromFilename(filenameOrPath, defaultMimeType)`
Gets MIME type from filename or path.
- **Parameters**:
- `filenameOrPath`: Filename or full file path
- `defaultMimeType`: Optional default (default: 'application/octet-stream')
- **Returns**: MIME type string
- **Used by**: File upload, file type detection
#### `getMimeTypeFromExtension(extension, defaultMimeType)`
Gets MIME type from file extension.
- **Parameters**:
- `extension`: File extension (with or without leading dot)
- `defaultMimeType`: Optional default (default: 'application/octet-stream')
- **Returns**: MIME type string
---
## Error Handling
### File Handler Errors
**Network Errors**:
- Handled gracefully in all functions
- Logged via `pathwayResolver` or `logger`
- Non-critical operations return `null` instead of throwing
**404 Errors**:
- Treated as "file not found" (not an error)
- `deleteFileByHash` returns `false` on 404
- `checkHashExists` returns `null` on 404
**Timeout Errors**:
- Upload: 30 seconds
- Check hash: 10 seconds
- Fetch file: 60 seconds
- Set retention: 15 seconds
### File Collection Errors
**Missing ContextId**:
- File collection operations require `contextId`
- Returns `null` or throws error if missing
**Concurrent Modifications**:
- Prevented by atomic Redis operations (HSET, HDEL are thread-safe)
- No version conflicts (each file updated independently)
**Invalid File Data**:
- Invalid JSON entries are skipped during load
- Missing required fields are handled gracefully
### Best Practices
1. **Always pass `contextId`** when available (strongly recommended for multi-tenant)
2. **Use atomic operations** - `addFileToCollection()`, `updateFileMetadata()` are thread-safe
3. **Check `permanent` flag** before deleting files from cloud storage
4. **Handle errors gracefully** - don't throw on non-critical failures
5. **Use short-lived URLs** for LLM file access (via `ensureShortLivedUrl()`)
6. **Check for existing files** before uploading (automatic in `uploadFileToCloud`)
7. **Preserve CFH fields** - when updating metadata, preserve `url`, `gcs`, `filename` from file handler
8. **Use `displayFilename`** for user-facing displays (fallback to `filename` if not set)
---
## Summary
The Cortex file system provides:
✅ **Encapsulated file handler interactions** - No direct axios calls
✅ **Hash-based deduplication** - Avoids duplicate storage
✅ **Context scoping** - Per-user file isolation via `FileStoreMap:ctx:<contextId>`
✅ **Permanent file support** - Indefinite retention
✅ **Atomic operations** - Thread-safe collection modifications via Redis hash maps
✅ **Short-lived URLs** - Secure file access (5-minute expiration)
✅ **Comprehensive error handling** - Graceful failure handling
✅ **Single API call optimization** - Efficient file resolution
✅ **Field ownership separation** - CFH-managed vs Cortex-managed fields
✅ **Chat history integration** - Automatic file syncing from conversations
All file operations flow through `lib/fileUtils.js`, ensuring consistency, maintainability, and proper error handling throughout the system.
### Architecture Highlights
- **File Handler Service**: External Azure Function managing cloud storage
- **File Utilities Layer**: Abstraction over file handler (no direct API calls)
- **File Collection System**: Redis hash maps for user file metadata
- **Atomic Operations**: Thread-safe via Redis HSET/HDEL/HGET operations
- **Context Isolation**: Per-context hash maps for multi-tenant support