llama-cpp-capacitor
Version:
A native Capacitor plugin that embeds llama.cpp directly into mobile apps, enabling offline AI inference with chat-first API design. Supports both simple text generation and advanced chat conversations with system prompts, multimodal processing, TTS, LoRA
718 lines (529 loc) • 21.6 kB
Markdown
# llama-cpp Capacitor Plugin
[](https://github.com/arusatech/llama-cpp/actions)
[](https://opensource.org/licenses/MIT)
[](https://www.npmjs.com/package/llama-cpp-capacitor/)
A native Capacitor plugin that embeds [llama.cpp](https://github.com/ggerganov/llama.cpp) directly into mobile apps, enabling offline AI inference with comprehensive support for text generation, multimodal processing, TTS, LoRA adapters, and more.
[llama.cpp](https://github.com/ggerganov/llama.cpp): Inference of [LLaMA](https://arxiv.org/abs/2302.13971) model in pure C/C++
## 🚀 Features
- **Offline AI Inference**: Run large language models completely offline on mobile devices
- **Text Generation**: Complete text completion with streaming support
- **Chat Conversations**: Multi-turn conversations with context management
- **Multimodal Support**: Process images and audio alongside text
- **Text-to-Speech (TTS)**: Generate speech from text using vocoder models
- **LoRA Adapters**: Fine-tune models with LoRA adapters
- **Embeddings**: Generate vector embeddings for semantic search
- **Reranking**: Rank documents by relevance to queries
- **Session Management**: Save and load conversation states
- **Benchmarking**: Performance testing and optimization tools
- **Structured Output**: Generate JSON with schema validation
- **Cross-Platform**: iOS and Android support with native optimizations
## ✅ **Complete Implementation Status**
This plugin is now **FULLY IMPLEMENTED** with complete native integration of llama.cpp for both iOS and Android platforms. The implementation includes:
### **Completed Features**
- **Complete C++ Integration**: Full llama.cpp library integration with all core components
- **Native Build System**: CMake-based build system for both iOS and Android
- **Platform Support**: iOS (arm64, x86_64) and Android (arm64-v8a, armeabi-v7a, x86, x86_64)
- **TypeScript API**: Complete TypeScript interface matching llama.rn functionality
- **Native Methods**: All 30+ native methods implemented with proper error handling
- **Event System**: Capacitor event system for progress and token streaming
- **Documentation**: Comprehensive README and API documentation
### **Technical Implementation**
- **C++ Core**: Complete llama.cpp library with GGML, GGUF, and all supporting components
- **iOS Framework**: Native iOS framework with Metal acceleration support
- **Android JNI**: Complete JNI implementation with multi-architecture support
- **Build Scripts**: Automated build system for both platforms
- **Error Handling**: Robust error handling and result types
### **Project Structure**
```
llama-cpp/
├── cpp/ # Complete llama.cpp C++ library
│ ├── ggml.c # GGML core
│ ├── gguf.cpp # GGUF format support
│ ├── llama.cpp # Main llama.cpp implementation
│ ├── rn-llama.cpp # React Native wrapper (adapted)
│ ├── rn-completion.cpp # Completion handling
│ ├── rn-tts.cpp # Text-to-speech
│ └── tools/mtmd/ # Multimodal support
├── ios/
│ ├── CMakeLists.txt # iOS build configuration
│ └── Sources/ # Swift implementation
├── android/
│ ├── src/main/
│ │ ├── CMakeLists.txt # Android build configuration
│ │ ├── jni.cpp # JNI implementation
│ │ └── jni-utils.h # JNI utilities
│ └── build.gradle # Android build config
├── src/
│ ├── definitions.ts # Complete TypeScript interfaces
│ ├── index.ts # Main plugin implementation
│ └── web.ts # Web fallback
└── build-native.sh # Automated build script
```
## 📦 Installation
```sh
npm install llama-cpp-capacitor
```
## 🔨 **Building the Native Library**
The plugin includes a complete native implementation of llama.cpp. To build the native libraries:
### **Prerequisites**
- **CMake** (3.16+ for iOS, 3.10+ for Android)
- **Xcode** (for iOS builds, macOS only)
- **Android Studio** with NDK (for Android builds)
- **Make** or **Ninja** build system
### **Automated Build**
```bash
# Build for all platforms
npm run build:native
# Build for specific platforms
npm run build:ios # iOS only
npm run build:android # Android only
# Clean native builds
npm run clean:native
```
### **Manual Build**
#### **iOS Build**
```bash
cd ios
cmake -B build -S .
cmake --build build --config Release
```
#### **Android Build**
```bash
cd android
./gradlew assembleRelease
```
### **Build Output**
- **iOS**: `ios/build/LlamaCpp.framework/`
- **Android**: `android/src/main/jniLibs/{arch}/libllama-cpp-{arch}.so`
### iOS Setup
1. Install the plugin:
```sh
npm install llama-cpp
```
2. Add to your iOS project:
```sh
npx cap add ios
npx cap sync ios
```
3. Open the project in Xcode:
```sh
npx cap open ios
```
### Android Setup
1. Install the plugin:
```sh
npm install llama-cpp
```
2. Add to your Android project:
```sh
npx cap add android
npx cap sync android
```
3. Open the project in Android Studio:
```sh
npx cap open android
```
## 🎯 Quick Start
### Basic Text Completion
```typescript
import { initLlama } from 'llama-cpp';
// Initialize a model
const context = await initLlama({
model: '/path/to/your/model.gguf',
n_ctx: 2048,
n_threads: 4,
n_gpu_layers: 0,
});
// Generate text
const result = await context.completion({
prompt: "Hello, how are you today?",
n_predict: 50,
temperature: 0.8,
});
console.log('Generated text:', result.text);
```
### Chat-Style Conversations
```typescript
const result = await context.completion({
messages: [
{ role: "system", content: "You are a helpful AI assistant." },
{ role: "user", content: "What is the capital of France?" },
{ role: "assistant", content: "The capital of France is Paris." },
{ role: "user", content: "Tell me more about it." }
],
n_predict: 100,
temperature: 0.7,
});
console.log('Chat response:', result.content);
```
### Streaming Completion
```typescript
let fullText = '';
const result = await context.completion({
prompt: "Write a short story about a robot learning to paint:",
n_predict: 150,
temperature: 0.8,
}, (tokenData) => {
// Called for each token as it's generated
fullText += tokenData.token;
console.log('Token:', tokenData.token);
});
console.log('Final result:', result.text);
```
## 🚀 **Mobile-Optimized Speculative Decoding**
**Achieve 2-8x faster inference with significantly reduced battery consumption!**
Speculative decoding uses a smaller "draft" model to predict multiple tokens ahead, which are then verified by the main model. This results in dramatic speedups with identical output quality.
### Basic Usage
```typescript
import { initLlama } from 'llama-cpp-capacitor';
// Initialize with speculative decoding
const context = await initLlama({
model: '/path/to/your/main-model.gguf', // Main model (e.g., 7B)
draft_model: '/path/to/your/draft-model.gguf', // Draft model (e.g., 1.5B)
// Speculative decoding parameters
speculative_samples: 3, // Number of tokens to predict speculatively
mobile_speculative: true, // Enable mobile optimizations
// Standard parameters
n_ctx: 2048,
n_threads: 4,
});
// Use normally - speculative decoding is automatic
const result = await context.completion({
prompt: "Write a story about AI:",
n_predict: 200,
temperature: 0.7,
});
console.log('🚀 Generated with speculative decoding:', result.text);
```
### Mobile-Optimized Configuration
```typescript
// Recommended mobile setup for best performance/battery balance
const mobileContext = await initLlama({
// Quantized models for mobile efficiency
model: '/models/llama-2-7b-chat.q4_0.gguf',
draft_model: '/models/tinyllama-1.1b-chat.q4_0.gguf',
// Conservative mobile settings
n_ctx: 1024, // Smaller context for mobile
n_threads: 3, // Conservative threading
n_batch: 64, // Smaller batch size
n_gpu_layers: 24, // Utilize mobile GPU
// Optimized speculative decoding
speculative_samples: 3, // 2-3 tokens ideal for mobile
mobile_speculative: true, // Enables mobile-specific optimizations
// Memory optimizations
use_mmap: true, // Memory mapping for efficiency
use_mlock: false, // Don't lock memory on mobile
});
```
### Performance Benefits
- **2-8x faster inference** - Dramatically reduced time to generate text
- **50-80% battery savings** - Less time computing = longer battery life
- **Identical output quality** - Same text quality as regular decoding
- **Automatic fallback** - Falls back to regular decoding if draft model fails
- **Mobile optimized** - Specifically tuned for mobile device constraints
### Model Recommendations
| Model Type | Recommended Size | Quantization | Example |
|------------|------------------|--------------|---------|
| **Main Model** | 3-7B parameters | Q4_0 or Q4_1 | `llama-2-7b-chat.q4_0.gguf` |
| **Draft Model** | 1-1.5B parameters | Q4_0 | `tinyllama-1.1b-chat.q4_0.gguf` |
### Error Handling & Fallback
```typescript
// Robust setup with automatic fallback
try {
const context = await initLlama({
model: '/models/main-model.gguf',
draft_model: '/models/draft-model.gguf',
speculative_samples: 3,
mobile_speculative: true,
});
console.log('✅ Speculative decoding enabled');
} catch (error) {
console.warn('⚠️ Falling back to regular decoding');
const context = await initLlama({
model: '/models/main-model.gguf',
// No draft_model = regular decoding
});
}
```
## 📚 API Reference
### Core Functions
#### `initLlama(params: ContextParams, onProgress?: (progress: number) => void): Promise<LlamaContext>`
Initialize a new llama.cpp context with a model.
**Parameters:**
- `params`: Context initialization parameters
- `onProgress`: Optional progress callback (0-100)
**Returns:** Promise resolving to a `LlamaContext` instance
#### `releaseAllLlama(): Promise<void>`
Release all contexts and free memory.
#### `toggleNativeLog(enabled: boolean): Promise<void>`
Enable or disable native logging.
#### `addNativeLogListener(listener: (level: string, text: string) => void): { remove: () => void }`
Add a listener for native log messages.
### LlamaContext Class
#### `completion(params: CompletionParams, callback?: (data: TokenData) => void): Promise<NativeCompletionResult>`
Generate text completion.
**Parameters:**
- `params`: Completion parameters including prompt or messages
- `callback`: Optional callback for token-by-token streaming
#### `tokenize(text: string, options?: { media_paths?: string[] }): Promise<NativeTokenizeResult>`
Tokenize text or text with images.
#### `detokenize(tokens: number[]): Promise<string>`
Convert tokens back to text.
#### `embedding(text: string, params?: EmbeddingParams): Promise<NativeEmbeddingResult>`
Generate embeddings for text.
#### `rerank(query: string, documents: string[], params?: RerankParams): Promise<RerankResult[]>`
Rank documents by relevance to a query.
#### `bench(pp: number, tg: number, pl: number, nr: number): Promise<BenchResult>`
Benchmark model performance.
### Multimodal Support
#### `initMultimodal(params: { path: string; use_gpu?: boolean }): Promise<boolean>`
Initialize multimodal support with a projector file.
#### `isMultimodalEnabled(): Promise<boolean>`
Check if multimodal support is enabled.
#### `getMultimodalSupport(): Promise<{ vision: boolean; audio: boolean }>`
Get multimodal capabilities.
#### `releaseMultimodal(): Promise<void>`
Release multimodal resources.
### TTS (Text-to-Speech)
#### `initVocoder(params: { path: string; n_batch?: number }): Promise<boolean>`
Initialize TTS with a vocoder model.
#### `isVocoderEnabled(): Promise<boolean>`
Check if TTS is enabled.
#### `getFormattedAudioCompletion(speaker: object | null, textToSpeak: string): Promise<{ prompt: string; grammar?: string }>`
Get formatted audio completion prompt.
#### `getAudioCompletionGuideTokens(textToSpeak: string): Promise<Array<number>>`
Get guide tokens for audio completion.
#### `decodeAudioTokens(tokens: number[]): Promise<Array<number>>`
Decode audio tokens to audio data.
#### `releaseVocoder(): Promise<void>`
Release TTS resources.
### LoRA Adapters
#### `applyLoraAdapters(loraList: Array<{ path: string; scaled?: number }>): Promise<void>`
Apply LoRA adapters to the model.
#### `removeLoraAdapters(): Promise<void>`
Remove all LoRA adapters.
#### `getLoadedLoraAdapters(): Promise<Array<{ path: string; scaled?: number }>>`
Get list of loaded LoRA adapters.
### Session Management
#### `saveSession(filepath: string, options?: { tokenSize: number }): Promise<number>`
Save current session to a file.
#### `loadSession(filepath: string): Promise<NativeSessionLoadResult>`
Load session from a file.
## 🔧 Configuration
### Context Parameters
```typescript
interface ContextParams {
model: string; // Path to GGUF model file
n_ctx?: number; // Context size (default: 512)
n_threads?: number; // Number of threads (default: 4)
n_gpu_layers?: number; // GPU layers (iOS only)
use_mlock?: boolean; // Lock memory (default: false)
use_mmap?: boolean; // Use memory mapping (default: true)
embedding?: boolean; // Embedding mode (default: false)
cache_type_k?: string; // KV cache type for K
cache_type_v?: string; // KV cache type for V
pooling_type?: string; // Pooling type
// ... more parameters
}
```
### Completion Parameters
```typescript
interface CompletionParams {
prompt?: string; // Text prompt
messages?: Message[]; // Chat messages
n_predict?: number; // Max tokens to generate
temperature?: number; // Sampling temperature
top_p?: number; // Top-p sampling
top_k?: number; // Top-k sampling
stop?: string[]; // Stop sequences
// ... more parameters
}
```
## 📱 Platform Support
| Feature | iOS | Android | Web |
|---------|-----|---------|-----|
| Text Generation | ✅ | ✅ | ❌ |
| Chat Conversations | ✅ | ✅ | ❌ |
| Streaming | ✅ | ✅ | ❌ |
| Multimodal | ✅ | ✅ | ❌ |
| TTS | ✅ | ✅ | ❌ |
| LoRA Adapters | ✅ | ✅ | ❌ |
| Embeddings | ✅ | ✅ | ❌ |
| Reranking | ✅ | ✅ | ❌ |
| Session Management | ✅ | ✅ | ❌ |
| Benchmarking | ✅ | ✅ | ❌ |
## 🎨 Advanced Examples
### Multimodal Processing
```typescript
// Initialize multimodal support
await context.initMultimodal({
path: '/path/to/mmproj.gguf',
use_gpu: true,
});
// Process image with text
const result = await context.completion({
messages: [
{
role: "user",
content: [
{ type: "text", text: "What do you see in this image?" },
{ type: "image_url", image_url: { url: "file:///path/to/image.jpg" } }
]
}
],
n_predict: 100,
});
console.log('Image analysis:', result.content);
```
### Text-to-Speech
```typescript
// Initialize TTS
await context.initVocoder({
path: '/path/to/vocoder.gguf',
n_batch: 512,
});
// Generate audio
const audioCompletion = await context.getFormattedAudioCompletion(
null, // Speaker configuration
"Hello, this is a test of text-to-speech functionality."
);
const guideTokens = await context.getAudioCompletionGuideTokens(
"Hello, this is a test of text-to-speech functionality."
);
const audioResult = await context.completion({
prompt: audioCompletion.prompt,
grammar: audioCompletion.grammar,
guide_tokens: guideTokens,
n_predict: 1000,
});
const audioData = await context.decodeAudioTokens(audioResult.audio_tokens);
```
### LoRA Adapters
```typescript
// Apply LoRA adapters
await context.applyLoraAdapters([
{ path: '/path/to/adapter1.gguf', scaled: 1.0 },
{ path: '/path/to/adapter2.gguf', scaled: 0.5 }
]);
// Check loaded adapters
const adapters = await context.getLoadedLoraAdapters();
console.log('Loaded adapters:', adapters);
// Generate with adapters
const result = await context.completion({
prompt: "Test prompt with LoRA adapters:",
n_predict: 50,
});
// Remove adapters
await context.removeLoraAdapters();
```
### Structured Output
#### JSON Schema (Auto-converted to GBNF)
```typescript
const result = await context.completion({
prompt: "Generate a JSON object with a person's name, age, and favorite color:",
n_predict: 100,
response_format: {
type: 'json_schema',
json_schema: {
strict: true,
schema: {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' },
favorite_color: { type: 'string' }
},
required: ['name', 'age', 'favorite_color']
}
}
}
});
console.log('Structured output:', result.content);
```
#### Direct GBNF Grammar
```typescript
// Define GBNF grammar directly for maximum control
const grammar = `
root ::= "{" ws name_field "," ws age_field "," ws color_field "}"
name_field ::= "\\"name\\"" ws ":" ws string_value
age_field ::= "\\"age\\"" ws ":" ws number_value
color_field ::= "\\"favorite_color\\"" ws ":" ws string_value
string_value ::= "\\"" [a-zA-Z ]+ "\\""
number_value ::= [0-9]+
ws ::= [ \\t\\n]*
`;
const result = await context.completion({
prompt: "Generate a person's profile:",
grammar: grammar,
n_predict: 100
});
console.log('Grammar-constrained output:', result.text);
```
#### Manual JSON Schema to GBNF Conversion
```typescript
import { convertJsonSchemaToGrammar } from 'llama-cpp-capacitor';
const schema = {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' }
},
required: ['name', 'age']
};
// Convert schema to GBNF grammar
const grammar = await convertJsonSchemaToGrammar(schema);
console.log('Generated grammar:', grammar);
const result = await context.completion({
prompt: "Generate a person:",
grammar: grammar,
n_predict: 100
});
```
## 🔍 Model Compatibility
This plugin supports GGUF format models, which are compatible with llama.cpp. You can find GGUF models on Hugging Face by searching for the "GGUF" tag.
### Recommended Models
- **Llama 2**: Meta's latest language model
- **Mistral**: High-performance open model
- **Code Llama**: Specialized for code generation
- **Phi-2**: Microsoft's efficient model
- **Gemma**: Google's open model
### Model Quantization
For mobile devices, consider using quantized models (Q4_K_M, Q5_K_M, etc.) to reduce memory usage and improve performance.
## ⚡ Performance Considerations
### Memory Management
- Use quantized models for better memory efficiency
- Adjust `n_ctx` based on your use case
- Monitor memory usage with `use_mlock: false`
### GPU Acceleration
- iOS: Set `n_gpu_layers` to use Metal GPU acceleration
- Android: GPU acceleration is automatically enabled when available
### Threading
- Adjust `n_threads` based on device capabilities
- More threads may improve performance but increase memory usage
## 🐛 Troubleshooting
### Common Issues
1. **Model not found**: Ensure the model path is correct and the file exists
2. **Out of memory**: Try using a quantized model or reducing `n_ctx`
3. **Slow performance**: Enable GPU acceleration or increase `n_threads`
4. **Multimodal not working**: Ensure the mmproj file is compatible with your model
### Debugging
Enable native logging to see detailed information:
```typescript
import { toggleNativeLog, addNativeLogListener } from 'llama-cpp';
await toggleNativeLog(true);
const logListener = addNativeLogListener((level, text) => {
console.log(`[${level}] ${text}`);
});
```
## 🤝 Contributing
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🙏 Acknowledgments
- [llama.cpp](https://github.com/ggerganov/llama.cpp) - The core inference engine
- [Capacitor](https://capacitorjs.com/) - The cross-platform runtime
- [llama.rn](https://github.com/mybigday/llama.rn) - Inspiration for the React Native implementation
## 📞 Support
- 📧 Email: support@arusatech.com
- 🐛 Issues: [GitHub Issues](https://github.com/arusatech/llama-cpp/issues)
- 📖 Documentation: [GitHub Wiki](https://github.com/arusatech/llama-cpp/wiki)