ocr-click-plugin
Version:
An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens with AI-powered screen analysis
187 lines (129 loc) • 4.8 kB
Markdown
# AI Features Setup Guide
This guide will help you set up the Google Cloud Vertex AI integration for the `askllm` API.
## Prerequisites
1. **Google Cloud Account**: You need a Google Cloud account with billing enabled
2. **Google Cloud Project**: Create or select a project
3. **Vertex AI API**: Enable the Vertex AI API for your project
## Step-by-Step Setup
### 1. Create Google Cloud Project
```bash
# Using gcloud CLI (optional)
gcloud projects create your-project-id
gcloud config set project your-project-id
```
Or create via [Google Cloud Console](https://console.cloud.google.com/)
### 2. Enable Vertex AI API
```bash
# Using gcloud CLI
gcloud services enable aiplatform.googleapis.com
```
Or enable via [API Library](https://console.cloud.google.com/apis/library/aiplatform.googleapis.com)
### 3. Set Up Authentication
#### Option A: Service Account (Recommended for Production)
1. Go to [IAM & Admin > Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccounts)
2. Click "Create Service Account"
3. Give it a name like "ocr-click-plugin-ai"
4. Grant the role "Vertex AI User"
5. Create and download the JSON key file
6. Set the environment variable:
```bash
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json"
```
#### Option B: Application Default Credentials (For Development)
```bash
gcloud auth application-default login
```
### 4. Set Environment Variables
Create a `.env` file or export these variables:
```bash
# Required
export GOOGLE_PROJECT_ID="your-gcp-project-id"
export GOOGLE_LOCATION="us-central1" # or your preferred region
export GOOGLE_MODEL="gemini-1.5-flash" # or gemini-1.5-pro
# If using service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"
```
### 5. Available Models
Choose one of these models for `GOOGLE_MODEL`:
- `gemini-1.5-flash` - Faster, cheaper, good for most use cases
- `gemini-1.5-pro` - More capable, better for complex analysis
- `gemini-1.0-pro-vision` - Legacy model, still supported
### 6. Available Regions
Common regions for `GOOGLE_LOCATION`:
- `us-central1` (Iowa)
- `us-west1` (Oregon)
- `us-east1` (South Carolina)
- `europe-west1` (Belgium)
- `asia-southeast1` (Singapore)
Check [Vertex AI locations](https://cloud.google.com/vertex-ai/docs/general/locations) for full list.
## Testing the Setup
### 1. Install Dependencies
```bash
npm install
```
### 2. Build the Plugin
```bash
npm run build
```
### 3. Test Environment
```bash
node -e "
console.log('Project ID:', process.env.GOOGLE_PROJECT_ID);
console.log('Location:', process.env.GOOGLE_LOCATION);
console.log('Model:', process.env.GOOGLE_MODEL);
console.log('Credentials:', process.env.GOOGLE_APPLICATION_CREDENTIALS ? 'Set' : 'Not set');
"
```
### 4. Run the Test Script
```bash
npm run test-askllm
```
## Usage Examples
### Basic AI Query
```javascript
const result = await driver.execute('mobile: askllm', {
instruction: 'What buttons are visible on this screen?'
});
console.log(result.response.candidates[0].content.parts[0].text);
```
### Complex Analysis
```javascript
const analysis = await driver.execute('mobile: askllm', {
instruction: 'Analyze this screen and provide: 1) Screen type, 2) Main actions available, 3) Any issues or errors visible'
});
```
## Troubleshooting
### Common Issues
**"Authentication error"**
- Check `GOOGLE_APPLICATION_CREDENTIALS` path
- Verify service account has "Vertex AI User" role
- Try `gcloud auth application-default login`
**"API not enabled"**
- Enable Vertex AI API: `gcloud services enable aiplatform.googleapis.com`
**"Model not found"**
- Check model name spelling
- Verify model is available in your region
- Try `gemini-1.5-flash` as a fallback
**"Location not supported"**
- Use `us-central1` as default
- Check [supported regions](https://cloud.google.com/vertex-ai/docs/general/locations)
### Debug Mode
Enable debug logging:
```bash
export DEBUG=1
npm run test-askllm
```
## Cost Considerations
- **gemini-1.5-flash**: ~$0.075 per 1M input tokens, ~$0.30 per 1M output tokens
- **gemini-1.5-pro**: ~$1.25 per 1M input tokens, ~$5.00 per 1M output tokens
Each screenshot is typically 1000-3000 tokens depending on size and complexity.
## Security Best Practices
1. **Never commit credentials** to version control
2. **Use service accounts** with minimal required permissions
3. **Rotate keys regularly**
4. **Monitor usage** in Google Cloud Console
5. **Set up billing alerts** to avoid unexpected charges
## Support
- [Google Cloud Vertex AI Documentation](https://cloud.google.com/vertex-ai/docs)
- [Gemini API Reference](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini)
- [OCR Click Plugin Issues](https://github.com/yourusername/ocr-click-plugin/issues)