UNPKG

ocr-click-plugin

Version:

An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens with AI-powered screen analysis

187 lines (129 loc) 4.8 kB
# AI Features Setup Guide This guide will help you set up the Google Cloud Vertex AI integration for the `askllm` API. ## Prerequisites 1. **Google Cloud Account**: You need a Google Cloud account with billing enabled 2. **Google Cloud Project**: Create or select a project 3. **Vertex AI API**: Enable the Vertex AI API for your project ## Step-by-Step Setup ### 1. Create Google Cloud Project ```bash # Using gcloud CLI (optional) gcloud projects create your-project-id gcloud config set project your-project-id ``` Or create via [Google Cloud Console](https://console.cloud.google.com/) ### 2. Enable Vertex AI API ```bash # Using gcloud CLI gcloud services enable aiplatform.googleapis.com ``` Or enable via [API Library](https://console.cloud.google.com/apis/library/aiplatform.googleapis.com) ### 3. Set Up Authentication #### Option A: Service Account (Recommended for Production) 1. Go to [IAM & Admin > Service Accounts](https://console.cloud.google.com/iam-admin/serviceaccounts) 2. Click "Create Service Account" 3. Give it a name like "ocr-click-plugin-ai" 4. Grant the role "Vertex AI User" 5. Create and download the JSON key file 6. Set the environment variable: ```bash export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account.json" ``` #### Option B: Application Default Credentials (For Development) ```bash gcloud auth application-default login ``` ### 4. Set Environment Variables Create a `.env` file or export these variables: ```bash # Required export GOOGLE_PROJECT_ID="your-gcp-project-id" export GOOGLE_LOCATION="us-central1" # or your preferred region export GOOGLE_MODEL="gemini-1.5-flash" # or gemini-1.5-pro # If using service account export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json" ``` ### 5. Available Models Choose one of these models for `GOOGLE_MODEL`: - `gemini-1.5-flash` - Faster, cheaper, good for most use cases - `gemini-1.5-pro` - More capable, better for complex analysis - `gemini-1.0-pro-vision` - Legacy model, still supported ### 6. Available Regions Common regions for `GOOGLE_LOCATION`: - `us-central1` (Iowa) - `us-west1` (Oregon) - `us-east1` (South Carolina) - `europe-west1` (Belgium) - `asia-southeast1` (Singapore) Check [Vertex AI locations](https://cloud.google.com/vertex-ai/docs/general/locations) for full list. ## Testing the Setup ### 1. Install Dependencies ```bash npm install ``` ### 2. Build the Plugin ```bash npm run build ``` ### 3. Test Environment ```bash node -e " console.log('Project ID:', process.env.GOOGLE_PROJECT_ID); console.log('Location:', process.env.GOOGLE_LOCATION); console.log('Model:', process.env.GOOGLE_MODEL); console.log('Credentials:', process.env.GOOGLE_APPLICATION_CREDENTIALS ? 'Set' : 'Not set'); " ``` ### 4. Run the Test Script ```bash npm run test-askllm ``` ## Usage Examples ### Basic AI Query ```javascript const result = await driver.execute('mobile: askllm', { instruction: 'What buttons are visible on this screen?' }); console.log(result.response.candidates[0].content.parts[0].text); ``` ### Complex Analysis ```javascript const analysis = await driver.execute('mobile: askllm', { instruction: 'Analyze this screen and provide: 1) Screen type, 2) Main actions available, 3) Any issues or errors visible' }); ``` ## Troubleshooting ### Common Issues **"Authentication error"** - Check `GOOGLE_APPLICATION_CREDENTIALS` path - Verify service account has "Vertex AI User" role - Try `gcloud auth application-default login` **"API not enabled"** - Enable Vertex AI API: `gcloud services enable aiplatform.googleapis.com` **"Model not found"** - Check model name spelling - Verify model is available in your region - Try `gemini-1.5-flash` as a fallback **"Location not supported"** - Use `us-central1` as default - Check [supported regions](https://cloud.google.com/vertex-ai/docs/general/locations) ### Debug Mode Enable debug logging: ```bash export DEBUG=1 npm run test-askllm ``` ## Cost Considerations - **gemini-1.5-flash**: ~$0.075 per 1M input tokens, ~$0.30 per 1M output tokens - **gemini-1.5-pro**: ~$1.25 per 1M input tokens, ~$5.00 per 1M output tokens Each screenshot is typically 1000-3000 tokens depending on size and complexity. ## Security Best Practices 1. **Never commit credentials** to version control 2. **Use service accounts** with minimal required permissions 3. **Rotate keys regularly** 4. **Monitor usage** in Google Cloud Console 5. **Set up billing alerts** to avoid unexpected charges ## Support - [Google Cloud Vertex AI Documentation](https://cloud.google.com/vertex-ai/docs) - [Gemini API Reference](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini) - [OCR Click Plugin Issues](https://github.com/yourusername/ocr-click-plugin/issues)