ocr-click-plugin

Version:

An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens with AI-powered screen analysis

github.com/Jitu1888/ocr-click-plugin

Jitu1888/ocr-click-plugin

501 lines (381 loc) • 13.3 kB

Markdown

# OCR Click Plugin An Appium plugin that uses OCR (Optical Character Recognition) to find and click text elements on mobile device screens. This plugin leverages Tesseract.js for text recognition, Sharp for image enhancement, and Google Cloud Vertex AI for intelligent screen analysis. ## Features - 🔍 **Advanced OCR**: Uses Tesseract.js with optimized configuration for mobile screens - 🖼️ **Image Enhancement**: Preprocessing with Sharp for better text recognition - 🤖 **AI-Powered Analysis**: Google Cloud Vertex AI integration for intelligent screen understanding - 🎯 **Confidence Filtering**: Only considers text matches above configurable confidence threshold - 📱 **Cross-Platform**: Works with both iOS (XCUITest) and Android (UiAutomator2) drivers - 🔧 **Configurable**: Customizable OCR parameters and image processing options - 📊 **Detailed Logging**: Progress tracking and confidence scores for debugging ## Installation ### Prerequisites - Node.js 14+ - Appium 2.x - iOS/Android drivers installed - Google Cloud Project with Vertex AI API enabled (for AI features) ### Install the Plugin ```bash # Clone the repository git clone <your-repo-url> cd ocr-click-plugin # Install dependencies npm install # Build the plugin npm run build # Install plugin to Appium npm run install-plugin ``` ### Google Cloud Setup (for AI Features) 1. Create a Google Cloud Project 2. Enable the Vertex AI API 3. Set up authentication (Service Account or Application Default Credentials) 4. Set environment variables: ```bash export GOOGLE_PROJECT_ID="your-project-id" export GOOGLE_LOCATION="us-central1" # or your preferred location export GOOGLE_MODEL="gemini-1.5-flash" # or gemini-1.5-pro export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json" ``` ### Development Setup ```bash # Run development server (uninstall, build, install, and start server) npm run dev # Or run individual commands npm run build npm run reinstall-plugin npm run run-server ``` ## API Endpoints ### 1. Text Click API Find and click text elements using OCR. ``` POST /session/{sessionId}/appium/plugin/textclick ``` **Parameters:** | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `text` | string | Yes | - | Text to search for and click | | `index` | number | No | 0 | Index of match to click (if multiple matches found) | **Response:** ```json { "success": true, "message": "Clicked on text 'Login' at index 0", "totalMatches": 2, "confidence": 87.5, "imageEnhanced": true } ``` ### 2. Text Check API Check if text is present on screen without clicking. ``` POST /session/{sessionId}/appium/plugin/checktext ``` **Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `text` | string | Yes | Text to search for | **Response:** ```json { "success": true, "isPresent": true, "totalMatches": 1, "searchText": "Submit", "matches": [ { "text": "Submit", "confidence": 92.3, "coordinates": { "x": 200, "y": 400 }, "bbox": { "x0": 150, "y0": 380, "x1": 250, "y1": 420 } } ], "imageEnhanced": true, "message": "Text 'Submit' found with 1 match(es)" } ``` ### 3. AI Analysis API (NEW) Analyze screen content using Google Cloud Vertex AI. ``` POST /session/{sessionId}/appium/plugin/askllm ``` **Parameters:** | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `instruction` | string | Yes | Natural language instruction for AI analysis | **Response:** ```json { "success": true, "instruction": "What buttons are visible on this screen?", "response": { "candidates": [ { "content": { "parts": [ { "text": "I can see several buttons on this screen: 'Login', 'Sign Up', 'Forgot Password', and 'Help'. The Login button appears to be the primary action button." } ] } } ] }, "message": "AI analysis completed successfully" } ``` ## Usage Examples ### Mobile Commands (Recommended) ```javascript // JavaScript/TypeScript const driver = await remote(capabilities); // Click text using mobile command await driver.execute('mobile: textclick', { text: 'Login', index: 0 }); // Check if text exists const result = await driver.execute('mobile: checktext', { text: 'Welcome' }); console.log(result.isPresent); // true/false // AI screen analysis const aiResult = await driver.execute('mobile: askllm', { instruction: 'What are the main actions a user can take on this screen?' }); console.log(aiResult.response.candidates[0].content.parts[0].text); ``` ### Java Examples ```java import io.appium.java_client.android.AndroidDriver; import org.openqa.selenium.remote.DesiredCapabilities; import java.util.HashMap; import java.util.Map; public class OCRClickExample { public static void main(String[] args) { AndroidDriver driver = new AndroidDriver(serverUrl, capabilities); // Click text Map<String, Object> clickParams = new HashMap<>(); clickParams.put("text", "Submit"); clickParams.put("index", 0); Object result = driver.executeScript("mobile: textclick", clickParams); // Check text presence Map<String, Object> checkParams = new HashMap<>(); checkParams.put("text", "Error"); Object checkResult = driver.executeScript("mobile: checktext", checkParams); // AI analysis Map<String, Object> aiParams = new HashMap<>(); aiParams.put("instruction", "Describe the layout and main elements of this screen"); Object aiResult = driver.executeScript("mobile: askllm", aiParams); System.out.println("AI Response: " + aiResult); } } ``` ### Python Examples ```python from appium import webdriver driver = webdriver.Remote('http://localhost:4723/wd/hub', capabilities) # Click text result = driver.execute_script('mobile: textclick', {'text': 'Login'}) print(f"Click result: {result}") # Check text check_result = driver.execute_script('mobile: checktext', {'text': 'Welcome'}) print(f"Text present: {check_result['isPresent']}") # AI analysis ai_result = driver.execute_script('mobile: askllm', { 'instruction': 'What form fields are visible and what information do they require?' }) print(f"AI Analysis: {ai_result['response']['candidates'][0]['content']['parts'][0]['text']}") ``` ### Direct HTTP API ```bash # Text click curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/textclick \ -H "Content-Type: application/json" \ -d '{"text": "Sign Up", "index": 0}' # Text check curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/checktext \ -H "Content-Type: application/json" \ -d '{"text": "Error Message"}' # AI analysis curl -X POST http://localhost:4723/wd/hub/session/{sessionId}/appium/plugin/askllm \ -H "Content-Type: application/json" \ -d '{"instruction": "What are the key UI elements and their purposes on this screen?"}' ``` ## AI Analysis Use Cases The `askllm` API enables powerful screen analysis capabilities: ### Screen Understanding ```javascript await driver.execute('mobile: askllm', { instruction: 'Describe the main purpose of this screen and its key components' }); ``` ### Element Identification ```javascript await driver.execute('mobile: askllm', { instruction: 'List all clickable buttons and their likely functions' }); ``` ### Form Analysis ```javascript await driver.execute('mobile: askllm', { instruction: 'What form fields are present and what type of information do they expect?' }); ``` ### Error Detection ```javascript await driver.execute('mobile: askllm', { instruction: 'Are there any error messages or warnings visible on this screen?' }); ``` ### Navigation Guidance ```javascript await driver.execute('mobile: askllm', { instruction: 'How would a user navigate to the settings page from this screen?' }); ``` ## Environment Variables ### Required for AI Features ```bash # Google Cloud Configuration GOOGLE_PROJECT_ID=your-gcp-project-id GOOGLE_LOCATION=us-central1 GOOGLE_MODEL=gemini-1.5-flash GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json # Alternative: Use gcloud CLI authentication # gcloud auth application-default login ``` ### Optional Configuration ```bash # OCR Configuration OCR_CONFIDENCE_THRESHOLD=60 OCR_LANGUAGE=eng # Image Processing ENABLE_IMAGE_ENHANCEMENT=true SHARP_IGNORE_GLOBAL_LIBVIPS=1 ``` ## Configuration ### OCR Settings The plugin uses optimized Tesseract configuration: ```typescript const TESSERACT_CONFIG = { lang: 'eng', tessedit_char_whitelist: 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 .,!?-_@#$%^&*()', tessedit_pageseg_mode: '6', // Uniform text block preserve_interword_spaces: '1', // ... other optimizations }; ``` ### Confidence Threshold Default minimum confidence threshold is 60%. Words below this confidence are filtered out: ```typescript const MIN_CONFIDENCE_THRESHOLD = 60; ``` ### Image Enhancement The plugin applies several image processing steps: 1. **Grayscale conversion** - Reduces noise 2. **Normalization** - Enhances contrast 3. **Sharpening** - Improves text clarity 4. **Gamma correction** - Better text contrast 5. **Median filtering** - Removes noise 6. **Binary thresholding** - Clear text separation ## Troubleshooting ### Google Cloud Setup Issues **Authentication Error:** ```bash # Set up application default credentials gcloud auth application-default login # Or use service account export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json" ``` **API Not Enabled:** ```bash # Enable Vertex AI API gcloud services enable aiplatform.googleapis.com ``` **Model Not Available:** Try different model names: - `gemini-1.5-flash` (faster, cheaper) - `gemini-1.5-pro` (more capable) - `gemini-1.0-pro-vision` (legacy) ### Sharp Installation Issues If you encounter Sharp compilation errors during installation, especially with Node.js v24+: ```bash # Method 1: Use environment variable SHARP_IGNORE_GLOBAL_LIBVIPS=1 npm install ocr-click-plugin # Method 2: Install Sharp separately first SHARP_IGNORE_GLOBAL_LIBVIPS=1 npm install --include=optional sharp npm install ocr-click-plugin # Method 3: For Appium plugin installation SHARP_IGNORE_GLOBAL_LIBVIPS=1 appium plugin install ocr-click-plugin ``` ### Text Not Found - **Check confidence threshold**: Lower `MIN_CONFIDENCE_THRESHOLD` if text is not being detected - **Verify text spelling**: Ensure exact text match (case-insensitive) - **Check image quality**: Poor screenshots may affect OCR accuracy ### Inconsistent Results - **Image enhancement**: The plugin includes advanced preprocessing to improve consistency - **Confidence filtering**: Only high-confidence matches are considered - **Character whitelist**: Limits recognition to expected characters ### Performance Issues - **Reduce image size**: Large screenshots take longer to process - **Optimize configuration**: Adjust Tesseract parameters for your use case - **Check device performance**: Ensure adequate resources ## Development ### Project Structure ``` ocr-click-plugin/ ├── src/ │ └── index.ts # Main plugin implementation ├── dist/ # Compiled JavaScript ├── package.json # Dependencies and scripts ├── tsconfig.json # TypeScript configuration └── README.md # This file ``` ### Building ```bash npm run build ``` ### Testing ```bash npm test ``` ### Available Scripts ```bash npm run dev # Full development workflow npm run build # Compile TypeScript npm run install-plugin # Install to Appium npm run reinstall-plugin # Uninstall and reinstall npm run run-server # Start Appium server npm run uninstall # Remove from Appium ``` ## Technical Details ### Dependencies - **@appium/base-plugin**: Appium plugin framework - **tesseract.js**: OCR engine - **sharp**: Image processing - **typescript**: Development language ### Supported Platforms - ✅ Android (UiAutomator2) - ✅ iOS (XCUITest) ### Image Processing Pipeline 1. Capture screenshot via Appium driver 2. Convert to grayscale for better OCR 3. Apply normalization and sharpening 4. Gamma correction for text contrast 5. Noise reduction with median filter 6. Binary threshold for clear text separation 7. OCR recognition with Tesseract 8. Confidence filtering and text matching 9. Coordinate calculation and click action ## Contributing 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add some amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## License This project is licensed under the ISC License - see the LICENSE file for details. ## Changelog ### Version 1.0.0 - Initial release with OCR text detection and clicking - Advanced image preprocessing for better accuracy - Confidence-based filtering for consistent results - Support for multiple text matches with index selection - Comprehensive logging and error handling