@aj-archipelago/cortex

# Cortex AutoGen2 Automated Testing Suite Comprehensive automated testing framework for evaluating and improving the AutoGen2 system quality. ## Features - ✅ **Automated Test Execution**: Run predefined test cases with zero manual intervention - 📊 **LLM-Based Evaluation**: Scores progress updates (0-100) and final outputs (0-100) using Cortex API - 📈 **Performance Metrics**: Track latency, update frequency, error rates, and more - 🗄️ **Log-Based Storage**: Metadata kept in-memory; artifacts and JSONL logs on disk (no database) - 💡 **Improvement Suggestions**: LLM analyzes failures and suggests code improvements - 📉 **Trend Analysis**: Detect quality regressions over time - 🖥️ **CLI Interface**: Easy-to-use command-line tool ## Quick Start ### 1. Prerequisites Ensure you have: - Docker running (for cortex-autogen-function container) - Redis running (for progress updates) - Azure Queue setup - Environment variables configured (.env file) Required environment variables: ```bash CORTEX_API_KEY=your_key_here CORTEX_API_BASE_URL=http://localhost:4000/v1 REDIS_CONNECTION_STRING=redis://localhost:6379 REDIS_CHANNEL=cortex_progress AZURE_STORAGE_CONNECTION_STRING=your_connection_string AZURE_QUEUE_NAME=cortex-tasks ``` ### 2. Install Dependencies The testing suite uses the same dependencies as the main project. No additional installation needed. ### 3. Run Tests ```bash # Run all test cases python tests/cli/run_tests.py --all # Run specific test python tests/cli/run_tests.py --test tc001_pokemon_pptx # View test history python tests/cli/run_tests.py --history --limit 20 # View score trend for a test case python tests/cli/run_tests.py --trend tc001_pokemon_pptx ``` ## Test Cases The suite includes 3 predefined test cases: ### TC001: Pokemon PPTX Presentation Creates a professional PowerPoint with Pokemon images, tests: - Image collection (10+ images) - Professional slide design - Preview image generation - File upload with SAS URLs ### TC002: PDF Report with Images Generates a renewable energy PDF report, tests: - Web research and image collection - Chart/graph generation - PDF formatting - Document quality ### TC003: Random CSV Generation Creates realistic sales data CSVs, tests: - Data generation - Statistical calculations - CSV formatting - Quick task execution ## Architecture ``` tests/ ├── orchestrator.py # Main test execution engine ├── test_cases.yaml # Test case definitions ├── database/ │ └── repository.py # In-memory data access layer (no SQLite) ├── collectors/ │ ├── progress_collector.py # Redis subscriber for progress updates │ └── log_collector.py # Docker log parser ├── evaluators/ │ ├── llm_scorer.py # LLM-based evaluation │ └── prompts.py # Evaluation prompts and rubrics ├── metrics/ │ └── collector.py # Performance metrics calculation ├── analysis/ │ ├── improvement_suggester.py # LLM-powered suggestions │ └── trend_analyzer.py # Trend and regression detection └── cli/ └── run_tests.py # CLI interface ``` ## How It Works 1. **Test Submission**: Test orchestrator submits task to Azure Queue 2. **Data Collection**: - Progress collector subscribes to Redis for real-time updates - Log collector streams Docker container logs 3. **Execution Monitoring**: Wait for task completion or timeout 4. **Data Storage**: Store progress updates, logs, and files in per-request log folders (no database) 5. **Metrics Calculation**: Calculate latency, frequency, error counts 6. **LLM Evaluation**: - Score progress updates (frequency, clarity, accuracy) - Score final output (completeness, quality, correctness) 7. **Analysis**: Generate improvement suggestions and track trends ## Evaluation Criteria ### Progress Updates (0-100) - **Frequency** (25 pts): Updates every 2-5 seconds ideal - **Clarity** (25 pts): Emojis, concise, informative - **Accuracy** (25 pts): Progress % matches work done - **Coverage** (25 pts): All important steps communicated ### Final Output (0-100) - **Completeness** (25 pts): All deliverables present - **Quality** (25 pts): Professional, polished, no placeholders - **Correctness** (25 pts): Accurate data, no hallucinations - **Presentation** (25 pts): SAS URLs, previews, clear results ## Run Telemetry (No Database) - Run metadata is held in-memory via `tests/database/repository.py`. - Detailed evidence lives in the per-request log folders (bind-mounted to `/tmp/coding/req_<id>/logs`): - `logs.jsonl`, `messages.jsonl`, `progress_updates.jsonl` - `accomplishments.log`, `agent_journey.log` - Generated files under `files/` with SAS URLs captured in logs - History/trend commands rely on the current process data; for long-term records, read the log folders. ## Example Output ``` 🧪 Running Test: Pokemon PowerPoint Presentation with Images ID: tc001_pokemon_pptx Timeout: 300s 📝 Test run created: ID=1, Request=test_tc001_pokemon_pptx_a3f9b12e ✅ Task submitted to queue 📡 Starting data collection... Progress: 10% - 📋 Planning task execution... Progress: 25% - 🌐 Collecting Pokemon images... Progress: 50% - 💻 Creating PowerPoint presentation... Progress: 75% - 📸 Generating slide previews... Progress: 100% - ✅ Task completed successfully! ✅ Data collection complete Progress updates: 12 Log entries: 45 📊 Calculating metrics... Time to completion: 142.3s Progress updates: 12 Files created: 15 Errors: 0 🤖 Running LLM evaluation... Progress Score: 88/100 Output Score: 92/100 ✨ Evaluation complete: Progress Score: 88/100 Output Score: 92/100 Overall Score: 90/100 ✅ Test Complete: Pokemon PowerPoint Presentation with Images ``` ## Extending the Suite ### Add New Test Cases Edit `tests/test_cases.yaml`: ```yaml test_cases: - id: tc004_my_new_test name: "My New Test" task: "Test task description..." timeout_seconds: 300 expected_deliverables: - type: pdf pattern: "*.pdf" min_count: 1 min_progress_updates: 5 quality_criteria: - "Criterion 1" - "Criterion 2" ``` ### Customize Evaluation Modify prompts in `tests/evaluators/prompts.py` to change scoring criteria. ### Add New Metrics Extend `tests/metrics/collector.py` with additional metrics calculation logic. ## Troubleshooting ### No progress updates collected - Check Redis is running: `redis-cli ping` - Verify REDIS_CONNECTION_STRING in .env - Check Docker container is running: `docker ps` ### LLM evaluation fails - Verify CORTEX_API_KEY is set - Check CORTEX_API_BASE_URL is accessible - Review logs for API errors ## Future Enhancements - [ ] Web dashboard for viewing results - [ ] CI/CD integration (GitHub Actions) - [ ] Parallel test execution - [ ] Screenshot comparison for visual regression - [ ] Custom test case generator - [ ] Export reports (PDF, HTML) - [ ] Slack/email notifications ## License Part of the Cortex AutoGen2 project.