UNPKG

@aj-archipelago/cortex

Version:

Cortex is a GraphQL API for AI. It provides a simple, extensible interface for using AI services from OpenAI, Azure and others.

575 lines (557 loc) 33.7 kB
# Cortex AutoGen2 Test Cases # Predefined test tasks for automated quality testing # Global settings that apply to ALL test cases (can be overridden per test case) timeout_seconds: 900 # Global expectations that apply to ALL test cases global_expectations: - "CRITICAL FAIL (SCORE=0): If main requested deliverable is missing, score=0 regardless of other content quality" - "CRITICAL FAIL (SCORE=0): MULTIPLE FILE DELIVERIES - If task requests multiple specific file types (e.g., 'return pptx & pdf', 'give me chart and CSV', 'Excel and PDF'), ALL requested file types MUST be delivered. Missing ANY requested file type = score 0. Example: Task asks for 'pptx & pdf' but only PPTX delivered = FAIL." - "CRITICAL FAIL (SCORE=0): FILES CONTAIN ERROR MESSAGES - Any deliverable containing error messages like 'Error: Unable to Generate', 'generation failed', 'contact admin', or font errors like 'Character at index 0 in text is outside the range of characters supported by the font' will score 0" - "All files must have working SAS URLs and proper download links" - "Response must be FUN, ENGAGING, and PROFESSIONAL - complete user's request with insightful reply" - "MINIMAL BUT DETAILED: Every word adds value, no text dumps or link dumps" - "VISUALS FIRST: Rich visuals (charts, images, previews) integrated naturally with explanations" - "NO TEXT/LINK DUMPS: Never just list files or dump links - integrate everything into engaging narrative" - "EXPLANATIONS WITH VISUALS: Use visuals to explain, not replace explanation - combine both" - "COMPLETE THE CONVERSATION: Reply to user's request as if continuing a conversation, not just delivering files" - "PROFESSIONAL PRESENTATION: Polished, error-free, consistent styling - what an expert would present" - "ENGAGING STORYTELLING: Use data insights, surprising findings, clear narratives to hold attention" - "FUN & DELIGHTFUL: Creative visualizations, interesting patterns, enjoyable experience while maintaining professionalism" - "File previews should appear before download links when available" - "All download links MUST open in new tabs (target='_blank') to prevent users from leaving the site" - "VISUALISTIC: Outputs should be visually rich with charts, images, previews, and visual elements that enhance understanding" - "ENGAGING: Content should be interesting, compelling, and hold user attention - use data insights, surprising findings, and clear narratives" - "PROFESSIONAL: All deliverables must meet professional standards - proper formatting, high-quality visuals, clear structure, polished presentation" - "FUN: While maintaining professionalism, outputs should be enjoyable and delightful - creative visualizations, interesting data patterns, engaging storytelling" - "CLICKABLE PREVIEWS: Preview images MUST be clickable (wrapped in anchor tags linking to main file download URL). The `<img src>` displays the preview image, but `<a href>` must link to the original deliverable file (PPTX, PDF, CSV, etc.), not the preview image URL. When users click preview images, they should download/open the original file." - "PREVIEW IMAGES LINK TO ORIGINALS: When preview images are shown, clicking them MUST open/download the original deliverable file (PPTX, PDF, CSV, etc.), not the preview image itself. The `<img src>` shows the preview, but `<a href>` must link to the original file." - "HTML DOWNLOAD LINKS REQUIRED: All download links MUST use HTML `<a href=\"URL\" target=\"_blank\">text</a>` syntax, NOT markdown `[text](URL)` syntax. Markdown links cannot open in new tabs, so HTML is mandatory for proper user experience." - "CLEAN FILENAMES: All download links MUST use clean, user-friendly filenames - remove timestamps, hashes, and system-generated prefixes (e.g., 'output_20251107T175133Z_4b471ed1.pdf' → 'output.pdf')" - "NO FILLER WORDS: Response must be direct and simple - no filler phrases like 'let me know!', 'if you'd like', 'just let me know', 'feel free to', 'don't hesitate to', or closing pleasantries" - "DIRECT & SIMPLE: Get straight to the point - every word must add value, no unnecessary phrases or pleasantries" - "ALWAYS SHOW PREVIEWS: If preview images exist, they MUST be displayed (not optional)" - "PREFER AZURE UPLOADS: Deliverables should link to Azure SAS URLs generated by the uploader. If you reference external source URLs directly (e.g., vendor PDFs or public dashboards), call that out explicitly; evaluators will deduct a few points instead of treating them as hallucinations as long as the URLs are accessible." - "VISUALS ENHANCE UNDERSTANDING: Charts and graphs significantly improve data comprehension when included, providing visual insights into patterns and trends" - "OPTIONAL DATA VISUALS: For data tasks, visualizations are encouraged but not strictly required - they enhance understanding when present but don't penalize when absent" - "CRITICAL FAIL (SCORE=0): If there's data and the response only describes what's visible without providing insights, patterns, surprises, or key findings, score=0. Data without insights is useless." - "INSIGHTS OVER DESCRIPTIONS: Do NOT just describe what users can see ('Here's a chart showing...', 'This visualization displays...'). Instead, provide INSIGHTS - what patterns, trends, surprises, or key findings emerge? Extract details that make users go 'wow'." - "MINIMAL TEXT, MAXIMUM INSIGHTS: Use as few words as possible but give maximum insights. Be direct and impactful. Answer 'So what?' - what does this data mean? What should users notice?" - "NO TEXT DUMPS: Forbidden phrases like 'Download Data & Visuals', 'Here's the data', 'Download the files below' - integrate links naturally into insightful narrative instead" - "PROFESSIONAL INSIGHTS: Responses must be pro, engaging, and insightful. Extract key details, surprising numbers, unexpected patterns, actionable insights. Don't repeat what's visible - provide value through insights." - "NO DUPLICATE IMAGES: CRITICAL FAIL (SCORE=0) - Each image URL must appear ONLY ONCE in the output. If the same image URL appears multiple times, score=0. Duplicate images waste space and create poor user experience." - "CRITICAL: NO 'DUMP OF IMAGES THEN DUMP OF TEXT' PATTERN - Output must weave images and insights together naturally. FORBIDDEN: Showing all images first, then all text below. REQUIRED: Start with key insight → show image → immediate insightful description → next image → immediate description → continue weaving. This creates natural, expert-level flow." - "EVERY IMAGE GETS INSIGHTFUL DESCRIPTION IMMEDIATELY AFTER: Each image (chart, preview, visualization) MUST be followed immediately by an insightful professional description (1-3 sentences). FORBIDDEN: Multiple images in a row without descriptions. FORBIDDEN: Saving all descriptions for the end. Each image needs its description right after it." - "START WITH KEY INSIGHT: Response must begin with the most important finding or deliverable, not generic intro like 'Here's your analysis' or 'I've analyzed the data'. Start directly with the key insight (e.g., 'Revenue spiked 40% in Q3, driven entirely by one product line')." - "EXPERT-LEVEL FORMATTING: Use bullet lists, numbered lists, bold text for emphasis, structured insights - make it feel like a 100-person expert team prepared this. Polished, comprehensive, but concise. Not verbose but includes every detail and key point." - "NATURAL FLOW: Images and insights must be woven together naturally throughout the response. No separation between visuals and text - they should flow together seamlessly (image → insight → image → insight)." - "PROACTIVE DATA VISUALIZATION: For data tasks, proactively create multiple charts showing different perspectives. Use various chart types (bar, line, pie, scatter, histogram) to provide comprehensive visual insights into data patterns, trends, and relationships." - "MULTIPLE VISUALS FOR DATA RICHNESS: Data analysis tasks benefit from 2-4 charts showing different aspects of the data. Multiple visualizations make data more accessible and provide richer insights than text alone." - "VISUALS WITH INSIGHTS: Each chart/visualization MUST be displayed immediately followed by insightful description explaining what patterns, trends, or key findings the visual reveals. Charts without insights are incomplete." - "CRITICAL: OUTPUT MUST BE PURE MARKDOWN - NO HTML STRUCTURE (no <!DOCTYPE>, <html>, <head>, <body> tags). Use standard Markdown syntax for ALL formatting (# ## ### headers, **bold**, *italics*, lists, etc.). Use HTML <img> tags ONLY for images and HTML <a> tags ONLY for download links. NO CSS styling, NO <style> tags, NO HTML document structure. Theme compatibility is handled by the UI." test_cases: - id: tc001_random_data name: "Random Sales Data Generation" description: "Generates random sales data in appropriate format with summary statistics" task: "Generate 100 rows of random sales data and create summary statistics in an appropriate format." requires_ajsql: false expected_deliverables: - type: data pattern: "*sales*" min_count: 1 description: "Main sales data file" - type: data pattern: "*summary*" min_count: 1 description: "Summary statistics file" min_progress_updates: 3 quality_criteria: - "CRITICAL FAIL (SCORE=0): NO DATA FILES DELIVERED - if requested data files are missing, score=0 regardless of other content" - "Main data file contains exactly 100 rows of sales data" - "Realistic product names and prices (no generic 'Product1', 'Product2')" - "Summary statistics calculated correctly from the main data" - "Proper data structure and formatting" - "Files uploaded with SAS URLs provided" - "CRITICAL: Summary statistics displayed as markdown table preview (NOT just download link)" - "Summary table shows key metrics like total sales, average order value, etc." - "Response flows naturally with contextual table integration, no generic headers" - "Download links styled properly and integrated contextually" - "ENCOURAGED: Multiple data visualizations (2-4 charts) showing different perspectives on the sales data would significantly enhance user understanding and provide richer insights" expected_agents: - planner_agent - coder_agent - presenter_agent - id: tc002_pdf_with_images name: "PDF Report with Images and Charts" description: "Generates a PDF report with images and charts" task: "Generate a PDF report about renewable energy trends in 2026." requires_ajsql: false expected_deliverables: - type: pdf pattern: "*.pdf" min_count: 1 description: "PDF report file" - type: images pattern: "*.png" min_count: 5 description: "Images and charts included in report" min_progress_updates: 6 quality_criteria: - "CRITICAL FAIL (SCORE=0): NO PDF FILE DELIVERED - if requested PDF is missing, score=0 regardless of other content" - "CRITICAL FAIL (SCORE=0): PDF CONTAINS ERROR MESSAGES - if PDF contains 'generation failed', 'contact admin', or similar error messages instead of actual content, score=0" - "PDF contains both text content and images" - "Charts and graphs are professionally designed" - "Real data used, no placeholder or dummy content" - "Proper document formatting with headers and page numbers" - "Images are relevant to renewable energy topic" - "Preview images or thumbnails provided" expected_agents: - planner_agent - web_search_agent - coder_agent - presenter_agent - id: tc003_pokemon_pptx name: "Most Powerful Gen 1 Pokemon PowerPoint" description: "Creates a PowerPoint about the most powerful Gen 1 Pokemon with their images and file previews" task: "Create a PowerPoint presentation about the top 10 Most Powerful Gen1 Pokemon. Include individual images of each of the top 10 Pokemon directly in the presentation slides." requires_ajsql: false expected_deliverables: - type: pptx pattern: "*.pptx" min_count: 1 description: "PowerPoint presentation file" - type: preview_images pattern: "preview_*.png" min_count: 1 description: "Multiple slide previews showing different Pokemon" - type: file_preview pattern: "preview_*.pptx.png" min_count: 0 description: "PowerPoint file preview thumbnail (generated but not displayed)" - type: images pattern: "*.png" min_count: 10 description: "Pokemon images used in presentation" min_progress_updates: 8 quality_criteria: - "CRITICAL FAIL (SCORE=0): NO PPTX FILE DELIVERED - if requested PPTX is missing, score=0 regardless of other content" - "Presentation includes Pokemon-themed content with visual elements" - "Each Pokemon mentioned should have some form of visual representation" - "At least 5-15 Pokemon featured with their names and basic information" - "Professional slide design with clear, readable text" - "Images positioned appropriately and do not interfere with text readability" - "Power ranking or basic stats included for Pokemon" - "File preview integration when available" - "Download links styled properly and contextually placed" - "Gen 1 Pokemon focus (original 151, including evolutions)" expected_agents: - planner_agent - web_search_agent - coder_agent - presenter_agent - id: tc004_sports_excel_preview name: "Sports Excel File Generation with Preview" description: "Generates sports statistics Excel file with summary statistics and preview table" task: "Generate an Excel file with 50 rows of sports game statistics and include summary statistics. Create basic charts showing scoring trends and team performance." requires_ajsql: false expected_deliverables: - type: excel pattern: "*sports*.xlsx" min_count: 1 description: "Main sports data Excel file" - type: excel pattern: "*summary*.xlsx" min_count: 1 description: "Summary statistics Excel file" - type: chart pattern: "*.png" min_count: 1 description: "Basic charts showing scoring trends or performance" min_progress_updates: 3 quality_criteria: - "CRITICAL FAIL (SCORE=0): NO EXCEL FILE DELIVERED - if requested Excel is missing, score=0 regardless of other content" - "Main Excel contains exactly 50 rows of sports game data" - "Dates span the last 60 days as specified" - "Realistic sports teams and game statistics" - "Summary statistics calculated correctly from the main data" - "Proper Excel formatting with headers and data validation" - "Files uploaded with SAS URLs provided" - "Summary data previewed as markdown table in output" expected_agents: - planner_agent - coder_agent - presenter_agent - id: tc005_weather_analysis name: "Weather Data Analysis with Full Insights" description: "Generates weather data with comprehensive statistical analysis and visualizations" task: "Generate weather temperature data for 100 cities over 30 days. Provide full statistical analysis including mean, median, standard deviation, correlations, and create insightful charts showing temperature trends and patterns." requires_ajsql: false expected_deliverables: - type: csv pattern: "*weather*.csv" min_count: 1 description: "Weather temperature data CSV" - type: chart pattern: "*.png" min_count: 3 description: "Statistical charts and visualizations" - type: excel pattern: "*analysis*.xlsx" min_count: 1 description: "Comprehensive analysis with statistics" min_progress_updates: 5 quality_criteria: - "CSV contains weather data for 100 cities over 30 days (3000 total records)" - "Realistic city names and temperature ranges by region" - "Comprehensive statistical analysis (mean, median, std dev, quartiles, correlations)" - "Multiple insightful charts: temperature distributions, regional comparisons, time series" - "Statistical insights clearly explained (what the numbers mean)" - "Data visualizations are informative and well-labeled" - "Files uploaded with SAS URLs provided" - "Full analysis presented with key findings and interpretations" expected_agents: - planner_agent - coder_agent - presenter_agent - id: tc006_currency_analysis name: "Currency Exchange Rate Analysis - Last 10 Years" description: "Fetches real Turkish Lira data and compares with Argentine Peso and US Dollar over 10 years" task: "Get the last 10 years of Turkish Lira (TRY) exchange rate data against USD, and compare it with Argentine Peso (ARS) and US Dollar performance. Analyze volatility, depreciation trends, inflation impacts, and create insightful charts showing comparative currency performance." requires_ajsql: false expected_deliverables: - type: csv pattern: "*currency*.csv" min_count: 1 description: "Raw currency exchange rate data CSV" - type: chart pattern: "*.png" min_count: 4 description: "Currency comparison charts and volatility analysis" - type: excel pattern: "*analysis*.xlsx" min_count: 1 description: "Comprehensive currency analysis with statistics" min_progress_updates: 6 quality_criteria: - "Real exchange rate data for TRY, ARS, and USD spanning 10 years" - "Data sourced from reliable financial APIs or databases" - "Comprehensive analysis including volatility metrics, depreciation rates, inflation correlations" - "CRITICAL: Multiple insightful charts created and included in response (NOT 'available on request')" - "Economic insights explaining currency movements and macroeconomic factors" - "Statistical analysis (correlation, standard deviation, trend analysis)" - "Clear explanations of currency crisis periods and economic impacts" - "Professional visualizations with proper labeling and time periods" - "Files uploaded with SAS URLs provided" - "Full comparative analysis presented with key findings and economic interpretations" - "CRITICAL: Charts displayed with descriptive captions explaining what they show" - "Response flows naturally with contextual chart integration, no generic headers" - "Download links styled properly and integrated contextually into narrative" expected_agents: - planner_agent - coder_agent - web_search_agent - presenter_agent - id: tc007_aje_aja_comparison name: "AJE vs AJA Daily Article Count Comparison" description: "Compares daily article counts between AJE and AJA" task: "Compare daily article counts for AJE and AJA from the last 30 days. Give me a chart and CSV." requires_ajsql: true expected_deliverables: - type: chart pattern: "*.png" min_count: 1 description: "Comparison chart showing AJE vs AJA daily counts" - type: csv pattern: "*.csv" min_count: 1 description: "Raw data CSV with daily counts" min_progress_updates: 5 quality_criteria: - "CRITICAL DATA ANALYSIS FAILURE (SCORE=0): If response claims AJE has more articles than AJA when data shows AJA > AJE, score=0" - "CRITICAL DATA ANALYSIS FAILURE (SCORE=0): If conclusions are inconsistent (e.g., line chart says AJE higher but bar chart says AJA higher), score=0" - "Data queried from UCMS AJE and AJA databases" - "Exactly 30 days of data (excluding today)" - "Chart clearly shows both AJE and AJA trends" - "CSV contains date, aje_count, aja_count columns" - "No missing dates in the 30-day period" - "Professional chart with legend, labels, and title" - "Response must correctly identify AJA as having higher article counts than AJE based on actual data" expected_agents: - planner_agent - aj_sql_agent - coder_agent - presenter_agent - id: tc008_aje_trump_trend name: "AJE Trump Headlines - 6 Month Trend Analysis" description: "Analyzes Trump headline trends in AJE" task: "Plot Trump headline percentage trends for AJE over the last 6 months by week. Give me a chart and CSV." requires_ajsql: true expected_deliverables: - type: chart pattern: "*.png" min_count: 1 description: "Weekly trend chart with 3 metrics" - type: csv pattern: "*.csv" min_count: 1 description: "Weekly data CSV" min_progress_updates: 6 quality_criteria: - "Data covers full 6 months from UCMS AJE database" - "Professional chart(s) showing Trump headline trends and patterns" - "Data aggregated by week (ISO weeks recommended)" - "CSV contains columns: week, trump_count, total_count, percent_trump" - "Case-insensitive Trump matching in headlines" - "Clear visualizations with legends and axis labels" - "All weeks in 6-month period represented" expected_agents: - planner_agent - aj_sql_agent - coder_agent - presenter_agent - id: tc009_aje_trump_daily name: "AJE Trump Headlines - Last Month Daily Chart" description: "Daily Trump headline chart for AJE" task: "Chart Trump headlines from AJE for the last month by day. Give me the chart and CSV with the headlines." requires_ajsql: true expected_deliverables: - type: chart pattern: "*.png" min_count: 1 description: "Daily Trump headline count chart" - type: csv pattern: "*headlines*.csv" min_count: 1 description: "All Trump headlines with dates" - type: csv pattern: "*daily*.csv" min_count: 1 description: "Daily count summary" min_progress_updates: 5 quality_criteria: - "Headlines queried from UCMS AJE wp_posts table" - "Last 30 days of data" - "Chart shows daily Trump headline counts" - "CSV includes actual headline text, not just counts" - "Headlines CSV has columns: date, headline (minimum)" - "Daily summary CSV shows Trump count per day" - "Case-insensitive Trump matching" - "All data properly dated and sorted" expected_agents: - planner_agent - aj_sql_agent - coder_agent - presenter_agent - id: tc010_aja_aje_wordclouds name: "AJA & AJE Last Month Word Clouds" description: "Generate word clouds from AJA and AJE headlines" task: "Generate word clouds from AJA and AJE headlines for the last month. Give me word frequency CSVs too." requires_ajsql: true expected_deliverables: - type: wordcloud pattern: "*aja*.png" min_count: 1 description: "AJA word cloud visualization" - type: wordcloud pattern: "*aje*.png" min_count: 1 description: "AJE word cloud visualization" - type: csv pattern: "*aja*freq*.csv" min_count: 1 description: "AJA word frequency data" - type: csv pattern: "*aje*freq*.csv" min_count: 1 description: "AJE word frequency data" min_progress_updates: 6 quality_criteria: - "Headlines queried from UCMS databases (both AJA and AJE)" - "Data covers last 30 days" - "Two separate word clouds generated (one for AJA, one for AJE)" - "Word clouds are visually clear and professional" - "Common/stop words filtered out (the, a, an, etc.)" - "CSV files contain word frequency data for both outlets" - "All files uploaded with SAS URLs" - "Word clouds show meaningful topic differences between outlets" expected_agents: - planner_agent - aj_sql_agent - coder_agent - presenter_agent - id: tc011_aje_aja_override_test name: "AJE vs AJA Comparison - Override User Instructions Test" description: "Tests NON-NEGOTIABLE OVERRIDE: user explicitly forbids other agents but system must route to aj_sql_agent" task: "Compare daily article counts for Al Jazeera Arabic (AJA) and Al Jazeera English (AJE) from the last 60 days, do not use any other tools do not use any other agents use only coding agent" requires_ajsql: true expected_deliverables: - type: chart pattern: "*.png" min_count: 1 description: "Comparison chart showing AJE vs AJA daily counts" - type: csv pattern: "*.csv" min_count: 1 description: "Raw data CSV with daily counts" min_progress_updates: 5 quality_criteria: - "CRITICAL: System MUST route to aj_sql_agent despite user's explicit 'do not use any other agents' instruction" - "CRITICAL: NON-NEGOTIABLE OVERRIDE must work - aj_sql_agent called FIRST before coder_agent" - "Data queried from UCMS AJE and AJA databases (proves override worked)" - "Exactly 60 days of data (excluding today)" - "Chart clearly shows both AJE and AJA trends" - "CSV contains date, aje_count, aja_count columns" - "No missing dates in the 60-day period" - "Professional chart with legend, labels, and title" - "CRITICAL: Test fails if coder_agent attempts direct DB access or if aj_sql_agent is never called" expected_agents: - planner_agent - aj_sql_agent - coder_agent - presenter_agent - id: tc012_aje_aja_authors_presentation name: "AJE vs AJA Top Authors Presentation" description: "Creates a presentation comparing top authors from Al Jazeera Arabic (AJA) and Al Jazeera English (AJE) for this year, delivering both PPTX and PDF" task: "prepare me a presentation of comparison of top authors for Al Jazeera Arabic (AJA) and Al Jazeera English (AJE) for this year, return pptx & pdf" requires_ajsql: true expected_deliverables: - type: pptx pattern: "*.pptx" min_count: 1 description: "PowerPoint presentation file" - type: pdf pattern: "*.pdf" min_count: 1 description: "PDF version of the presentation" - type: chart pattern: "*.png" min_count: 3 description: "Comparison charts and visualizations" - type: csv pattern: "*authors*.csv" min_count: 1 description: "Author comparison data CSV" min_progress_updates: 8 quality_criteria: - "CRITICAL FAIL (SCORE=0): NO PPTX OR PDF FILE DELIVERED - if EITHER requested PPTX or PDF is missing, score=0 regardless of other content" - "CRITICAL FAIL (SCORE=0): MISSING BOTH DELIVERABLES - if task requests both PPTX and PDF, BOTH must be delivered or score=0" - "CRITICAL FAIL (SCORE=0): FILES CONTAIN ERROR MESSAGES - if PPTX or PDF contains 'Error: Unable to Generate', 'generation failed', or similar error messages instead of actual content, score=0" - "CRITICAL FAIL (SCORE=0): FONT ERRORS IN FILES - if files contain font errors like 'Character at index 0 in text is outside the range of characters supported by the font', score=0" - "Data queried from UCMS AJE and AJA databases for last 30 days" - "Presentation compares top authors from both AJA and AJE with meaningful metrics (article count, engagement, etc.)" - "Includes both PPTX and PDF deliverables as explicitly requested" - "Professional presentation with comparison charts and insights" - "Files uploaded with SAS URLs provided" - "Multiple charts showing author comparisons and trends" - "Clean filenames without timestamps or hashes" expected_agents: - planner_agent - aj_sql_agent - coder_agent - presenter_agent - id: tc013_debug_year_query name: "Debug: AJE vs AJA Top Publish Days (This Year)" description: "Reproduces user reported issue where 'this year' is interpreted as only the first week" task: "prepare me a presentation of comparison of top publish days for Al Jazeera Arabic (AJA) and Al Jazeera English (AJE), this year" requires_ajsql: true expected_deliverables: - type: chart pattern: "*.png" min_count: 1 description: "Comparison chart" min_progress_updates: 5 quality_criteria: - "Query covers the full current year (2025)" - "Does NOT limit to first week of January" - "Presentation puts previews at the END or makes them less intrusive" - "No __TASK_COMPLETELY_FINISHED__ in final output" - "CRITICAL FAIL (SCORE=0): AJE & AJA article count must be > 0 for top days AND data must show variation (not all days equal) - if all days have identical counts (e.g., all 1s), there are no 'top days' with meaningful variation" - "Data validation: Verify non-zero counts for both AJA and AJE" expected_agents: - planner_agent - aj_sql_agent - coder_agent - presenter_agent - id: tc014_us_gdp_by_state name: "US States GDP Report" description: "Creates a comprehensive GDP by state report with data from web sources" task: "Create a report on US state GDP. I need a CSV with the latest GDP data for all states, summary statistics, a bar chart of the top 10 states, and a PDF report." requires_ajsql: false expected_deliverables: - type: csv pattern: "*gdp*.csv" min_count: 1 description: "GDP by state CSV file" - type: chart pattern: "*gdp*.png" min_count: 1 description: "Top 10 states GDP bar chart" - type: pdf pattern: "*gdp*.pdf" min_count: 1 description: "GDP report PDF" - type: data pattern: "*summary*.json" min_count: 1 description: "Summary statistics JSON" min_progress_updates: 6 quality_criteria: - "CRITICAL FAIL (SCORE=0): NO CSV FILE DELIVERED - if requested CSV is missing, score=0 regardless of other content" - "CRITICAL FAIL (SCORE=0): NO PDF FILE DELIVERED - if requested PDF is missing, score=0 regardless of other content" - "CRITICAL FAIL (SCORE=0): NO CHART FILE DELIVERED - if requested chart is missing, score=0 regardless of other content" - "Data source: GDP data found via web search (not synthetic/fallback data)" - "CSV contains GDP data for all 50 US states + DC" - "GDP values are realistic (billions USD, reasonable ranges)" - "Summary statistics include total, mean, median, standard deviation" - "Bar chart shows top 10 states by GDP clearly" - "PDF report is multi-page with title, data table, statistics, and chart" - "All files uploaded with SAS URLs provided" - "Response starts with key insight (e.g., top state, concentration patterns)" - "Charts integrated naturally with insights, not dumped at end" - "No requests for user to provide API keys or upload files manually" - "System finds data autonomously via web search" - "Clean filenames without timestamps or hashes" expected_agents: - planner_agent - web_search_agent - coder_agent - presenter_agent - id: tc015_multi_format_analysis name: "Multi-Format Tech Company Performance Analysis" description: "Creates comprehensive analysis in CSV, JSON, XLSX, and PDF formats" task: "Analyze the performance of top 5 tech companies (Apple, Microsoft, Google, Amazon, Tesla) for Q3 2024. Return the analysis in all these formats: CSV, JSON, XLSX, and PDF. Include revenue data, market share, growth metrics, and key insights for each company." requires_ajsql: false expected_deliverables: - type: csv pattern: "*.csv" min_count: 1 description: "Raw performance data CSV" - type: json pattern: "*.json" min_count: 1 description: "Structured performance data JSON" - type: excel pattern: "*.xlsx" min_count: 1 description: "Analysis spreadsheet with charts" - type: pdf pattern: "*.pdf" min_count: 1 description: "Professional PDF report" min_progress_updates: 8 quality_criteria: - "CRITICAL FAIL (SCORE=0): MISSING ANY REQUESTED FORMAT - if CSV, JSON, XLSX, or PDF is missing, score=0" - "Data covers all 5 companies (Apple, Microsoft, Google, Amazon, Tesla) with realistic Q3 2024 metrics" - "CSV contains structured tabular data suitable for spreadsheet analysis" - "JSON provides hierarchical data structure with nested company details and metrics" - "XLSX includes multiple worksheets, formulas, and embedded charts/visualizations" - "PDF presents professional report format with tables, insights, and visualizations" - "All formats contain consistent data (same revenue figures, market shares, etc.)" - "Each format optimized for its use case (CSV for data analysis, JSON for APIs, XLSX for interactive analysis, PDF for sharing)" - "Complex hierarchical data (company details, quarterly metrics, competitor comparisons) represented appropriately in each format" - "Files uploaded with SAS URLs and clean download links" - "Response provides clear navigation between different format downloads" expected_agents: - planner_agent - web_search_agent - coder_agent - presenter_agent