aiwg

Version:

Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo

aiwg.io

jmagly/aiwg

72 lines (55 loc) • 2.22 kB

Markdown

View Raw

# AIWG Model Evaluation Suite Evaluate local and cloud models for AIWG compatibility across 6 dimensions. ## Quick Start ```bash cd tools/eval npm install npx tsx src/index.ts hermes3:latest --verbose ``` ## Dependencies `tools/eval` depends on `@matric/eval-client` from the private Gitea npm registry. The `.npmrc` in this directory configures the `@matric` scope automatically — `npm install` picks it up without extra setup. `@matric/eval-client` is a TypeScript client for the Python [matric-eval](https://git.integrolabs.net/roctinam/matric-eval) framework. When the `matric-eval` binary is installed and on `$PATH`, standard benchmark scores (HumanEval, GSM8K, ARC, etc.) can be included alongside AIWG-specific dimension scores. ## Dimensions | Dimension | Weight | What it tests | |-----------|--------|---------------| | Tool Use | 25% | Correct tool selection and parameter formatting | | Instruction Following | 25% | Constraint adherence, multi-part requests | | Coding | 20% | Code generation quality and correctness | | Structured Output | 15% | JSON/YAML/Markdown generation | | Reasoning | 10% | Task decomposition and planning | | Context Handling | 5% | Long-context accuracy | ## Scoring - **90-100**: opus tier — fully compatible - **70-89**: sonnet tier — good with minor limitations - **50-69**: haiku tier — partial compatibility - **Below 50**: not recommended ## CLI Options ```bash npx tsx src/index.ts <model-id> [options] Options: --backend <type> ollama or api (default: ollama) --dimensions <list> Comma-separated dimensions to evaluate --output <format> json or markdown (default: markdown) --ollama-url <url> Ollama API URL (default: http://localhost:11434) --verbose Show detailed progress ``` ## Adding Test Cases Test cases live in `datasets/<dimension>/` as YAML files: ```yaml id: unique-test-id dimension: tool-use difficulty: basic prompt: | The prompt sent to the model... expected: tool_calls: - tool: Read params_contain: { file_path: "example.ts" } contains: ["keyword"] must_not_contain: ["forbidden"] valid_json: true scoring: correct_tool: 0.4 correct_params: 0.4 no_hallucination: 0.2 ```