aiwg
Version:
Deployment tool and support utility for AI context. Copies agents, skills, commands, rules, and behaviors into the paths each AI platform reads (Claude Code, Codex, Copilot, Cursor, Warp, OpenClaw, and 6 more) so one source of truth works across 10 platfo
72 lines (55 loc) • 2.22 kB
Markdown
# AIWG Model Evaluation Suite
Evaluate local and cloud models for AIWG compatibility across 6 dimensions.
## Quick Start
```bash
cd tools/eval
npm install
npx tsx src/index.ts hermes3:latest --verbose
```
## Dependencies
`tools/eval` depends on `@matric/eval-client` from the private Gitea npm registry. The `.npmrc` in this directory configures the `@matric` scope automatically — `npm install` picks it up without extra setup.
`@matric/eval-client` is a TypeScript client for the Python [matric-eval](https://git.integrolabs.net/roctinam/matric-eval) framework. When the `matric-eval` binary is installed and on `$PATH`, standard benchmark scores (HumanEval, GSM8K, ARC, etc.) can be included alongside AIWG-specific dimension scores.
## Dimensions
| Dimension | Weight | What it tests |
|-----------|--------|---------------|
| Tool Use | 25% | Correct tool selection and parameter formatting |
| Instruction Following | 25% | Constraint adherence, multi-part requests |
| Coding | 20% | Code generation quality and correctness |
| Structured Output | 15% | JSON/YAML/Markdown generation |
| Reasoning | 10% | Task decomposition and planning |
| Context Handling | 5% | Long-context accuracy |
## Scoring
- **90-100**: opus tier — fully compatible
- **70-89**: sonnet tier — good with minor limitations
- **50-69**: haiku tier — partial compatibility
- **Below 50**: not recommended
## CLI Options
```bash
npx tsx src/index.ts <model-id> [options]
Options:
--backend <type> ollama or api (default: ollama)
--dimensions <list> Comma-separated dimensions to evaluate
--output <format> json or markdown (default: markdown)
--ollama-url <url> Ollama API URL (default: http://localhost:11434)
--verbose Show detailed progress
```
## Adding Test Cases
Test cases live in `datasets/<dimension>/` as YAML files:
```yaml
id: unique-test-id
dimension: tool-use
difficulty: basic
prompt: |
The prompt sent to the model...
expected:
tool_calls:
- tool: Read
params_contain: { file_path: "example.ts" }
contains: ["keyword"]
must_not_contain: ["forbidden"]
valid_json: true
scoring:
correct_tool: 0.4
correct_params: 0.4
no_hallucination: 0.2
```