Model Tester
VerifiedTest agents or models against predefined test cases to validate model routing, performance, and output quality. Use when: (1) verifying a specific agent or m...
$ Add to .claude/skills/ About This Skill
Use `scripts/model_tester.py` to run repeatable test prompts and compare requested vs actual model usage from OpenClaw logs.
Run
From the skill directory (or pass absolute paths):
```bash python3 scripts/model_tester.py --agent menial --case extract-emails python3 scripts/model_tester.py --model openai/gpt-4.1 --case math-reasoning python3 scripts/model_tester.py --agent chat --model openai/gpt-4.1 --case all --out /tmp/model-test.json ```
Inputs
- `--agent <name>`: Target agent (chat, menial, coder, etc.)
- `--model <name>`: Requested model alias/name to test
- `--case <id|all>`: Case from `references/test-cases.json` or `all`
- `--timeout <sec>`: Per-case timeout (default `120`)
- `--out <file>`: Optional JSON output file
Require at least one of `--agent` or `--model`.
What the runner does
- Load test cases from `references/test-cases.json`.
- Start `openclaw logs --follow --json` in parallel.
- Run `openclaw agent --json` with a bounded test prompt (asks agent to use a subagent for the task).
- Parse response + tailed logs.
- Emit machine-readable JSON and a short human summary.
Output format
Top-level JSON:
- `tool`
- `timestamp`
- `agent`
- `requested_model`
- `results[]`
Each result entry returns:
- `test_case`
- `agent`
- `requested_model`
- `actual_model` (parsed from logs when available)
- `status` (`ok`/`error`)
- `result_summary`
- `runtime_seconds`
- `tokens` (when discoverable)
- `errors[]`
Privacy & Safety
- The tester spawns isolated subagent tasks with predefined test prompts only — no user data is passed to models. It tails OpenClaw logs to extract:
- which model was actually selected (routing validation)
- token usage statistics
- runtime metrics
Log extraction uses regex patterns to find model/token fields. No personally identifiable information or arbitrary log content is captured — only structured fields related to the test execution.
Notes
- Model extraction and token extraction are best-effort because log fields may vary by OpenClaw/provider version.
- If `openclaw` config is invalid or gateway is unavailable, the script returns `status=error` with stderr details.
- Edit `references/test-cases.json` to add custom prompts for your benchmark set.
- All test cases are generic; no workspace or user data is baked in.
Use Cases
- Test AI agents against predefined test cases for output validation
- Verify model routing and performance across different configurations
- Build automated quality assurance pipelines for AI model evaluation
- Run regression tests to ensure consistent model behavior after updates
- Compare model outputs across providers using standardized test suites
Pros & Cons
Pros
- +Compatible with multiple platforms including claude-code, openclaw
- +Well-documented with detailed usage instructions and examples
- +Open source with permissive licensing
Cons
- -Requires API tokens or authentication setup before first use
- -No built-in analytics or usage metrics dashboard
FAQ
What does Model Tester do?
What platforms support Model Tester?
What are the use cases for Model Tester?
100+ free AI tools
Writing, PDF, image, and developer tools — all in your browser.
Next Step
Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.