Model Tester

Verified

Test agents or models against predefined test cases to validate model routing, performance, and output quality. Use when: (1) verifying a specific agent or m...

79 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

Use `scripts/model_tester.py` to run repeatable test prompts and compare requested vs actual model usage from OpenClaw logs.

Run

From the skill directory (or pass absolute paths):

```bash python3 scripts/model_tester.py --agent menial --case extract-emails python3 scripts/model_tester.py --model openai/gpt-4.1 --case math-reasoning python3 scripts/model_tester.py --agent chat --model openai/gpt-4.1 --case all --out /tmp/model-test.json ```

Inputs

`--agent <name>`: Target agent (chat, menial, coder, etc.)
`--model <name>`: Requested model alias/name to test
`--case <id|all>`: Case from `references/test-cases.json` or `all`
`--timeout <sec>`: Per-case timeout (default `120`)
`--out <file>`: Optional JSON output file

Require at least one of `--agent` or `--model`.

What the runner does

Load test cases from `references/test-cases.json`.
Start `openclaw logs --follow --json` in parallel.
Run `openclaw agent --json` with a bounded test prompt (asks agent to use a subagent for the task).
Parse response + tailed logs.
Emit machine-readable JSON and a short human summary.

Output format

Top-level JSON:

`tool`
`timestamp`
`agent`
`requested_model`
`results[]`

Each result entry returns:

`test_case`
`agent`
`requested_model`
`actual_model` (parsed from logs when available)
`status` (`ok`/`error`)
`result_summary`
`runtime_seconds`
`tokens` (when discoverable)
`errors[]`

Privacy & Safety

The tester spawns isolated subagent tasks with predefined test prompts only — no user data is passed to models. It tails OpenClaw logs to extract:
which model was actually selected (routing validation)
token usage statistics
runtime metrics

Log extraction uses regex patterns to find model/token fields. No personally identifiable information or arbitrary log content is captured — only structured fields related to the test execution.

Notes

Model extraction and token extraction are best-effort because log fields may vary by OpenClaw/provider version.
If `openclaw` config is invalid or gateway is unavailable, the script returns `status=error` with stderr details.
Edit `references/test-cases.json` to add custom prompts for your benchmark set.
All test cases are generic; no workspace or user data is baked in.

Use Cases

Test AI agents against predefined test cases for output validation
Verify model routing and performance across different configurations
Build automated quality assurance pipelines for AI model evaluation
Run regression tests to ensure consistent model behavior after updates
Compare model outputs across providers using standardized test suites

Pros & Cons

Pros

+Compatible with multiple platforms including claude-code, openclaw
+Well-documented with detailed usage instructions and examples
+Open source with permissive licensing

Cons

-Requires API tokens or authentication setup before first use
-No built-in analytics or usage metrics dashboard

FAQ

What does Model Tester do?

Test agents or models against predefined test cases to validate model routing, performance, and output quality. Use when: (1) verifying a specific agent or m...

What platforms support Model Tester?

Model Tester is available on Claude Code, OpenClaw.

What are the use cases for Model Tester?

Test AI agents against predefined test cases for output validation. Verify model routing and performance across different configurations. Build automated quality assurance pipelines for AI model evaluation.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector