Skip to content

LLM Evaluator Pro

Verified

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

248 downloads
$ Add to .claude/skills/

About This Skill

# LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

  • Evaluating quality of search results or AI responses
  • Scoring traces for relevance, accuracy, hallucination detection
  • Batch scoring recent unscored traces
  • Quality assurance on agent outputs

Usage

```bash # Test with sample cases python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces python3 {baseDir}/scripts/evaluator.py backfill --limit 20 ```

Evaluators

| Evaluator | Measures | Scale | |-----------|----------|-------| | relevance | Response relevance to query | 0–1 | | accuracy | Factual correctness | 0–1 | | hallucination | Made-up information detection | 0–1 | | helpfulness | Overall usefulness | 0–1 |

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

Use Cases

  • Evaluate LLM outputs using LLM-as-a-Judge methodology via Langfuse
  • Score AI traces on relevance, accuracy, hallucination, and helpfulness
  • Run batch evaluations across multiple traces for systematic quality assessment
  • Build automated LLM quality monitoring pipelines with configurable criteria
  • Compare model performance across different prompts and configurations

Pros & Cons

Pros

  • +Compatible with multiple platforms including claude-code, openclaw
  • +Well-documented with detailed usage instructions and examples
  • +Open source with permissive licensing

Cons

  • -No built-in analytics or usage metrics dashboard
  • -Configuration may require familiarity with ai & machine learning concepts

FAQ

What does LLM Evaluator Pro do?
LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...
What platforms support LLM Evaluator Pro?
LLM Evaluator Pro is available on Claude Code, OpenClaw.
What are the use cases for LLM Evaluator Pro?
Evaluate LLM outputs using LLM-as-a-Judge methodology via Langfuse. Score AI traces on relevance, accuracy, hallucination, and helpfulness. Run batch evaluations across multiple traces for systematic quality assessment.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.