LLM Evaluator Pro

Verified

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

248 downloads

$ Add to .claude/skills/

$ openclaw install

About This Skill

# LLM Evaluator ⚖️

LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.

When to Use

Evaluating quality of search results or AI responses
Scoring traces for relevance, accuracy, hallucination detection
Batch scoring recent unscored traces
Quality assurance on agent outputs

Usage

```bash # Test with sample cases python3 {baseDir}/scripts/evaluator.py test

# Score a specific Langfuse trace python3 {baseDir}/scripts/evaluator.py score <trace_id>

# Score with specific evaluator only python3 {baseDir}/scripts/evaluator.py score <trace_id> --evaluators relevance

# Backfill scores on recent unscored traces python3 {baseDir}/scripts/evaluator.py backfill --limit 20 ```

Evaluators

| Evaluator | Measures | Scale | |-----------|----------|-------| | relevance | Response relevance to query | 0–1 | | accuracy | Factual correctness | 0–1 | | hallucination | Made-up information detection | 0–1 | | helpfulness | Overall usefulness | 0–1 |

Credits

Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.

📅 Need help setting up OpenClaw for your business? Book a free consultation

Use Cases

Evaluate LLM outputs using LLM-as-a-Judge methodology via Langfuse
Score AI traces on relevance, accuracy, hallucination, and helpfulness
Run batch evaluations across multiple traces for systematic quality assessment
Build automated LLM quality monitoring pipelines with configurable criteria
Compare model performance across different prompts and configurations

Pros & Cons

Pros

+Compatible with multiple platforms including claude-code, openclaw
+Well-documented with detailed usage instructions and examples
+Open source with permissive licensing

Cons

-No built-in analytics or usage metrics dashboard
-Configuration may require familiarity with ai & machine learning concepts

FAQ

What does LLM Evaluator Pro do?

LLM-as-a-Judge evaluator via Langfuse. Scores traces on relevance, accuracy, hallucination, and helpfulness using GPT-5-nano as judge. Supports single trace...

What platforms support LLM Evaluator Pro?

LLM Evaluator Pro is available on Claude Code, OpenClaw.

What are the use cases for LLM Evaluator Pro?

Evaluate LLM outputs using LLM-as-a-Judge methodology via Langfuse. Score AI traces on relevance, accuracy, hallucination, and helpfulness. Run batch evaluations across multiple traces for systematic quality assessment.

100+ free AI tools

Writing, PDF, image, and developer tools — all in your browser.

AI Humanizer

Make AI text undetectable

AI Detector

Free, unlimited

PDF Tools

Merge, split, compress

Next Step

Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.

Open Free Tools Try AI Detector