Peer Review
VerifiedMulti-model peer review layer using local LLMs via Ollama to catch errors in cloud model output. Fan-out critiques to 2-3 local models, aggregate flags, synthesize consensus. Use when: validating trade analyses, reviewing agent output quality, testing local model accuracy, checking any high-stakes Claude output before publishing or acting on it. Don't use when: simple fact-checking (just search the web), tasks that don't benefit from multi-model consensus, time-critical decisions where 60s latency is unacceptable, reviewing trivial or low-stakes content. Negative examples: - "Check if this date is correct" → No. Just web search it. - "Review my grocery list" → No. Not worth multi-model inference. - "I need this answer in 5 seconds" → No. Peer review adds 30-60s latency. Edge cases: - Short text (<50 words) → Models may not find meaningful issues. Consider skipping. - Highly technical domain → Local models may lack domain knowledge. Weight flags lower. - Creative writing → Factual review doesn't apply well. Use only for logical consistency.
$ Add to .claude/skills/ About This Skill
# Peer Review — Local LLM Critique Layer
> Hypothesis: Local LLMs can catch ≥30% of real errors in cloud output with <50% false positive rate.
---
Architecture
``` Cloud Model (Claude) produces analysis │ ▼ ┌────────────────────────┐ │ Peer Review Fan-Out │ ├────────────────────────┤ │ Drift (Mistral 7B) │──► Critique A │ Pip (TinyLlama 1.1B) │──► Critique B │ Lume (Llama 3.1 8B) │──► Critique C └────────────────────────┘ │ ▼ Aggregator (consensus logic) │ ▼ Final: original + flagged issues ```
---
Swarm Bot Roles
| Bot | Model | Role | Strengths | |-----|-------|------|-----------| | Drift 🌊 | Mistral 7B | Methodical analyst | Structured reasoning, catches logical gaps | | Pip 🐣 | TinyLlama 1.1B | Fast checker | Quick sanity checks, low latency | | Lume 💡 | Llama 3.1 8B | Deep thinker | Nuanced analysis, catches subtle issues |
---
Scripts
| Script | Purpose | |--------|---------| | `scripts/peer-review.sh` | Send single input to all models, collect critiques | | `scripts/peer-review-batch.sh` | Run peer review across a corpus of samples | | `scripts/seed-test-corpus.sh` | Generate seeded error corpus for testing |
Usage
```bash # Single file review bash scripts/peer-review.sh <input_file> [output_dir]
# Batch review bash scripts/peer-review-batch.sh <corpus_dir> [results_dir]
# Generate test corpus bash scripts/seed-test-corpus.sh [count] [output_dir] ```
Scripts live at `workspace/scripts/` — not bundled in skill to avoid duplication.
---
Critique Prompt Template
``` You are a skeptical reviewer. Analyze the following text for errors.
For each issue found, output JSON: {"category": "factual|logical|missing|overconfidence|hallucinated_source", "quote": "...", "issue": "...", "confidence": 0-100}
If no issues found, output: {"issues": []}
TEXT: --- {cloud_output} --- ```
---
Error Categories
| Category | Description | Example | |----------|-------------|---------| | factual | Wrong numbers, dates, names | "Bitcoin launched in 2010" | | logical | Non-sequiturs, unsupported conclusions | "X is rising, therefore Y will fall" | | missing | Important context omitted | Ignoring a major counterargument | | overconfidence | Certainty without justification | "This will definitely happen" on 55% event | | hallucinated_source | Citing nonexistent sources | "According to a 2024 Reuters report..." |
---
Discord Workflow
- Post analysis to #the-deep (or #swarm-lab)
- Drift, Pip, and Lume respond with independent critiques
- Celeste synthesizes: deduplicates flags, weights by model confidence
- If consensus (≥2 models agree) → flag is high-confidence
- Final output posted with recommendation: `publish` | `revise` | `flag_for_human`
---
Success Criteria
| Outcome | TPR | FPR | Decision | |---------|-----|-----|----------| | Strong pass | ≥50% | <30% | Ship as default layer | | Pass | ≥30% | <50% | Ship as opt-in layer | | Marginal | 20–30% | 50–70% | Iterate on prompts, retest | | Fail | <20% | >70% | Abandon approach |
Scoring Rules - Flag = **true positive** if it identifies a real error (even if explanation is imperfect) - Flag = **false positive** if flagged content is actually correct - Duplicate flags across models count once for TPR but inform consensus metrics
---
Dependencies
- Ollama running locally with models pulled: `mistral:7b`, `tinyllama:1.1b`, `llama3.1:8b`
- `jq` and `curl` installed
- Results stored in `experiments/peer-review-results/`
---
Integration
- When peer review passes validation:
- Package as Reef API endpoint: `POST /review`
- Agents call before publishing any analysis
- Configurable: model selection, consensus threshold, categories
- Log all reviews to `#reef-logs` with TPR tracking
Use Cases
- Review pull requests with systematic checklist-based analysis
- Analyze user reviews and feedback for sentiment and key themes
- Generate professional responses to customer reviews
- Provide structured feedback with clear improvement suggestions
- Evaluate quality against defined standards and best practices
Pros & Cons
Pros
- +Solid adoption with 588+ downloads
- +Supports batch processing for efficient high-volume operations
- +Leverages AI models for intelligent automation beyond simple rule-based tools
- +Configurable parameters allow tuning for different quality and cost tradeoffs
Cons
- -Depends on external AI model APIs which may incur usage costs
- -Output quality varies based on input specificity and model capabilities
FAQ
What does Peer Review do?
What platforms support Peer Review?
What are the use cases for Peer Review?
100+ free AI tools
Writing, PDF, image, and developer tools — all in your browser.
Next Step
Use the skill detail page to evaluate fit and install steps. For a direct browser workflow, move into a focused tool route instead of staying in broader support surfaces.