Benchmark methodology.
The engineering spec behind /benchmarks/. Everything on this page is drawn from our internal detector-adapter spec — we published it externally so our claims on that page are independently verifiable.
Where the 200 texts come from.
The golden set is a single JSON file (data/eval/golden-set-v1.json) versioned alongside the benchmark code. It contains 200 samples split 100 tune / 100 holdout. Only the holdout split is used to produce numbers on /benchmarks/.
Each sample is tagged with a category and a ground-truth label (ai or human). The category balance targets look like this:
| Category | Target count | AI/human split | Source |
|---|---|---|---|
| academic | 40 | 20 / 20 | arXiv abstracts + literature-review snippets (human); GPT-4o regenerations (AI) |
| creative | 30 | 15 / 15 | Reddit WritingPrompts (licensed) + GPT-4o short stories |
| technical | 30 | 15 / 15 | GitHub README / spec excerpts (permissive licenses) + GPT-generated |
| casual | 30 | 15 / 15 | Public Reddit threads + GPT generated informal posts |
| marketing | 30 | 15 / 15 | Public landing-page copy snapshots + AI-generated equivalents |
| news | 20 | 10 / 10 | Reuters/AP-style human wire copy + GPT-4o style-matched generations |
| student-essay | 20 | 10 / 10 | Released student essays (with permission) + GPT-generated equivalents |
| Total | 200 | 100 / 100 | — |
Provenance rules
- Human samples must be either public domain, permissively licensed, or used with explicit written consent from the author.
- No PII: email addresses, phone numbers, and real names are scrubbed before a sample is added.
- No scraped copyrighted content. If a source terms-of-service prohibits training or benchmark use, the sample is excluded.
- AI samples are regenerated with a modern model (GPT-4o in v1) so we are measuring against current adversarial output, not 2022-era GPT-3.5 artifacts.
- The SHA-256 hash of the golden-set JSON is committed at
data/eval/golden-set-v1.sha256. Any change to the set requires hash update + commit.
Contamination hygiene
The holdout split is never used for prompt tuning, fine-tuning, or training. This is enforced by CI: any pull request that touches training configs runs a diff against the holdout sample IDs and fails if overlap is detected. Annual audit by our budget owner confirms the pipeline still enforces this.
Who marks samples as AI or human.
Ground-truth labels are assigned at sample creation time, not inferred by another detector. Every sample goes through two checks:
- Provenance. AI samples are tagged with the generating model, date, and prompt. Human samples are tagged with source URL, author (if known), and publication date. If we cannot verify human provenance with reasonable confidence, the sample is rejected.
- Human validation. Eva reads every single sample. On borderline cases (short text, non-native-English human writing that might look machine-like) she flags for review with Mira (claim safety) before the sample lands in the set.
We do not crowdsource ground truth. The labels are owned by a single identified person so disputes have an escalation path. When we scale the set from 200 → 1,000 (Month 3 target), we will add a second reviewer for every sample to catch single-reviewer bias.
Exact plan, version, and rate limit per detector.
Detector identity is not just the vendor name — it is the vendor name plus the plan tier and the specific API version. A GPTZero v2 run on the starter plan is not comparable to a v3 run on an enterprise plan. We pin and record both.
| Detector | Endpoint | Plan | Rate limit | Auth |
|---|---|---|---|---|
| GPTZero | api.gptzero.me/v2/predict/text | Essential ($14.99/mo monthly, $8.33/mo annual) | 10 req/min | x-api-key header |
| Originality.ai | api.originality.ai/api/v3/scan/ai | Starter ($14.95/mo, 2,000 credits) | 60 req/min | X-OAI-API-KEY header |
| Copyleaks | api.copyleaks.com/v2/writer-detector/{scan_id}/check | Growth ($29.99/mo) | 30 req/min | JWT (refreshed per 23h) |
| Coda One internal | Self-hosted RoBERTa endpoint | n/a | no external limit | internal only |
Model pinning
Originality.ai lets us pin to aiModelVersion: "3.0.1" — we do. GPTZero and Copyleaks do not expose model pinning on the tiers we use, so we record the API version and the date of the run, and accept that vendor-side model changes may shift numbers between runs.
Privacy discipline
We never send user-submitted text to external detectors in this pipeline. The benchmark operates exclusively on the curated golden set. On Originality.ai we set storeScan: false. On Copyleaks, we reviewed the documented retention policy before integrating.
Normalizing wildly different detector outputs.
Detectors return scores in different native ranges. Without normalization, "72 from Copyleaks" and "0.72 from GPTZero" look the same on paper and mean different things. The adapter layer enforces a canonical 0–100 AI-likelihood scale:
| Native range | Used by | Transform |
|---|---|---|
[0, 1] probability of AI | GPTZero, Originality.ai | × 100 |
[0, 100] percent AI | Copyleaks | identity |
| Categorical (AI / Human / Mixed) | some legacy detectors | mapped via calibration table with confidence collapsed |
[0, 1] probability of human | none in v1 set | (1 − x) × 100 |
Three calibration gotchas, documented in the adapter code
- GPTZero returns both
completely_generated_probandclass_probabilities.ai. These are not always equal. We useclass_probabilities.aibecause it incorporates the "mixed" class — partially-AI text is still meaningfully AI for our use case. - Copyleaks'
summary.ai(integer percent) andresults.score.aggregatedScorecan differ by 1–2 points. We preferaggregatedScore(more precise). - Originality.ai's
score.ai + score.originalis usually 1.0 but v3 introduced a "mixed" bucket that breaks the sum. We do not assume sum = 1.
Binary threshold
For precision/recall/accuracy, we binarize with a threshold of aiScore ≥ 50 → "AI". This is the same threshold for every detector in v1, which is a blunt tool — individual detectors likely have different optimal operating points. Per-detector calibrated thresholds (ROC analysis) are on the v2 roadmap. Until then, reported precision/recall should be read as "at threshold 50" not "at the optimal threshold for this detector."
Per-detector metrics computed per run
- Confusion matrix: true positives, false positives, true negatives, false negatives
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1 = 2 · (precision · recall) / (precision + recall)
- Accuracy = (TP + TN) / total
- p50 and p95 latency (wall-clock per detect() call)
- Score distribution (10-bucket histogram from 0 to 100)
- Raw score per sample for post-hoc analysis
How much can we actually conclude from 100 samples?
Short answer: not as much as headline percentages suggest. At a holdout size of 100, the 95% binomial confidence interval for accuracy is roughly ±5 percentage points at center (and wider near 0/100). A detector measuring 78% accuracy and one measuring 82% accuracy on this set are not meaningfully different.
We report point estimates on the main page because bar charts with error bars are visually noisy for a general audience. The raw JSON on /admin-data/ai-writing-detector-benchmark.json preserves full confusion matrices so you can compute intervals yourself (or we will — on request).
What triggers a "real" regression alert
- A 5+ percentage point drop on any detector's humanizer bypass rate vs. the prior run.
- Sustained across two consecutive runs (not a single noisy week).
- Bonferroni-adjusted across detectors — a single detector moving 5 points randomly is expected; three detectors moving together is not.
What we do not do
- We do not run significance tests on every pairwise detector comparison and report p-values in marketing copy. That would invite p-hacking and has no audience.
- We do not stop a deploy on a single-week drop. We escalate, investigate, and only revert if root cause is confirmed.
Known failure modes of this benchmark.
Non-native-English false positives
Multiple commercial detectors flag fluent but non-native-English writing as AI-generated. Our golden set is predominantly native-speaker English — we may be underestimating false-positive rates that non-native writers encounter in practice. This is a known ethical problem with the detector category as a whole, not specific to Coda One.
Short text unreliability
Under ~100 words, every detector we tested becomes unreliable (signals are too short). Our samples are ≥150 words; numbers on this page do not generalize to tweet-length content.
Creative-writing confusion
Fictional voice, second-person narration, and deliberately unusual syntax look "AI-like" to perplexity-based detectors. Creative-category numbers are usually the weakest across all detectors in the set.
Sample sourcing bias
Our academic samples are arXiv abstracts, which are a specific register. Real student papers, grant proposals, and legal writing look different. We are transparent about sources so you can judge fit; generalize cautiously.
Single-model AI side
The v1 AI samples are GPT-4o. Claude, Gemini, Mistral, and Llama outputs look different to detectors. We plan multi-model AI sampling for v2.
Temporal drift
The set is a snapshot of "what AI text looks like in April 2026." Six months from now it will be stale. We refresh annually and keep v1 as a historical baseline.
Run the benchmark yourself.
The runner is a single Node.js script. Once the adapter code is open-sourced (Sprint 3 target), the full flow for an independent reproduction looks like this:
# 1. Clone and install
git clone https://github.com/codaone/coda-site
cd coda-site && npm install
# 2. Provide your own detector API keys
export GPTZERO_API_KEY="..."
export ORIGINALITY_API_KEY="..."
export COPYLEAKS_API_KEY="..."
export COPYLEAKS_EMAIL="..."
# 3. Dry-run to see the plan and estimated cost
node scripts/run-detector-benchmark.js --dry-run
# 4. Real run against the holdout split
node scripts/run-detector-benchmark.js --split holdout
# 5. Full benchmark command (all flags)
node scripts/run-detector-benchmark.js \
--detectors gptzero,originality,copyleaks,internal \
--split holdout \
--output /tmp/my-benchmark.json What you should see vs. what we publish
Your numbers should land within ~3 percentage points of ours on the same holdout set, assuming the same API plan tier and no major detector-side model update between our run and yours. Larger discrepancies are interesting — email [email protected] with your output JSON and we will investigate.
What we will not help with
- Running against your own text (benchmark is for the golden set; use the tools directly for your own text).
- Providing our API keys. Each reproducer needs their own subscriptions — this is what makes the reproduction independent.
- Getting comparable numbers on enterprise detector tiers (we do not run those).
Which detector versions ran when.
Every benchmark run records the detector API version and the git commit SHA. This page shows the human-readable history; the machine-readable form is the versionHistory array in the benchmark JSON.
- 2026-04-16Schema scaffold committed. No data yet. Detector versions that v1 will pin: GPTZero v2, Originality.ai 3.0.1, Copyleaks v2 writer-detector, internal RoBERTa v2.3.
- Expected 2026-05-02First real run (per Eva's Sprint 1 timeline). 100-sample holdout across all four detectors.