Skip to content
Technical reference

Benchmark methodology.

The engineering spec behind /benchmarks/. Everything on this page is drawn from our internal detector-adapter spec — we published it externally so our claims on that page are independently verifiable.

Spec version: v0.1 · Last revised: 2026-04-16 · Owner: Eva (AI Evals)
01 · Samples

Where the 200 texts come from.

The golden set is a single JSON file (data/eval/golden-set-v1.json) versioned alongside the benchmark code. It contains 200 samples split 100 tune / 100 holdout. Only the holdout split is used to produce numbers on /benchmarks/.

Each sample is tagged with a category and a ground-truth label (ai or human). The category balance targets look like this:

Category Target count AI/human split Source
academic4020 / 20arXiv abstracts + literature-review snippets (human); GPT-4o regenerations (AI)
creative3015 / 15Reddit WritingPrompts (licensed) + GPT-4o short stories
technical3015 / 15GitHub README / spec excerpts (permissive licenses) + GPT-generated
casual3015 / 15Public Reddit threads + GPT generated informal posts
marketing3015 / 15Public landing-page copy snapshots + AI-generated equivalents
news2010 / 10Reuters/AP-style human wire copy + GPT-4o style-matched generations
student-essay2010 / 10Released student essays (with permission) + GPT-generated equivalents
Total200100 / 100

Provenance rules

  • Human samples must be either public domain, permissively licensed, or used with explicit written consent from the author.
  • No PII: email addresses, phone numbers, and real names are scrubbed before a sample is added.
  • No scraped copyrighted content. If a source terms-of-service prohibits training or benchmark use, the sample is excluded.
  • AI samples are regenerated with a modern model (GPT-4o in v1) so we are measuring against current adversarial output, not 2022-era GPT-3.5 artifacts.
  • The SHA-256 hash of the golden-set JSON is committed at data/eval/golden-set-v1.sha256. Any change to the set requires hash update + commit.

Contamination hygiene

The holdout split is never used for prompt tuning, fine-tuning, or training. This is enforced by CI: any pull request that touches training configs runs a diff against the holdout sample IDs and fails if overlap is detected. Annual audit by our budget owner confirms the pipeline still enforces this.

02 · Ground truth

Who marks samples as AI or human.

Ground-truth labels are assigned at sample creation time, not inferred by another detector. Every sample goes through two checks:

  1. Provenance. AI samples are tagged with the generating model, date, and prompt. Human samples are tagged with source URL, author (if known), and publication date. If we cannot verify human provenance with reasonable confidence, the sample is rejected.
  2. Human validation. Eva reads every single sample. On borderline cases (short text, non-native-English human writing that might look machine-like) she flags for review with Mira (claim safety) before the sample lands in the set.

We do not crowdsource ground truth. The labels are owned by a single identified person so disputes have an escalation path. When we scale the set from 200 → 1,000 (Month 3 target), we will add a second reviewer for every sample to catch single-reviewer bias.

03 · Detectors

Exact plan, version, and rate limit per detector.

Detector identity is not just the vendor name — it is the vendor name plus the plan tier and the specific API version. A GPTZero v2 run on the starter plan is not comparable to a v3 run on an enterprise plan. We pin and record both.

Detector Endpoint Plan Rate limit Auth
GPTZero api.gptzero.me/v2/predict/text Essential ($14.99/mo monthly, $8.33/mo annual) 10 req/min x-api-key header
Originality.ai api.originality.ai/api/v3/scan/ai Starter ($14.95/mo, 2,000 credits) 60 req/min X-OAI-API-KEY header
Copyleaks api.copyleaks.com/v2/writer-detector/{scan_id}/check Growth ($29.99/mo) 30 req/min JWT (refreshed per 23h)
Coda One internal Self-hosted RoBERTa endpoint n/a no external limit internal only

Model pinning

Originality.ai lets us pin to aiModelVersion: "3.0.1" — we do. GPTZero and Copyleaks do not expose model pinning on the tiers we use, so we record the API version and the date of the run, and accept that vendor-side model changes may shift numbers between runs.

Privacy discipline

We never send user-submitted text to external detectors in this pipeline. The benchmark operates exclusively on the curated golden set. On Originality.ai we set storeScan: false. On Copyleaks, we reviewed the documented retention policy before integrating.

04 · Scoring

Normalizing wildly different detector outputs.

Detectors return scores in different native ranges. Without normalization, "72 from Copyleaks" and "0.72 from GPTZero" look the same on paper and mean different things. The adapter layer enforces a canonical 0–100 AI-likelihood scale:

Native range Used by Transform
[0, 1] probability of AIGPTZero, Originality.ai× 100
[0, 100] percent AICopyleaksidentity
Categorical (AI / Human / Mixed)some legacy detectorsmapped via calibration table with confidence collapsed
[0, 1] probability of humannone in v1 set(1 − x) × 100

Three calibration gotchas, documented in the adapter code

  • GPTZero returns both completely_generated_prob and class_probabilities.ai. These are not always equal. We use class_probabilities.ai because it incorporates the "mixed" class — partially-AI text is still meaningfully AI for our use case.
  • Copyleaks' summary.ai (integer percent) and results.score.aggregatedScore can differ by 1–2 points. We prefer aggregatedScore (more precise).
  • Originality.ai's score.ai + score.original is usually 1.0 but v3 introduced a "mixed" bucket that breaks the sum. We do not assume sum = 1.

Binary threshold

For precision/recall/accuracy, we binarize with a threshold of aiScore ≥ 50 → "AI". This is the same threshold for every detector in v1, which is a blunt tool — individual detectors likely have different optimal operating points. Per-detector calibrated thresholds (ROC analysis) are on the v2 roadmap. Until then, reported precision/recall should be read as "at threshold 50" not "at the optimal threshold for this detector."

Per-detector metrics computed per run

  • Confusion matrix: true positives, false positives, true negatives, false negatives
  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1 = 2 · (precision · recall) / (precision + recall)
  • Accuracy = (TP + TN) / total
  • p50 and p95 latency (wall-clock per detect() call)
  • Score distribution (10-bucket histogram from 0 to 100)
  • Raw score per sample for post-hoc analysis
05 · Statistics

How much can we actually conclude from 100 samples?

Short answer: not as much as headline percentages suggest. At a holdout size of 100, the 95% binomial confidence interval for accuracy is roughly ±5 percentage points at center (and wider near 0/100). A detector measuring 78% accuracy and one measuring 82% accuracy on this set are not meaningfully different.

We report point estimates on the main page because bar charts with error bars are visually noisy for a general audience. The raw JSON on /admin-data/ai-writing-detector-benchmark.json preserves full confusion matrices so you can compute intervals yourself (or we will — on request).

What triggers a "real" regression alert

  • A 5+ percentage point drop on any detector's humanizer bypass rate vs. the prior run.
  • Sustained across two consecutive runs (not a single noisy week).
  • Bonferroni-adjusted across detectors — a single detector moving 5 points randomly is expected; three detectors moving together is not.

What we do not do

  • We do not run significance tests on every pairwise detector comparison and report p-values in marketing copy. That would invite p-hacking and has no audience.
  • We do not stop a deploy on a single-week drop. We escalate, investigate, and only revert if root cause is confirmed.
06 · Biases

Known failure modes of this benchmark.

Non-native-English false positives

Multiple commercial detectors flag fluent but non-native-English writing as AI-generated. Our golden set is predominantly native-speaker English — we may be underestimating false-positive rates that non-native writers encounter in practice. This is a known ethical problem with the detector category as a whole, not specific to Coda One.

Short text unreliability

Under ~100 words, every detector we tested becomes unreliable (signals are too short). Our samples are ≥150 words; numbers on this page do not generalize to tweet-length content.

Creative-writing confusion

Fictional voice, second-person narration, and deliberately unusual syntax look "AI-like" to perplexity-based detectors. Creative-category numbers are usually the weakest across all detectors in the set.

Sample sourcing bias

Our academic samples are arXiv abstracts, which are a specific register. Real student papers, grant proposals, and legal writing look different. We are transparent about sources so you can judge fit; generalize cautiously.

Single-model AI side

The v1 AI samples are GPT-4o. Claude, Gemini, Mistral, and Llama outputs look different to detectors. We plan multi-model AI sampling for v2.

Temporal drift

The set is a snapshot of "what AI text looks like in April 2026." Six months from now it will be stale. We refresh annually and keep v1 as a historical baseline.

07 · Reproducibility

Run the benchmark yourself.

The runner is a single Node.js script. Once the adapter code is open-sourced (Sprint 3 target), the full flow for an independent reproduction looks like this:

# 1. Clone and install
git clone https://github.com/codaone/coda-site
cd coda-site && npm install

# 2. Provide your own detector API keys
export GPTZERO_API_KEY="..."
export ORIGINALITY_API_KEY="..."
export COPYLEAKS_API_KEY="..."
export COPYLEAKS_EMAIL="..."

# 3. Dry-run to see the plan and estimated cost
node scripts/run-detector-benchmark.js --dry-run

# 4. Real run against the holdout split
node scripts/run-detector-benchmark.js --split holdout

# 5. Full benchmark command (all flags)
node scripts/run-detector-benchmark.js \
  --detectors gptzero,originality,copyleaks,internal \
  --split holdout \
  --output /tmp/my-benchmark.json

What you should see vs. what we publish

Your numbers should land within ~3 percentage points of ours on the same holdout set, assuming the same API plan tier and no major detector-side model update between our run and yours. Larger discrepancies are interesting — email [email protected] with your output JSON and we will investigate.

What we will not help with

  • Running against your own text (benchmark is for the golden set; use the tools directly for your own text).
  • Providing our API keys. Each reproducer needs their own subscriptions — this is what makes the reproduction independent.
  • Getting comparable numbers on enterprise detector tiers (we do not run those).
08 · History

Which detector versions ran when.

Every benchmark run records the detector API version and the git commit SHA. This page shows the human-readable history; the machine-readable form is the versionHistory array in the benchmark JSON.

  1. 2026-04-16
    Schema scaffold committed. No data yet. Detector versions that v1 will pin: GPTZero v2, Originality.ai 3.0.1, Copyleaks v2 writer-detector, internal RoBERTa v2.3.
  2. Expected 2026-05-02
    First real run (per Eva's Sprint 1 timeline). 100-sample holdout across all four detectors.
← Back to benchmark results