- Precision
- —
- Recall
- —
- F1
- —
- Accuracy
- —
- p50 latency
- —
- p95 latency
- —
Benchmarks — how our tools compare.
We publish our real numbers. No cherry-picked screenshots, no "bypasses everything" claims — just reproducible data from weekly runs against three programmatic AI detectors.
AI tool vendors overclaim by default. Users deserve the real numbers.
Marketing copy lies by omission
"Bypasses leading detectors" is everywhere in this category. We have been guilty of it. The problem: "leading" and "bypass" are undefined, and the screenshots are from a single cherry-picked run. This page is our commitment that every future performance claim in our copy must cite a specific row of a specific benchmark run, or it gets pulled.
Regressions should be visible
When a prompt change, model version bump, or infrastructure tweak lowers humanizer performance, silent degradation is the default. We run the same golden set every week so any drop shows up in the delta between runs, not in customer complaints weeks later.
You can reproduce it
The runner is scripts/run-detector-benchmark.js. The golden set is versioned. The detector API adapters are thin and documented. When we open-source the adapter code (Sprint 3 target), you will be able to run the same benchmark with your own subscription keys and reconcile the numbers.
We will lose on some metrics
When a competitor's humanizer beats our own on a particular detector or category, we will report it on this page. Rule of thumb: if a claim would embarrass us if true, we publish the metric that would prove it either way.
How much humanizer output do detectors miss?
We feed the same source AI-generated texts through the Coda One humanizer, then score the output on each external detector. Higher percentages mean the detector classified the rewritten text as human. A 100% bypass score is not achievable and we do not target it — it would indicate the detector is broken, not that our output is flawless.
- Precision
- —
- Recall
- —
- F1
- —
- Accuracy
- —
- p50 latency
- —
- p95 latency
- —
- Precision
- —
- Recall
- —
- F1
- —
- Accuracy
- —
- p50 latency
- —
- p95 latency
- —
- Precision
- —
- Recall
- —
- F1
- —
- Accuracy
- —
- p50 latency
- —
- p95 latency
- —
Coda One AI Detector vs. commercial detectors
Our own RoBERTa-based detector (the one at /ai-detector/) is benchmarked using the same 200-sample golden set and scored on the same two metrics that matter: does it catch AI text (recall) and does it avoid flagging real human writing (false-positive rate)?
We do not claim ours is more accurate than paid detectors. In many categories — academic in particular — the paid detectors likely win on recall. What we publish is the apples-to-apples number so you can decide which tool fits your use case.
Three pillars: golden set, adapters, weekly runs.
200 curated samples, 50/50 AI/human split
Built from public-domain academic abstracts, Reddit WritingPrompts (licensed), GitHub README snippets, and GPT-4o regenerations. Split deterministically into 100 tune / 100 holdout. Luna (ML lead) never sees the holdout split. SHA-256 of the set is committed alongside, so any silent change is detectable.
- Categories: academic, creative, technical, casual, marketing, news, student-essay
- Provenance check on every human sample (no scraped copyrighted text)
- AI samples regenerated on modern models so we are measuring current adversaries
Same interface across GPTZero, Originality.ai, Copyleaks, internal
Each detector has a thin adapter that calls the vendor API, normalizes the score to a 0–100 AI-likelihood scale, and records latency + request ID. This keeps the runner detector-agnostic; adding a fifth detector is one adapter file.
- GPTZero v2 · starter plan · 10 req/min
- Originality.ai v3.0.1 · starter plan · 60 req/min
- Copyleaks v2 writer-detector · growth plan · 30 req/min
- Coda One internal RoBERTa (self-hosted, same inputs)
Friday 02:00 UTC, committed to git, alerted on regression
GitHub Actions runs the full holdout split every Friday. Output is committed as public/admin-data/ai-writing-detector-benchmark.json. The run SHA is tied to the git commit so any marketing claim can cite a specific immutable artifact.
- Regression gate: a 5-point drop vs the prior run fails the job and alerts the on-call
- Results archived as GitHub Actions artifacts for 90 days
- Cost discipline: ~$55/month for subscriptions + API credits
Performance over time.
Weekly runs start after Eva ships Sprint 1. With one data point, a trend chart would imply more signal than exists — we would rather show nothing than a fake line.
Expected first trend chart: late May 2026 (after 4 weekly runs).
What this benchmark does not tell you.
- Sample size is small. 200 samples in v1, growing to 1,000 by Month 3. Confidence intervals at n=100 are roughly ±5 percentage points for accuracy — numbers that look different by 3 points are probably not meaningfully different.
- English only. The Sprint 1 golden set is English. Coda One runs in seven languages but we have not yet built multilingual eval. Do not generalize these numbers to non-English content.
- No proprietary detectors. Turnitin, ZeroGPT enterprise, and Winston AI enterprise do not expose public APIs at tiers we could automate against. Their absence is a real gap.
- Detectors are not deterministic. Running the same text on the same day can return scores that differ by 2–5 points, especially on borderline cases. We do not chase week-to-week noise.
- Detector vendors ship silent updates. Originality.ai or Copyleaks can change their model under you without a version bump. We record
apiVersionper run but cannot control when vendors retrain. - Golden set curation bias. Our samples come from the sources we chose. A student essay from an SAT prep site is not the same distribution as a real student's final paper. We name every source in the methodology so you can judge fit for your use case.
- Humanizer bypass is not product quality. A humanizer that scores 100% on detector bypass but turns clear prose into word salad is a bad product. We track readability separately; it is not on this page yet.
Questions we expect.
Why not include Turnitin?
How often is this page updated?
Can I suggest a detector to add?
Is the golden set public?
Why do numbers fluctuate week to week?
Do you benchmark on the same text users paste into your humanizer?
What if your humanizer regresses?
Why only English?
Run the round trip and compare.
Paste any AI-generated text into the humanizer, then paste the output into our detector (or any of the three external ones). Our numbers should be close to what you see — if they are not, we want to hear about it. Email [email protected] with the inputs and outputs.