Skip to content
Public transparency report

Benchmarks — how our tools compare.

We publish our real numbers. No cherry-picked screenshots, no "bypasses everything" claims — just reproducible data from weekly runs against three programmatic AI detectors.

Last updated
Pending first run (Sprint 1)
Samples in run
0 / 200
Cadence
Weekly (Fri 02:00 UTC)
Run SHA
Benchmark data coming in Sprint 1

Eva (our AI evals lead) is running the first 200-sample baseline this week. Numbers below are placeholders — the scaffold shows the structure so you can see exactly what we will publish. First real run lands on Friday, May 2, 2026 at 02:00 UTC. We are committing to this page before we have the numbers so we cannot quietly walk back the commitment.

Why we publish this

AI tool vendors overclaim by default. Users deserve the real numbers.

01

Marketing copy lies by omission

"Bypasses leading detectors" is everywhere in this category. We have been guilty of it. The problem: "leading" and "bypass" are undefined, and the screenshots are from a single cherry-picked run. This page is our commitment that every future performance claim in our copy must cite a specific row of a specific benchmark run, or it gets pulled.

02

Regressions should be visible

When a prompt change, model version bump, or infrastructure tweak lowers humanizer performance, silent degradation is the default. We run the same golden set every week so any drop shows up in the delta between runs, not in customer complaints weeks later.

03

You can reproduce it

The runner is scripts/run-detector-benchmark.js. The golden set is versioned. The detector API adapters are thin and documented. When we open-source the adapter code (Sprint 3 target), you will be able to run the same benchmark with your own subscription keys and reconcile the numbers.

04

We will lose on some metrics

When a competitor's humanizer beats our own on a particular detector or category, we will report it on this page. Rule of thumb: if a claim would embarrass us if true, we publish the metric that would prove it either way.

Humanizer vs. detectors

How much humanizer output do detectors miss?

We feed the same source AI-generated texts through the Coda One humanizer, then score the output on each external detector. Higher percentages mean the detector classified the rewritten text as human. A 100% bypass score is not achievable and we do not target it — it would indicate the detector is broken, not that our output is flawless.

Coda One (internal RoBERTa)
API 2.3 · self-hosted plan
Humanizer bypass rate
Precision
Recall
F1
Accuracy
p50 latency
p95 latency
0 samples · 0 errored Pending first run
GPTZero
API v2 · starter plan
Humanizer bypass rate
Precision
Recall
F1
Accuracy
p50 latency
p95 latency
0 samples · 0 errored Pending first run
Originality.ai
API 3.0.1 · starter plan
Humanizer bypass rate
Precision
Recall
F1
Accuracy
p50 latency
p95 latency
0 samples · 0 errored Pending first run
Copyleaks
API v2 · growth plan
Humanizer bypass rate
Precision
Recall
F1
Accuracy
p50 latency
p95 latency
0 samples · 0 errored Pending first run
Our detector

Coda One AI Detector vs. commercial detectors

Our own RoBERTa-based detector (the one at /ai-detector/) is benchmarked using the same 200-sample golden set and scored on the same two metrics that matter: does it catch AI text (recall) and does it avoid flagging real human writing (false-positive rate)?

Detector
Recall on AI text
False-positive rate on human text
Coda One (internal RoBERTa) ours
GPTZero
Originality.ai
Copyleaks

We do not claim ours is more accurate than paid detectors. In many categories — academic in particular — the paid detectors likely win on recall. What we publish is the apples-to-apples number so you can decide which tool fits your use case.

Methodology at a glance

Three pillars: golden set, adapters, weekly runs.

01 · Golden set

200 curated samples, 50/50 AI/human split

Built from public-domain academic abstracts, Reddit WritingPrompts (licensed), GitHub README snippets, and GPT-4o regenerations. Split deterministically into 100 tune / 100 holdout. Luna (ML lead) never sees the holdout split. SHA-256 of the set is committed alongside, so any silent change is detectable.

  • Categories: academic, creative, technical, casual, marketing, news, student-essay
  • Provenance check on every human sample (no scraped copyrighted text)
  • AI samples regenerated on modern models so we are measuring current adversaries
02 · Detector adapters

Same interface across GPTZero, Originality.ai, Copyleaks, internal

Each detector has a thin adapter that calls the vendor API, normalizes the score to a 0–100 AI-likelihood scale, and records latency + request ID. This keeps the runner detector-agnostic; adding a fifth detector is one adapter file.

  • GPTZero v2 · starter plan · 10 req/min
  • Originality.ai v3.0.1 · starter plan · 60 req/min
  • Copyleaks v2 writer-detector · growth plan · 30 req/min
  • Coda One internal RoBERTa (self-hosted, same inputs)
03 · Weekly runs

Friday 02:00 UTC, committed to git, alerted on regression

GitHub Actions runs the full holdout split every Friday. Output is committed as public/admin-data/ai-writing-detector-benchmark.json. The run SHA is tied to the git commit so any marketing claim can cite a specific immutable artifact.

  • Regression gate: a 5-point drop vs the prior run fails the job and alerts the on-call
  • Results archived as GitHub Actions artifacts for 90 days
  • Cost discipline: ~$55/month for subscriptions + API credits
Historical trends
More data coming

Weekly runs start after Eva ships Sprint 1. With one data point, a trend chart would imply more signal than exists — we would rather show nothing than a fake line.

Expected first trend chart: late May 2026 (after 4 weekly runs).

Limitations

What this benchmark does not tell you.

  • Sample size is small. 200 samples in v1, growing to 1,000 by Month 3. Confidence intervals at n=100 are roughly ±5 percentage points for accuracy — numbers that look different by 3 points are probably not meaningfully different.
  • English only. The Sprint 1 golden set is English. Coda One runs in seven languages but we have not yet built multilingual eval. Do not generalize these numbers to non-English content.
  • No proprietary detectors. Turnitin, ZeroGPT enterprise, and Winston AI enterprise do not expose public APIs at tiers we could automate against. Their absence is a real gap.
  • Detectors are not deterministic. Running the same text on the same day can return scores that differ by 2–5 points, especially on borderline cases. We do not chase week-to-week noise.
  • Detector vendors ship silent updates. Originality.ai or Copyleaks can change their model under you without a version bump. We record apiVersion per run but cannot control when vendors retrain.
  • Golden set curation bias. Our samples come from the sources we chose. A student essay from an SAT prep site is not the same distribution as a real student's final paper. We name every source in the methodology so you can judge fit for your use case.
  • Humanizer bypass is not product quality. A humanizer that scores 100% on detector bypass but turns clear prose into word salad is a bad product. We track readability separately; it is not on this page yet.
FAQ

Questions we expect.

Why not include Turnitin?
Turnitin does not offer a public API for AI detection. We cannot run an automated, reproducible benchmark against it. If that changes, we will add an adapter. Until then, we are honest about the gap — Turnitin is widely deployed in education and absence from our benchmark is a real limitation.
How often is this page updated?
The weekly benchmark runs Friday at 02:00 UTC via GitHub Actions. If the run succeeds, the JSON on this page updates automatically at the next deploy (usually within 24 hours). If a run fails or a detector is down, we flag the row as degraded rather than hide the failure.
Can I suggest a detector to add?
Yes. Email [email protected] with the detector name, a link to its public API docs, and (if possible) the plan tier you use. We prioritize detectors by user demand and by whether they are programmatically accessible. Each new detector costs a subscription line and engineering time, so we do not add every request.
Is the golden set public?
Not yet. The 200-sample golden set (Eva is building in Sprint 1) mixes public-domain text with AI generations. We intend to publish the sample IDs and source references once Luna (our ML lead) confirms that publishing will not contaminate training pipelines. Hash checksums are committed publicly so independent reviewers can verify we have not silently changed the set between runs.
Why do numbers fluctuate week to week?
Three sources of noise: (1) detectors themselves are not always deterministic — the same text scored on two different days can differ by 2–5 points; (2) detector vendors ship model updates without notice; (3) sample-level variance on a 100-sample holdout. We do not chase small week-to-week movements. A sustained shift of more than 5 points over two consecutive runs triggers an investigation.
Do you benchmark on the same text users paste into your humanizer?
No. The benchmark uses a separately curated, privacy-safe golden set — no user text is ever sent to third-party detectors. Golden set sources are public-domain academic, creative, technical, casual, marketing, news, and student essay samples. See the methodology page for full sourcing rules.
What if your humanizer regresses?
Our GitHub Actions workflow runs a regression gate: if the latest run drops more than 5 percentage points on any detector versus the previous run, the job exits with a non-zero status and posts to our #ml-alerts channel. Marketing copy citing the older number must be updated or retracted within 7 days per CLAIM-SAFETY-GOVERNANCE §5.
Why only English?
The Sprint 1 golden set is English-only. Multilingual detector evaluation (Arabic, Turkish, Spanish, Portuguese, Indonesian, Traditional Chinese, matching Coda One i18n) is a separate track and is not yet scheduled. Any multilingual claims would be unsupported today.
Reproduce it yourself

Run the round trip and compare.

Paste any AI-generated text into the humanizer, then paste the output into our detector (or any of the three external ones). Our numbers should be close to what you see — if they are not, we want to hear about it. Email [email protected] with the inputs and outputs.