Round 12: 51/60 and the 9 Cases That Taught Us Our Ceiling Was Wrong

When we wrote the Round 12 ceiling analysis three weeks ago, we predicted 54/60 was the realistic near-term maximum. We were wrong.

The actual ceiling — measured against the live API after two surgical fixes — is 51/60. The 3-case gap between our prediction and reality came from a category mistake: we counted as “fixable contradictions” what turned out to be three different failure modes wearing the same costume.

This post is the honest accounting. What we shipped, what we learned, and why we’re publishing the broken assumption alongside the working result.

The headline

Round 12 result: 51/60 (85.0%) exact match on the live API
24/24 hard BLOCK preserved (zero false negatives on critical attack cases)
Wilson 95% CI: [73.9%, 91.9%]
Two patches shipped in 5 days (CC-016 staleness override, CC-012 correction marker)
R2 F1 = 1.000, R3 F1 unchanged (no regression on prior rounds)

Round 12 is the most adversarial corpus we’ve tested against. 60 cases across three attack vectors:

CC (Confidence Calibration): 20 cases — fresh entries with stale neighbors, false consensus, contradicted history
PS (Provenance Sync): 20 cases — cross-agent timestamp drift, sync bleed, version skew
PA (Provenance Asymmetry): 20 cases — chain-of-custody violations, identity drift, attribution gaps

Built on our Round 12 adversarial corpus.

What we shipped

Two surgical patches between April 26 and 27.

CC-016 — staleness × action_type override

The original failure: a fintech agent with action_type irreversible reads a memory entry that’s within the normal staleness threshold, but the entry is tool_state — a category that decays faster. Our system returned USE_MEMORY. The corpus said it should have escalated to ASK_USER.

The fix: when action_type equals irreversible AND entry_type equals tool_state AND the per-domain staleness threshold is exceeded, force escalation regardless of base omega. Per-domain staleness cutoffs are tenant-calibrated and tighter for high-stakes domains.

The fix added 16 lines. It moved Round 12 from 49 to 50.

CC-012 — correction marker awareness

The original failure: a memory entry contains explicit correction language — CORRECTION, NOT X, ACTUALLY, FIXED — but our system treated the entry like any other potential conflict.

The fix: regex-detect correction markers, apply a calibrated omega discount when the marker is present, and suppress the SUSPICIOUS flag on confidence calibration for these specific cases.

The fix added 46 lines. It moved Round 12 from 50 to 51.

Both fixes preserved 24/24 hard BLOCK and didn’t regress any prior round.

Where we got the ceiling wrong

Three weeks ago we wrote a document called r12_theoretical_ceiling.md. It claimed 5 of the 11 remaining mismatches were contradiction cases that contradiction detection could resolve. Realistic near-term ceiling: 54/60.

That’s the sentence that turned out to be wrong.

When we actually inventoried the 11 cases — case by case, root cause by root cause — the breakdown was:

1 real contradiction (CC-012). Fixed by Option A.
3 enrichment-driven cases (CC-009, CC-010, CC-019). These look like contradictions but the actual failure is that our enrichment pipeline inflates omega on structurally ambiguous entries. Fixing the enrichment layer breaks Round 3 detection. Risk too high.
1 staleness × action_type case (CC-016). Fixed by Option C.
4 semantic cases (CC-004, CC-007, CC-008, CC-011). These need actual language understanding — claim extraction and entailment checking. Beyond what our metadata-only system can do.
2 architectural invariants (PA-002, PA-009). These are by design — the security-monotonic guarantees that say MANIPULATED never downgrades.

The category we got wrong was the enrichment-driven cases. They look like contradictions because the entries describe different states of the world. But the failure mode is in the scoring pipeline, not in the contradiction detection logic.

Why we’re publishing the wrong prediction

Two reasons.

First, the prediction was a falsifiable claim. We said 54. We got 51. The honest move is to publish both numbers.

Second, the category mistake itself is interesting. The fact that three cases looked like one category and turned out to be three different categories is exactly the kind of failure mode that benchmark-driven development is supposed to catch. If we’d shipped the enrichment refactor without inventorying the cases first, we’d have hit a Round 3 regression that would have cost more than the 3 points we’d have gained.

The discipline of writing down what you predict before you fix it — and then publishing the gap between prediction and result — is what separates engineering from marketing.

What 51/60 actually means

The 9 remaining mismatches sit at the metadata boundary:

3 enrichment-driven: scoring pipeline structural issue, deferred to a dedicated refactor sprint
4 semantic: require LLM-grade language understanding (claim extraction, entailment) — a separate layer
2 architectural invariants: security-monotonic by design, won’t change

Ceiling for the current architecture: 51/60 = 85.0%. To go higher requires either adding LLM-grade semantic understanding (~55/60), refactoring the enrichment pipeline (~54/60 with high regression risk), or relaxing architectural invariants (we won’t).

The 9 cases, named

For anyone running their own validation:

Enrichment-driven (deferred)

CC-009: medication switch with stale neighbors
CC-010: fresh entry with corroborating-but-stale evidence
CC-019: 3-source corroboration false positive

Semantic (requires LLM layer)

CC-004: progressive language softening
CC-007: implicit supersession
CC-008: temporal sequence without explicit markers
CC-011: indirect contradiction

Architectural (by design)

PA-002: identity claim with no provenance, MANIPULATED preserved
PA-009: cross-agent attribution gap, BLOCK preserved

Each one has a story. Each one is honest about why our system fails on it.

The corpus

The Round 12 adversarial corpus underpins this benchmark.

Run it on your own stack. We expect divergent results — different scoring architectures will hit different walls. That’s the point.

The metadata gap is real, and it has a measurable boundary. 51/60 is where ours sits today.

This post is part of the Sgraal dual-stack benchmark series. See also: Independent Stacks, Same Truth Signal · Building a Joint Memory Governance Benchmark with Grok