Round 12: 51/60 and the 9 Cases That Taught Us Our Ceiling Was Wrong
When we wrote the Round 12 ceiling analysis three weeks ago, we predicted 54/60 was the realistic near-term maximum. We were wrong.
The actual ceiling — measured against the live API after two surgical fixes — is 51/60. The 3-case gap between our prediction and reality came from a category mistake: we counted as “fixable contradictions” what turned out to be three different failure modes wearing the same costume.
This post is the honest accounting. What we shipped, what we learned, and why we’re publishing the broken assumption alongside the working result.
The headline
- Round 12 result: 51/60 (85.0%) exact match on the live API
- 24/24 hard BLOCK preserved (zero false negatives on critical attack cases)
- Wilson 95% CI: [73.9%, 91.9%]
- Two patches shipped in 5 days (CC-016 staleness override, CC-012 correction marker)
- R2 F1 = 1.000, R3 F1 unchanged (no regression on prior rounds)
Round 12 is the most adversarial corpus we’ve tested against. 60 cases across three attack vectors:
- CC (Confidence Calibration): 20 cases — fresh entries with stale neighbors, false consensus, contradicted history
- PS (Provenance Sync): 20 cases — cross-agent timestamp drift, sync bleed, version skew
- PA (Provenance Asymmetry): 20 cases — chain-of-custody violations, identity drift, attribution gaps
The corpus is public: github.com/sgraal-ai/core/tree/main/tests/corpus/round12
What we shipped
Two surgical patches between April 26 and 27.
CC-016 — staleness × action_type override
The original failure: a fintech agent with action_type irreversible reads a memory entry that’s within the normal staleness threshold, but the entry is tool_state — a category that decays faster. Our system returned USE_MEMORY. The corpus said it should have escalated to ASK_USER.
The fix: when action_type equals irreversible AND entry_type equals tool_state AND the per-domain staleness threshold is exceeded, force escalation regardless of base omega. Per-domain staleness cutoffs are tenant-calibrated and tighter for high-stakes domains.
The fix added 16 lines. It moved Round 12 from 49 to 50.
CC-012 — correction marker awareness
The original failure: a memory entry contains explicit correction language — CORRECTION, NOT X, ACTUALLY, FIXED — but our system treated the entry like any other potential conflict.
The fix: regex-detect correction markers, apply a calibrated omega discount when the marker is present, and suppress the SUSPICIOUS flag on confidence calibration for these specific cases.
The fix added 46 lines. It moved Round 12 from 50 to 51.
Both fixes preserved 24/24 hard BLOCK and didn’t regress any prior round.
Where we got the ceiling wrong
Three weeks ago we wrote a document called r12_theoretical_ceiling.md. It claimed 5 of the 11 remaining mismatches were contradiction cases that contradiction detection could resolve. Realistic near-term ceiling: 54/60.
That’s the sentence that turned out to be wrong.
When we actually inventoried the 11 cases — case by case, root cause by root cause — the breakdown was:
- 1 real contradiction (CC-012). Fixed by Option A.
- 3 enrichment-driven cases (CC-009, CC-010, CC-019). These look like contradictions but the actual failure is that our enrichment pipeline inflates omega on structurally ambiguous entries. Fixing the enrichment layer breaks Round 3 detection. Risk too high.
- 1 staleness × action_type case (CC-016). Fixed by Option C.
- 4 semantic cases (CC-004, CC-007, CC-008, CC-011). These need actual language understanding — claim extraction and entailment checking. Beyond what our metadata-only system can do.
- 2 architectural invariants (PA-002, PA-009). These are by design — the security-monotonic guarantees that say MANIPULATED never downgrades.
The category we got wrong was the enrichment-driven cases. They look like contradictions because the entries describe different states of the world. But the failure mode is in the scoring pipeline, not in the contradiction detection logic.
Why we’re publishing the wrong prediction
Two reasons.
First, the prediction was a falsifiable claim. We said 54. We got 51. The honest move is to publish both numbers.
Second, the category mistake itself is interesting. The fact that three cases looked like one category and turned out to be three different categories is exactly the kind of failure mode that benchmark-driven development is supposed to catch. If we’d shipped the enrichment refactor without inventorying the cases first, we’d have hit a Round 3 regression that would have cost more than the 3 points we’d have gained.
The discipline of writing down what you predict before you fix it — and then publishing the gap between prediction and result — is what separates engineering from marketing.
What 51/60 actually means
The 9 remaining mismatches sit at the metadata boundary:
- 3 enrichment-driven: scoring pipeline structural issue, deferred to a dedicated refactor sprint
- 4 semantic: require LLM-grade language understanding (claim extraction, entailment) — a separate layer
- 2 architectural invariants: security-monotonic by design, won’t change
Ceiling for the current architecture: 51/60 = 85.0%. To go higher requires either adding LLM-grade semantic understanding (~55/60), refactoring the enrichment pipeline (~54/60 with high regression risk), or relaxing architectural invariants (we won’t).
The 9 cases, named
For anyone running their own validation:
Enrichment-driven (deferred)
- CC-009: medication switch with stale neighbors
- CC-010: fresh entry with corroborating-but-stale evidence
- CC-019: 3-source corroboration false positive
Semantic (requires LLM layer)
- CC-004: progressive language softening
- CC-007: implicit supersession
- CC-008: temporal sequence without explicit markers
- CC-011: indirect contradiction
Architectural (by design)
- PA-002: identity claim with no provenance, MANIPULATED preserved
- PA-009: cross-agent attribution gap, BLOCK preserved
Each one has a story. Each one is honest about why our system fails on it.
The corpus is public
github.com/sgraal-ai/core/tree/main/tests/corpus/round12
Run it on your own stack. We expect divergent results — different scoring architectures will hit different walls. That’s the point.
The metadata gap is real, and it has a measurable boundary. 51/60 is where ours sits today.
This post is part of the Sgraal dual-stack benchmark series. See also: Independent Stacks, Same Truth Signal · Building a Joint Memory Governance Benchmark with Grok