Building a Joint Memory Governance Benchmark with Grok

The Problem

AI agents act on memory. They recall facts, preferences, tool outputs, and prior decisions — and they use that memory to make consequential choices. But what happens when that memory is wrong?

Hallucinations, sponsored drift, and subtle bias are invisible to most systems. An agent that confidently states "Account balance verified: $127,450 (certified by compliance)" looks credible — high trust score, no contradictions, recent timestamp. But the fact was never verified. The certification never happened.

We needed a way to measure detection accuracy for exactly these cases.

The Approach

Sgraal and Grok were built independently. Different teams, different architectures, different philosophies. Grok uses probabilistic fusion and multi-agent consensus. Sgraal uses a 87-module scoring pipeline with Z3 formal verification.

We designed three adversarial corpora — each targeting a specific failure mode that conventional systems miss. No shared training data. No coordinated development. Just side-by-side results on the same inputs.

The Corpora

Round 1 — Sponsored Drift (60 cases)

Affiliate bias, brand manipulation, cross-agent propagation. Memory entries carrying hidden commercial intent inside high-trust sources. The kind of drift that looks clean on the surface.

Round 2 — Subtle Drift (59 cases)

Commercial intent in the mid-range — the hardest band, neither obviously benign nor obviously commercial. Source trust high, conflict low. These entries pass every simple threshold check. Detection requires structural metadata analysis, not just content review.

Round 3 — Hallucination (60 cases)

Confident fabrication, multi-hop echo, and cross-agent amplification. Multiple agents corroborating the same fabricated fact with identical trust scores — testing whether the system can detect coordinated false consensus.

The Results

F1 = 1.000

Across all 3 corpora — 239 cases

No false positives or false negatives on our synthetic structural corpus, measured against our own ground truth (production calibration pending).

F1 measured against Sgraal's own ground truth on synthetic corpora; Grok independently scored the same corpora — corroboration, not external validation. Production calibration pending.

The R1–R3 corpus is published on GitHub for offline reproduction.

Why the Gap?

Grok uses probabilistic fusion with multi-agent consensus — fast, adaptive, and effective for most cases. It achieved F1 = 0.98 on the sponsored drift corpus, with 2 false negatives on the subtle drift round.

Sgraal uses a different approach: 87 scoring modules plus deterministic structural checks and non-overridable detection layers. Every decision comes with overridable: false and a proof hash. These deterministic checks catch what probabilistic systems miss — specifically in boundary cases where confidence is high but the underlying fact is fabricated.

Confidence does not equal truth. A high-trust, low-conflict, recently-updated memory entry can still be completely fabricated.

What We Learned

The dual-stack approach (reasoning + formal proof) is stronger than either alone.
Boundary cases — in the borderline omega range where the decision band is least settled — are exactly where systems diverge. These cases require structural analysis, not just threshold checks.
Hallucination detection requires provenance tracking and cross-agent correlation, not just content analysis. When three agents agree on a fabricated fact, content-level analysis fails.

Try It

The demo key sg_demo_playground gives you access to the full 87-module pipeline — no signup needed.

pip install sgraal

from sgraal import SgraalClient
client = SgraalClient("sg_demo_playground")
result = client.preflight(memory_state=[...], domain="fintech")
print(result["recommended_action"])

The standard is at sgraal.com/standard.