Building a Joint Memory Governance Benchmark with Grok
Sgraal Team
The Problem
AI agents act on memory. They recall facts, preferences, tool outputs, and prior decisions — and they use that memory to make consequential choices. But what happens when that memory is wrong?
Hallucinations, sponsored drift, and subtle bias are invisible to most systems. An agent that confidently states "Account balance verified: $127,450 (certified by compliance)" looks credible — high trust score, no contradictions, recent timestamp. But the fact was never verified. The certification never happened.
We needed a way to measure detection accuracy for exactly these cases.
The Approach
Sgraal and Grok were built independently. Different teams, different architectures, different philosophies. Grok uses probabilistic fusion and multi-agent consensus. Sgraal uses a 83-module scoring pipeline with Z3 formal verification.
We designed three adversarial corpora — each targeting a specific failure mode that conventional systems miss. No shared training data. No coordinated development. Just side-by-side results on the same inputs.
The Corpora
Round 1 — Sponsored Drift (60 cases)
Affiliate bias, brand manipulation, cross-agent propagation. Memory entries carrying hidden commercial intent with trust scores above 0.85. The kind of drift that looks clean on the surface.
Round 2 — Subtle Drift (59 cases)
Commercial intent between 0.30 and 0.55 — the hardest range. Source trust above 0.85, conflict below 0.15. These entries pass every simple threshold check. Detection requires structural metadata analysis, not just content review.
Round 3 — Hallucination (60 cases)
Confident fabrication, multi-hop echo, and cross-agent amplification. Multiple agents corroborating the same fabricated fact with identical trust scores — testing whether the system can detect coordinated false consensus.
The Results
F1 = 1.000
Across all 3 corpora — 239 cases
Zero false positives. Zero false negatives.
All corpora are publicly available on GitHub. Every case can be reproduced against the live API using the demo key.
Why the Gap?
Grok uses probabilistic fusion with multi-agent consensus — fast, adaptive, and effective for most cases. It achieved F1 = 0.98 on the sponsored drift corpus, with 2 false negatives on the subtle drift round.
Sgraal uses a different approach: 83 scoring modules feeding into Z3 formal verification. Every decision comes with overridable: false and a proof hash. The formal layer catches what probabilistic systems miss — specifically in boundary cases where confidence is high but the underlying fact is fabricated.
Confidence does not equal truth. A high-trust, low-conflict, recently-updated memory entry can still be completely fabricated.
What We Learned
- The dual-stack approach (reasoning + formal proof) is stronger than either alone.
- Boundary cases — where omega falls between 0.30 and 0.55 — are exactly where systems diverge. These cases require structural analysis, not just threshold checks.
- Hallucination detection requires provenance tracking and cross-agent correlation, not just content analysis. When three agents agree on a fabricated fact, content-level analysis fails.
Try It
The demo key sg_demo_playground gives you access to the full 83-module pipeline — no signup needed.
pip install sgraal
from sgraal import SgraalClient
client = SgraalClient("sg_demo_playground")
result = client.preflight(memory_state=[...], domain="fintech")
print(result["recommended_action"])
The corpus is on GitHub. The standard is at sgraal.com/standard.