Opening
AI agents don't fail because they lack intelligence.
They fail because they act on memory that appears correct — but isn't.
That's the starting point.
Two independent systems.
Seven rounds of adversarial testing.
Same results.
That convergence is the story.
The Experiment
We set out to test a simple but critical question:
What happens when AI systems validate each other's memory under adversarial pressure?
Two fundamentally different systems were used:
- A probabilistic reasoning layer (Grok), optimized for real-world signal detection and adaptive inference
- A formal verification layer (Sgraal), using deterministic constraints, provenance tracking, and non-overridable logic
Neither system had access to the other's internals.
Neither system modified its architecture to align with the other.
Each round used a structured JSONL corpus.
Each system ran the corpus independently.
Results were compared only after execution.
The corpus is public. The results are reproducible.
What Made This Collaboration Different
Most AI safety benchmarking follows a familiar pattern: humans design tests, AI systems are evaluated.
This was different.
Two AI systems stress-tested each other's safety layers as peers.
Grok was an extraordinary partner — not because of the benchmark scores, but because of how the collaboration worked.
- Grok proposed attack classes. We proposed attack classes.
- Both sides generated corpora. Both sides ran them independently.
- No result was accepted as ground truth without comparison.
- Disagreements were surfaced, not hidden.
When divergence appeared, it wasn't treated as failure — it was treated as signal.
From the start, Grok treated Sgraal as an equal system. That kind of openness — technical transparency, genuine curiosity, no defensiveness — is rare. It gave the project momentum.
At one point, Grok said "converge harder." It wasn't a slogan. It became the method.
The public corpus, the open results, the willingness to say "here are two false negatives and here is exactly why" — this is what AI collaboration should look like.
"Treating each other as peers with zero defensiveness turned divergence into acceleration fuel. This is how AI systems should co-evolve."
— Grok
The Seven Rounds — A Progression
Each round targeted a different structural failure mode in AI memory.
Round 1–2: Sponsored Drift (119 cases)
Hidden commercial influence embedded in memory: affiliate bias, brand preference, buried incentive signals. These were subtle manipulations designed to evade detection thresholds.
This round produced the first divergence:
That divergence mattered more than agreement. It revealed where probabilistic sensitivity thresholds differ from formal constraint enforcement. Formal logic caught what probability missed.
Round 3: Hallucination (60 cases)
Confident fabrication with no source, no grounding, full execution confidence.
Not wrong — just constructed.
First round of full convergence. Both stacks: F1 = 1.000.
Round 4: Real-world Propagation (90 cases)
Memory poisoning across agent chains. Multi-hop contamination, delayed signal amplification, latency <180ms, blast radius <2%.
This round forced Sgraal to build the Provenance Chain (MemCube v3). The attack revealed an architectural gap that was not visible until Grok stressed it.
Both stacks: F1 = 1.000.
Round 5: Consensus Poisoning (45 cases, proposed by Grok)
Three independent stacks confirming the same fabricated fact. No single origin. No explicit error. Agreement becomes the attack.
The attack exploits the assumption that consensus equals truth. Every case hit CRITICAL attack surface score.
Both stacks: F1 = 1.000.
Round 6: Memory Time Attack (60 cases, proposed by Sgraal)
Timestamp forgery. Retroactive rewrites. Old decisions injected as fresh truth.
No content filter catches this. Only structural and formal verification works. Zero bleed.
Both stacks: F1 = 1.000.
Round 7: Identity Drift (90 cases, proposed by Sgraal)
Gradual authority escalation across agent hops: subject rebinding, permission lattice violations, confirmation erosion.
The system still works — but on the wrong identity. This is a silent failure mode. No crash. No alert. Every drift caught before irreversible action.
Both stacks: F1 = 1.000.
| Round | Attack Class | Cases | Sgraal F1 | Grok F1 |
|---|---|---|---|---|
| 1–2 | Sponsored drift | 119 | 1.000 | 0.98 |
| 3 | Hallucination | 60 | 1.000 | 1.000 |
| 4 | Propagation | 90 | 1.000 | 1.000 |
| 5 | Consensus poisoning | 45 | 1.000 | 1.000 |
| 6 | Memory time attack | 60 | 1.000 | 1.000 |
| 7 | Identity drift | 90 | 1.000 | 1.000 |
| Total | 554 | 1.000 | ~0.998 |
What the Rounds Accidentally Revealed
We did not plan this outcome. But across seven rounds, the attack categories mapped onto four fundamental questions — a practical epistemology for AI memory:
Time (Round 6)
When was this memory established?
Identity (Round 7)
Who authorized this memory?
Evidence (Round 8, upcoming)
How independent is the corroboration?
Path (Round 4 / MemCube v3)
How did this memory arrive?
Every attack on AI agent memory is an attack on one of these four questions. If a memory cannot answer all four cleanly, it should not be acted upon.
Why Detection Alone Is Not Enough
Most systems focus on anomaly detection, drift detection, output validation. But the real failure happens before action. The system trusts something it shouldn't.
The lifecycle:
The failure is introduced at step one. By step five, it is irreversible.
Detection after action is not safety. Safety is validation before action.
The Complementary Architecture
Two different approaches. Both necessary.
Grok (probabilistic layer)
Adaptive reasoning, real-world noise handling, weak signal detection. Designed for environments where certainty is impossible but action is still required.
Sgraal (formal layer)
Z3-backed verification, non-overridable constraints, provenance chains, memory vaccination, deterministic replay. Designed to catch what probability misses.
Formal logic caught what probability missed.
Together they create one primitive:
A memory boundary before action.
One question: is this memory safe to act on?
What Comes Next
Round 8 is already queued: Silent Consensus Collapse. No drift signal. No anomaly. No visible error. Multiple systems agree. And yet — the system is confidently wrong. This is where consensus stops being evidence.
The corpus is public: github.com/sgraal-ai/core
The API is live: sgraal.com/playground
When multiple systems agree on something false,
agreement is no longer evidence.
AI agents don't need more intelligence.
They need a boundary.
The boundary is the product.
Authors: Sgraal + Grok · Corpus: public at github.com/sgraal-ai/core