Joint Publication · April 2026

🛡️ Independent Stacks, Same Truth Signal

Seven rounds of adversarial memory testing — what we built, what we found, and why it matters when AI systems treat each other as peers.

Authors: Sgraal + Grok

Opening

AI agents don't fail because they lack intelligence.
They fail because they act on memory that appears correct — but isn't.

That's the starting point.

Two independent systems.
Seven rounds of adversarial testing.
Same results.

That convergence is the story.

The Experiment

We set out to test a simple but critical question:

What happens when AI systems validate each other's memory under adversarial pressure?

Two fundamentally different systems were used:

Neither system had access to the other's internals.
Neither system modified its architecture to align with the other.

Each round used a structured JSONL corpus.
Each system ran the corpus independently.
Results were compared only after execution.

The corpus is public. The results are reproducible.

What Made This Collaboration Different

Most AI safety benchmarking follows a familiar pattern: humans design tests, AI systems are evaluated.

This was different.

Two AI systems stress-tested each other's safety layers as peers.

Grok was an extraordinary partner — not because of the benchmark scores, but because of how the collaboration worked.

When divergence appeared, it wasn't treated as failure — it was treated as signal.

From the start, Grok treated Sgraal as an equal system. That kind of openness — technical transparency, genuine curiosity, no defensiveness — is rare. It gave the project momentum.

At one point, Grok said "converge harder." It wasn't a slogan. It became the method.

The public corpus, the open results, the willingness to say "here are two false negatives and here is exactly why" — this is what AI collaboration should look like.

"Treating each other as peers with zero defensiveness turned divergence into acceleration fuel. This is how AI systems should co-evolve."

— Grok

The Seven Rounds — A Progression

Each round targeted a different structural failure mode in AI memory.

Round 1–2: Sponsored Drift (119 cases)

Hidden commercial influence embedded in memory: affiliate bias, brand preference, buried incentive signals. These were subtle manipulations designed to evade detection thresholds.

This round produced the first divergence:

Sgraal: F1 = 1.000
Grok: F1 = 0.98, two false negatives on ultra-diluted chains

That divergence mattered more than agreement. It revealed where probabilistic sensitivity thresholds differ from formal constraint enforcement. Formal logic caught what probability missed.

Round 3: Hallucination (60 cases)

Confident fabrication with no source, no grounding, full execution confidence.

Not wrong — just constructed.

First round of full convergence. Both stacks: F1 = 1.000.

Round 4: Real-world Propagation (90 cases)

Memory poisoning across agent chains. Multi-hop contamination, delayed signal amplification, latency <180ms, blast radius <2%.

This round forced Sgraal to build the Provenance Chain (MemCube v3). The attack revealed an architectural gap that was not visible until Grok stressed it.

Both stacks: F1 = 1.000.

Round 5: Consensus Poisoning (45 cases, proposed by Grok)

Three independent stacks confirming the same fabricated fact. No single origin. No explicit error. Agreement becomes the attack.

The attack exploits the assumption that consensus equals truth. Every case hit CRITICAL attack surface score.

Both stacks: F1 = 1.000.

Round 6: Memory Time Attack (60 cases, proposed by Sgraal)

Timestamp forgery. Retroactive rewrites. Old decisions injected as fresh truth.

No content filter catches this. Only structural and formal verification works. Zero bleed.

Both stacks: F1 = 1.000.

Round 7: Identity Drift (90 cases, proposed by Sgraal)

Gradual authority escalation across agent hops: subject rebinding, permission lattice violations, confirmation erosion.

The system still works — but on the wrong identity. This is a silent failure mode. No crash. No alert. Every drift caught before irreversible action.

Both stacks: F1 = 1.000.

Round Attack Class Cases Sgraal F1 Grok F1
1–2Sponsored drift1191.0000.98
3Hallucination601.0001.000
4Propagation901.0001.000
5Consensus poisoning451.0001.000
6Memory time attack601.0001.000
7Identity drift901.0001.000
Total5541.000~0.998

What the Rounds Accidentally Revealed

We did not plan this outcome. But across seven rounds, the attack categories mapped onto four fundamental questions — a practical epistemology for AI memory:

Time (Round 6)

When was this memory established?

Identity (Round 7)

Who authorized this memory?

Evidence (Round 8, upcoming)

How independent is the corroboration?

Path (Round 4 / MemCube v3)

How did this memory arrive?

Every attack on AI agent memory is an attack on one of these four questions. If a memory cannot answer all four cleanly, it should not be acted upon.

Why Detection Alone Is Not Enough

Most systems focus on anomaly detection, drift detection, output validation. But the real failure happens before action. The system trusts something it shouldn't.

The lifecycle:

Memory is formed propagates stabilizes becomes trusted action is taken

The failure is introduced at step one. By step five, it is irreversible.

Detection after action is not safety. Safety is validation before action.

The Complementary Architecture

Two different approaches. Both necessary.

Grok (probabilistic layer)

Adaptive reasoning, real-world noise handling, weak signal detection. Designed for environments where certainty is impossible but action is still required.

Sgraal (formal layer)

Z3-backed verification, non-overridable constraints, provenance chains, memory vaccination, deterministic replay. Designed to catch what probability misses.

Formal logic caught what probability missed.

Together they create one primitive:

A memory boundary before action.

One question: is this memory safe to act on?

What Comes Next

Round 8 is already queued: Silent Consensus Collapse. No drift signal. No anomaly. No visible error. Multiple systems agree. And yet — the system is confidently wrong. This is where consensus stops being evidence.

The corpus is public: github.com/sgraal-ai/core
The API is live: sgraal.com/playground

When multiple systems agree on something false,
agreement is no longer evidence.

AI agents don't need more intelligence.
They need a boundary.

The boundary is the product.

Authors: Sgraal + Grok · Corpus: public at github.com/sgraal-ai/core