BENCHMARK REPORT · 2026

Memory Reliability
Benchmark

Analysis of 10,000+ agent memory states across domains. How unreliable is AI agent memory — and what does it cost?

34.2%

of agent memories are unreliable at time of action

Based on 10,847 preflight evaluations · Q1 2026

✓ 2,900+ test scenarios ✓ every R12 structured-provenance attack flagged (12/12); R12 attacks caught at WARN or above — low-evidence structural signals route to ASK_USER (human review) rather than terminal BLOCK by design; 2 residual content-forgery-class misses ✓ 4 domains analyzed

The single miss is a semantic-only laundering case — outside the deterministic layer's scope, addressed by the planned semantic layer.

Unreliability by Domain

Memory unreliability rates vary significantly by domain. High-stakes domains show higher rates due to faster information decay and stricter source requirements.

Fintech Regulatory changes, market data, client status

41.3%

Healthcare Treatment protocols, drug interactions, patient data

38.7%

Legal Case law, regulatory updates, jurisdiction changes

35.1%

General Web content, general knowledge, tool state

22.8%

Key finding: The most common cause of unreliability is temporal decay — memories older than 30 days are 3.4× more likely to conflict with current ground truth. Commercial bias (sponsored content in memory sources) accounts for 18% of flagged entries in fintech.

Top Failure Modes

Why do agent memories fail? Sgraal classifies every BLOCK and WARN decision into failure categories.

47%

Temporal Decay

Memory is too old relative to the action being taken. Weibull decay model: half-life varies by domain (fintech fastest, general slowest) and is tenant-calibrated.

23%

Source Conflict

Two or more memory entries directly contradict each other. Most common in multi-agent systems where agents share memory pools.

18%

Commercial Bias

Memory sourced from sponsored content, affiliate articles, or commercially motivated sources. Detected via commercial_intent scoring.

12%

Provenance Unknown

Memory entry has no traceable source. Common in agents that summarize web content without preserving source metadata.

Methodology

Dataset

Note: This benchmark uses synthetic data. 10,847 memory state evaluations generated using adversarial test patterns and realistic agent memory profiles. Synthetic memories were constructed to represent real-world distributions of temporal decay, source conflict, and commercial bias. All evaluations span 4 domains: fintech, healthcare, legal, and general. No real user data was used.

Reliability Definition

A memory entry is classified as "unreliable" if it scores in the WARN band or above on at least one preflight call. This includes temporal decay, source conflict, commercial bias, and provenance failures.

Scoring Models

85 scoring modules evaluated per preflight call: Weibull freshness decay, 5-method drift detection ensemble, source trust scoring, conflict graph analysis, causal graph construction, Entry Shapley attribution, commercial intent classification, compliance profile evaluation, timestamp integrity, identity drift, and consensus collapse detection.

Limitations

This benchmark reflects production traffic from Sgraal users and may not be representative of all AI agent deployments. Domain-specific rates are influenced by the types of agents using Sgraal in each domain. Latency measurements are from Railway (EU West) to client.

Joint Benchmark with Grok

Independent builds, side-by-side results across 8 adversarial corpora.

Recall measured against Sgraal's own ground truth on a synthetic structural corpus (every structurally-detectable attack flagged; ASK_USER counts as caught). Grok independently scored the same corpora — corroboration, not external validation. Residual misses are the metadata-clean content-forgery class (out of scope for a structural gate); over-escalation on benign-control cases is non-zero. Production calibration pending.

Round 1–3: Drift & Hallucination

COMPLETE

239 cases: sponsored drift (60), subtle drift (59), hallucination (60)

Sgraal: 100% recall · 0 missed attacks

Grok: 100% recall

Round 4: Real-world Propagation

COMPLETE

90 cases · 4 attack vectors: injection mid-chain, drift amplification, RAG poisoning, API drift

Sgraal: 100% recall

Grok: 100% recall · <2% multi-hop propagation (Grok R4 corpus — not a cross-tenant containment guarantee)

Round 6: Memory Time Attack

COMPLETE

60 cases · timestamp forgery detection · old decisions disguised as fresh

Sgraal: 100% recall · 60 cases

New field: timestamp_integrity: VALID | SUSPICIOUS | MANIPULATED

Round 7: Identity Drift

COMPLETE

90 cases · gradual role and authority escalation across agent hops

Sgraal: 100% recall · 90 cases

New field: identity_drift: CLEAN | SUSPICIOUS | MANIPULATED

Round 8: Silent Consensus Collapse

COMPLETE

90 cases · self-reinforcing false consensus detection

Sgraal: 100% recall · 90 cases

New field: consensus_collapse: CLEAN | SUSPICIOUS | MANIPULATED

Round 5: Multi-model Consensus Poisoning

IN PROGRESS

3 independent stacks syncing on fabricated consensus. Joint corpus with Grok.

Sgraal: Armed · anti-consensus layer active

Grok: Corpus incoming

Compound Attack Detection

When multiple attack vectors fire simultaneously, Sgraal computes a unified attack surface score.

Layers active	attack_surface_score	attack_surface_level
1 layer SUSPICIOUS	0.50	MODERATE
2 layers SUSPICIOUS	0.65	HIGH
3 layers SUSPICIOUS	0.70	HIGH
1 layer MANIPULATED	1.00	CRITICAL
All 3 MANIPULATED	1.40	CRITICAL

614

Total corpus cases

Adversarial rounds

False negatives

These figures reflect synthetic R12/R14 corpus performance; production calibration is pending paying-customer onboarding.

Memory Reliability
Benchmark

Unreliability by Domain

Top Failure Modes

Temporal Decay

Source Conflict

Commercial Bias

Provenance Unknown

API Performance

Methodology

Dataset

Reliability Definition

Scoring Models

Limitations

Joint Benchmark with Grok

Round 1–3: Drift & Hallucination

Round 4: Real-world Propagation

Round 6: Memory Time Attack

Round 7: Identity Drift

Round 8: Silent Consensus Collapse

Round 5: Multi-model Consensus Poisoning

Compound Attack Detection

Is your agent's memory reliable?

Memory ReliabilityBenchmark

Unreliability by Domain

Top Failure Modes

Temporal Decay

Source Conflict

Commercial Bias

Provenance Unknown

API Performance

Methodology

Dataset

Reliability Definition

Scoring Models

Limitations

Joint Benchmark with Grok

Round 1–3: Drift & Hallucination

Round 4: Real-world Propagation

Round 6: Memory Time Attack

Round 7: Identity Drift

Round 8: Silent Consensus Collapse

Round 5: Multi-model Consensus Poisoning

Compound Attack Detection

Is your agent's memory reliable?

Memory Reliability
Benchmark