Two arXiv papers and one breaking security report prove what every engineer running a long-lived agent already suspected. Your agent degrades over time. While it degrades, it is silently executing unverified code. And nobody is measuring either threat.

We benchmark agents on day one. We ship them on day one. But nobody benchmarks them on day thirty, day ninety, day two hundred. The industry has built an entire deployment culture around the assumption that an AI agent is a stateless API call. You send a prompt, you get a response, the interaction ends. But the agents entering production in 2026 are not stateless. They are persistent systems. They accumulate memory. They compress their own history. They learn conflicting facts across weeks of operation. They sit on top of software stacks that change underneath them. And through all of this, nobody is asking the most basic systems question: how long does an agent remain reliable after deployment?

A new arXiv paper introduces the first answer. AgingBench, submitted on May 25, 2026, is the first longitudinal reliability benchmark for AI agents. The authors, Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, and Zhangyang Wang, ran over 400 sessions across 14 models, spanning 8 to 200 sessions each. Their finding is stark: agents degrade, and the degradation follows four distinct patterns. Behavioral tests can remain clean while factual precision decays. Derived-state tracking can collapse sharply within a single model. The same wrong answer can require completely different repairs depending on which aging mechanism is dominant.

This is not an abstract research curiosity. It is a live production threat. In May 2026, the same week AgingBench appeared on arXiv, another paper dropped proposing an entirely new database paradigm for agent memory because current systems cannot handle the problem. And a security report from Aikido confirmed that while these agents are already failing production tasks, they are actively executing code, hallucinating dependency names and installing packages that no one owns. The degradation is no longer just a performance problem. It is a runtime supply-chain vulnerability.

PhantomByte covered the short-term version of this problem in "The Groundhog Day Problem." That piece explored session amnesia, the way an AI coding assistant forgets the architecture decision you explained twenty minutes ago. Session amnesia is a bad afternoon. Agent aging is a bad quarter. It is what happens when an agent runs for weeks or months, when compression, contradiction, drift, and environmental change accumulate faster than any human operator can detect.

If you need a reality check on how bad the baseline already is, consider ITBench-AA. Announced in May 2026 by IBM Research and Artificial Analysis, this benchmark measures frontier models on real-world enterprise IT tasks, starting with Site Reliability Engineering. The setup is punishing: 59 SRE tasks where an agentic harness gives the model shell access to a sandboxed filesystem, and the model must diagnose Kubernetes incidents by reading logs, tracing dependencies, and identifying root causes. The scoring is merciless: miss any ground-truth root cause, and the entire task scores zero. Even the best frontier models scored below 50% on day one. If they degrade from there, and AgingBench proves they do, your production SRE agent is heading below 30% reliability within months.

That would be bad enough if the agent were only making recommendations. But agents in 2026 are not just recommending. They are acting. Claude Code, GitHub Copilot, Cursor, these tools run in autonomous loops. They encounter a build failure, search for a fix, generate a dependency resolution, and execute pip install or npm install without a human ever writing the import statement. Aikido Security documented what happens next. The agents hallucinate package names. A name that follows the naming convention, fits the dependency graph, and resolves to nothing. No owner. No code. Just a registry entry waiting for someone to claim it and push malware. While your agent degrades from 50% to 30% reliability, it is simultaneously expanding your attack surface one hallucinated package at a time.

The Four Mechanisms of Agent Death

AgingBench organizes agent degradation into four mechanisms. Each one maps to a specific failure mode in production, and each one creates a distinct pathway to hallucinated dependency installation.

Compression Aging occurs because context window limits force the agent to compress its own history. An agent with a 128,000-token context window that has been running for months cannot retain every detail of every interaction. It must summarize. It must drop. The details that disappear first are not the common cases, those get reinforced by repetition. The details that disappear are the edge cases. The one customer whose setup differs from the standard configuration. The deprecated API endpoint that still serves a legacy integration. The internal library whose syntax does not match the public documentation. As the agent compresses its history, it begins handling the 80% case flawlessly and forgetting the 20% that breaks production. The link to hallucinated dependencies is direct. When an agent forgets verified internal library syntax, it loses access to its established tools. Faced with a missing import, it resorts to generating plausible but fake package names to bypass the error, actively expanding your attack surface.

Four mechanisms of AI agent degradation: compression aging, interference aging, drift aging, and environmental schema aging
Each aging mechanism creates a distinct pathway to phantom package installation.

Interference Aging is the accumulation of conflicting memories. The agent learns something in week 1 that contradicts what it learns in week 6. Without semantic revision, a mechanism to reconcile, prioritize, or invalidate outdated memories, both memories coexist in the vector store or knowledge graph. The agent's behavior becomes nondeterministic. On Tuesday it routes customer requests to the old API gateway. On Thursday it routes them to the new one. The discrepancy does not show up in any single test because each memory is individually correct in isolation. The failure only appears when the agent retrieves from the wrong memory at the wrong time. A confused agent resolving a dependency conflict is highly vulnerable. Forced to reconcile incompatible package versions from conflicting memories, the agent often synthesizes a resolution by inventing a "merged" package, looping through variants until it creates a fresh, unowned entry point.

Drift Aging, which the AgingBench paper calls revision aging, occurs because the world changes while the agent's memory does not. APIs update. Documentation shifts. Third-party services sunset features and add new ones. The agent's internal knowledge base accumulates stale information faster than it can be refreshed. A customer support agent that learned your pricing structure in March is still quoting March prices in June because no update pipeline exists to revise its memory. A coding assistant that learned a framework's syntax from documentation published in January is generating invalid code in May because the framework shipped three breaking changes in between. When an agent's memory of a registry goes stale due to renamed libraries or shifted namespaces, it relies on outdated conventions. If those old targets no longer resolve, the agent simply fabricates variants that do.

Environmental and Schema Aging, labeled maintenance aging in the original paper, is the most insidious because it is not the agent that changes. It is the software underneath it. Dependencies update. Infrastructure shifts. The Kubernetes cluster the agent was trained to manage gets upgraded to a new version with different resource semantics. The database schema the agent queries gets migrated to a new structure. The agent was trained on a schema that no longer exists. Its tools break silently. Under pressure to fix a broken build with failing tools, the agent autonomously generates and installs fictional packages to bridge the gap. The build passes a superficial check but fails catastrophically in production. This is not a model bug. It is a fundamental systems failure.

Why the Current Tooling Fails

If agent aging is a known problem, why does every deployed agent still suffer from it? The answer is that the tooling was built for a different job entirely.

Vector stores treat memory as storage, not as state trajectory. They ingest embeddings, retrieve nearest neighbors, and return results. They do not ask whether a memory should have been revised. They do not detect that two embeddings represent contradictory facts. They do not expire information that has been superseded. A vector store is a retrieval engine, not a memory system.

KV caches forget by design when they hit capacity. This is not a bug, it is the intended behavior. But in a long-lived agent, capacity-driven forgetting is catastrophic. The least frequently accessed memories are evicted first, which means the edge cases, the one-off configurations, and the rare but critical exceptions disappear while the common paths are reinforced. The agent becomes more predictable and less correct at the same time.

Graph databases track relationships but do not govern how memories evolve, conflict, or expire. An edge from "API v2" to "deprecated" is just another fact. Without a semantic revision engine, the graph accumulates every version of every fact as if they were all equally current. The agent traverses the graph and finds multiple answers where there should be one.

The GEM paper, submitted to arXiv on the same day as AgingBench and titled "Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory," formalizes these failures. Authors Abdelghny Orogat and Essam Mansour argue that current database paradigms fail on four dimensions: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. They propose Governed Evolving Memory, a new abstraction with four state-level operators: ingestion, revision, forgetting, and retrieval, governed by six correctness conditions. They built a prototype called MemState on a property-graph backend to prove feasibility. But Orogat and Mansour are explicit: MemState validates the concept and exposes the gap to a native engine. The tooling does not exist yet.

Trajectory AI offers a theoretical answer with a continuous learning loop designed to slow this decline. The thesis is sound, but it remains an early-stage startup, not a deployed standard. It is a promising direction, not a production-ready solution you can install today.

None of these systems, including vector stores, KV caches, graph databases, GEM, or Trajectory, address the specific vulnerability that Aikido Security documented. None verify whether an agent-generated package name resolves to a legitimate, owned registry entry before allowing installation. The dependency security layer is completely absent from every memory architecture on the market.

The Shadow Dependency Runtime Threat

Aikido Security released its report on AI coding agents in May 2026, and the findings should have triggered an immediate industry response. They did not, because the vulnerability operates in a blind spot that existing security tooling cannot see.

The mechanics are straightforward. An agent is running in an autonomous loop, attempting to resolve a build issue or implement a feature. It encounters an edge case: a missing import, a version conflict, a deprecated API. It generates a package name. The name follows the conventions of the ecosystem. It looks like every other package in the registry. It fits the dependency graph logically. The agent executes the installation. No human wrote the import statement. No human reviewed the registry entry. The agent is doing exactly what it was designed to do, which is to generate plausible code and execute it.

The vulnerability is that standard human code reviews are failing to catch these because developers trust agentic output blindly. Or worse, the agents are executing changes live in sandboxed or staging environments that have internet access. The staging environment is supposed to be safe, but if the agent installs a malicious package there, the package can phone home, exfiltrate data, or establish persistence before any human reviews the pull request.

Existing dependency scanners like Snyk and Dependabot scan what is declared in package.json or requirements.txt. They do not catch packages installed dynamically by an agent during an autonomous run. Aikido calls this the shadow dependency problem: dependencies that exist in the runtime environment but never appear in any lockfile, manifest, or version control diff. They are invisible to scanners, invisible to auditors, and invisible to the security team until something goes wrong.

The scale argument is the one that should keep infrastructure engineers awake. If a single agent can hallucinate one unowned package per week, and Aikido's data suggests this is a conservative estimate, a fleet of twenty agents running for a quarter introduces sixty-plus unverified entry points into your build pipeline. Some of those packages will be harmless typos. Some will be squatting targets, waiting for an attacker to register the name and push a payload. The probability of any single hallucinated package being exploited is low. The probability that one of sixty is exploited, given a motivated attacker monitoring package registries for exactly these misspellings, is not low at all.

What To Do About It

The first step is conceptual: stop treating agents like stateless API calls and start treating them like long-lived systems with a finite reliable lifespan. That means lifespan SLAs, not launch-day benchmarks. Define how long your agent needs to stay above a reliability threshold, and test it across that duration before declaring it production-ready. A model that scores 85% on day one and 40% on day sixty is not an 85% model. It is a 40% model with a delayed failure mode.

The second step is diagnostic profiling. AgingBench provides a framework for identifying which aging mechanism is dominant for your deployment. Is your agent failing because it compresses its history too aggressively? Because conflicting memories accumulate without revision? Because the world drifted and its knowledge base went stale? Or because the infrastructure underneath it changed and its tools broke? The repair strategy depends entirely on the diagnosis. Compression aging requires context window architecture changes: larger windows, smarter summarization, or hierarchical memory structures. Interference aging requires a memory revision pipeline that can detect contradictions and prioritize newer over older information. Drift aging requires automated fact-refresh cycles that poll trusted sources and update the agent's knowledge base on a schedule. Environmental aging requires infrastructure change detection that alerts the agent when the schema, API, or dependency tree has shifted.

The third step is new, and it addresses the vulnerability that none of the existing research covers: dependency firewalling for agentic runtimes.

First, deploy automated dependency firewalls that block installation of any package not present in an allowlist. The allowlist should be human-curated, version-locked, and reviewed on a regular cadence. If the agent attempts to install a package outside the list, the installation is blocked and an alert is raised.

Second, implement strict proxy registries specifically for agentic execution environments. The agent must query the proxy; the proxy rejects unknown packages. This creates a chokepoint where every installation attempt is logged, analyzed, and either approved or denied. The proxy can also cache approved packages, eliminating the need for the agent to reach out to public registries directly.

Third, install runtime hooks that intercept pip install, npm install, and equivalent commands inside agent sandboxes and require second-factor approval for any package not already cached or pre-approved. This does not eliminate autonomy; the agent can still experiment with code, but it eliminates the unsupervised installation of unknown dependencies. The approval can be lightweight for the first few occurrences and escalate to a security review if the agent attempts repeated installations.

Agent lifespan engineering is now a recognized discipline. It needs its own SRE-style tooling, monitoring, alerting, repair playbooks, and lifespan SLAs. And it needs its own supply-chain security layer, because an agent that degrades silently while rewriting your dependency graph is not just inefficient. It is an unmonitored security backdoor.

The industry is shipping agents like they are stateless API calls. They are not. They are long-lived systems that accumulate failure modes: compression, interference, drift, and environmental decay, and silently rewrite your dependency graph while doing it.

The papers are days old. The security report is breaking now. The tooling is years behind.

An aging agent is not just inefficient. It is an unmonitored security backdoor with root access to your build pipeline.

Enjoyed this article?

Buy Me a Coffee

Support PhantomByte and keep the content coming!