SWE-Bench Is Dead: Build Your Own Agent Evaluation Stack

I spent Saturday morning reading the GitHub issues on SWE-Bench. All 545 of them. I did this not because I am a masochist, but because my CI/CD pipeline was flagging correct agent output as failed, and I needed to know why. What I found was not a bug in my code. It was a bug in the entire concept of vendor-controlled evaluation.

If your CI/CD pipeline depends on SWE-Bench, you are flying blind right now. If you are searching for SWE-Bench alternatives this week, you are already behind the curve. This is not because some indie dev broke the repository, but because the benchmark itself is structurally unsound. The people who ought to fix it have a conflicting incentive: they also sell the models being graded.

Why SWE-Bench Verified Fails Its Own Test

Let me walk you through what I found in those GitHub issues.

Issue #545: The SWE-Bench CLI marks instances as resolved: false even when all FAIL_TO_PASS tests pass. The failure occurs because PASS_TO_PASS tests get SKIPPED on single-core Docker environments. They do not fail; they get skipped. The harness counts that as a rejection. Your agent wrote correct code, but because the infrastructure could not run the test, you got dinged for it.

Issue #506: Java evaluations produce inconsistent results because shell tracing interleaves with Maven output, causing race conditions in log parsing. The harness cannot reliably determine test outcomes. You can run the exact same code twice and get a different result.

Issue #563: The benchmark data itself has severe quality problems. Pull request descriptions do not match the gold tests. The ground truth is wrong. It is not your code failing; it is the answer key.

These are not edge cases. A recent arXiv paper (2603.21454) put it bluntly, stating that LLM coding benchmarks face a credibility crisis where widespread solution leakage and test quality issues undermine SWE-Bench Verified. Another paper (2511.16708) found that 29.6% of SWE-Bench solved patches actually fail under proper verification.

When nearly a third of passing solutions are wrong and the harness rejects correct code on single-core containers, you do not have a benchmark. You have a random number generator with a README.

The benchmark contamination problem is just as severe. The same arXiv paper documents widespread solution leakage across frontier models. Models trained on public GitHub data can reproduce gold patches verbatim. When your agents get tested on code they have already memorized from training data, you are not measuring reasoning. You are measuring recall.

The Sovereign Alternative: Self-Hosted Agent Evaluation

Here is the brutal truth. The entire eval-as-a-service model is collapsing in slow motion. You are paying cloud companies to grade your agents with rubrics they designed, on datasets they curated, with harnesses they refuse to open-source. That is not evaluation. That is vendor-managed performance art.

The harness, not the model, is where real bugs live. SWE-Bench's own GitHub issues prove it. The test infrastructure rejects correct behavior because it was written for a specific runtime configuration. It cannot handle non-determinism, partial solutions, or iterative refinement. If your harness requires a single golden patch to be byte-identical, your harness is the problem.

If you are building serious agent CI/CD pipelines today, you need a stack you own, host, and modify. Period. That stack looks like this:

Self-hosted evaluation harness: Not a vendor API with race conditions in its log parser.
Open-source memory layer: Not a premium subscription feature locked behind a paywall.
Own benchmark suite: Augmented with live production traces from your actual failures.
PostgreSQL with pgvector backbone: Because you already know how to run it.

This is the sovereign eval stack. It is not a product; it is an architecture. It is the only thing that is going to survive the next benchmark collapse. This is what sovereign AI infrastructure actually looks like, and if you are already running PhantomByte's LLMOps coverage, this plugs directly into your existing observability pipeline.

The Missing Link: Persistent State and Memory

An evaluation harness is only half the equation. To build agents that actually learn from their failures during these evaluations, you need persistent state. An evaluation stack without state is just a fancy linter.

Anthropic bundles memory into its Claude Pro plan at $20 a month and Claude Max at $100 a month. Memory is not sold separately. It is locked inside the subscription. You do not own your agent's memory; you rent it. Stop paying, and your agent has amnesia. That is not a feature. That is a hostage situation.

Here is what the real benchmark data shows. On LiveCodeBench, a platform designed to resist contamination by using fresh problems, Qwen3-235B-A22B scores 65.9%. Claude Opus 4 with thinking enabled scores 56.6%, and Claude Sonnet 4 scores 47.1%. A model you can self-host beats every Claude configuration except the thinking-enabled Opus.

Furthermore, models like Qwen3.5-27B are available on HuggingFace right now for free. It is a dense model running on consumer hardware. The Qwen3.6-35B-A3B utilizes a MoE architecture with 256 experts, activating only 3B parameters per token. The efficiency-to-capacity ratio is absurd.

This means the model is no longer the bottleneck, and the inference cost is no longer the bottleneck. The memory layer, which allows your agent to remember what it did yesterday, learn from previous evaluations, and adapt its behavior over time, is the new battleground.

Sovereign agent evaluation stack architecture showing PostgreSQL pgvector memory layer, self-hosted harness, and production trace augmentation — The sovereign eval stack: PostgreSQL with pgvector provides persistent memory that you own and control.

Mem0 is the popular choice right now with 54,000 stars on GitHub. It is lightweight, easy to plug in, and handles basic episodic memory well. Letta is more robust if you want agent-oriented design with explicit memory management. LangMem is newer, built around LangGraph, and integrates cleanly if you are already in that ecosystem.

However, none of them advertise the fact that they are all just abstractions built on top of a vector database. If you are already running PostgreSQL for your application, pgvector is sitting right there for free. You own the data. You own the queries. Nobody can sunset your API or change your pricing tier.

Building Your CI/CD Pipeline Without Vendor Benchmarks

SWE-Bench is structurally unsound. The alternative is to build your own agent evaluation CI/CD. It is not as scary as it sounds.

Step One: Define your own correctness. Do not rely on pass or fail metrics from someone else's test suite. Demand actual correctness. Does the agent's output solve the stated problem? Does it preserve existing tests? Does it respect project conventions? You are building a rubric, not downloading one.

Step Two: Prioritize the harness over the benchmark. Your harness should test the interaction, not just the output. Write harnesses that can handle non-determinism, partial solutions, and iterative refinement.

Step Three: Augment with live traces. Synthetic benchmarks do not age well, but production traces do. Mine your real tasks, bug reports, feature requests, and refactoring tickets to turn them into regression tests. Your eval suite should grow every week based on exactly what your agents failed the week prior.

Step Four: Run it in your own infrastructure. Use Docker Compose and keep it self-hosted. You should not need API keys to grade your own work. If the internet goes down, your CI should still run.

This is how you break the dependency. You do not fix the problem by finding a better vendor benchmark; you fix it by owning the entire evaluation surface area.

The Cost of Subscription Lock-in

Let us run the numbers to see where the vendor lock-in gets personal. Claude Pro is $20 a month, Claude Max is $100 a month, and enterprise tiers require custom contracts. Memory is bundled inside these tiers, meaning you do not get memory without the model, and you do not get memory without the subscription.

Your agent's persistent state lives on Anthropic's infrastructure, behind Anthropic's paywall, and is subject to Anthropic's terms. That costs $240 to $1,200 per year, per seat, for context windows and a key-value store you do not even control.

Compare that to PostgreSQL with pgvector. The setup cost is your time. The hosting cost is the server you are already running. The runtime cost is zero additional dollars. You own the data, the queries, and the uptime. If you are already using Postgres for your app, pgvector is a simple CREATE EXTENSION away. It gives your AI agent memory a home it cannot be evicted from. Your agent memory lives in the same database as your app data, sharing the same backups, monitoring, and queries. There is no split-brain architecture and no separate microservices to babysit.

The argument against self-hosting is usually the fear of DevOps overhead. Enterprise teams often default to vendor solutions to avoid babysitting infrastructure. But let us be real about the actual maintenance burden. Is maintaining a Postgres instance you already run really harder than hoping a vendor does not double their pricing next quarter? Is maintaining an open-source harness you can fork and fix scarier than discovering your core benchmark was fundamentally broken for six months?

The maintenance cost myth is how cloud companies keep you dependent. The reality is simple. If you can run Docker, you can run a sovereign eval stack. If you cannot, you should learn, because the vendors are not coming to save you.

The Sovereign Path

The SWE-Bench collapse is not just a technical story; it is an inflection point. For two years, the industry outsourced evaluation to a single benchmark. That benchmark is now structurally unreliable. The vendor-controlled alternatives offer the exact same conflict of interest, as the company grading your agents is the same company selling you the model.

If you are building agent CI/CD pipelines, your testing infrastructure cannot be a line item you buy from the same company trying to sell you the AI. That is a conflict of interest dressed up as convenience.

The sovereign path is harder in the first week. You have to write your own harness, define your own rubric, and host your own memory layer on PostgreSQL and pgvector. But by week four, you are iterating faster than any team waiting on a vendor API update. By month two, your eval suite is tighter than any generic benchmark because it is built entirely from your actual failures. By quarter two, you are not reading blog posts about benchmark collapses because you are no longer affected by them.

That is the freedom part. That is the sovereignty part.

Get More Articles Like This

Building sovereign AI infrastructure is just the start. I'm documenting every mistake, fix, and lesson learned as I build PhantomByte.

Subscribe to receive updates when we publish new content. No spam, just real lessons from the trenches.

Enjoyed this article?

☕ Buy Me a Coffee

Support PhantomByte and keep the content coming!

Get Personal Loan Offers