Applying Karpathy's Autoresearch Pattern to RAG Retrieval

We pointed an autonomous AI agent at a production RAG system's knowledge base and told it to improve retrieval quality. Eight runs, $88 in agent inference cost, and the pass rate went from 22% to 89% across 110 test queries. The agent did not retrain a model, tweak prompts, or rebuild the retrieval stack. It fixed metadata in a live search engine, one mutation at a time, using an adapted version of Andrej Karpathy's autoresearch pattern.

The results were concrete. But the most valuable output was not the improved pass rate. It was the diagnostic: the agent surfaced a category of silent failure that no code review or log search would have caught.

The Pattern: Constrained Autonomous Optimization

Karpathy released autoresearch as a framework for autonomous ML research. The idea is clean: give an AI agent a playbook, a fixed evaluation metric, and a single file it can edit. Let it run overnight. Keep improvements, discard regressions.

The core loop (evaluate, diagnose, mutate, re-evaluate, keep or rollback) is not specific to ML training. It works anywhere you have a measurable quality signal and a constrained mutation surface. We tested this by adapting it to knowledge base retrieval quality.

In our version, the mutable target was not a Python file but metadata and content in a live search engine. The metric was not a single number but a five-dimension composite score covering relevance, coverage, coherence, metadata quality, and pollution. And rollback meant restoring specific documents via API rather than reverting a file with git.

The most important design decision was making the evaluator and query bank immutable within each run. SHA256 hashes were verified at startup and re-checked before every fix. If either file changed mid-run, the loop aborted. This is anti-benchmark-hacking infrastructure. If you give an autonomous agent both the ability to optimize and the ability to define success, it will optimize the definition. Separating those responsibilities is the core safety property.

The Root Cause: Ontology Drift

The first six runs revealed one dominant problem. The knowledge base had been indexed over several months, with content tagged using increasingly specific topic labels. Early chunks used canonical categories like "onboarding" and "support." Later chunks drifted to fine-grained variants: "onboarding_messages," "support_escalation," "customer_engagement."

The search layer filtered by canonical topic. Chunks tagged with variant names never surfaced. The content was there. The metadata had drifted away from the taxonomy, and nobody noticed because there was no quality signal telling them retrieval was degrading.

This is the silent killer of RAG systems. There is no error message for ontology drift. The code is correct. The search is correct. The taxonomy and the metadata just stopped agreeing.

The agent climbed a natural strategy ladder as easy fixes were exhausted. Runs 1 through 3 focused on cheap metadata re-tags, changing variant labels back to canonical categories. Runs 5 and 6 shifted to tag enrichment as re-tags hit diminishing returns. Nobody coded that progression. The agent chose which fix to apply based on its diagnosis of the worst-scoring query, and easy wins came first because they had the highest expected improvement per unit of risk.

When the Tests Are Wrong

After run 6, pass rate plateaued at 60%. The agent's exit report classified remaining failures into categories and flagged a pattern: 30 of the weak queries were testing against content that could never match. The query bank had been bootstrapped from distilled knowledge files and inherited their markdown formatting and file naming conventions. The actual indexed chunks contained plain text from original source documents with different metadata.

The agent flagged the pattern. Humans decided it was a test bug.

We fixed the query bank between runs (never during), cleaned up formatting, removed impossible anchors, and corrected source references. After the revision, a final pass brought the system to 89%. To be precise: the agent's optimization loop moved pass rate from 22% to 60%. The benchmark revision moved it from 60% to 89%. Both were necessary. But it would be dishonest to credit the agent with the full improvement when the larger jump came from fixing the tests.

This dynamic, where the optimization loop alternates between improving the system and revealing flaws in the evaluation, turned out to be the most practically useful aspect of the pattern. Each plateau forces you to ask: is this a real ceiling, or a measurement problem?

How to Apply This to Your System

The pattern transfers to any system where quality is measurable and mutations are bounded. The minimum viable setup requires five things: a query bank of 50 to 100 test queries representing real user needs, a deterministic evaluator that scores each query on the dimensions that matter, a playbook defining what the agent can and cannot do, a snapshot and rollback layer that captures state before every mutation, and a commitment to running the loop until it stalls and then classifying the remaining failures.

The hard part is not the agent. It is building an honest evaluator and committing to not letting anything modify it mid-run. If you have a system where quality is measurable and mutations are bounded, the pattern is worth testing. If you do not have a quality signal at all, that is the place to start.