We did not retrain a model, tweak prompts, or rebuild our retrieval stack. We took Andrej Karpathy's autoresearch pattern, froze the evaluator, gave an AI agent exactly one mutable surface to improve, and pointed it at a production RAG system's knowledge base. The agent fixed retrieval quality from 22% to 60% pass rate across 110 test queries. Then it stalled, and the plateau revealed that our test suite was flawed. After we fixed the benchmark, a final pass brought the system to 89%. Agent inference cost across all runs: $88.
Karpathy released autoresearch as an experimental framework for autonomous ML research. The idea is simple: give an AI agent a playbook, a fixed evaluation metric, and a single file it can edit. Let it run overnight. Keep improvements, discard regressions. Wake up to a better model.
The pattern is elegant. It is also not specific to ML training. The core loop (evaluate, diagnose, mutate, re-evaluate, keep or rollback) works anywhere you have a measurable quality signal and a constrained mutation surface. We tested this by applying it to a completely different domain: knowledge base retrieval quality in a production RAG system with 110 test queries across 10 topic categories. One system, one corpus, no control group. This is a case study, not a proof. But the results were concrete enough to be useful.
This article walks through what we borrowed from autoresearch, what we had to change, and what the results tell us about where autonomous optimization loops work well.
TL;DR
- Karpathy's autoresearch pattern (freeze the evaluator, constrain mutations, keep or rollback) transfers cleanly from ML training to RAG retrieval optimization
- An AI agent ran 8 optimization passes, kept 82 of 85 attempted fixes, and moved retrieval pass rate from 22% to 60% for $88 in inference cost
- The dominant root cause was ontology drift: topic metadata had silently diverged from the taxonomy the search layer expected
- The plateau at 60% revealed flawed test queries, not a retrieval ceiling; fixing the benchmark brought the system to 89%
- The immutable evaluator is the key safety property: if the agent can redefine success, it will optimize the definition instead of the system
The Autoresearch Pattern
Karpathy's autoresearch has three components. A program.md playbook that tells the agent what to do and what not to do. A single editable file (train.py) that contains the model architecture and training loop. And an immutable evaluation harness (prepare.py) that scores every experiment on the same metric: validation bits per byte.
Each experiment runs for exactly five minutes. The agent reads results, forms a hypothesis, edits train.py, and runs again. If the score improves, the change stays. If it regresses, the change is discarded. Over a night, this produces roughly 100 experiments. No human in the loop.
The load-bearing idea is in the constraints, not the autonomy. The agent cannot touch the evaluator, so it cannot game its own test. The time budget is fixed, so every experiment is directly comparable. The mutation surface is one file, so changes are reviewable. These constraints are what make the autonomy safe.
Adapting the Pattern to RAG
Our production system uses a semantic search engine (Onyx) to serve chunks of indexed content to an AI agent. The knowledge base contains roughly 500 chunks derived from 176 source documents, organized into 10 canonical topic categories. Retrieval works in two stages: semantic search returns the top results by embedding similarity, then results are filtered and boosted by topic metadata tags. When the topic tags are wrong, relevant content drops out of results even if the embeddings are close. The agent queries this knowledge base, retrieves relevant chunks, and uses them to ground its responses. The system had been running for months, and retrieval quality had never been formally measured.
We suspected it was degrading. We had no proof. So we built a test harness and pointed autoresearch at it.
Here is what we borrowed and what we changed:
| Component | Autoresearch (ML) | Our Adaptation (RAG) |
|---|---|---|
| Playbook | program.md |
program.md (521 lines) |
| Mutable target | train.py |
Onyx metadata + chunks |
| Immutable evaluator | prepare.py |
evaluator.py (SHA256 enforced) |
| Metric | val_bpb (single number) | 5-dimension composite score |
| Experiment budget | 5 minutes wall-clock | 1 fix per iteration, max 20 |
| Regression handling | Discard and continue | Snapshot, rollback, continue |
| Overnight yield | ~100 experiments | 82 fixes across 8 runs |
The table looks clean, but the adaptations were not trivial. Each one reflects a real operational difference between optimizing code and optimizing a live retrieval system.
Five Things We Had to Change
1. The target is data, not code
Autoresearch edits a Python file. We edit metadata and content in a search engine. The mutation surface is fundamentally different: instead of changing lines of code, the agent re-tags topic labels, enriches subtopic arrays, re-chunks source documents, or uploads missing content. Every mutation is an API call to an external system, not a file write.
This changes the operational model. A bad code edit is trivially reversible with git checkout. A bad metadata edit in a live search engine requires restoring specific documents to their pre-mutation state. We built a snapshot layer that captures document payloads before every fix, enabling surgical rollback without affecting unrelated content.
2. The metric is multi-dimensional
Karpathy has one number: validation bits per byte. Lower is better. Unambiguous.
We score every query on five dimensions, each weighted:
- Relevance (0.30): How high does the correct chunk rank in search results?
- Coverage (0.20): Are all expected anchor documents found?
- Coherence (0.20): Is the chunk text well-formed and extractable?
- Metadata (0.15): Are required fields present and correctly valued?
- Pollution (0.15): Are non-knowledge-base documents absent from results?
The composite formula is: (relevance * 0.30) + (coverage * 0.20) + (coherence * 0.20) + (metadata * 0.15) + (pollution * 0.15). A query passes at 0.70 or above. The keep/discard rule is straightforward: keep a fix if the composite score of the target query improves and the overall pass count does not decrease. The agent also inspects dimension-level scores during diagnosis to choose which fix type to attempt. No LLM evaluator is involved; all scoring is deterministic based on search result ranking and metadata presence.
This is simpler than it could be. A more sophisticated system would reject fixes where any individual dimension drops below a threshold, even if the composite improves. We did observe one case where a fix caused a cross-query regression (a re-upload that lost a verbatim anchor phrase), so per-dimension floors would have been useful. This is a known gap in the current design.
3. Rollback is harder
In autoresearch, rolling back means reverting a file. In our system, rolling back means restoring documents in an external search engine via API, then confirming the restoration did not introduce its own side effects. We discovered that re-uploading a chunk during a metadata fix can subtly change the indexed content, because the agent extracts approximate text from the source file rather than using the original chunk verbatim. One fix improved a target query but broke an unrelated query because the re-uploaded text lost a verbatim phrase that a different test depended on.
Production-safe autonomy needs rollback as a first-class requirement, not an afterthought. Every mutation snapshots state before and after. Every rollback verifies the restoration matches the snapshot.
4. The constraint set is richer
Autoresearch has one constraint: only edit train.py. Our playbook defines nine startup invariants, six categories of permitted fixes, a list of forbidden actions, and a partition lock system for parallel execution.
The six permitted fix types form a strategy ladder:
- Metadata re-tag: Fix a wrong
primary_topicvalue - Chunk boundary fix: Re-chunk a source file when a concept is split across chunks
- Missing chunk: Upload content that was skipped during initial indexing
- Semantic identifier fix: Rename a chunk title from "Section 3.2" to something descriptive
- Topic tag enrichment: Add missing subtopic tags
- Duplicate deletion: Remove stale duplicate documents
The forbidden actions are equally important: never fabricate content that does not exist in the source files, never modify the evaluator or query bank, never do a broad reindex (one fix at a time), and never touch documents managed by the file connector. The constraint surface for data mutation is larger than for code mutation because the blast radius of a bad data change in a live system is harder to bound.
5. Benchmark governance becomes a first-class problem
In autoresearch, the evaluation harness is built once and never revisited. With RAG retrieval, we discovered that the benchmark itself was the biggest source of error, and managing benchmark revisions became as important as managing the optimization loop.
When the loop stalled at 60% pass rate, the agent classified remaining failures into categories. Humans reviewed those categories and discovered that 30 of the remaining weak queries were testing against content that could never match: markdown-formatted phrases from derived knowledge files, source file references that did not match the actual index metadata. These were test bugs, not retrieval bugs.
This meant we needed a benchmark revision policy: when can the query bank change, who approves changes, and how do you keep results comparable across revisions? Our rule was simple: the query bank is immutable within a run. Between runs, humans can revise it, but the revision must be documented and the full test suite re-run from scratch afterward. No partial credit for pre-revision scores.
Karpathy's autoresearch does not face this problem because ML evaluation metrics are stable. In retrieval, the relationship between what you test and what you index can drift, and managing that drift is part of the system, not a one-time setup step.
The Immutable Evaluator: How You Keep Agents Honest
The most important design decision was making the evaluator and query bank immutable within each optimization run. SHA256 hashes are verified at startup and re-checked before every fix. If either file changes mid-run, the loop aborts.
This is anti-benchmark-hacking infrastructure. An agent optimizing a metric will, given the opportunity, find ways to improve the metric that do not improve the underlying quality. Freezing the evaluator means the agent cannot redefine what "good" means mid-loop. Freezing the query bank means it cannot drop hard test cases or add easy ones.
An important distinction: immutability applies within a run, not across the entire study. As we describe later, we revised the query bank between runs when the agent's plateau analysis revealed flaws in the test suite. Those revisions were human decisions made between runs, with full re-evaluation afterward. The agent never modified its own benchmark.
The constraint forces the agent to improve the actual knowledge base, not the measurement of it. If you give an autonomous agent both the ability to optimize and the ability to define success, it will optimize the definition. Separating those responsibilities is the core safety property.
What the Loop Found: 8 Runs, 82 Fixes
We ran the audit loop eight times over a single day, in two phases separated by a benchmark revision. Runs 1-6 used the original query bank. Runs 7-8 used the revised query bank. The scores are not directly comparable across the revision boundary because the test cases changed. We present them in one table for narrative clarity, but the pass rates before and after the revision are measured on different rulers.
| Run | Scope | Kept | Disc. | Pass Rate | Primary Strategy |
|---|---|---|---|---|---|
| 1 | All topics | 11 | 1 | 22% → 35% | Metadata re-tag |
| 2 | All topics | 16 | 0 | 35% → 41% | Metadata re-tag |
| 3 | All topics | 10 | 1 | 41% → 47% | Metadata re-tag |
| 4 | Single topic | 8 | 0 | 18% → 45%* | Re-tag + missing chunk + enrichment |
| 5 | All topics | 7 | 1 | 50% → 53% | Re-tag (diminishing) |
| 6 | All topics | 10 | 0 | 53% → 60% | Tag enrichment |
| --- Query bank fixes applied --- | |||||
| 7 | All topics | 20 | 0 | 58% → 62% | Metadata re-tag |
| 8 | All topics | 1 | 2 | 89% → 89% | Enrichment (ceiling reached) |
*Run 4 targeted a single topic partition (11 queries), not the full 110-query suite.
A note on the cost: $88 is the agent inference spend across 8 runs. It does not include the time to build the evaluator, author the query bank, or develop the snapshot and rollback tooling. Those were the real investments. The $88 figure represents the marginal cost of running the optimization loop once the infrastructure exists.
The root cause: ontology drift
The first six runs revealed a single dominant problem: ontology drift. The knowledge base had been indexed over several months, with content tagged using increasingly specific topic labels. Early chunks used canonical categories like onboarding and support. Later chunks drifted to fine-grained variants: onboarding_messages, support_escalation, profile_setup, customer_engagement.
The system searched by canonical topic. Chunks tagged with variant names did not surface. The content was there. The metadata had drifted away from the taxonomy the search layer expected.
This is not a bug anyone would catch by reading the code. The indexing logic was correct. The search logic was correct. The taxonomy just evolved faster than the metadata, and nobody noticed because there was no quality signal telling them retrieval was degrading.
Ontology drift is the silent killer of RAG systems. Your content is fine. Your search is fine. Your taxonomy and your metadata just stopped agreeing, and there is no error message for that.
The strategy ladder
The agent did not repeat the same fix type across all eight runs. It naturally climbed a strategy ladder as easier fixes were exhausted:
Runs 1-3: Metadata re-tags. The cheapest fix. Change onboarding_messages to onboarding, support_escalation to support. High impact, low risk. This phase fixed 37 queries.
Run 4: When we focused on a single topic, the agent exhausted re-tags within a few iterations and escalated to harder fixes: re-uploading a missing chunk that had been skipped during initial indexing, and enriching subtopic tags to improve retrieval relevance. Three different fix types in one run.
Runs 5-6: Re-tags hit diminishing returns globally. The agent shifted to tag enrichment as the primary strategy, adding descriptive subtopics to help chunks surface for queries that used different vocabulary than the primary topic label.
Runs 7-8: After the query bank was fixed (more on that below), one more wave of re-tags emerged, then the system hit true ceiling. Run 8 tried three fixes, kept one, and exited.
This progression was emergent, not hard-coded. The playbook defines the six fix types, but the agent chooses which to apply based on its diagnosis of the worst-scoring query. Easy wins come first because they have the highest expected improvement per unit of risk. That is exactly how a human would triage, but the agent did it at 3 AM without being asked.
When the Tests Are Wrong
After run 6, pass rate plateaued at 60% (66 of 110 queries passing). The agent's exit report classified the remaining 44 failures into categories. It flagged that 27 of the 30 content-derived queries had relevance scores of 0.0, and that these queries used anchor phrases containing markdown formatting (**bold text**) and referenced source files like opening-messages.md that did not exist in the index.
The agent flagged the pattern. Humans decided it was a test bug.
The query bank had been bootstrapped from distilled knowledge files, and it inherited their formatting and naming conventions. The indexed chunks contained plain text from the original source documents, tagged with names like Product Guide v2.0.md. The test suite was testing against derived content that did not exist in the index. No amount of metadata correction would ever fix that mismatch.
This is the same thing that happens in software testing: you build a test suite, run it until failures plateau, and discover that the remaining failures are bad tests, not bad code. The audit loop surfaced that distinction by driving the fixable issues to zero and leaving only the test-quality problems visible.
We fixed the query bank in two rounds, both between runs (never during). The first round cleaned up formatting and removed impossible anchors. The second corrected source file references to match actual index metadata. After each revision, we re-ran the full test suite from scratch. The query count stayed at 110. Pass rate jumped from 60% to 89%. Not because the knowledge base improved, but because the tests stopped penalizing correct retrieval for failing to match derived content.
To be precise about the split: the agent's optimization loop moved pass rate from 22% to 60%. The benchmark revision moved it from 60% to 89%. Both were necessary. But it would be dishonest to credit the agent with the full 22%-to-89% improvement when the larger jump came from fixing the tests.
The loop alternates between improving the system and revealing flaws in the evaluation. Each plateau forces you to ask: is this a real ceiling, or a measurement problem? The answer determines whether you run the loop again or fix the benchmark.
Limitations and Honest Caveats
This approach worked well for our system. It will not work for everything. Here are the limitations we hit:
No baseline comparison. The dominant root cause was ontology drift: variant topic labels that a deterministic normalization script could have caught. We did not compare the autonomous loop against a simpler baseline like a one-time sed replacement or a human metadata audit. It is possible that a script mapping onboarding_messages to onboarding would have captured 80% of the gains in minutes. The autonomous loop's value over a simple script was in the harder fixes (missing chunks, tag enrichment) and in the diagnostic output (classifying remaining failures). But we did not measure that delta rigorously.
Query bank overfitting. The agent optimizes for the specific queries in the test suite. If those queries under-represent real user needs, the ceiling can be misleading. We mitigated this with a three-tier query bank (content-derived, common user questions, and edge cases), but we did not validate against a held-out query set or independently authored evaluation. The 96% fix success rate looks suspiciously easy, and that should give the reader pause. The test suite is the ceiling, and the ceiling is only as useful as the test suite is representative.
Composite scores can hide regressions. A fix that improves relevance by 0.05 but degrades coherence by 0.04 looks like a net win on the composite but may be a net loss in practice. We inspect dimension-level scores during diagnosis, but the keep/discard decision still runs on the composite. A more sophisticated system would track per-dimension regressions separately.
Metadata-only fixes have a hard ceiling. When the root cause is missing content rather than wrong metadata, no amount of re-tagging or enrichment helps. Eight of our final 12 weak queries were content gaps: scenarios the source material simply does not cover. The system correctly identified these and stopped, but someone still has to write the missing content. The agent knows what is missing. It cannot create it from nothing.
Re-upload content drift. Our most surprising failure mode. When the agent re-tags a chunk, it reads the source file and re-uploads approximate text. This subtly changes the indexed content, which can break queries that depended on the original phrasing. One fix caused a previously passing query to regress from 0.85 to 0.46 because the re-uploaded text lost a verbatim phrase. Ideally, metadata updates would be pure metadata operations without re-indexing the content. Not all search engines support this cleanly.
Retrieval improvement is not answer improvement. We measured whether the right chunks appear in search results. We did not independently measure whether the downstream AI agent produces better answers as a result. Retrieval quality is a proxy for answer quality, and it is a reasonable proxy, but we did not validate the full chain. A team with more rigor would A/B test answer quality before and after the retrieval fixes.
Scale is modest. Karpathy's loop runs ~100 experiments overnight. Ours ran 82 fixes across 8 sessions. The bottleneck is not compute; it is the evaluation cycle (each query hits a live search engine) and the conservative one-fix-per-iteration approach. Trading throughput for safety was the right call for a production system, but teams with staging environments could parallelize more aggressively.
This ran against a live index. The optimization loop made API calls to the production search engine. We mitigated risk with snapshots, rollback, and one-fix-at-a-time iteration, but we did not use a staging replica or shadow index. Teams with higher availability requirements should run against a replica and promote fixes after validation.
How to Build This for Your System
The pattern is transferable to any system where you can define a quality metric and constrain the mutation surface. Here is the minimum viable setup:
1. Build a query bank.
Write 50-100 test queries that represent real user needs. For each query, define what a good result looks like: which documents should appear, what metadata should be present, what should not appear. Use three tiers: queries derived from your content (sanity checks), common user questions (the real test), and edge cases (stress tests). Bootstrap from your content, then refine manually.
2. Build an evaluator.
Score each query on the dimensions that matter for your system. Relevance and coverage are universal. Metadata quality, pollution, and coherence are situation-dependent. Weight the dimensions, compute a composite, and define a pass/fail threshold. Make this script immutable once built. If you need to change it, change it between runs, never during.
3. Write a playbook.
Define what the agent can do (permitted fix types), what it cannot do (forbidden actions), and how it should decide between them (diagnose the worst query, pick the lowest-risk fix that addresses the root cause). Include startup invariants, snapshot requirements, and rollback procedures. The playbook is your safety contract with the agent.
4. Build snapshot and rollback.
Before every mutation, capture the full state of the documents being changed. After every mutation, re-evaluate. If the score regresses, restore from the snapshot. This is non-negotiable for production systems. The cost of a snapshot is trivial compared to the cost of an unrecoverable bad mutation.
5. Run, analyze the ceiling, and iterate.
Run the loop until it stalls. Read the exit report. Classify remaining failures: are they data problems (run again), test problems (fix the tests), or content gaps (write new content)? Fix the identified issue and re-run. The loop will find the next bottleneck. Repeat until the ceiling is something you are comfortable with.
The Bottom Line
Karpathy's autoresearch pattern is not about ML training. It is about disciplined autonomous optimization: freeze the evaluator, constrain the mutation surface, make one change at a time, keep only improvements.
We applied it to RAG retrieval quality and got useful results. The agent found ontology drift we did not know existed, moved pass rate from 22% to 60%, and produced diagnostic output that revealed our test suite was flawed. After we fixed the benchmark, a final pass brought the system to 89%. The load-bearing components (immutable evaluator, constrained mutations, keep-or-rollback decisions, benchmark revision policy) transferred from ML training to retrieval operations without fundamental changes.
We do not know whether the autonomous loop was the most efficient way to fix these problems. A deterministic script might have handled the metadata normalization. A human might have spotted the ontology drift faster. What the loop provided that alternatives would not was the combination of automated diagnosis, safe mutation, and a structured ceiling analysis that classified remaining failures into actionable categories.
If you have a system where quality is measurable and mutations are bounded, the pattern is worth testing. The hard part is not the agent. It is building an honest evaluator and committing to not letting anything modify it mid-run.