Every data team has That Person — the one who knows that the user_activity_v2 table actually feeds three downstream jobs through an intermediate field called _tmp_session_hash that nobody documented because "everyone knows that." Then That Person goes on vacation during an incident.
Meta just published a detailed writeup about what happens when you try to solve this problem with brute force: they pointed a swarm of 50+ specialized AI agents at their data pipeline codebase and told them to find everything that isn't written down. The results are both impressive and deeply uncomfortable.
The Problem Nobody Wants to Admit
Meta's data processing infrastructure spans four repositories, three programming languages (Python, C++, and Hack), and over 4,100 files. When they started letting AI coding assistants help with development tasks, the assistants would "guess, explore, guess again, and often produce code that compiled but was subtly wrong."
The issue wasn't model intelligence. The models were capable enough. They just didn't know that removing a "deprecated" identifier value breaks backward compatibility, or that one pipeline stage outputs a temporary field name that a downstream stage silently renames. This stuff existed exclusively in the heads of engineers who built it — sometimes years ago.
Before this project, structured context covered roughly 5% of the codebase. Five files. Everything else was dark territory.
How the Agent Swarm Works
Meta didn't run a single pass over the code. They built a multi-phase orchestration system — an assembly line of specialized agents, each with a narrow job:
2 explorer agents mapped the overall repo structure
11 module analysts each answered five specific questions per module: what it configures, common modification patterns, non-obvious failure modes, cross-module dependencies, and buried tribal knowledge
2 writers synthesized findings into context files
10+ critic passes for quality review
4 fixers applied corrections from the critics
8 upgraders refined routing layers
3 prompt testers validated against 55+ real queries
4 gap-fillers swept remaining directories
3 final critics ran integration tests
The output: 59 context files, each roughly 1,000 tokens. They follow a strict "compass, not encyclopedia" format — four sections per file: quick commands, key files (3–5 most relevant), non-obvious patterns, and cross-references. Total context footprint: less than 0.1% of a modern LLM's context window.
That last number is key. They didn't try to dump everything into a prompt. They built a routing layer that gives each agent exactly the context it needs for the module it's touching, nothing more.
What They Dug Up
This is the part that should make every data team uncomfortable. Across those 4,100+ files, the agents surfaced 50+ non-obvious patterns that had zero documentation:
Hidden intermediate naming conventions. One pipeline stage outputs a temporary field that a downstream stage renames. Follow the convention and your data flows correctly. Deviate — say, by giving the intermediate field a "better" name — and your output silently produces wrong results. No error, no warning.
Append-only identifier rules. Values that look deprecated and unused are actually load-bearing for backward compatibility. Delete them and nothing breaks immediately. It breaks three weeks later when a downstream consumer replays historical data.
Cross-repo dependency chains. A config change in repository A propagates through a data flow into repository C, but the connection passes through repository B in a way that no single team owns or monitors. The agents found these chains. Humans hadn't mapped them.
None of this lived in READMEs, docstrings, or wikis. Some of it was recoverable from commit history — if you knew which commits to look at and in which of the four repos. Most of it lived in the heads of senior engineers who'd been burned by these patterns in past incidents.
Before and After
| Metric | Before | After |
|---|---|---|
| Context coverage | ~5% (5 files) | 100% (59 files) |
| Files with navigation | ~50 | 4,100+ |
| Documented non-obvious patterns | 0 | 50+ |
| Quality score (human-rated) | 3.65 / 5.0 | 4.20 / 5.0 |
With the pre-computed context, agents used roughly 40% fewer tool calls and tokens per task. Complex workflows that previously required about two days of research completed in around 30 minutes.
What Smaller Teams Can Steal
Fifty-plus agents and Meta-scale compute aren't in your budget. I get it. But the underlying approach — using LLMs to systematically interrogate a codebase and extract undocumented knowledge — is reproducible at smaller scale.
The five questions their module analysts asked are genuinely useful for any pipeline codebase:
What does this module configure?
What are common modification patterns?
What non-obvious patterns cause build failures?
What cross-module dependencies exist?
What tribal knowledge is buried in code comments?
You could run these against your own repos with a single agent — Claude, GPT, Gemini, whatever. It won't be as thorough as a 50-agent swarm with critic passes and gap-fillers. But even partial coverage beats what most teams actually have: a stale Confluence page from 2023 and a README that says "TODO: add setup instructions."
The design constraint worth copying is "compass, not encyclopedia." Each context file is 25–35 lines. Not comprehensive documentation — just enough to orient someone (or something) about to touch that code. This is the right call. Thick documentation rots faster than thin documentation. A pointer to the three files you actually need to read beats a 50-page architecture doc that was accurate six months ago.
Maybe the Fix Isn't Documentation
Meta's system auto-refreshes every few weeks, which handles staleness. But here's the structural question nobody asked in the blog post: if your pipelines contain 50+ undocumented patterns that trip up both humans and AI, is better documentation really the solution? Or should you make the pipelines less surprising?
Enforcing naming conventions. Making dependencies explicit in config rather than in convention. Adding contract tests at pipeline boundaries. These reduce the surface area of tribal knowledge in the first place. Documentation — whether human-written or AI-generated — is a patch. A useful patch, and sometimes the only practical option at Meta's scale. But a 10-person data team might be better off spending that energy making the code self-documenting instead of documenting the code.
That said, I'm going to run an LLM against our pipeline repo this weekend with those five questions. I already know the answers will be embarrassing.