
Introduction
For years, machine learning pipelines followed a predictable rhythm: train a model offline, deploy it, and expose an inference endpoint. Retrieval-Augmented Generation (RAG) has broken that rhythm.
RAG introduces a dynamic retrieval layer that feeds live, domain-specific knowledge into Large Language Models (LLMs). The promise is context-aware answers grounded in real data. But for infrastructure teams, the reality is a new set of operational headaches: keeping systems available, fast, and affordable.
My perspective on this comes from two very different worlds. As a software engineer at Meta, I worked on the Shops logging infrastructure – enforcing rigid data structures on massive streams of user events – and later on high-traffic experimentation platforms for product images. Today, I apply those hyperscale lessons to building RAG systems in the startup world.
This article explores what “boring” infrastructure engineering teaches us about deploying, scaling, and future-proofing RAG.
RAG vs. Traditional ML: The Observability Gap
Traditional ML pipelines are relatively static. Data scientists train models in batch jobs, hand them to engineering, and inference services run behind an API.
RAG inverts this. The LLM is just one component in a real-time orchestration layer involving vector databases, embedding jobs, and retrieval logic. The biggest failure point I see isn’t the model – it’s the data structure.
The Lesson from Meta’s Shops Logging:
When I worked on the logging framework for Meta Shops, our core mandate was “opinionated structure.” We didn’t just let teams log whatever they wanted; we enforced a rigid schema to ensure that end-to-end user flows could be analyzed reliably.
RAG demands this same discipline.
- The Mistake: Dumping raw PDFs into a vector store and hoping for the best.
- The Fix: Enforcing metadata structures at ingestion. Just as we required specific fields for Shop events to ensure downstream data integrity, RAG ingestion pipelines must enforce metadata (timestamp, author, domain) to enable filtering before the vector search happens. If you can’t trace a chunk back to its source, your “intelligent” system is just hallucinating with confidence.
The Economics of RAG: Latency is the Enemy
In a RAG system, queries traverse multiple hops. Retrieval latency eats into your generation latency budget.
This reminds me of my time working on Product Image Labels at Meta. We were serving dynamic overlays (like “Bestseller” or “Free Shipping”) on millions of images. We had to ideate, experiment, and serve these labels instantly.
The constraints we faced there are identical to RAG:
- Latency is cumulative: Every millisecond spent fetching a label (or a vector chunk) is a millisecond the user waits.
- Caching is mandatory: We couldn’t re-compute labels for every view. Similarly, in RAG, you cannot afford to run GPT-4 on every query.
Where the Money Actually Goes in RAG:
A typical cost breakdown reveals a clear hierarchy of expense, with LLM Inference consistently dominating the budget (often >60%).
To control these costs, apply the same caching rigor used in high-traffic experimentation:
- Semantic Caching: Don’t pay for the same answer twice. By caching responses for queries with high cosine similarity (e.g., >0.85), you can bypass the LLM entirely for common questions.
- Lightweight Reranking: Instead of sending twenty documents to a premium model, use a cheaper, smaller model to score and filter them first.
Navigating the Tooling Landscape
The ecosystem is crowded, but for infrastructure engineers, vendor categories dictate the architecture.
- Vector Database Vendors (Pinecone, Weaviate, Milvus): Excel at retrieval speed but lack orchestration. You will need to build your own reranking and prompt logic.
- Orchestration Frameworks (LangChain, LlamaIndex): Great for prototyping but require significant “hardening” for production (error handling, retries, observability).
- End-to-End Platforms (AWS Kendra, Vertex AI Search): The “easy button.” They handle standard use cases well but often fail when you need custom embedding models or complex access control.
A Decision Framework:
- Standard Enterprise Search? -> Go End-to-End Platform.
- Custom ML / Specialized Data? -> Build custom orchestration + Managed Vector DB.
- Extreme Scale? -> Full in-house build (the Meta approach).
The Hidden Complexity: Multi-Modal Data
Text is just the beginning. RAG is increasingly ingesting product images, medical scans, and audio.
During my work on product images at Meta, we treated image data and metadata (labels) as distinct but coordinated entities. In Distributed RAG systems, this split often causes consistency issues.
Distributed Consistency Risks:
Updating a document that contains both text and images requires coordinated writes across different storage backends (Vector DB for text, S3 for images).
- The Risk: If the image upload fails but the text index updates, you have a corrupted state.
- The Stack: At enterprise scale, we used coordination services (like ZooKeeper) to manage state. At startup scale, Redis with Lua scripts is often sufficient to handle distributed locking and ensure atomicity.
Future-Proofing Your Architecture
RAG is not the final state of AI architecture. Long-context models (1M+ tokens) are shifting the problem from “retrieval” to “filtering.”
Infrastructure teams must prepare for:
- Specialized Retrieval Models: Generic embeddings are being replaced by domain-specific models (legal, code, medical).
- Hybrid Search: Combining vector similarity with keyword search (BM25) and Knowledge Graphs.
- Agentic Patterns: Agents that perform “search-synthesize-search” loops can triple your latency and cost. Strict timeout thresholds are mandatory.
Conclusion
Deploying RAG is ultimately an infrastructure challenge, not just an AI one. Success depends on boring operational fundamentals: observability, cost attribution, and system resilience.
Whether you are enforcing logging schemas at Meta or integrating sensitive private data at a seed-stage startup, the lesson is the same: Structure saves you. Unstructured data leads to unstructured answers. The most successful RAG deployments are the ones that treat data quality as an infrastructure problem, not just an ML one.



