RAG Implementation: What the Demo Never Shows You

A RAG demo takes an afternoon. You load a PDF, split it every 1,000 characters, drop the chunks into a vector store, wire up an LLM, and ask it a question. It answers. Everyone in the room nods. Ship it, right?

That gap is where most projects quietly die: the gap between the notebook that works in the meeting and the system that has to work at 2 a.m. when a customer uploads a smudged scan of a 14-page invoice. We've lived in that gap. We've shipped retrieval-augmented generation (RAG) into two of our own products: the AI invoice-parsing engine inside Invoxbooks.com (built in Python), and the RAG features in Bar.Stream, our Java and Spring Boot hospitality automation platform, where RAG handles product size identification and product classification — turning inconsistent supplier product names into clean categories so bars and restaurants get complete, accurate inventory reporting. The first Invoxbooks version looked great in a demo and fell apart the moment real plumbing-supply invoices started arriving. Multi-vendor line items. Inconsistent column headers. Totals that didn't reconcile. The occasional fax-quality JPEG. That platform now saves the client roughly $200K a month, and none of that came from the demo. It came from the boring six weeks after.

Production RAG pipeline architecture showing document ingestion, chunking, embeddings, a vector database, retrieval with reranking, and LLM generation with an evaluation loop A production RAG pipeline has two halves: an offline indexing path and an online query path. The demo only ever shows you the happy middle.

Here's what a serious RAG implementation actually demands, and where we've watched teams get it wrong. An earlier version of ourselves included.

RAG vs fine-tuning: why we reach for retrieval first

The first question clients ask is usually the RAG vs fine-tuning one: should the model learn our data or look it up? For most business problems we reach for retrieval first. Fine-tuning bakes knowledge into weights, so every time a price list or a policy changes you're retraining. RAG keeps your knowledge in a vector database you can update in seconds, gives you citations back to the source, and costs a fraction to run. Fine-tuning earns its place for changing behavior and tone, not for facts that move. The two aren't rivals, but if your goal is "answer from our documents, accurately, and tell me where it got that," a RAG pipeline is the right tool.

Your chunking strategy is a product decision, not a default

The most common mistake is treating chunking as plumbing. People reach for the default RecursiveCharacterTextSplitter at 1,000 characters with 200 overlap and never look back. For an invoice, that's a disaster. A fixed-size splitter will happily cut a line item in half, separating the part number from its price, or strand a table header three chunks away from the rows it describes.

What worked for invoices was structure-aware chunking. We parse the document into logical regions first (header block, vendor block, line-item table, totals) and chunk along those seams instead of by character count. A line item stays whole. The vendor's name and address travel together. For the tables, we kept each row as its own retrievable unit with the column headers re-attached as a prefix, so a chunk reads like "Item: 1/2 inch copper elbow | Qty: 50 | Unit: $1.20 | Total: $60.00" instead of a naked row that means nothing without its header three chunks up.

The lesson generalizes. Your chunking strategy is a tradeoff between retrieval precision and context completeness, and the right answer depends entirely on your document type. You have to look at your actual corpus first. Not a sample. The real, messy thing. Legal contracts? Semantic chunking on clause boundaries. Support tickets? One ticket per chunk. There's no universal number, and anyone who hands you one hasn't looked at your data.

Embeddings, vector databases, and the retrieval you can't see

Pick an embedding model and you've made a quiet, expensive commitment, because switching later means re-embedding everything. For Invoxbooks we benchmarked a few options on our own labeled query set rather than trusting the MTEB leaderboard. Leaderboard performance on generic text says little about how a model handles part numbers, SKUs, and abbreviations that look like noise to something trained mostly on Wikipedia.

Here's a gotcha most tutorials skip: pure vector search is bad at exact-match retrieval. Ask "what's the total for invoice #INV-2024-8841" and a dense embedding model will cheerfully return semantically similar invoices while missing the one with that exact ID. Identifiers don't embed meaningfully. The fix is hybrid search. Run BM25 lexical search alongside dense vector search and fuse the results; we used reciprocal rank fusion to combine them. That one change did more for retrieval quality on identifier-heavy queries than any amount of embedding-model shopping.

We run this on a managed vector database (pgvector when a client is already on Postgres, Pinecone or Qdrant when they're not) rather than self-hosting for most clients. The operational cost of tuning HNSW parameters and babysitting index rebuilds isn't where you want your engineers early on. That's a tradeoff, not a rule. At very large scale the math flips.

Reranking is the cheap win teams skip

Here's a thing we were slow to adopt and now consider close to mandatory: a reranker.

Your retriever's job is recall. It pulls back the 20 or 50 candidate chunks that might be relevant, tuned to not miss things, which means it returns a lot of near-misses. Stuff all 50 into the prompt and you pay for the tokens, you blow past the useful context window, and you bury the one good chunk in noise. The model then answers from the noise.

A cross-encoder reranker scores each candidate against the query directly, and you keep the top 3 to 5. (We've used Cohere's Rerank and bge-reranker, depending on whether the client wanted everything self-hosted.) It's a second model and a few hundred milliseconds of latency. In exchange, answer quality jumps and hallucinations drop, and your generation cost goes down because you're sending fewer, better tokens. It's the highest return-on-effort step in most RAG implementation work, and it's the one people leave out because the demo "worked fine" without it.

You can't improve what you don't measure: RAG evaluation

This is the wall almost every team hits. The demo has no test set. So when you change the chunk size or swap the embedding model, you judge the result by asking it three questions you happen to remember and eyeballing the answers. That's not engineering. That's vibes.

We built a RAG evaluation harness before we built the second version. A few hundred real queries paired with known-correct answers, scored on three things: did retrieval surface the right chunk (Recall@k, MRR), did the generated answer match the ground truth, and was the answer actually grounded in the retrieved context or invented. We leaned on Ragas for the retrieval and faithfulness metrics, plus an LLM-as-judge for answer correctness, with a human spot-checking the judge because LLM judges carry their own biases.

The payoff is that every change becomes a number. When we moved to structure-aware chunking, faithfulness went up by a measurable amount, and we could prove it instead of arguing about it. Without that harness you're flying blind, and "it feels better" is how regressions ship.

Hallucination control is mostly grounding discipline

For an invoice that drives $200K in monthly savings through automated vendor comparison and order management, a confidently wrong total isn't a cute mistake. It's money.

Three things move the needle, in our experience. First, the prompt has to instruct the model to answer only from the provided context and to say "not found" when the context doesn't contain the answer. Then you have to actually test that it complies, because models love to fill gaps. Second, citations: we make the model return the chunk ID it pulled each value from, so a number can be traced back to a line in the source document. Errors become auditable instead of mysterious. Third, for structured extraction we don't trust free-text output at all. We constrain the model to a strict JSON schema and run deterministic validation on top. Does the sum of the line items equal the stated total? If not, flag for human review rather than auto-approving.

That last check is non-negotiable for financial data. The model is a first pass, not the final authority.

What RAG implementation actually costs

RAG cost surprises people because the demo is free-tier cheap and production is not. The drivers, roughly in order of how often they bite:

Embedding the corpus, especially if you re-embed on every model change or re-chunk. Budget for re-embedding. It will happen more than once.
Per-query generation tokens, which balloon when you skip reranking and shove 50 chunks into every prompt.
Re-indexing and storage as the corpus grows.

We cut Invoxbooks' per-query cost meaningfully by reranking down to five chunks and caching embeddings for repeated documents. Caching matters more than people expect, because in the real world the same vendor sends the same invoice template every single month.

If you're weighing whether to build this in-house or bring in people who've shipped production RAG before, that's a real decision with real tradeoffs, and it's the kind of work our AI and RAG development team does day to day. The demo will always be easy. The harness, the reranker, the validation layer, the cost tuning: that's the part that decides whether the thing survives contact with actual users.

Build the eval set first. Everything else gets easier once you can measure it.