Enterprise adoption of generative AI hit a new threshold in 2026: Gartner's survey shows 71% of large enterprises now run at least one LLM-based application in production. The majority of these applications are not just a model trained on the public internet — under the hood they rely on Retrieval Augmented Generation (RAG) architecture. RAG grounds the LLM in the organisation's own knowledge base, reducing hallucination and giving every answer a traceable source.
The Three Layers of a RAG System
A RAG system runs as three sequential layers:
- Content preparation: enterprise documents, product catalogues, ticket histories and policy PDFs are split into chunks and converted into vector representations through an embedding model.
- Vector retrieval: the user query is also embedded and the closest matches are pulled from a vector database (Pinecone, Weaviate, pgvector, Azure AI Search).
- Answer generation: the LLM receives both the question and the most relevant chunks as context, and produces an answer grounded in that context.
Four Common Mistakes
Recurring traps frequently seen in enterprise RAG projects:
- Wrong chunk size: 200 tokens is too short and loses meaning; 2000 tokens is too long and pollutes the window with irrelevant text. A 500-800 token chunk that respects sentence boundaries gives the healthiest results.
- Trusting a single embedding model: OpenAI's text-embedding-3-large works well for general content, but heavy domain-specific terminology benefits dramatically from a fine-tuned embedding.
- No post-retrieval processing: raw vector search alone is not enough; without a reranker, semantic deduplication and metadata filtering, consistency drops fast.
- No citations: even when the answer is correct, users do not trust it without seeing the source. Attaching a source reference to each statement doubles acceptance rates in our experience.
Data Governance Is a Prerequisite
RAG quality cannot exceed the quality of the data feeding it. If three different versions of the same policy document live on the shared drive, the model will produce three different answers. RAG projects must establish the single authoritative version of each source before launch, and pull MDM and governance into the loop.
Cost and Performance Balance
A GPT-4 class model can cost 5-10 cents per query. In high-volume internal scenarios (10K+ queries per day) a hybrid architecture — small model (Llama 3.1 8B, Mistral) for simple questions, large model for complex ones — drives total cost down by 4-5x.
