AI Fundamentals

What Is Retrieval-Augmented Generation? A Practical RAG Definition

Direct answer

Retrieval-augmented generation, or RAG, is an architecture that retrieves relevant information from an external collection and supplies it as context to a generative model. It can add current or private knowledge without changing the model's base weights. Its quality depends on source freshness, access control, retrieval accuracy, prompt construction, and whether the answer remains supported by the retrieved evidence.

Apply this concept - RAG Cost Calculator: Embeddings, Vector DB & LLM Spend →

Summary

A language model contains knowledge encoded in its trained parameters, but an application may need information that is newer, private, frequently updated, or specific to one organization. RAG keeps that information outside the model and retrieves selected evidence when a request arrives.

The original RAG research combined a pretrained generator with nonparametric memory accessed by a retriever. Modern product implementations use many variations, including keyword search, dense vector search, hybrid search, metadata filters, rerankers, and multiple retrieval passes. The stable idea is the same: find useful external material before generation and make it available in the model's context.

RAG is an application architecture, not a single model feature or database product.

A production RAG system usually has an indexing path and a query path.

RAG combines retrieval with generation

A language model contains knowledge encoded in its trained parameters, but an application may need information that is newer, private, frequently updated, or specific to one organization. RAG keeps that information outside the model and retrieves selected evidence when a request arrives.

The original RAG research combined a pretrained generator with nonparametric memory accessed by a retriever. Modern product implementations use many variations, including keyword search, dense vector search, hybrid search, metadata filters, rerankers, and multiple retrieval passes. The stable idea is the same: find useful external material before generation and make it available in the model's context.

RAG is an application architecture, not a single model feature or database product.

The common RAG pipeline

A production RAG system usually has an indexing path and a query path.

1. Collect documents from approved sources. 2. Parse and normalize their contents. 3. Split them into retrievable units or chunks. 4. Create metadata such as source, date, tenant, and access policy. 5. Create vector embeddings when dense retrieval is used. 6. Store searchable representations and source text. 7. Reindex content when documents change.

1. Receive and normalize the user's question. 2. Apply authorization and metadata filters. 3. Create a query representation. 4. Retrieve candidate chunks. 5. Optionally rerank or deduplicate candidates. 6. Assemble selected evidence into a prompt. 7. Ask the language model to answer under explicit instructions. 8. Return the answer with source references when appropriate. 9. Log retrieval and generation outcomes for evaluation.

Each stage can fail independently. A fluent final answer does not prove that retrieval found the right evidence.

RAG is not finetuning

Finetuning changes model parameters. RAG supplies external information during inference. They solve different problems.

Use RAG when knowledge changes frequently, must remain attributable to source documents, or is private to an application. Consider finetuning when the main requirement is behavior, style, format, or task specialization rather than access to changing facts.

The two methods can be combined. A finetuned model can still receive retrieved context, and a RAG system can use a base model without any additional training.

Why chunking matters

Retrieval typically operates on portions of documents rather than entire files. A chunk that is too large may include irrelevant material and consume excessive context. A chunk that is too small may lose the surrounding information needed to interpret it.

Chunking should follow document structure where possible. Headings, paragraphs, tables, code boundaries, and semantic sections are often better boundaries than a fixed character count. Overlap can preserve continuity, but it also increases index size and may return duplicate evidence.

There is no universal chunk size. Evaluate retrieval quality and generation quality on representative questions.

Dense, sparse, and hybrid retrieval

Sparse retrieval uses lexical signals such as term occurrence. Dense retrieval maps content into vectors and retrieves nearby representations. Hybrid retrieval combines both approaches.

Dense retrieval can match related meaning even when wording differs. Sparse retrieval can be strong for exact names, identifiers, error codes, and uncommon terms. Hybrid systems attempt to preserve both advantages.

A reranker can score a smaller candidate set with a more expensive model after initial retrieval. This can improve ordering, but it adds latency and cost.

The cost layers of RAG

RAG cost is more than the final languagemodel call. A complete model includes:

Document parsing and cleanup Embedding generation during indexing Reembedding changed content Vector or search storage Query embeddings Search reads Reranking Retrieved input tokens Output tokens Evaluation and observability Data synchronization and accesscontrol logic

The generation prompt can become the largest cost line when many chunks are inserted. Use the RAG cost calculator to keep embeddings, retrieval, storage, reranking, and generation separate.

For tokenpriced APIs, the retrieved text contributes to input usage. The contextwindow guide explains why capacity and cost must be modeled together.

RAG does not guarantee factual answers

Retrieval can improve grounding, but several failure modes remain:

The source collection is incomplete or stale. Parsing removed important structure. The query did not express the user's real intent. Retrieval selected irrelevant chunks. Relevant chunks ranked below the cutoff. Access filters removed necessary evidence. The prompt mixed conflicting sources. The model ignored or misread the evidence. The answer claimed more than the sources supported.

A good RAG system can abstain, disclose missing evidence, and show source references. It should not convert retrieval confidence into unsupported certainty.

How to evaluate a RAG system

Evaluate retrieval and generation separately before combining them.

Retrieval evaluation asks whether relevant evidence appears in the candidate set and whether irrelevant material is controlled. Metrics may include recall at k, precision at k, ranking quality, and human relevance labels.

Generation evaluation asks whether the answer is correct, supported by the retrieved evidence, complete enough for the task, and appropriately cited.

Endtoend evaluation measures whether real users obtain useful answers within latency and cost targets.

Keep a versioned test set of questions, expected source documents, and answer criteria. Rerun it when chunking, embeddings, filters, reranking, prompts, or models change.

Access control belongs before retrieval output

A multitenant RAG system must enforce authorization at retrieval time. Filtering after generation is too late because unauthorized text may already have entered the model context or logs.

Store tenant and permission metadata with indexed content, apply filters before returning candidates, and test isolation explicitly. Treat retrieved documents as untrusted input that can contain misleading or malicious instructions. Application instructions should define how document content may be used.

What this article covers

  • RAG combines retrieval with generation
  • The common RAG pipeline
  • RAG is not finetuning
  • Why chunking matters
  • Dense, sparse, and hybrid retrieval

Use it with ByteCosts calculators

After reading the research note, open the related calculator and replace the example assumptions with your own users, requests, tokens, seats, or platform usage.

The goal is to convert the article's cost pattern into a concrete monthly run-rate, per-user margin, or break-even point your team can discuss.

Frequently asked questions

Does RAG train the language model?

No. Standard RAG retrieves external information at inference time and supplies it as context. It does not update the base model's weights.

Does every RAG system need a vector database?

No. Retrieval can use lexical search, relational filters, graph queries, dense vectors, or a hybrid. The storage and search method should match the documents and queries.

Can RAG use current private data?

Yes, when the application indexes authorized private sources and enforces access controls. Freshness still depends on the synchronization and reindexing process.

Why can a RAG answer still be wrong?

The source may be wrong, retrieval may miss relevant evidence, or the model may misuse the evidence. RAG reduces some knowledge limitations but does not eliminate evaluation and verification.

Cite this page

What Is Retrieval-Augmented Generation? A Practical RAG Definition. ByteCosts. Updated 2026-06-21. https://bytecosts.com/blog/what-is-retrieval-augmented-generation/

Sources

Machine-readable