CTO AI Corner: Don’t set up a RAG model unless you really need to

Retrieval-Augmented Generation (RAG) has become the default choice when organizations want to enhance language model responses with internal documentation. It makes sense in scenarios where there's a lot of content to sift through. But in many real-world use cases, the number of source documents is maybe a dozen. In those cases, RAG can be more than what’s needed, and it might not even be the most effective solution.

There are several challenges that come with RAG. It adds latency, since each user query needs to be embedded and matched with relevant context via a retrieval step. The quality of the results heavily depends on how documents are chunked and whether important information is included in the retrieved context. If the chunking or overlap settings aren't optimal, you might miss the key information altogether. On top of that, managing vector databases, embeddings, and pipelines adds complexity to your stack.

Alternatives for RAG

So what’s the alternative when the data size is small enough? One approach is Cache-Augmented Generation (CAG). If your background information and the user’s question all fit within the model’s context window, you can simply include the entire relevant content directly in the prompt. There’s no need for separate retrieval logic, embedding pipelines, or external storage. The model sees all the information at once, which removes the risk of missing something important due to a retrieval miss.

Of course, this approach isn’t perfect either.

With API-based models, you're billed by input tokens as well, so including everything in the prompt might cost more than using a retrieval system that adds only what’s necessary. However, depending on your use case, the savings in infrastructure and engineering effort may outweigh the token costs.

Cache-Augmented Generation also doesn’t scale well to large datasets. If your content doesn’t fit into the model’s context window, you’ll need some way to narrow it down. In those situations, a hybrid approach might make sense: use retrieval to find relevant documents, then include those documents in full in the prompt if they still fit. This can improve answer quality while keeping the system relatively simple.

April 9, 2025
Authors
Tomi Leppälahti
Share

Jätä viesti ja kartoitetaan yhdessä, miten ja missä hyödyntää tekoälyä.

Kiitos viestistäsi! Olemme pian yhteydessä.
Hupsis! Jotain meni pieleen lomakkeen lähetyksessä.