There are several challenges that come with RAG. It adds latency, since each user query needs to be embedded and matched with relevant context via a retrieval step. The quality of the results heavily depends on how documents are chunked and whether important information is included in the retrieved context. If the chunking or overlap settings aren't optimal, you might miss the key information altogether. On top of that, managing vector databases, embeddings, and pipelines adds complexity to your stack.
So what’s the alternative when the data size is small enough? One approach is Cache-Augmented Generation (CAG). If your background information and the user’s question all fit within the model’s context window, you can simply include the entire relevant content directly in the prompt. There’s no need for separate retrieval logic, embedding pipelines, or external storage. The model sees all the information at once, which removes the risk of missing something important due to a retrieval miss.
Of course, this approach isn’t perfect either.
With API-based models, you're billed by input tokens as well, so including everything in the prompt might cost more than using a retrieval system that adds only what’s necessary. However, depending on your use case, the savings in infrastructure and engineering effort may outweigh the token costs.
Cache-Augmented Generation also doesn’t scale well to large datasets. If your content doesn’t fit into the model’s context window, you’ll need some way to narrow it down. In those situations, a hybrid approach might make sense: use retrieval to find relevant documents, then include those documents in full in the prompt if they still fit. This can improve answer quality while keeping the system relatively simple.