Member-only story

|Retrieval-augmented generation (RAG)| LLMs| RAG alternatives|

You Cache Only Once: Cache-Augmented Generation (CAG) Instead Of RAG

Streamlining Knowledge Tasks with Cache-Augmented Generation: A Simpler Alternative to Retrieval-Based Approaches

Salvatore Raieli

Published in

Level Up Coding

8 min readJan 17, 2025

Explore Cache-Augmented Generation (CAG), a simpler alternative to Retrieval-Augmented Generation (RAG). By preloading knowledge into extended context windows, CAG removes retrieval latency, reduces errors, and streamlines complexity. Ideal for tasks with manageable datasets, it delivers competitive or superior results while maintaining context relevance across benchmarks. — image by the author using AI

Large language models (LLMs) have shown great capabilities but also have some limitations: hallucinations and knowledge updates. Retrieval-augmented generation (RAG) is one such technique that has been used to try to solve these limitations. RAG basically finds information in an external memory and provides it to the LLM before generation. RAG is one of the most popular techniques today, but it is not without its shortcomings: real-time retrieval introduces latency, it is not easy to identify relevant documents, it makes the system more complex, and system tuning must be conducted.

RAG is Dead, Long Live RAG

Is it really true that long-context LLMs are killing the RAG?

levelup.gitconnected.com

A Requiem for the Transformer?

Will be the transformer the model leading us to artificial general intelligence? Or will be replaced?

towardsdatascience.com

Therefore, considering that models can take more text as input (new models can have a much longer context length), this paper suggests preloading the LLM with all relevant documents in advance and precomputing the key-value (KV) cache. This way, no additional document search is needed and time is saved in inference:

This approach eliminates retrieval latency, mitigates retrieval errors, and simplifies system architecture, all while maintaining high-quality responses by ensuring the model processes all relevant context holistically. — source

In this article, we discuss the details and why this idea is interesting.

Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has…