Member-only story
|Retrieval-augmented generation (RAG)| LLMs| RAG alternatives|
You Cache Only Once: Cache-Augmented Generation (CAG) Instead Of RAG
Streamlining Knowledge Tasks with Cache-Augmented Generation: A Simpler Alternative to Retrieval-Based Approaches

Large language models (LLMs) have shown great capabilities but also have some limitations: hallucinations and knowledge updates. Retrieval-augmented generation (RAG) is one such technique that has been used to try to solve these limitations. RAG basically finds information in an external memory and provides it to the LLM before generation. RAG is one of the most popular techniques today, but it is not without its shortcomings: real-time retrieval introduces latency, it is not easy to identify relevant documents, it makes the system more complex, and system tuning must be conducted.
Therefore, considering that models can take more text as input (new models can have a much longer context length), this paper suggests preloading the LLM with all relevant documents in advance and precomputing the key-value (KV) cache. This way, no additional document search is needed and time is saved in inference:
This approach eliminates retrieval latency, mitigates retrieval errors, and simplifies system architecture, all while maintaining high-quality responses by ensuring the model processes all relevant context holistically. — source
In this article, we discuss the details and why this idea is interesting.
Artificial intelligence is transforming our world, shaping how we live and work. Understanding how it works and its implications has…