缓存增强生成（CAG）：是否真的比RAG更加强大？

检索增强生成（RAG）通过动态检索外部知识改变了人工智能，但它也存在一些局限性，如延迟和对外部资源的依赖性。为了克服这些挑战，缓存增强生成（CAG）作为一种强大的替代方法应运而生。CAG 的实施侧重于缓存相关信息，从而实现更快、更高效的响应，同时提高可扩展性、准确性和可靠性。在 CAG 与 RAG 的对比中，我们将探讨 CAG 如何解决 RAG 的局限性，深入研究 CAG 的实施策略，并分析其在现实世界中的应用。

什么是缓存增强生成（CAG）？

缓存增强生成（CAG）是一种通过将相关知识预加载到上下文窗口来增强语言模型的方法，无需进行实时检索。CAG 通过利用预先计算的键值 (KV) 缓存来优化知识密集型任务，从而实现更快、更高效的响应。

CAG如何工作？

当提交查询时，CAG 采用结构化方法高效检索并生成响应：

预载知识：在推理之前，对相关信息进行预处理，并将其存储在扩展上下文或专用缓存中。这可确保经常访问的知识随时可用，而无需实时检索。
键值缓存：CAG 利用预先计算的推理状态，而不是像 RAG 那样动态获取文档。这些状态可作为参考，允许模型即时访问缓存知识，而无需进行外部查找。
优化推理：当收到查询时，模型会检查缓存中预先存在的知识嵌入。如果发现匹配，模型会直接利用存储的上下文生成响应。这大大缩短了推理时间，同时确保了生成输出的一致性和流畅性。

CAG工作原理图

与RAG的主要区别

这就是 CAG 方法与 RAG 的不同之处：

无实时检索： 知识是预先加载的，而不是动态获取的。
更低的延迟： 由于模型在推理过程中不查询外部资源，因此响应速度更快。
潜在稳定性： 如果不定期刷新，缓存的知识可能会过时。

CAG架构

为了在不进行实时检索的情况下高效生成响应，CAG 依赖于一个专为快速可靠地访问信息而设计的结构化框架。CAG 系统由以下部分组成：

CAG架构

知识源：信息库（如文档或结构化数据），在推理之前访问，以预载知识。
离线预加载：在推理之前提取知识并将其存储在 LLM 内部的知识缓存中，确保无需实时检索即可快速访问。
LLM（大型语言模型）：使用知识缓存中存储的预加载知识生成响应的核心模型。
查询处理：当收到查询时，模型从知识缓存中检索相关信息，而不是实时对外请求。
生成响应：LLM 利用缓存知识和查询上下文生成输出，从而实现更快、更高效的响应。

这种架构最适合知识变化不频繁、需要快速响应的用例。

为什么需要CAG？

传统的 RAG 系统通过实时集成外部知识源来增强语言模型。然而，RAG 带来了一些挑战，如检索延迟、文档选择中的潜在错误以及系统复杂性的增加。CAG 通过将所有相关资源预加载到模型上下文并缓存其运行时参数来解决这些问题。这种方法消除了检索延迟，最大限度地减少了检索错误，同时保持了上下文的相关性。

CAG的应用

CAG 是一种通过将相关知识预加载到上下文中来增强语言模型的技术，从而消除了实时数据检索的需要。这种方法在多个领域都有实际应用：

客户服务和支持： 通过预加载产品信息、常见问题和故障排除指南，CAG 使人工智能驱动的客户服务平台能够提供即时、准确的响应，从而提高用户满意度。
教育工具： CAG 可用于教育应用，提供有关特定主题的即时解释和资源，促进高效的学习体验。
对话式人工智能：在聊天机器人和虚拟助手中，CAG 可通过维护对话历史记录来实现更连贯、更有语境感知的互动，从而带来更自然的对话。
内容创作： 撰稿人和营销人员可利用 CAG，通过预载相关材料生成符合品牌准则和信息的内容，确保一致性和效率。
医疗保健信息系统： 通过预载医疗指南和协议，CAG 可以帮助医疗保健专业人员快速访问关键信息，支持及时决策。

通过将 CAG 集成到这些应用程序中，企业可以实现更快的响应时间、更高的准确性和更高效的运营。

CAG实践经验

在本实践实验中，我们将探索如何使用模糊匹配和缓存有效地处理人工智能查询，以优化响应时间。

为此，我们首先会问系统：“什么是过拟合？”然后再追问“解释一下过拟合。” 系统首先会检查是否存在缓存响应。如果没有，它就会从知识库中检索最相关的上下文，使用 OpenAI 的应用程序接口生成一个响应，并将其缓存起来。

模糊匹配是一种用于确定查询之间相似性的技术，即使它们并不完全相同，也能帮助识别先前查询的细微变化、拼写错误或重新措辞版本。对于第二个问题，模糊匹配不是进行多余的 API 调用，而是识别其与之前查询的相似性，并立即检索缓存的响应，从而大大提高了速度并降低了成本。

代码：

import os

import hashlib

import time

import difflib

from dotenv import load_dotenv

from openai import OpenAI

# Load environment variables from .env file

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Static Knowledge Dataset

knowledge_base = {

"Data Science": "Data Science is an interdisciplinary field that combines statistics, machine learning, and domain expertise to analyze and extract insights from data.",

"Machine Learning": "Machine Learning (ML) is a subset of AI that enables systems to learn from data and improve over time without explicit programming.",

"Deep Learning": "Deep Learning is a branch of ML that uses neural networks with multiple layers to analyze complex patterns in large datasets.",

"Neural Networks": "Neural Networks are computational models inspired by the human brain, consisting of layers of interconnected nodes (neurons).",

"Natural Language Processing": "NLP enables machines to understand, interpret, and generate human language.",

"Feature Engineering": "Feature Engineering is the process of selecting, transforming, or creating features to improve model performance.",

"Hyperparameter Tuning": "Hyperparameter tuning optimizes model parameters like learning rate and batch size to improve performance.",

"Model Evaluation": "Model evaluation assesses performance using accuracy, precision, recall, F1-score, and RMSE.",

"Overfitting": "Overfitting occurs when a model learns noise instead of patterns, leading to poor generalization. Prevention techniques include regularization, dropout, and early stopping.",

"Cloud Computing for AI": "Cloud platforms like AWS, GCP, and Azure provide scalable infrastructure for AI model training and deployment."

}

# Cache for storing responses

response_cache = {}

# Generate a cache key based on normalized query

def get_cache_key(query):

return hashlib.md5(query.lower().encode()).hexdigest()

# Function to find the best matching key from the knowledge base

def find_best_match(query):

matches = difflib.get_close_matches(query, knowledge_base.keys(), n=1, cutoff=0.5)

return matches[0] if matches else None

# Function to process queries with caching & fuzzy matching

def query_with_cache(query):

normalized_query = query.lower().strip()

# First, check if a similar query exists in the cache

for cached_query in response_cache.keys():

if difflib.SequenceMatcher(None, normalized_query, cached_query).ratio() > 0.8:

return f"(Cached) {response_cache[cached_query]}"

# Find best match in knowledge base

best_match = find_best_match(normalized_query)

if not best_match:

return "No relevant knowledge found."

context = knowledge_base[best_match]

cache_key = get_cache_key(best_match)

# Check if the response for this context is cached

if cache_key in response_cache:

return f"(Cached) {response_cache[cache_key]}"

# If not cached, generate response

prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:"

response = client.responses.create(

model="gpt-4o",

instructions="You are an AI assistant with expert knowledge.",

input=prompt

)

response_text = response.output_text.strip()

# Store response in cache

response_cache[cache_key] = response_text

return response_text

if __name__ == "__main__":

start_time = time.time()

print(query_with_cache("What is Overfitting"))

print(f"Response Time: {time.time() - start_time:.4f} seconds\n")

start_time = time.time()

print(query_with_cache("Explain Overfitting"))

print(f"Response Time: {time.time() - start_time:.4f} seconds")

import os import hashlib import time import difflib from dotenv import load_dotenv from openai import OpenAI # Load environment variables from .env file load_dotenv() client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) # Static Knowledge Dataset knowledge_base = { "Data Science": "Data Science is an interdisciplinary field that combines statistics, machine learning, and domain expertise to analyze and extract insights from data.", "Machine Learning": "Machine Learning (ML) is a subset of AI that enables systems to learn from data and improve over time without explicit programming.", "Deep Learning": "Deep Learning is a branch of ML that uses neural networks with multiple layers to analyze complex patterns in large datasets.", "Neural Networks": "Neural Networks are computational models inspired by the human brain, consisting of layers of interconnected nodes (neurons).", "Natural Language Processing": "NLP enables machines to understand, interpret, and generate human language.", "Feature Engineering": "Feature Engineering is the process of selecting, transforming, or creating features to improve model performance.", "Hyperparameter Tuning": "Hyperparameter tuning optimizes model parameters like learning rate and batch size to improve performance.", "Model Evaluation": "Model evaluation assesses performance using accuracy, precision, recall, F1-score, and RMSE.", "Overfitting": "Overfitting occurs when a model learns noise instead of patterns, leading to poor generalization. Prevention techniques include regularization, dropout, and early stopping.", "Cloud Computing for AI": "Cloud platforms like AWS, GCP, and Azure provide scalable infrastructure for AI model training and deployment." } # Cache for storing responses response_cache = {} # Generate a cache key based on normalized query def get_cache_key(query): return hashlib.md5(query.lower().encode()).hexdigest() # Function to find the best matching key from the knowledge base def find_best_match(query): matches = difflib.get_close_matches(query, knowledge_base.keys(), n=1, cutoff=0.5) return matches[0] if matches else None # Function to process queries with caching & fuzzy matching def query_with_cache(query): normalized_query = query.lower().strip() # First, check if a similar query exists in the cache for cached_query in response_cache.keys(): if difflib.SequenceMatcher(None, normalized_query, cached_query).ratio() > 0.8: return f"(Cached) {response_cache[cached_query]}" # Find best match in knowledge base best_match = find_best_match(normalized_query) if not best_match: return "No relevant knowledge found." context = knowledge_base[best_match] cache_key = get_cache_key(best_match) # Check if the response for this context is cached if cache_key in response_cache: return f"(Cached) {response_cache[cache_key]}" # If not cached, generate response prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:" response = client.responses.create( model="gpt-4o", instructions="You are an AI assistant with expert knowledge.", input=prompt ) response_text = response.output_text.strip() # Store response in cache response_cache[cache_key] = response_text return response_text if __name__ == "__main__": start_time = time.time() print(query_with_cache("What is Overfitting")) print(f"Response Time: {time.time() - start_time:.4f} seconds\n") start_time = time.time() print(query_with_cache("Explain Overfitting")) print(f"Response Time: {time.time() - start_time:.4f} seconds")

import os
import hashlib
import time
import difflib 
from dotenv import load_dotenv
from openai import OpenAI
# Load environment variables from .env file
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Static Knowledge Dataset
knowledge_base = {
"Data Science": "Data Science is an interdisciplinary field that combines statistics, machine learning, and domain expertise to analyze and extract insights from data.",
"Machine Learning": "Machine Learning (ML) is a subset of AI that enables systems to learn from data and improve over time without explicit programming.",
"Deep Learning": "Deep Learning is a branch of ML that uses neural networks with multiple layers to analyze complex patterns in large datasets.",
"Neural Networks": "Neural Networks are computational models inspired by the human brain, consisting of layers of interconnected nodes (neurons).",
"Natural Language Processing": "NLP enables machines to understand, interpret, and generate human language.",
"Feature Engineering": "Feature Engineering is the process of selecting, transforming, or creating features to improve model performance.",
"Hyperparameter Tuning": "Hyperparameter tuning optimizes model parameters like learning rate and batch size to improve performance.",
"Model Evaluation": "Model evaluation assesses performance using accuracy, precision, recall, F1-score, and RMSE.",
"Overfitting": "Overfitting occurs when a model learns noise instead of patterns, leading to poor generalization. Prevention techniques include regularization, dropout, and early stopping.",
"Cloud Computing for AI": "Cloud platforms like AWS, GCP, and Azure provide scalable infrastructure for AI model training and deployment."
}
# Cache for storing responses
response_cache = {}
# Generate a cache key based on normalized query
def get_cache_key(query):
return hashlib.md5(query.lower().encode()).hexdigest()
# Function to find the best matching key from the knowledge base
def find_best_match(query):
matches = difflib.get_close_matches(query, knowledge_base.keys(), n=1, cutoff=0.5)
return matches[0] if matches else None
# Function to process queries with caching & fuzzy matching
def query_with_cache(query):
normalized_query = query.lower().strip()
# First, check if a similar query exists in the cache
for cached_query in response_cache.keys():
if difflib.SequenceMatcher(None, normalized_query, cached_query).ratio() > 0.8:
return f"(Cached) {response_cache[cached_query]}"
# Find best match in knowledge base
best_match = find_best_match(normalized_query)
if not best_match:
return "No relevant knowledge found."
context = knowledge_base[best_match]
cache_key = get_cache_key(best_match)
# Check if the response for this context is cached
if cache_key in response_cache:
return f"(Cached) {response_cache[cache_key]}"
# If not cached, generate response
prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:"
response = client.responses.create(
model="gpt-4o",
instructions="You are an AI assistant with expert knowledge.",
input=prompt
)
response_text = response.output_text.strip()
# Store response in cache
response_cache[cache_key] = response_text
return response_text
if __name__ == "__main__":
start_time = time.time()
print(query_with_cache("What is Overfitting"))
print(f"Response Time: {time.time() - start_time:.4f} seconds\n")
start_time = time.time()
print(query_with_cache("Explain Overfitting")) 
print(f"Response Time: {time.time() - start_time:.4f} seconds")

输出结果：

在输出结果中，我们发现第二个查询的处理速度更快，因为它通过相似性匹配利用了缓存，避免了冗余的 API 调用。响应时间证实了这一效率，表明缓存显著提高了速度并降低了成本。

CAG模糊匹配

CAG与RAG比较

在利用外部知识增强语言模型方面，CAG 和 RAG 采用了不同的方法。

以下是它们的主要区别。

对比项	缓存增强生成 (CAG)	检索增强生成（RAG）
知识集成	在预处理过程中将相关知识预加载到模型的扩展上下文中，无需实时检索。	根据输入查询实时动态检索外部信息，并在推理过程中加以整合。
系统架构	简化架构，无需外部检索组件，减少了潜在的故障点。	需要一个具有检索机制的更复杂系统，以便在推理过程中获取相关信息。
响应延迟	由于不需要实时检索过程，因此响应速度更快。	由于实时数据检索需要时间，可能会增加延迟。
使用案例	非常适合静态数据集或变化不频繁的数据集，如公司政策或用户手册。	适合需要最新信息的应用，如新闻更新或实时分析。
系统复杂性	精简组件，便于维护，降低运营成本。	涉及管理外部检索系统，增加了复杂性和潜在的维护挑战。
性能	在具有稳定知识域的任务中表现出色，提供高效可靠的响应。	适合动态环境，能适应最新信息和发展。
可靠性	依靠预先加载的、经过整理的知识，降低了检索出错的风险。	由于依赖外部数据源和实时获取，可能会出现检索错误。

CAG或RAG – 哪一个适合您的项目？

在决定使用检索增强生成（RAG）还是缓存增强生成（CAG）时，必须考虑数据波动性、系统复杂性和语言模型上下文窗口大小等因素。

何时使用RAG：

动态知识库：RAG 非常适合需要最新信息的应用，如数据经常变化的新闻聚合或实时分析。其实时检索机制可确保模型访问最新数据。
庞大的数据集：对于超出模型上下文窗口的大型知识库，RAG 动态获取相关信息的能力变得至关重要，可防止上下文超载并保持准确性。

何时使用CAG：

静态或稳定数据：CAG 适用于数据集变化不频繁的场景，如公司政策或教育材料。通过将知识预加载到模型的上下文中，CAG 可以加快响应速度并降低系统复杂性。
扩展上下文窗口：随着支持更大上下文窗口的语言模型的进步，CAG 可以预加载大量相关信息，使其能够高效地完成具有稳定知识域的任务。

小结

通过将相关知识预加载到模型的上下文中，CAG 为传统的 RAG 提供了令人信服的替代方案。这消除了实时检索延迟，大大减少了延迟并提高了效率。此外，它还简化了系统架构，非常适合具有稳定知识域的应用，如客户支持、教育工具和人工智能对话。

虽然 RAG 对于动态、实时信息检索仍然是必不可少的，但在速度、可靠性和降低系统复杂性成为优先考虑因素的情况下，CAG 被证明是一种强大的解决方案。随着语言模型不断发展，上下文窗口不断扩大，内存机制不断改进，CAG 在优化人工智能驱动应用方面的作用只会越来越大。通过根据用例在 RAG 和 CAG 之间进行战略性选择，企业和开发人员可以充分释放人工智能驱动的知识集成潜力。