快取增強生成（CAG）：是否真的比RAG更加強大？

檢索增強生成（RAG）透過動態檢索外部知識改變了人工智慧，但它也存在一些侷限性，如延遲和對外部資源的依賴性。為了克服這些挑戰，快取增強生成（CAG）作為一種強大的替代方法應運而生。CAG 的實施側重於快取相關資訊，從而實現更快、更高效的響應，同時提高可擴充套件性、準確性和可靠性。在 CAG 與 RAG 的對比中，我們將探討 CAG 如何解決 RAG 的侷限性，深入研究 CAG 的實施策略，並分析其在現實世界中的應用。

什麼是快取增強生成（CAG）？

快取增強生成（CAG）是一種透過將相關知識預載入到上下文視窗來增強語言模型的方法，無需進行即時檢索。CAG 透過利用預先計算的鍵值 (KV) 快取來最佳化知識密集型任務，從而實現更快、更高效的響應。

CAG如何工作？

當提交查詢時，CAG 採用結構化方法高效檢索並生成響應：

預載知識：在推理之前，對相關資訊進行預處理，並將其儲存在擴充套件上下文或專用快取中。這可確保經常訪問的知識隨時可用，而無需即時檢索。
鍵值快取：CAG 利用預先計算的推理狀態，而不是像 RAG 那樣動態獲取文件。這些狀態可作為參考，允許模型即時訪問快取知識，而無需進行外部查詢。
最佳化推理：當收到查詢時，模型會檢查快取中預先存在的知識嵌入。如果發現匹配，模型會直接利用儲存的上下文生成響應。這大大縮短了推理時間，同時確保了生成輸出的一致性和流暢性。

CAG工作原理圖

與RAG的主要區別

這就是 CAG 方法與 RAG 的不同之處：

無即時檢索： 知識是預先載入的，而不是動態獲取的。
更低的延遲： 由於模型在推理過程中不查詢外部資源，因此響應速度更快。
潛在穩定性： 如果不定期重新整理，快取的知識可能會過時。

CAG架構

為了在不進行即時檢索的情況下高效生成響應，CAG 依賴於一個專為快速可靠地訪問資訊而設計的結構化框架。CAG 系統由以下部分組成：

CAG架構

知識源：資訊庫（如文件或結構化資料），在推理之前訪問，以預載知識。
離線預載入：在推理之前提取知識並將其儲存在 LLM 內部的知識快取中，確保無需即時檢索即可快速訪問。
LLM（大型語言模型）：使用知識快取中儲存的預載入知識生成響應的核心模型。
查詢處理：當收到查詢時，模型從知識快取中檢索相關資訊，而不是即時對外請求。
生成響應：LLM 利用快取知識和查詢上下文生成輸出，從而實現更快、更高效的響應。

這種架構最適合知識變化不頻繁、需要快速響應的用例。

為什麼需要CAG？

傳統的 RAG 系統透過即時整合外部知識源來增強語言模型。然而，RAG 帶來了一些挑戰，如檢索延遲、文件選擇中的潛在錯誤以及系統複雜性的增加。CAG 透過將所有相關資源預載入到模型上下文並快取其執行時引數來解決這些問題。這種方法消除了檢索延遲，最大限度地減少了檢索錯誤，同時保持了上下文的相關性。

CAG的應用

CAG 是一種透過將相關知識預載入到上下文中來增強語言模型的技術，從而消除了即時資料檢索的需要。這種方法在多個領域都有實際應用：

客戶服務和支援： 透過預載入產品資訊、常見問題和故障排除指南，CAG 使人工智慧驅動的客戶服務平臺能夠提供即時、準確的響應，從而提高使用者滿意度。
教育工具： CAG 可用於教育應用，提供有關特定主題的即時解釋和資源，促進高效的學習體驗。
對話式人工智慧：在聊天機器人和虛擬助手中，CAG 可透過維護對話歷史記錄來實現更連貫、更有語境感知的互動，從而帶來更自然的對話。
內容創作： 撰稿人和營銷人員可利用 CAG，透過預載相關材料生成符合品牌準則和資訊的內容，確保一致性和效率。
醫療保健資訊系統： 透過預載醫療指南和協議，CAG 可以幫助醫療保健專業人員快速訪問關鍵資訊，支援及時決策。

透過將 CAG 整合到這些應用程式中，企業可以實現更快的響應時間、更高的準確性和更高效的運營。

CAG實踐經驗

在本實踐實驗中，我們將探索如何使用模糊匹配和快取有效地處理人工智慧查詢，以最佳化響應時間。

為此，我們首先會問系統：“什麼是過擬合？”然後再追問“解釋一下過擬合。” 系統首先會檢查是否存在快取響應。如果沒有，它就會從知識庫中檢索最相關的上下文，使用 OpenAI 的應用程式介面生成一個響應，並將其快取起來。

模糊匹配是一種用於確定查詢之間相似性的技術，即使它們並不完全相同，也能幫助識別先前查詢的細微變化、拼寫錯誤或重新措辭版本。對於第二個問題，模糊匹配不是進行多餘的 API 呼叫，而是識別其與之前查詢的相似性，並立即檢索快取的響應，從而大大提高了速度並降低了成本。

程式碼：

import os

import hashlib

import time

import difflib

from dotenv import load_dotenv

from openai import OpenAI

# Load environment variables from .env file

load_dotenv()

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Static Knowledge Dataset

knowledge_base = {

"Data Science": "Data Science is an interdisciplinary field that combines statistics, machine learning, and domain expertise to analyze and extract insights from data.",

"Machine Learning": "Machine Learning (ML) is a subset of AI that enables systems to learn from data and improve over time without explicit programming.",

"Deep Learning": "Deep Learning is a branch of ML that uses neural networks with multiple layers to analyze complex patterns in large datasets.",

"Neural Networks": "Neural Networks are computational models inspired by the human brain, consisting of layers of interconnected nodes (neurons).",

"Natural Language Processing": "NLP enables machines to understand, interpret, and generate human language.",

"Feature Engineering": "Feature Engineering is the process of selecting, transforming, or creating features to improve model performance.",

"Hyperparameter Tuning": "Hyperparameter tuning optimizes model parameters like learning rate and batch size to improve performance.",

"Model Evaluation": "Model evaluation assesses performance using accuracy, precision, recall, F1-score, and RMSE.",

"Overfitting": "Overfitting occurs when a model learns noise instead of patterns, leading to poor generalization. Prevention techniques include regularization, dropout, and early stopping.",

"Cloud Computing for AI": "Cloud platforms like AWS, GCP, and Azure provide scalable infrastructure for AI model training and deployment."

}

# Cache for storing responses

response_cache = {}

# Generate a cache key based on normalized query

def get_cache_key(query):

return hashlib.md5(query.lower().encode()).hexdigest()

# Function to find the best matching key from the knowledge base

def find_best_match(query):

matches = difflib.get_close_matches(query, knowledge_base.keys(), n=1, cutoff=0.5)

return matches[0] if matches else None

# Function to process queries with caching & fuzzy matching

def query_with_cache(query):

normalized_query = query.lower().strip()

# First, check if a similar query exists in the cache

for cached_query in response_cache.keys():

if difflib.SequenceMatcher(None, normalized_query, cached_query).ratio() > 0.8:

return f"(Cached) {response_cache[cached_query]}"

# Find best match in knowledge base

best_match = find_best_match(normalized_query)

if not best_match:

return "No relevant knowledge found."

context = knowledge_base[best_match]

cache_key = get_cache_key(best_match)

# Check if the response for this context is cached

if cache_key in response_cache:

return f"(Cached) {response_cache[cache_key]}"

# If not cached, generate response

prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:"

response = client.responses.create(

model="gpt-4o",

instructions="You are an AI assistant with expert knowledge.",

input=prompt

)

response_text = response.output_text.strip()

# Store response in cache

response_cache[cache_key] = response_text

return response_text

if __name__ == "__main__":

start_time = time.time()

print(query_with_cache("What is Overfitting"))

print(f"Response Time: {time.time() - start_time:.4f} seconds\n")

start_time = time.time()

print(query_with_cache("Explain Overfitting"))

print(f"Response Time: {time.time() - start_time:.4f} seconds")

import os import hashlib import time import difflib from dotenv import load_dotenv from openai import OpenAI # Load environment variables from .env file load_dotenv() client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) # Static Knowledge Dataset knowledge_base = { "Data Science": "Data Science is an interdisciplinary field that combines statistics, machine learning, and domain expertise to analyze and extract insights from data.", "Machine Learning": "Machine Learning (ML) is a subset of AI that enables systems to learn from data and improve over time without explicit programming.", "Deep Learning": "Deep Learning is a branch of ML that uses neural networks with multiple layers to analyze complex patterns in large datasets.", "Neural Networks": "Neural Networks are computational models inspired by the human brain, consisting of layers of interconnected nodes (neurons).", "Natural Language Processing": "NLP enables machines to understand, interpret, and generate human language.", "Feature Engineering": "Feature Engineering is the process of selecting, transforming, or creating features to improve model performance.", "Hyperparameter Tuning": "Hyperparameter tuning optimizes model parameters like learning rate and batch size to improve performance.", "Model Evaluation": "Model evaluation assesses performance using accuracy, precision, recall, F1-score, and RMSE.", "Overfitting": "Overfitting occurs when a model learns noise instead of patterns, leading to poor generalization. Prevention techniques include regularization, dropout, and early stopping.", "Cloud Computing for AI": "Cloud platforms like AWS, GCP, and Azure provide scalable infrastructure for AI model training and deployment." } # Cache for storing responses response_cache = {} # Generate a cache key based on normalized query def get_cache_key(query): return hashlib.md5(query.lower().encode()).hexdigest() # Function to find the best matching key from the knowledge base def find_best_match(query): matches = difflib.get_close_matches(query, knowledge_base.keys(), n=1, cutoff=0.5) return matches[0] if matches else None # Function to process queries with caching & fuzzy matching def query_with_cache(query): normalized_query = query.lower().strip() # First, check if a similar query exists in the cache for cached_query in response_cache.keys(): if difflib.SequenceMatcher(None, normalized_query, cached_query).ratio() > 0.8: return f"(Cached) {response_cache[cached_query]}" # Find best match in knowledge base best_match = find_best_match(normalized_query) if not best_match: return "No relevant knowledge found." context = knowledge_base[best_match] cache_key = get_cache_key(best_match) # Check if the response for this context is cached if cache_key in response_cache: return f"(Cached) {response_cache[cache_key]}" # If not cached, generate response prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:" response = client.responses.create( model="gpt-4o", instructions="You are an AI assistant with expert knowledge.", input=prompt ) response_text = response.output_text.strip() # Store response in cache response_cache[cache_key] = response_text return response_text if __name__ == "__main__": start_time = time.time() print(query_with_cache("What is Overfitting")) print(f"Response Time: {time.time() - start_time:.4f} seconds\n") start_time = time.time() print(query_with_cache("Explain Overfitting")) print(f"Response Time: {time.time() - start_time:.4f} seconds")

import os
import hashlib
import time
import difflib 
from dotenv import load_dotenv
from openai import OpenAI
# Load environment variables from .env file
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
# Static Knowledge Dataset
knowledge_base = {
"Data Science": "Data Science is an interdisciplinary field that combines statistics, machine learning, and domain expertise to analyze and extract insights from data.",
"Machine Learning": "Machine Learning (ML) is a subset of AI that enables systems to learn from data and improve over time without explicit programming.",
"Deep Learning": "Deep Learning is a branch of ML that uses neural networks with multiple layers to analyze complex patterns in large datasets.",
"Neural Networks": "Neural Networks are computational models inspired by the human brain, consisting of layers of interconnected nodes (neurons).",
"Natural Language Processing": "NLP enables machines to understand, interpret, and generate human language.",
"Feature Engineering": "Feature Engineering is the process of selecting, transforming, or creating features to improve model performance.",
"Hyperparameter Tuning": "Hyperparameter tuning optimizes model parameters like learning rate and batch size to improve performance.",
"Model Evaluation": "Model evaluation assesses performance using accuracy, precision, recall, F1-score, and RMSE.",
"Overfitting": "Overfitting occurs when a model learns noise instead of patterns, leading to poor generalization. Prevention techniques include regularization, dropout, and early stopping.",
"Cloud Computing for AI": "Cloud platforms like AWS, GCP, and Azure provide scalable infrastructure for AI model training and deployment."
}
# Cache for storing responses
response_cache = {}
# Generate a cache key based on normalized query
def get_cache_key(query):
return hashlib.md5(query.lower().encode()).hexdigest()
# Function to find the best matching key from the knowledge base
def find_best_match(query):
matches = difflib.get_close_matches(query, knowledge_base.keys(), n=1, cutoff=0.5)
return matches[0] if matches else None
# Function to process queries with caching & fuzzy matching
def query_with_cache(query):
normalized_query = query.lower().strip()
# First, check if a similar query exists in the cache
for cached_query in response_cache.keys():
if difflib.SequenceMatcher(None, normalized_query, cached_query).ratio() > 0.8:
return f"(Cached) {response_cache[cached_query]}"
# Find best match in knowledge base
best_match = find_best_match(normalized_query)
if not best_match:
return "No relevant knowledge found."
context = knowledge_base[best_match]
cache_key = get_cache_key(best_match)
# Check if the response for this context is cached
if cache_key in response_cache:
return f"(Cached) {response_cache[cache_key]}"
# If not cached, generate response
prompt = f"Context:\n{context}\n\nQuery: {query}\nAnswer:"
response = client.responses.create(
model="gpt-4o",
instructions="You are an AI assistant with expert knowledge.",
input=prompt
)
response_text = response.output_text.strip()
# Store response in cache
response_cache[cache_key] = response_text
return response_text
if __name__ == "__main__":
start_time = time.time()
print(query_with_cache("What is Overfitting"))
print(f"Response Time: {time.time() - start_time:.4f} seconds\n")
start_time = time.time()
print(query_with_cache("Explain Overfitting")) 
print(f"Response Time: {time.time() - start_time:.4f} seconds")

輸出結果：

在輸出結果中，我們發現第二個查詢的處理速度更快，因為它透過相似性匹配利用了快取，避免了冗餘的 API 呼叫。響應時間證實了這一效率，表明快取顯著提高了速度並降低了成本。

CAG模糊匹配

CAG與RAG比較

在利用外部知識增強語言模型方面，CAG 和 RAG 採用了不同的方法。

以下是它們的主要區別。

對比項	快取增強生成 (CAG)	檢索增強生成（RAG）
知識整合	在預處理過程中將相關知識預載入到模型的擴充套件上下文中，無需即時檢索。	根據輸入查詢即時動態檢索外部資訊，並在推理過程中加以整合。
系統架構	簡化架構，無需外部檢索元件，減少了潛在的故障點。	需要一個具有檢索機制的更復雜系統，以便在推理過程中獲取相關資訊。
響應延遲	由於不需要即時檢索過程，因此響應速度更快。	由於即時資料檢索需要時間，可能會增加延遲。
使用案例	非常適合靜態資料集或變化不頻繁的資料集，如公司政策或使用者手冊。	適合需要最新資訊的應用，如新聞更新或即時分析。
系統複雜性	精簡元件，便於維護，降低運營成本。	涉及管理外部檢索系統，增加了複雜性和潛在的維護挑戰。
效能	在具有穩定知識域的任務中表現出色，提供高效可靠的響應。	適合動態環境，能適應最新資訊和發展。
可靠性	依靠預先載入的、經過整理的知識，降低了檢索出錯的風險。	由於依賴外部資料來源和即時獲取，可能會出現檢索錯誤。

CAG或RAG – 哪一個適合您的專案？

在決定使用檢索增強生成（RAG）還是快取增強生成（CAG）時，必須考慮資料波動性、系統複雜性和語言模型上下文視窗大小等因素。

何時使用RAG：

動態知識庫：RAG 非常適合需要最新資訊的應用，如資料經常變化的新聞聚合或即時分析。其即時檢索機制可確保模型訪問最新資料。
龐大的資料集：對於超出模型上下文視窗的大型知識庫，RAG 動態獲取相關資訊的能力變得至關重要，可防止上下文超載並保持準確性。

何時使用CAG：

靜態或穩定資料：CAG 適用於資料集變化不頻繁的場景，如公司政策或教育材料。透過將知識預載入到模型的上下文中，CAG 可以加快響應速度並降低系統複雜性。
擴充套件上下文視窗：隨著支援更大上下文視窗的語言模型的進步，CAG 可以預載入大量相關資訊，使其能夠高效地完成具有穩定知識域的任務。

小結

透過將相關知識預載入到模型的上下文中，CAG 為傳統的 RAG 提供了令人信服的替代方案。這消除了即時檢索延遲，大大減少了延遲並提高了效率。此外，它還簡化了系統架構，非常適合具有穩定知識域的應用，如客戶支援、教育工具和人工智慧對話。

雖然 RAG 對於動態、即時資訊檢索仍然是必不可少的，但在速度、可靠性和降低系統複雜性成為優先考慮因素的情況下，CAG 被證明是一種強大的解決方案。隨著語言模型不斷發展，上下文視窗不斷擴大，記憶體機制不斷改進，CAG 在最佳化人工智慧驅動應用方面的作用只會越來越大。透過根據用例在 RAG 和 CAG 之間進行戰略性選擇，企業和開發人員可以充分釋放人工智慧驅動的知識整合潛力。