適合您下一個專案的13種高階RAG技術

人工智慧能否大規模生成真正相關的答案？我們如何確保人工智慧能夠理解複雜的多輪對話？我們又如何讓它不自信地吐出錯誤的事實？這些都是現代人工智慧系統面臨的挑戰，尤其是那些使用 RAG 構建的系統。RAG 將文件檢索的強大功能與語言生成的流暢性結合在一起，使系統能夠根據上下文回答問題。雖然基本的 RAG 系統能很好地完成簡單的任務，但它們在處理複雜的查詢、幻覺和長時間互動中的上下文保持等問題時往往會遇到困難。這就是高階 RAG 技術的用武之地。

在本文中，我們將探討如何提升 RAG 管道的水平，加強每個階段的堆疊：索引、檢索和生成。我們將介紹一些強大的方法（附帶實踐程式碼），這些方法有助於提高相關性、減少噪音並提高系統效能–無論您是在構建醫療保健助手、教育輔導員還是企業知識機器人。

基本 RAG 的不足之處？

讓我們來看看基本 RAG 框架：

RAG 框架

Source: Dipanjan Sarkar

這個 RAG 系統架構顯示了向量儲存中塊嵌入的基本儲存方式。第一步是載入文件，然後使用各種分塊技術對文件進行分割或分塊，然後使用嵌入模型對文件進行嵌入，以便 LLM 能夠輕鬆理解文件。

這幅圖片描述了 RAG 的檢索和生成步驟：使用者提出問題，然後我們的系統透過搜尋向量儲存提取基於問題的結果。然後，檢索到的內容與問題一起傳遞給 LLM，LLM 提供結構化輸出。

基本的 RAG 系統有明顯的侷限性，尤其是在要求較高的情況下。

幻覺：一個主要問題是幻覺。模型建立的內容與事實不符或沒有原始檔支援。這有損可靠性，尤其是在醫學或法律等對精確性要求極高的領域。
缺乏領域針對性：標準的 RAG 模型在處理專業主題時非常吃力。如果不根據領域的具體細節調整檢索和生成流程，系統就有可能找到不相關或不準確的資訊。
複雜對話：基本的 RAG 系統在處理複雜查詢或多輪對話時會遇到困難。它們經常會在互動過程中丟失上下文。這就會導致答案不連貫或不完整。RAG 系統必須處理日益複雜的查詢。

因此，我們將討論高階 RAG 技術的 RAG 堆疊的每個部分，即索引、檢索和生成。我們將討論使用開源庫和資源進行改進的問題。這些高階 RAG 技術普遍適用於醫療聊天機器人、教育機器人或其他應用。它們將改善大多數 RAG 系統。

讓我們從高階 RAG 技術開始！

索引和分塊：打下堅實基礎

良好的索引對任何 RAG 系統都至關重要。第一步涉及我們如何引入、分解和儲存資料。讓我們來探討索引資料的方法，重點是索引和分塊文字以及使用後設資料。

1. HNSW

Hierarchical Navigable Small Worlds（HNSW）是一種在大型資料集中查詢相似專案的有效演算法。它採用基於圖的結構化方法，有助於快速定位近似近鄰（ANN）。

近鄰圖：HNSW 構建了一個圖，其中每個點都與附近的點相連。這種結構可實現高效搜尋。
分層結構：該演算法將點組織成多個層次。頂層連線遠處的點，而下層連線近處的點。這種結構加快了搜尋過程。
貪婪路由：HNSW 採用貪婪法尋找鄰居。它從高層點開始，移動到最近的鄰居，直到達到區域性最小值。這種方法縮短了查詢相似專案所需的時間。

HNSW如何工作？

HNSW 的工作原理包括幾個關鍵部分：

輸入層：每個資料點在高維空間中表示為一個向量。
圖構建：
- 節點一次一個地新增到圖中。
- 根據機率函式將每個節點分配到一個層。該函式決定了節點被置於更高層的可能性。
- 該演算法兼顧了連線數和搜尋速度。
搜尋過程 ：
- 搜尋從頂層的一個特定入口點開始。
- 演算法每一步都會移動到最近的鄰居。
- 一旦達到區域性最小值，它就會轉移到下一層，並繼續搜尋，直到在底層找到最近的點。
引數：
- M：與每個節點相連的鄰居數量。
- efConstruction：該引數影響演算法在構建圖形時考慮的鄰居數量。
- efSearch：該引數影響搜尋過程，決定評估多少個鄰居。

HNSW 的設計使其能夠快速、準確地找到相似專案。這使它成為需要在大型資料集中進行高效搜尋的任務的有力選擇。

HNSW 搜尋

Source: Link

圖片描述的是簡化的 HNSW 搜尋：從“entry point”（藍色）開始，演算法在圖中向“query vector”（黃色）導航。“nearest neighbor”（條紋狀）是透過遍歷基於鄰近性的邊來識別的。這說明了導航圖進行高效近似近鄰搜尋的核心概念。

HNSW實踐

請按照以下步驟使用 FAISS 實現層次導航小世界（HNSW）演算法。本指南包括示例輸出，以說明該過程。

Step 1：設定HNSW引數

首先，定義 HNSW 索引的引數。您需要指定向量的大小和每個節點的鄰居數量。

import faiss

import numpy as np

# Set up HNSW parameters

d = 128 # Size of the vectors

M = 32 # Number of neighbors for each nodel

import faiss import numpy as np # Set up HNSW parameters d = 128 # Size of the vectors M = 32 # Number of neighbors for each nodel

import faiss
import numpy as np
# Set up HNSW parameters
d = 128  # Size of the vectors
M = 32   # Number of neighbors for each nodel

Step 2：初始化HNSW索引

使用上文定義的引數建立 HNSW 索引。

# Initialize the HNSW index

index = faiss.IndexHNSWFlat(d, M)

# Initialize the HNSW index index = faiss.IndexHNSWFlat(d, M)

# Initialize the HNSW index
index = faiss.IndexHNSWFlat(d, M)

Step 3：設定efConstruction

在向索引中新增資料之前，請設定 `efConstruction` 引數。該引數控制著演算法在構建索引時會考慮多少個鄰居。

efConstruction = 200 # Example value for efConstruction

index.hnsw.efConstruction = efConstruction

efConstruction = 200 # Example value for efConstruction index.hnsw.efConstruction = efConstruction

efConstruction = 200  # Example value for efConstruction
index.hnsw.efConstruction = efConstruction

Step 4：生成樣本資料

在本例中，生成隨機資料以編制索引。這裡，`xb` 表示要索引的資料集。

# Generate random dataset of vectors

n = 10000 # Number of vectors to index

xb = np.random.random((n, d)).astype('float32')

# Add data to the index

index.add(xb) # Build the index

# Generate random dataset of vectors n = 10000 # Number of vectors to index xb = np.random.random((n, d)).astype('float32') # Add data to the index index.add(xb) # Build the index

# Generate random dataset of vectors
n = 10000  # Number of vectors to index
xb = np.random.random((n, d)).astype('float32')
# Add data to the index
index.add(xb)  # Build the index

Step 5：設定efSearch

建立索引後，設定 `efSearch` 引數。該引數會影響搜尋過程。

efSearch = 100 # Example value for efSearch

index.hnsw.efSearch = efSearch

efSearch = 100 # Example value for efSearch index.hnsw.efSearch = efSearch

efSearch = 100  # Example value for efSearch
index.hnsw.efSearch = efSearch

Step 6：執行搜尋

現在，您可以搜尋查詢向量的近鄰。這裡，`xq`代表查詢向量。

# Generate random query vectors

nq = 5 # Number of query vectors

xq = np.random.random((nq, d)).astype('float32')

# Perform a search for the top k nearest neighbors

k = 5 # Number of nearest neighbors to retrieve

distances, indices = index.search(xq, k)

# Output the results

print("Query Vectors:\n", xq)

print("\nNearest Neighbors Indices:\n", indices)

print("\nNearest Neighbors Distances:\n", distances)

# Generate random query vectors nq = 5 # Number of query vectors xq = np.random.random((nq, d)).astype('float32') # Perform a search for the top k nearest neighbors k = 5 # Number of nearest neighbors to retrieve distances, indices = index.search(xq, k) # Output the results print("Query Vectors:\n", xq) print("\nNearest Neighbors Indices:\n", indices) print("\nNearest Neighbors Distances:\n", distances)

# Generate random query vectors
nq = 5  # Number of query vectors
xq = np.random.random((nq, d)).astype('float32')
# Perform a search for the top k nearest neighbors
k = 5  # Number of nearest neighbors to retrieve
distances, indices = index.search(xq, k)
# Output the results
print("Query Vectors:\n", xq)
print("\nNearest Neighbors Indices:\n", indices)
print("\nNearest Neighbors Distances:\n", distances)

輸出

Query Vectors: [[0.12345678 0.23456789 ... 0.98765432] [0.23456789 0.34567890 ... 0.87654321] [0.34567890 0.45678901 ... 0.76543210] [0.45678901 0.56789012 ... 0.65432109] [0.56789012 0.67890123 ... 0.54321098]]Nearest Neighbors Indices: [[ 123  456  789  101  112] [ 234  567  890  123  134] [ 345  678  901  234  245] [ 456  789  012  345  356] [ 567  890  123  456  467]]Nearest Neighbors Distances: [[0.123 0.234 0.345 0.456 0.567] [0.234 0.345 0.456 0.567 0.678] [0.345 0.456 0.567 0.678 0.789] [0.456 0.567 0.678 0.789 0.890] [0.567 0.678 0.789 0.890 0.901]]

2. 語義分塊

這種方法是根據文字的含義，而不僅僅是固定大小來劃分文字。每個語塊代表一個連貫的資訊。我們計算句子嵌入之間的餘弦距離。如果兩個句子在語義上相似（低於閾值），它們就會被歸入同一個分塊。這就根據內容的含義建立了不同長度的語塊。

優點：建立更連貫、更有意義的語塊，提高檢索效率。
缺點：需要更多計算（使用基於 BERT 的編碼器）。

語義分塊實踐

from langchain_experimental.text_splitter import SemanticChunker

from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

docs = text_splitter.create_documents([document])

print(docs[0].page_content)

from langchain_experimental.text_splitter import SemanticChunker from langchain_openai.embeddings import OpenAIEmbeddings text_splitter = SemanticChunker(OpenAIEmbeddings()) docs = text_splitter.create_documents([document]) print(docs[0].page_content)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([document])
print(docs[0].page_content)

本程式碼採用了 LangChain 的 SemanticChunker，它利用OpenAI嵌入將文件分割成語義相關的語塊。它建立的文件塊中的每個塊都旨在捕捉連貫的語義單位，而不是任意的文字片段。文件

3. 基於語言模型的分塊

這種先進的方法使用語言模型從文字中建立完整的語句。每個語塊在語義上都是完整的。語言模型（如 70 億引數模型）處理文字。它將文字分解成本身有意義的語句。然後，模型將這些語句組合成語塊，在完整性和上下文之間取得平衡。這種方法計算量大，但準確率高。

優點：適應文字的細微差別，建立高質量的語塊。
缺點：計算成本高；可能需要針對特定用途進行微調。

基於語言模型的分塊實踐

async def generate_contexts(document, chunks):

async def process_chunk(chunk):

response = await client.chat.completions.create(

model="gpt-4o",

messages=[

{"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."},

{"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."}

temperature=0.3,

max_tokens=100

)

context = response.choices[0].message.content

return f"{context} {chunk}"

# Process all chunks concurrently

contextual_chunks = await asyncio.gather(

*[process_chunk(chunk) for chunk in chunks]

)

return contextual_chunks

async def generate_contexts(document, chunks): async def process_chunk(chunk): response = await client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."}, {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."} ], temperature=0.3, max_tokens=100 ) context = response.choices[0].message.content return f"{context} {chunk}" # Process all chunks concurrently contextual_chunks = await asyncio.gather( *[process_chunk(chunk) for chunk in chunks] ) return contextual_chunks

async def generate_contexts(document, chunks):
   async def process_chunk(chunk):
       response = await client.chat.completions.create(
           model="gpt-4o",
           messages=[
               {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."},
               {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."}
           ],
           temperature=0.3,
           max_tokens=100
       )
       context = response.choices[0].message.content
       return f"{context} {chunk}"
   # Process all chunks concurrently
   contextual_chunks = await asyncio.gather(
       *[process_chunk(chunk) for chunk in chunks]
   )
   return contextual_chunks

這段程式碼利用 LLM（可能是 OpenAI 的 GPT-4o，透過 client.chat.completions.create 呼叫）為文件的每個分塊生成上下文資訊。它會非同步處理每個片段，提示 LLM 解釋該片段與完整文件的關係。最後，它將返回一個原始分塊的列表，並在其中新增生成的上下文，從而有效地豐富了這些分塊，提高了搜尋檢索效率。

4. 利用後設資料：新增上下文

使用後設資料新增和篩選

後設資料可提供額外的上下文。這可以提高檢索的準確性。透過加入日期、患者年齡和既存病症等後設資料，您可以在搜尋過程中過濾掉不相關的資訊。過濾可縮小搜尋範圍，使檢索更高效、更相關。編制索引時，將後設資料與文字一起儲存。

例如，醫療保健資料包括年齡、就診日期和患者記錄中的特定病症。利用這些後設資料過濾搜尋結果。這可確保系統只檢索相關資訊。例如，如果查詢與兒童有關，則過濾掉 18 歲以上患者的記錄。這樣可以減少噪音，提高相關性。

示例

Chunk #1

Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}

源文字：

2.2.1 The First Incompleteness TheoremIn his Logical Journey (Wang 1996) Hao Wang published thefull text of material Gödel had written (at Wang’s request)about his discovery of the incompleteness theorems. This material hadformed the basis of Wang’s “Some Facts about KurtGödel,” and was read and approved by Gödel:

Chunk #2

Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}

源文字：

The First Incompleteness Theorem provides a counterexample tocompleteness by exhibiting an arithmetic statement which is neitherprovable nor refutable in Peano arithmetic, though true in thestandard model. The Second Incompleteness Theorem shows that theconsistency of arithmetic cannot be proved in arithmetic itself. ThusGödel’s theorems demonstrated the infeasibility of theHilbert program, if it is to be characterized by those particulardesiderata, consistency and completeness.

在這裡，我們可以看到後設資料包含了資料塊的唯一 ID 和來源，這為資料塊提供了更多上下文資訊，有助於輕鬆檢索。

5. 使用GLiNER生成後設資料

後設資料並不總是很多，但使用 GLiNER 這樣的模型可以即時生成後設資料！GLiNER 在攝取過程中對資料塊進行標記和標籤，從而生成後設資料。

實施

為 GLiNER 提供每個塊的標籤，以便識別。如果找到標籤，它就會給它們貼標籤。一般情況下，GLiNER 能很好地工作，但對於小眾資料集可能需要微調。GLiNER 可以解析傳入的查詢，並將其與後設資料標籤進行匹配以進行過濾。

GLiNER：使用雙向變換器進行命名實體識別的通用模型演示：點選此處

這些技術構建了一個強大的 RAG 系統。它們能從大型資料集中實現高效檢索。分塊和後設資料使用的選擇取決於資料集的具體需求和特徵。

檢索：查詢正確的資訊

現在，讓我們關注 RAG 中的“R”。如何改進向量資料庫的檢索？這就是檢索與查詢相關的所有文件。這大大增加了 LLM 產生高質量結果的機會。以下是幾種技術：

6. 混合搜尋

結合向量搜尋（查詢語義）和關鍵詞搜尋（查詢精確匹配）。混合搜尋利用了兩者的優勢。在人工智慧中，許多術語都是特定的關鍵詞：演算法名稱、技術術語、法律碩士。僅靠向量搜尋可能會漏掉這些。而關鍵詞搜尋則能確保這些重要術語得到考慮。將這兩種方法結合起來，可以建立一個更完整的檢索過程。這些搜尋同時執行。

搜尋結果透過加權系統進行合併和排序。例如，使用 Weaviate，您可以調整阿爾法引數，以平衡向量和關鍵詞結果。這樣就能建立一個合併的排序列表。

優點：平衡精確度和召回率，提高檢索質量。
缺點：需要仔細調整權重。

混合搜尋實踐

from langchain_community.retrievers import WeaviateHybridSearchRetriever

from langchain_core.documents import Document

retriever = WeaviateHybridSearchRetriever(

client=client,

index_name="LangChain",

text_key="text",

attributes=[],

create_schema_if_missing=True,

)

retriever.invoke("the ethical implications of AI")

from langchain_community.retrievers import WeaviateHybridSearchRetriever from langchain_core.documents import Document retriever = WeaviateHybridSearchRetriever( client=client, index_name="LangChain", text_key="text", attributes=[], create_schema_if_missing=True, ) retriever.invoke("the ethical implications of AI")

from langchain_community.retrievers import WeaviateHybridSearchRetriever
from langchain_core.documents import Document
retriever = WeaviateHybridSearchRetriever(
   client=client,
   index_name="LangChain",
   text_key="text",
   attributes=[],
   create_schema_if_missing=True,
)
retriever.invoke("the ethical implications of AI")

這段程式碼初始化了一個 WeaviateHybridSearchRetriever，用於從 Weaviate 向量資料庫中檢索文件。它在 Weaviate 的混合檢索功能中結合了向量搜尋和關鍵詞搜尋。最後，它執行查詢“人工智慧的倫理意義”，利用這種混合方法檢索相關文件。

7. 查詢重寫

認識到人類查詢可能不是資料庫或語言模型的最佳選擇。使用語言模型重寫查詢可顯著提高檢索效率。

向量資料庫重寫：這可將使用者的初始查詢轉化為資料庫友好的格式。例如，“什麼是人工智慧代理，為什麼它們是 2025 年的下一件大事 ”可能會變成“人工智慧代理 2025 年的大事”。我們可以使用任何 LLM 來重寫查詢，以便抓住查詢的重要方面。
語言模型的提示重寫：這包括自動建立提示，以最佳化與語言模型的互動。這可以提高結果的質量和準確性。我們可以使用 DSPy 等框架或任何 LLM 來幫助重寫查詢。這些重寫的查詢和提示可確保搜尋過程檢索到相關文件，並有效地提示語言模型。

多查詢檢索

基於查詢措辭的細微變化，檢索會產生不同的結果。如果嵌入不能準確反映資料的含義，這個問題就會變得更加突出。為了應對這些挑戰，通常會使用提示工程或調整方法，但這一過程可能非常耗時。

MultiQueryRetriever 簡化了這一任務。它使用大型語言模型（LLM），根據單個使用者輸入從不同角度建立多個查詢。對於每個生成的查詢，它都會檢索一組相關文件。透過綜合所有查詢的唯一結果，多查詢檢索器提供了更廣泛的潛在相關文件集。這種方法提高了找到有用資訊的機率，而無需進行大量的人工調整。

from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)

from langchain.retrievers.multi_query import MultiQueryRetriever

# Set logging for the queries

import logging

similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity",

search_kwargs={"k": 2})

mq_retriever = MultiQueryRetriever.from_llm(

retriever=similarity_retriever3, llm=chatgpt,

include_original=True

)

logging.basicConfig()

# so we can see what queries are generated by the LLM

logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

query = "what is the capital of India?"

docs = mq_retriever.invoke(query)

docs

from langchain_openai import ChatOpenAI chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0) from langchain.retrievers.multi_query import MultiQueryRetriever # Set logging for the queries import logging similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 2}) mq_retriever = MultiQueryRetriever.from_llm( retriever=similarity_retriever3, llm=chatgpt, include_original=True ) logging.basicConfig() # so we can see what queries are generated by the LLM logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO) query = "what is the capital of India?" docs = mq_retriever.invoke(query) docs

from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging
similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity",
                                               search_kwargs={"k": 2})
mq_retriever = MultiQueryRetriever.from_llm(
   retriever=similarity_retriever3, llm=chatgpt,
   include_original=True
)
logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
query = "what is the capital of India?"
docs = mq_retriever.invoke(query)
docs

這段程式碼使用 LangChain 建立了一個多查詢檢索系統。它會生成輸入查詢（“印度的首都是哪裡？”）的多個變體。然後透過相似性檢索器使用這些變體來查詢 Chroma 向量資料庫（chroma_db3），目的是擴大搜尋範圍並捕獲各種相關文件。多重查詢檢索器最終彙總並返回檢索到的文件。

輸出

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'), Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content="Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capital of India during the British Raj until 1911. Kolkata was once the center of industry and education. However, it has witnessed political violence and economic problems since 1954. Since 2000, Kolkata has grown due to economic growth. Like other metropolitan cities in India, Kolkata struggles with poverty, pollution and traffic congestion."), Document(metadata={'article_id': '22215', 'title': 'States and union territories of India'}, page_content='The Republic of India is divided into twenty-eight States,and eight union territories including the National Capital Territory.')]

8. 基於LLM提示的上下文壓縮檢索

上下文壓縮有助於提高檢索文件的相關性。這主要有兩種方式：

提取相關內容：刪除檢索文件中與查詢無關的部分。這意味著只保留回答問題的部分。
過濾無關文件：在不改變文件本身內容的情況下，排除與查詢無關的文件。

為了實現這一目的，我們可以使用 LLMChainExtractor，它可以審查最初返回的文件，並只提取與查詢相關的內容。它也可以放棄完全不相關的文件。

下面是如何使用 LangChain 實現這一功能：

from langchain.retrievers import ContextualCompressionRetriever

from langchain.retrievers.document_compressors import LLMChainExtractor

from langchain_openai import ChatOpenAI

# Initialize the language model

chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Set up a similarity retriever

similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Create the extractor to get relevant content

compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# Combine the retriever and the extractor

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever)

# Example query

query = "What is the capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor from langchain_openai import ChatOpenAI # Initialize the language model chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0) # Set up a similarity retriever similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3}) # Create the extractor to get relevant content compressor = LLMChainExtractor.from_llm(llm=chatgpt) # Combine the retriever and the extractor compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever) # Example query query = "What is the capital of India?" docs = compression_retriever.invoke(query) print(docs)

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
# Initialize the language model
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
# Set up a similarity retriever
similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3})
# Create the extractor to get relevant content
compressor = LLMChainExtractor.from_llm(llm=chatgpt)
# Combine the retriever and the extractor
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever)
# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi is the capital of India and a union territory of the megacity of Delhi.')]

對於不同的查詢：

query = "What is the old capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

query = "What is the old capital of India?" docs = compression_retriever.invoke(query) print(docs)

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content='Kolkata served as the capital of India during the British Raj until 1911.')]

LLMChainFilter 提供了一種更簡單但有效的方法來過濾文件。它使用 LLM 鏈來決定哪些文件要保留，哪些要丟棄，而不會改變文件的內容。

下面是實現該過濾器的方法：

from langchain.retrievers.document_compressors import LLMChainFilter

# Set up the filter

_filter = LLMChainFilter.from_llm(llm=chatgpt)

# Combine the retriever and the filter

compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever)

# Example query

query = "What is the capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

from langchain.retrievers.document_compressors import LLMChainFilter # Set up the filter _filter = LLMChainFilter.from_llm(llm=chatgpt) # Combine the retriever and the filter compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever) # Example query query = "What is the capital of India?" docs = compression_retriever.invoke(query) print(docs)

from langchain.retrievers.document_compressors import LLMChainFilter
# Set up the filter
_filter = LLMChainFilter.from_llm(llm=chatgpt)
# Combine the retriever and the filter
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever)
# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi is the capital of India and a union territory of the megacity of Delhi.')]

另一個問題：

query = "What is the old capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

query = "What is the old capital of India?" docs = compression_retriever.invoke(query) print(docs)

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

輸出：

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content='Kolkata served as the capital of India during the British Raj until 1911.')]

這些策略透過關注相關內容來幫助完善檢索過程。LLMChainExtractor 只提取文件的必要部分，而 LLMChainFilter 則決定保留哪些文件。這兩種方法都能提高檢索資訊的質量，使其與使用者的查詢更加相關。

9. 微調嵌入模型

預先訓練的嵌入模型是一個良好的開端。在資料基礎上對這些模型進行微調可大大提高檢索效率。

選擇正確的模型：對於醫學等專業領域，應選擇在相關資料上預先訓練好的模型。例如，您可以使用 MedCPT 系列查詢和文件編碼器，這些編碼器是在來自 PubMed 搜尋日誌的 2.55 億個查詢-文章對的大規模資料上預先訓練過的。

使用正對和負對進行微調：收集您自己的資料，建立相似（正面）和不相似（負面）示例對。對模型進行微調，以瞭解這些差異。這有助於模型學習特定領域的關係，從而提高檢索效率。

優點：提高檢索效能。
缺點：需要精心建立訓練資料。

這些綜合技術可以建立一個強大的檢索系統。這可以提高給 LLM 提供的物件的相關性，從而提高生成質量。

生成高質量的回覆

最後，讓我們來討論如何提高語言模型 (LLM) 的生成質量。我們的目標是為 LLM 提供儘可能與提示相關的語境。不相關的資料會引發幻覺。以下是更好地生成語言模型的技巧：

10. Autocut刪除無關資訊

Autocut 可過濾掉從資料庫中獲取的無關資訊。這可以防止 LLM 被誤導。

檢索和評分相似性：當進行查詢時，會檢索出多個具有相似性得分的物件。
識別和截斷：利用相似性分數找到分數明顯下降的截斷點。排除超出該點的物件。這樣可以確保只向 LLM 提供最相關的資訊。例如，如果您檢索了六個物件，那麼在檢索到第四個物件後，分數可能會急劇下降。透過觀察變化率，您可以確定要排除哪些物件。

動手操作

from langchain_openai import OpenAIEmbeddings

from langchain_pinecone import PineconeVectorStore

from typing import List

from langchain_core.documents import Document

from langchain_core.runnables import chain

vectorstore = PineconeVectorStore.from_documents(

docs, index_name="sample", embedding=OpenAIEmbeddings()

)

@chain

def retriever(query: str):

docs, scores = zip(*vectorstore.similarity_search_with_score(query))

for doc, score in zip(docs, scores):

doc.metadata["score"] = score

return docs

result = retriever.invoke("dinosaur")

result

from langchain_openai import OpenAIEmbeddings from langchain_pinecone import PineconeVectorStore from typing import List from langchain_core.documents import Document from langchain_core.runnables import chain vectorstore = PineconeVectorStore.from_documents( docs, index_name="sample", embedding=OpenAIEmbeddings() ) @chain def retriever(query: str): docs, scores = zip(*vectorstore.similarity_search_with_score(query)) for doc, score in zip(docs, scores): doc.metadata["score"] = score return docs result = retriever.invoke("dinosaur") result

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import chain
vectorstore = PineconeVectorStore.from_documents(
   docs, index_name="sample", embedding=OpenAIEmbeddings()
)
@chain
def retriever(query: str):
   docs, scores = zip(*vectorstore.similarity_search_with_score(query))
   for doc, score in zip(docs, scores):
       doc.metadata["score"] = score
   return docs
 result = retriever.invoke("dinosaur")
result

本程式碼片段使用 LangChain 和 Pinecone 執行相似性搜尋。它使用 OpenAI embeddings 嵌入文件，將文件儲存在 Pinecone 向量儲存區中，並定義了一個檢索器函式。檢索器會搜尋與給定查詢（“恐龍”）相似的文件，計算相似度得分，並在返回結果前將這些得分新增到文件後設資料中。

輸出

[Document(page_content='In her second book, Dr. Simmons delves deeper into the ethical considerations surrounding AI development and deployment. It is an eye-opening examination of the dilemmas faced by developers, policymakers, and society at large.', metadata={}), Document(page_content='A comprehensive analysis of the evolution of artificial intelligence, from its inception to its future prospects. Dr. Simmons covers ethical considerations, potentials, and threats posed by AI.', metadata={}), Document(page_content="In his follow-up to 'Symbiosis', Prof. Sterling takes a look at the subtle, unnoticed presence and influence of AI in our everyday lives. It reveals how AI has become woven into our routines, often without our explicit realization.", metadata={}), Document(page_content='Prof. Sterling explores the potential for harmoniouscoexistence between humans and artificial intelligence. The book discusses how AI can be integrated into society in a beneficial and non-disruptivemanner.', metadata={})]

我們可以看到，它還給出了相似性得分，我們可以根據閾值對其進行截斷。

11. 對檢索到的物件重新排序

重新排序使用一種更先進的模型來重新評估和排序最初檢索到的物件。這可以提高最終檢索集的質量。

過度獲取：最初檢索的物件數量超過所需數量。
應用排序器模型：使用高延遲模型（通常是交叉編碼器）重新評估相關性。該模型將查詢和每個物件配對考慮，以重新評估相似性。
對結果重新排序：根據新的評估結果，對物件重新排序。將最相關的結果放在最前面。這樣可以確保最相關的文件得到優先排序，從而改進提供給 LLM 的資料。

對檢索到的物件重新排序實踐

from langchain.retrievers import ContextualCompressionRetriever

from langchain.retrievers.document_compressors import FlashrankRerank

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

compressor = FlashrankRerank()

compression_retriever = ContextualCompressionRetriever(

base_compressor=compressor, base_retriever=retriever

)

compressed_docs = compression_retriever.invoke(

"What did the president say about Ketanji Jackson Brown"

)

print([doc.metadata["id"] for doc in compressed_docs])

pretty_print_docs(compressed_docs)

from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import FlashrankRerank from langchain_openai import ChatOpenAI llm = ChatOpenAI(temperature=0) compressor = FlashrankRerank() compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=retriever ) compressed_docs = compression_retriever.invoke( "What did the president say about Ketanji Jackson Brown" ) print([doc.metadata["id"] for doc in compressed_docs]) pretty_print_docs(compressed_docs)

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
   "What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata["id"] for doc in compressed_docs])
pretty_print_docs(compressed_docs)

本程式碼片段在 ContextualCompressionRetriever 中利用 FlashrankRerank 來提高檢索文件的相關性。它根據基本檢索器（由檢索器表示）獲取的文件與查詢“總統對凱坦吉-傑克遜-布朗說了什麼”的相關性，對這些文件進行重新分級。最後，它會列印文件 ID 和壓縮後的重排序文件。

輸出

[0, 5, 3]Document 1:One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.----------------------------------------------------------------------------------------------------Document 2:He met the Ukrainian people.From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.----------------------------------------------------------------------------------------------------Document 3:And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.By the end of this year, the deficit will be down to less than half what it was before I took office.  The only president ever to cut the deficit by more than one trillion dollars in a single year.Lowering your costs also means demanding more competition.I’m a capitalist, but capitalism without competition isn’t capitalism.It’s exploitation—and it drives up prices.

輸出鞋會根據相關性對檢索到的資料塊進行重新排序。

12. 微調LLM

根據特定領域的資料對 LLM 進行微調可大大提高其效能。例如，使用 Meditron 70B 這樣的模型。這是針對醫療資料的 LLaMA 2 70b 微調版本，同時使用了以下兩種方法：

無監督微調：繼續對大量特定領域文字（如 PubMed 文獻）進行預訓練。

監督微調：在特定領域任務（如醫學選擇題）上使用監督學習進一步完善模型。這種專門的訓練有助於模型在目標領域中的良好表現。在特定任務上，該模型的表現優於其基礎模型和 GPT-3.5 等更大型、更不專業的模型。

監督微調

Source: Link

此圖表示在特定任務示例中進行微調的過程。這種方法允許開發人員指定所需的輸出、鼓勵某些行為或更好地控制模型的響應。

13. 使用RAFT：根據特定領域的RAG調整語言模型

RAFT，即檢索增強微調（Retrieval-Augmented fine-tuning），是一種改進大型語言模型（LLM）在特定領域工作方式的方法。它有助於這些模型使用文件中的相關資訊更準確地回答問題。

檢索增強微調法（RAFT）：RAFT 將微調與檢索方法相結合。這使得模型在訓練過程中既能從有用的文件中學習，也能從無用的文件中學習。
思維鏈推理（Chain-of-Thought Reasoning）：模型生成的答案可顯示其推理過程。這有助於它根據檢索到的文件提供清晰準確的回答。
動態文件處理：RAFT 訓練模型查詢和使用最相關的文件，同時忽略那些無助於回答問題的文件。

RAFT的架構

RAFT 架構包括幾個關鍵元件：

輸入層（Input Layer）：模型接收一個問題（Q）和一組檢索到的文件（D），其中包括相關和不相關文件。
處理層 ：
- 該模型對輸入進行分析，以找到文件中的重要資訊。
- 它建立一個參考相關文件的答案（A*）。
輸出層：模型根據相關文件生成最終答案，同時忽略無關文件。
訓練機制：在訓練過程中，一些資料包括相關和不相關文件，而另一些資料只包括不相關文件。這種設定鼓勵模型關注上下文，而不是死記硬背。
評估：根據模型使用檢索到的文件準確回答問題的能力來評估模型的效能。

透過使用這種架構，RAFT 增強了模型在特定領域的工作能力。它提供了一種可靠的方法來生成準確而相關的回答。

RAFT的架構

Source: link

左上圖描述的是將 LLMs 適應於從一組正向文件和干擾文件中閱讀解決方案的方法，與標準 RAG 設定形成鮮明對比，在標準 RAG 設定中，模型是根據檢索器的輸出進行訓練的，而檢索器的輸出是記憶和閱讀的混合。在測試時，所有方法都遵循標準的 RAG 設定，在上下文中提供前 k 個檢索文件。

小結

改進 RAG 系統中的檢索和生成對於更好的人工智慧應用至關重要。所討論的技術包括從低功耗、高影響的方法（查詢重寫、重排）到更密集的過程（嵌入和 LLM 微調）。最佳技術取決於應用程式的具體需求和限制。先進的 RAG 技術只要經過深思熟慮，就能讓開發人員構建出更準確、更可靠、更能感知上下文的人工智慧系統，從而能夠處理複雜的資訊需求。

Autocut GLiNER HNSW LLM RAFT RAG

適合您下一個專案的13種高階RAG技術

基本 RAG 的不足之處？

索引和分塊：打下堅實基礎

1. HNSW

HNSW如何工作？

HNSW實踐

Step 1：設定HNSW引數

Step 2：初始化HNSW索引

Step 3：設定efConstruction

Step 4：生成樣本資料

Step 5：設定efSearch

Step 6：執行搜尋

輸出

2. 語義分塊

語義分塊實踐

3. 基於語言模型的分塊

基於語言模型的分塊實踐

4. 利用後設資料：新增上下文

使用後設資料新增和篩選

示例

5. 使用GLiNER生成後設資料

實施

檢索：查詢正確的資訊

6. 混合搜尋

混合搜尋實踐

7. 查詢重寫

多查詢檢索

8. 基於LLM提示的上下文壓縮檢索

9. 微調嵌入模型

生成高質量的回覆

10. Autocut刪除無關資訊

動手操作

輸出

11. 對檢索到的物件重新排序

對檢索到的物件重新排序實踐

輸出

12. 微調LLM

13. 使用RAFT：根據特定領域的RAG調整語言模型

RAFT的架構

小結

相關文章

評論留言

取消回覆

文章目录