适合您下一个项目的13种高级RAG技术

人工智能能否大规模生成真正相关的答案？我们如何确保人工智能能够理解复杂的多轮对话？我们又如何让它不自信地吐出错误的事实？这些都是现代人工智能系统面临的挑战，尤其是那些使用 RAG 构建的系统。RAG 将文档检索的强大功能与语言生成的流畅性结合在一起，使系统能够根据上下文回答问题。虽然基本的 RAG 系统能很好地完成简单的任务，但它们在处理复杂的查询、幻觉和长时间交互中的上下文保持等问题时往往会遇到困难。这就是高级 RAG 技术的用武之地。

在本文中，我们将探讨如何提升 RAG 管道的水平，加强每个阶段的堆栈：索引、检索和生成。我们将介绍一些强大的方法（附带实践代码），这些方法有助于提高相关性、减少噪音并提高系统性能–无论您是在构建医疗保健助手、教育辅导员还是企业知识机器人。

基本 RAG 的不足之处？

让我们来看看基本 RAG 框架：

RAG 框架

Source: Dipanjan Sarkar

这个 RAG 系统架构显示了矢量存储中块嵌入的基本存储方式。第一步是加载文档，然后使用各种分块技术对文档进行分割或分块，然后使用嵌入模型对文档进行嵌入，以便 LLM 能够轻松理解文档。

这幅图片描述了 RAG 的检索和生成步骤：用户提出问题，然后我们的系统通过搜索向量存储提取基于问题的结果。然后，检索到的内容与问题一起传递给 LLM，LLM 提供结构化输出。

基本的 RAG 系统有明显的局限性，尤其是在要求较高的情况下。

幻觉：一个主要问题是幻觉。模型创建的内容与事实不符或没有源文件支持。这有损可靠性，尤其是在医学或法律等对精确性要求极高的领域。
缺乏领域针对性：标准的 RAG 模型在处理专业主题时非常吃力。如果不根据领域的具体细节调整检索和生成流程，系统就有可能找到不相关或不准确的信息。
复杂对话：基本的 RAG 系统在处理复杂查询或多轮对话时会遇到困难。它们经常会在交互过程中丢失上下文。这就会导致答案不连贯或不完整。RAG 系统必须处理日益复杂的查询。

因此，我们将讨论高级 RAG 技术的 RAG 堆栈的每个部分，即索引、检索和生成。我们将讨论使用开源库和资源进行改进的问题。这些高级 RAG 技术普遍适用于医疗聊天机器人、教育机器人或其他应用。它们将改善大多数 RAG 系统。

让我们从高级 RAG 技术开始！

索引和分块：打下坚实基础

良好的索引对任何 RAG 系统都至关重要。第一步涉及我们如何引入、分解和存储数据。让我们来探讨索引数据的方法，重点是索引和分块文本以及使用元数据。

1. HNSW

Hierarchical Navigable Small Worlds（HNSW）是一种在大型数据集中查找相似项目的有效算法。它采用基于图的结构化方法，有助于快速定位近似近邻（ANN）。

近邻图：HNSW 构建了一个图，其中每个点都与附近的点相连。这种结构可实现高效搜索。
分层结构：该算法将点组织成多个层次。顶层连接远处的点，而下层连接近处的点。这种结构加快了搜索过程。
贪婪路由：HNSW 采用贪婪法寻找邻居。它从高层点开始，移动到最近的邻居，直到达到局部最小值。这种方法缩短了查找相似项目所需的时间。

HNSW如何工作？

HNSW 的工作原理包括几个关键部分：

输入层：每个数据点在高维空间中表示为一个向量。
图构建：
- 节点一次一个地添加到图中。
- 根据概率函数将每个节点分配到一个层。该函数决定了节点被置于更高层的可能性。
- 该算法兼顾了连接数和搜索速度。
搜索过程 ：
- 搜索从顶层的一个特定入口点开始。
- 算法每一步都会移动到最近的邻居。
- 一旦达到局部最小值，它就会转移到下一层，并继续搜索，直到在底层找到最近的点。
参数：
- M：与每个节点相连的邻居数量。
- efConstruction：该参数影响算法在构建图形时考虑的邻居数量。
- efSearch：该参数影响搜索过程，决定评估多少个邻居。

HNSW 的设计使其能够快速、准确地找到相似项目。这使它成为需要在大型数据集中进行高效搜索的任务的有力选择。

HNSW 搜索

Source: Link

图片描述的是简化的 HNSW 搜索：从“entry point”（蓝色）开始，算法在图中向“query vector”（黄色）导航。“nearest neighbor”（条纹状）是通过遍历基于邻近性的边来识别的。这说明了导航图进行高效近似近邻搜索的核心概念。

HNSW实践

请按照以下步骤使用 FAISS 实现层次导航小世界（HNSW）算法。本指南包括示例输出，以说明该过程。

Step 1：设置HNSW参数

首先，定义 HNSW 索引的参数。您需要指定向量的大小和每个节点的邻居数量。

import faiss

import numpy as np

# Set up HNSW parameters

d = 128 # Size of the vectors

M = 32 # Number of neighbors for each nodel

import faiss import numpy as np # Set up HNSW parameters d = 128 # Size of the vectors M = 32 # Number of neighbors for each nodel

import faiss
import numpy as np
# Set up HNSW parameters
d = 128  # Size of the vectors
M = 32   # Number of neighbors for each nodel

Step 2：初始化HNSW索引

使用上文定义的参数创建 HNSW 索引。

# Initialize the HNSW index

index = faiss.IndexHNSWFlat(d, M)

# Initialize the HNSW index index = faiss.IndexHNSWFlat(d, M)

# Initialize the HNSW index
index = faiss.IndexHNSWFlat(d, M)

Step 3：设置efConstruction

在向索引中添加数据之前，请设置 `efConstruction` 参数。该参数控制着算法在构建索引时会考虑多少个邻居。

efConstruction = 200 # Example value for efConstruction

index.hnsw.efConstruction = efConstruction

efConstruction = 200 # Example value for efConstruction index.hnsw.efConstruction = efConstruction

efConstruction = 200  # Example value for efConstruction
index.hnsw.efConstruction = efConstruction

Step 4：生成样本数据

在本例中，生成随机数据以编制索引。这里，`xb` 表示要索引的数据集。

# Generate random dataset of vectors

n = 10000 # Number of vectors to index

xb = np.random.random((n, d)).astype('float32')

# Add data to the index

index.add(xb) # Build the index

# Generate random dataset of vectors n = 10000 # Number of vectors to index xb = np.random.random((n, d)).astype('float32') # Add data to the index index.add(xb) # Build the index

# Generate random dataset of vectors
n = 10000  # Number of vectors to index
xb = np.random.random((n, d)).astype('float32')
# Add data to the index
index.add(xb)  # Build the index

Step 5：设置efSearch

建立索引后，设置 `efSearch` 参数。该参数会影响搜索过程。

efSearch = 100 # Example value for efSearch

index.hnsw.efSearch = efSearch

efSearch = 100 # Example value for efSearch index.hnsw.efSearch = efSearch

efSearch = 100  # Example value for efSearch
index.hnsw.efSearch = efSearch

Step 6：执行搜索

现在，您可以搜索查询向量的近邻。这里，`xq`代表查询向量。

# Generate random query vectors

nq = 5 # Number of query vectors

xq = np.random.random((nq, d)).astype('float32')

# Perform a search for the top k nearest neighbors

k = 5 # Number of nearest neighbors to retrieve

distances, indices = index.search(xq, k)

# Output the results

print("Query Vectors:\n", xq)

print("\nNearest Neighbors Indices:\n", indices)

print("\nNearest Neighbors Distances:\n", distances)

# Generate random query vectors nq = 5 # Number of query vectors xq = np.random.random((nq, d)).astype('float32') # Perform a search for the top k nearest neighbors k = 5 # Number of nearest neighbors to retrieve distances, indices = index.search(xq, k) # Output the results print("Query Vectors:\n", xq) print("\nNearest Neighbors Indices:\n", indices) print("\nNearest Neighbors Distances:\n", distances)

# Generate random query vectors
nq = 5  # Number of query vectors
xq = np.random.random((nq, d)).astype('float32')
# Perform a search for the top k nearest neighbors
k = 5  # Number of nearest neighbors to retrieve
distances, indices = index.search(xq, k)
# Output the results
print("Query Vectors:\n", xq)
print("\nNearest Neighbors Indices:\n", indices)
print("\nNearest Neighbors Distances:\n", distances)

输出

Query Vectors: [[0.12345678 0.23456789 ... 0.98765432] [0.23456789 0.34567890 ... 0.87654321] [0.34567890 0.45678901 ... 0.76543210] [0.45678901 0.56789012 ... 0.65432109] [0.56789012 0.67890123 ... 0.54321098]]Nearest Neighbors Indices: [[ 123  456  789  101  112] [ 234  567  890  123  134] [ 345  678  901  234  245] [ 456  789  012  345  356] [ 567  890  123  456  467]]Nearest Neighbors Distances: [[0.123 0.234 0.345 0.456 0.567] [0.234 0.345 0.456 0.567 0.678] [0.345 0.456 0.567 0.678 0.789] [0.456 0.567 0.678 0.789 0.890] [0.567 0.678 0.789 0.890 0.901]]

2. 语义分块

这种方法是根据文本的含义，而不仅仅是固定大小来划分文本。每个语块代表一个连贯的信息。我们计算句子嵌入之间的余弦距离。如果两个句子在语义上相似（低于阈值），它们就会被归入同一个分块。这就根据内容的含义创建了不同长度的语块。

优点：创建更连贯、更有意义的语块，提高检索效率。
缺点：需要更多计算（使用基于 BERT 的编码器）。

语义分块实践

from langchain_experimental.text_splitter import SemanticChunker

from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

docs = text_splitter.create_documents([document])

print(docs[0].page_content)

from langchain_experimental.text_splitter import SemanticChunker from langchain_openai.embeddings import OpenAIEmbeddings text_splitter = SemanticChunker(OpenAIEmbeddings()) docs = text_splitter.create_documents([document]) print(docs[0].page_content)

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([document])
print(docs[0].page_content)

本代码采用了 LangChain 的 SemanticChunker，它利用OpenAI嵌入将文档分割成语义相关的语块。它创建的文档块中的每个块都旨在捕捉连贯的语义单位，而不是任意的文本片段。文档

3. 基于语言模型的分块

这种先进的方法使用语言模型从文本中创建完整的语句。每个语块在语义上都是完整的。语言模型（如 70 亿参数模型）处理文本。它将文本分解成本身有意义的语句。然后，模型将这些语句组合成语块，在完整性和上下文之间取得平衡。这种方法计算量大，但准确率高。

优点：适应文本的细微差别，创建高质量的语块。
缺点：计算成本高；可能需要针对特定用途进行微调。

基于语言模型的分块实践

async def generate_contexts(document, chunks):

async def process_chunk(chunk):

response = await client.chat.completions.create(

model="gpt-4o",

messages=[

{"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."},

{"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."}

temperature=0.3,

max_tokens=100

)

context = response.choices[0].message.content

return f"{context} {chunk}"

# Process all chunks concurrently

contextual_chunks = await asyncio.gather(

*[process_chunk(chunk) for chunk in chunks]

)

return contextual_chunks

async def generate_contexts(document, chunks): async def process_chunk(chunk): response = await client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."}, {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."} ], temperature=0.3, max_tokens=100 ) context = response.choices[0].message.content return f"{context} {chunk}" # Process all chunks concurrently contextual_chunks = await asyncio.gather( *[process_chunk(chunk) for chunk in chunks] ) return contextual_chunks

async def generate_contexts(document, chunks):
   async def process_chunk(chunk):
       response = await client.chat.completions.create(
           model="gpt-4o",
           messages=[
               {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."},
               {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."}
           ],
           temperature=0.3,
           max_tokens=100
       )
       context = response.choices[0].message.content
       return f"{context} {chunk}"
   # Process all chunks concurrently
   contextual_chunks = await asyncio.gather(
       *[process_chunk(chunk) for chunk in chunks]
   )
   return contextual_chunks

这段代码利用 LLM（可能是 OpenAI 的 GPT-4o，通过 client.chat.completions.create 调用）为文档的每个分块生成上下文信息。它会异步处理每个片段，提示 LLM 解释该片段与完整文档的关系。最后，它将返回一个原始分块的列表，并在其中添加生成的上下文，从而有效地丰富了这些分块，提高了搜索检索效率。

4. 利用元数据：添加上下文

使用元数据添加和筛选

元数据可提供额外的上下文。这可以提高检索的准确性。通过加入日期、患者年龄和既存病症等元数据，您可以在搜索过程中过滤掉不相关的信息。过滤可缩小搜索范围，使检索更高效、更相关。编制索引时，将元数据与文本一起存储。

例如，医疗保健数据包括年龄、就诊日期和患者记录中的特定病症。利用这些元数据过滤搜索结果。这可确保系统只检索相关信息。例如，如果查询与儿童有关，则过滤掉 18 岁以上患者的记录。这样可以减少噪音，提高相关性。

示例

Chunk #1

Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}

源文本：

2.2.1 The First Incompleteness TheoremIn his Logical Journey (Wang 1996) Hao Wang published thefull text of material Gödel had written (at Wang’s request)about his discovery of the incompleteness theorems. This material hadformed the basis of Wang’s “Some Facts about KurtGödel,” and was read and approved by Gödel:

Chunk #2

Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}

源文本：

The First Incompleteness Theorem provides a counterexample tocompleteness by exhibiting an arithmetic statement which is neitherprovable nor refutable in Peano arithmetic, though true in thestandard model. The Second Incompleteness Theorem shows that theconsistency of arithmetic cannot be proved in arithmetic itself. ThusGödel’s theorems demonstrated the infeasibility of theHilbert program, if it is to be characterized by those particulardesiderata, consistency and completeness.

在这里，我们可以看到元数据包含了数据块的唯一 ID 和来源，这为数据块提供了更多上下文信息，有助于轻松检索。

5. 使用GLiNER生成元数据

元数据并不总是很多，但使用 GLiNER 这样的模型可以即时生成元数据！GLiNER 在摄取过程中对数据块进行标记和标签，从而生成元数据。

实施

为 GLiNER 提供每个块的标签，以便识别。如果找到标签，它就会给它们贴标签。一般情况下，GLiNER 能很好地工作，但对于小众数据集可能需要微调。GLiNER 可以解析传入的查询，并将其与元数据标签进行匹配以进行过滤。

GLiNER：使用双向变换器进行命名实体识别的通用模型演示：点击此处

这些技术构建了一个强大的 RAG 系统。它们能从大型数据集中实现高效检索。分块和元数据使用的选择取决于数据集的具体需求和特征。

检索：查找正确的信息

现在，让我们关注 RAG 中的“R”。如何改进矢量数据库的检索？这就是检索与查询相关的所有文档。这大大增加了 LLM 产生高质量结果的机会。以下是几种技术：

6. 混合搜索

结合矢量搜索（查找语义）和关键词搜索（查找精确匹配）。混合搜索利用了两者的优势。在人工智能中，许多术语都是特定的关键词：算法名称、技术术语、法律硕士。仅靠向量搜索可能会漏掉这些。而关键词搜索则能确保这些重要术语得到考虑。将这两种方法结合起来，可以创建一个更完整的检索过程。这些搜索同时运行。

搜索结果通过加权系统进行合并和排序。例如，使用 Weaviate，您可以调整阿尔法参数，以平衡矢量和关键词结果。这样就能创建一个合并的排序列表。

优点：平衡精确度和召回率，提高检索质量。
缺点：需要仔细调整权重。

混合搜索实践

from langchain_community.retrievers import WeaviateHybridSearchRetriever

from langchain_core.documents import Document

retriever = WeaviateHybridSearchRetriever(

client=client,

index_name="LangChain",

text_key="text",

attributes=[],

create_schema_if_missing=True,

)

retriever.invoke("the ethical implications of AI")

from langchain_community.retrievers import WeaviateHybridSearchRetriever from langchain_core.documents import Document retriever = WeaviateHybridSearchRetriever( client=client, index_name="LangChain", text_key="text", attributes=[], create_schema_if_missing=True, ) retriever.invoke("the ethical implications of AI")

from langchain_community.retrievers import WeaviateHybridSearchRetriever
from langchain_core.documents import Document
retriever = WeaviateHybridSearchRetriever(
   client=client,
   index_name="LangChain",
   text_key="text",
   attributes=[],
   create_schema_if_missing=True,
)
retriever.invoke("the ethical implications of AI")

这段代码初始化了一个 WeaviateHybridSearchRetriever，用于从 Weaviate 向量数据库中检索文档。它在 Weaviate 的混合检索功能中结合了矢量搜索和关键词搜索。最后，它执行查询“人工智能的伦理意义”，利用这种混合方法检索相关文档。

7. 查询重写

认识到人类查询可能不是数据库或语言模型的最佳选择。使用语言模型重写查询可显著提高检索效率。

矢量数据库重写：这可将用户的初始查询转化为数据库友好的格式。例如，“什么是人工智能代理，为什么它们是 2025 年的下一件大事 ”可能会变成“人工智能代理 2025 年的大事”。我们可以使用任何 LLM 来重写查询，以便抓住查询的重要方面。
语言模型的提示重写：这包括自动创建提示，以优化与语言模型的交互。这可以提高结果的质量和准确性。我们可以使用 DSPy 等框架或任何 LLM 来帮助重写查询。这些重写的查询和提示可确保搜索过程检索到相关文档，并有效地提示语言模型。

多查询检索

基于查询措辞的细微变化，检索会产生不同的结果。如果嵌入不能准确反映数据的含义，这个问题就会变得更加突出。为了应对这些挑战，通常会使用提示工程或调整方法，但这一过程可能非常耗时。

MultiQueryRetriever 简化了这一任务。它使用大型语言模型（LLM），根据单个用户输入从不同角度创建多个查询。对于每个生成的查询，它都会检索一组相关文档。通过综合所有查询的唯一结果，多查询检索器提供了更广泛的潜在相关文档集。这种方法提高了找到有用信息的几率，而无需进行大量的人工调整。

from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)

from langchain.retrievers.multi_query import MultiQueryRetriever

# Set logging for the queries

import logging

similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity",

search_kwargs={"k": 2})

mq_retriever = MultiQueryRetriever.from_llm(

retriever=similarity_retriever3, llm=chatgpt,

include_original=True

)

logging.basicConfig()

# so we can see what queries are generated by the LLM

logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

query = "what is the capital of India?"

docs = mq_retriever.invoke(query)

docs

from langchain_openai import ChatOpenAI chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0) from langchain.retrievers.multi_query import MultiQueryRetriever # Set logging for the queries import logging similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 2}) mq_retriever = MultiQueryRetriever.from_llm( retriever=similarity_retriever3, llm=chatgpt, include_original=True ) logging.basicConfig() # so we can see what queries are generated by the LLM logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO) query = "what is the capital of India?" docs = mq_retriever.invoke(query) docs

from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging
similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity",
                                               search_kwargs={"k": 2})
mq_retriever = MultiQueryRetriever.from_llm(
   retriever=similarity_retriever3, llm=chatgpt,
   include_original=True
)
logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
query = "what is the capital of India?"
docs = mq_retriever.invoke(query)
docs

这段代码使用 LangChain 建立了一个多查询检索系统。它会生成输入查询（“印度的首都是哪里？”）的多个变体。然后通过相似性检索器使用这些变体来查询 Chroma 向量数据库（chroma_db3），目的是扩大搜索范围并捕获各种相关文档。多重查询检索器最终汇总并返回检索到的文档。

输出

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi () is the capital of India and a union territory of the megacity of Delhi. It has a very old history and is home to several monuments where the city is expensive to live in. In traditional Indian geography it falls under the North Indian zone. The city has an area of about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'), Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content="Kolkata (spelled Calcutta before 1 January 2001) is the capital city of the Indian state of West Bengal. It is the second largest city in India after Mumbai. It is on the east bank of the River Hooghly. When it is called Calcutta, it includes the suburbs. This makes it the third largest city of India. This also makes it the world's 8th largest metropolitan area as defined by the United Nations. Kolkata served as the capital of India during the British Raj until 1911. Kolkata was once the center of industry and education. However, it has witnessed political violence and economic problems since 1954. Since 2000, Kolkata has grown due to economic growth. Like other metropolitan cities in India, Kolkata struggles with poverty, pollution and traffic congestion."), Document(metadata={'article_id': '22215', 'title': 'States and union territories of India'}, page_content='The Republic of India is divided into twenty-eight States,and eight union territories including the National Capital Territory.')]

8. 基于LLM提示的上下文压缩检索

上下文压缩有助于提高检索文档的相关性。这主要有两种方式：

提取相关内容：删除检索文档中与查询无关的部分。这意味着只保留回答问题的部分。
过滤无关文档：在不改变文档本身内容的情况下，排除与查询无关的文档。

为了实现这一目的，我们可以使用 LLMChainExtractor，它可以审查最初返回的文档，并只提取与查询相关的内容。它也可以放弃完全不相关的文档。

下面是如何使用 LangChain 实现这一功能：

from langchain.retrievers import ContextualCompressionRetriever

from langchain.retrievers.document_compressors import LLMChainExtractor

from langchain_openai import ChatOpenAI

# Initialize the language model

chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Set up a similarity retriever

similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Create the extractor to get relevant content

compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# Combine the retriever and the extractor

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever)

# Example query

query = "What is the capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import LLMChainExtractor from langchain_openai import ChatOpenAI # Initialize the language model chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0) # Set up a similarity retriever similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3}) # Create the extractor to get relevant content compressor = LLMChainExtractor.from_llm(llm=chatgpt) # Combine the retriever and the extractor compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever) # Example query query = "What is the capital of India?" docs = compression_retriever.invoke(query) print(docs)

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
# Initialize the language model
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
# Set up a similarity retriever
similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3})
# Create the extractor to get relevant content
compressor = LLMChainExtractor.from_llm(llm=chatgpt)
# Combine the retriever and the extractor
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever)
# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

输出

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi is the capital of India and a union territory of the megacity of Delhi.')]

对于不同的查询：

query = "What is the old capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

query = "What is the old capital of India?" docs = compression_retriever.invoke(query) print(docs)

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

输出

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content='Kolkata served as the capital of India during the British Raj until 1911.')]

LLMChainFilter 提供了一种更简单但有效的方法来过滤文档。它使用 LLM 链来决定哪些文档要保留，哪些要丢弃，而不会改变文档的内容。

下面是实现该过滤器的方法：

from langchain.retrievers.document_compressors import LLMChainFilter

# Set up the filter

_filter = LLMChainFilter.from_llm(llm=chatgpt)

# Combine the retriever and the filter

compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever)

# Example query

query = "What is the capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

from langchain.retrievers.document_compressors import LLMChainFilter # Set up the filter _filter = LLMChainFilter.from_llm(llm=chatgpt) # Combine the retriever and the filter compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever) # Example query query = "What is the capital of India?" docs = compression_retriever.invoke(query) print(docs)

from langchain.retrievers.document_compressors import LLMChainFilter
# Set up the filter
_filter = LLMChainFilter.from_llm(llm=chatgpt)
# Combine the retriever and the filter
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever)
# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

输出

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'}, page_content='New Delhi is the capital of India and a union territory of the megacity of Delhi.')]

另一个问题：

query = "What is the old capital of India?"

docs = compression_retriever.invoke(query)

print(docs)

query = "What is the old capital of India?" docs = compression_retriever.invoke(query) print(docs)

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

输出：

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'}, page_content='Kolkata served as the capital of India during the British Raj until 1911.')]

这些策略通过关注相关内容来帮助完善检索过程。LLMChainExtractor 只提取文档的必要部分，而 LLMChainFilter 则决定保留哪些文档。这两种方法都能提高检索信息的质量，使其与用户的查询更加相关。

9. 微调嵌入模型

预先训练的嵌入模型是一个良好的开端。在数据基础上对这些模型进行微调可大大提高检索效率。

选择正确的模型：对于医学等专业领域，应选择在相关数据上预先训练好的模型。例如，您可以使用 MedCPT 系列查询和文档编码器，这些编码器是在来自 PubMed 搜索日志的 2.55 亿个查询-文章对的大规模数据上预先训练过的。

使用正对和负对进行微调：收集您自己的数据，创建相似（正面）和不相似（负面）示例对。对模型进行微调，以了解这些差异。这有助于模型学习特定领域的关系，从而提高检索效率。

优点：提高检索性能。
缺点：需要精心创建训练数据。

这些综合技术可以创建一个强大的检索系统。这可以提高给 LLM 提供的对象的相关性，从而提高生成质量。

生成高质量的回复

最后，让我们来讨论如何提高语言模型 (LLM) 的生成质量。我们的目标是为 LLM 提供尽可能与提示相关的语境。不相关的数据会引发幻觉。以下是更好地生成语言模型的技巧：

10. Autocut删除无关信息

Autocut 可过滤掉从数据库中获取的无关信息。这可以防止 LLM 被误导。

检索和评分相似性：当进行查询时，会检索出多个具有相似性得分的对象。
识别和截断：利用相似性分数找到分数明显下降的截断点。排除超出该点的对象。这样可以确保只向 LLM 提供最相关的信息。例如，如果您检索了六个对象，那么在检索到第四个对象后，分数可能会急剧下降。通过观察变化率，您可以确定要排除哪些对象。

动手操作

from langchain_openai import OpenAIEmbeddings

from langchain_pinecone import PineconeVectorStore

from typing import List

from langchain_core.documents import Document

from langchain_core.runnables import chain

vectorstore = PineconeVectorStore.from_documents(

docs, index_name="sample", embedding=OpenAIEmbeddings()

)

@chain

def retriever(query: str):

docs, scores = zip(*vectorstore.similarity_search_with_score(query))

for doc, score in zip(docs, scores):

doc.metadata["score"] = score

return docs

result = retriever.invoke("dinosaur")

result

from langchain_openai import OpenAIEmbeddings from langchain_pinecone import PineconeVectorStore from typing import List from langchain_core.documents import Document from langchain_core.runnables import chain vectorstore = PineconeVectorStore.from_documents( docs, index_name="sample", embedding=OpenAIEmbeddings() ) @chain def retriever(query: str): docs, scores = zip(*vectorstore.similarity_search_with_score(query)) for doc, score in zip(docs, scores): doc.metadata["score"] = score return docs result = retriever.invoke("dinosaur") result

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import chain
vectorstore = PineconeVectorStore.from_documents(
   docs, index_name="sample", embedding=OpenAIEmbeddings()
)
@chain
def retriever(query: str):
   docs, scores = zip(*vectorstore.similarity_search_with_score(query))
   for doc, score in zip(docs, scores):
       doc.metadata["score"] = score
   return docs
 result = retriever.invoke("dinosaur")
result

本代码片段使用 LangChain 和 Pinecone 执行相似性搜索。它使用 OpenAI embeddings 嵌入文档，将文档存储在 Pinecone 向量存储区中，并定义了一个检索器函数。检索器会搜索与给定查询（“恐龙”）相似的文档，计算相似度得分，并在返回结果前将这些得分添加到文档元数据中。

输出

[Document(page_content='In her second book, Dr. Simmons delves deeper into the ethical considerations surrounding AI development and deployment. It is an eye-opening examination of the dilemmas faced by developers, policymakers, and society at large.', metadata={}), Document(page_content='A comprehensive analysis of the evolution of artificial intelligence, from its inception to its future prospects. Dr. Simmons covers ethical considerations, potentials, and threats posed by AI.', metadata={}), Document(page_content="In his follow-up to 'Symbiosis', Prof. Sterling takes a look at the subtle, unnoticed presence and influence of AI in our everyday lives. It reveals how AI has become woven into our routines, often without our explicit realization.", metadata={}), Document(page_content='Prof. Sterling explores the potential for harmoniouscoexistence between humans and artificial intelligence. The book discusses how AI can be integrated into society in a beneficial and non-disruptivemanner.', metadata={})]

我们可以看到，它还给出了相似性得分，我们可以根据阈值对其进行截断。

11. 对检索到的对象重新排序

重新排序使用一种更先进的模型来重新评估和排序最初检索到的对象。这可以提高最终检索集的质量。

过度获取：最初检索的对象数量超过所需数量。
应用排序器模型：使用高延迟模型（通常是交叉编码器）重新评估相关性。该模型将查询和每个对象配对考虑，以重新评估相似性。
对结果重新排序：根据新的评估结果，对对象重新排序。将最相关的结果放在最前面。这样可以确保最相关的文档得到优先排序，从而改进提供给 LLM 的数据。

对检索到的对象重新排序实践

from langchain.retrievers import ContextualCompressionRetriever

from langchain.retrievers.document_compressors import FlashrankRerank

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

compressor = FlashrankRerank()

compression_retriever = ContextualCompressionRetriever(

base_compressor=compressor, base_retriever=retriever

)

compressed_docs = compression_retriever.invoke(

"What did the president say about Ketanji Jackson Brown"

)

print([doc.metadata["id"] for doc in compressed_docs])

pretty_print_docs(compressed_docs)

from langchain.retrievers import ContextualCompressionRetriever from langchain.retrievers.document_compressors import FlashrankRerank from langchain_openai import ChatOpenAI llm = ChatOpenAI(temperature=0) compressor = FlashrankRerank() compression_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=retriever ) compressed_docs = compression_retriever.invoke( "What did the president say about Ketanji Jackson Brown" ) print([doc.metadata["id"] for doc in compressed_docs]) pretty_print_docs(compressed_docs)

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
   "What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata["id"] for doc in compressed_docs])
pretty_print_docs(compressed_docs)

本代码片段在 ContextualCompressionRetriever 中利用 FlashrankRerank 来提高检索文档的相关性。它根据基本检索器（由检索器表示）获取的文档与查询“总统对凯坦吉-杰克逊-布朗说了什么”的相关性，对这些文档进行重新分级。最后，它会打印文档 ID 和压缩后的重排序文档。

输出

[0, 5, 3]Document 1:One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.----------------------------------------------------------------------------------------------------Document 2:He met the Ukrainian people.From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.----------------------------------------------------------------------------------------------------Document 3:And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.By the end of this year, the deficit will be down to less than half what it was before I took office.  The only president ever to cut the deficit by more than one trillion dollars in a single year.Lowering your costs also means demanding more competition.I’m a capitalist, but capitalism without competition isn’t capitalism.It’s exploitation—and it drives up prices.

输出鞋会根据相关性对检索到的数据块进行重新排序。

12. 微调LLM

根据特定领域的数据对 LLM 进行微调可大大提高其性能。例如，使用 Meditron 70B 这样的模型。这是针对医疗数据的 LLaMA 2 70b 微调版本，同时使用了以下两种方法：

无监督微调：继续对大量特定领域文本（如 PubMed 文献）进行预训练。

监督微调：在特定领域任务（如医学选择题）上使用监督学习进一步完善模型。这种专门的训练有助于模型在目标领域中的良好表现。在特定任务上，该模型的表现优于其基础模型和 GPT-3.5 等更大型、更不专业的模型。

监督微调

Source: Link

此图表示在特定任务示例中进行微调的过程。这种方法允许开发人员指定所需的输出、鼓励某些行为或更好地控制模型的响应。

13. 使用RAFT：根据特定领域的RAG调整语言模型

RAFT，即检索增强微调（Retrieval-Augmented fine-tuning），是一种改进大型语言模型（LLM）在特定领域工作方式的方法。它有助于这些模型使用文档中的相关信息更准确地回答问题。

检索增强微调法（RAFT）：RAFT 将微调与检索方法相结合。这使得模型在训练过程中既能从有用的文档中学习，也能从无用的文档中学习。
思维链推理（Chain-of-Thought Reasoning）：模型生成的答案可显示其推理过程。这有助于它根据检索到的文档提供清晰准确的回答。
动态文档处理：RAFT 训练模型查找和使用最相关的文档，同时忽略那些无助于回答问题的文档。

RAFT的架构

RAFT 架构包括几个关键组件：

输入层（Input Layer）：模型接收一个问题（Q）和一组检索到的文档（D），其中包括相关和不相关文档。
处理层 ：
- 该模型对输入进行分析，以找到文档中的重要信息。
- 它创建一个参考相关文档的答案（A*）。
输出层：模型根据相关文档生成最终答案，同时忽略无关文档。
训练机制：在训练过程中，一些数据包括相关和不相关文档，而另一些数据只包括不相关文档。这种设置鼓励模型关注上下文，而不是死记硬背。
评估：根据模型使用检索到的文档准确回答问题的能力来评估模型的性能。

通过使用这种架构，RAFT 增强了模型在特定领域的工作能力。它提供了一种可靠的方法来生成准确而相关的回答。

RAFT的架构

Source: link

左上图描述的是将 LLMs 适应于从一组正向文档和干扰文档中阅读解决方案的方法，与标准 RAG 设置形成鲜明对比，在标准 RAG 设置中，模型是根据检索器的输出进行训练的，而检索器的输出是记忆和阅读的混合。在测试时，所有方法都遵循标准的 RAG 设置，在上下文中提供前 k 个检索文档。

小结

改进 RAG 系统中的检索和生成对于更好的人工智能应用至关重要。所讨论的技术包括从低功耗、高影响的方法（查询重写、重排）到更密集的过程（嵌入和 LLM 微调）。最佳技术取决于应用程序的具体需求和限制。先进的 RAG 技术只要经过深思熟虑，就能让开发人员构建出更准确、更可靠、更能感知上下文的人工智能系统，从而能够处理复杂的信息需求。

Autocut GLiNER HNSW LLM RAFT RAG

适合您下一个项目的13种高级RAG技术

基本 RAG 的不足之处？

索引和分块：打下坚实基础

1. HNSW

HNSW如何工作？

HNSW实践

Step 1：设置HNSW参数

Step 2：初始化HNSW索引

Step 3：设置efConstruction

Step 4：生成样本数据

Step 5：设置efSearch

Step 6：执行搜索

输出

2. 语义分块

语义分块实践

3. 基于语言模型的分块

基于语言模型的分块实践

4. 利用元数据：添加上下文

使用元数据添加和筛选

示例

5. 使用GLiNER生成元数据

实施

检索：查找正确的信息

6. 混合搜索

混合搜索实践

7. 查询重写

多查询检索

8. 基于LLM提示的上下文压缩检索

9. 微调嵌入模型

生成高质量的回复

10. Autocut删除无关信息

动手操作

输出

11. 对检索到的对象重新排序

对检索到的对象重新排序实践

输出

12. 微调LLM

13. 使用RAFT：根据特定领域的RAG调整语言模型

RAFT的架构

小结

相关文章

评论留言

取消回复

文章目录