利用LangChain和CrewAI構建基於RAG的查詢解析系統

利用LangChain和CrewAI構建基於RAG的查詢解析系統

如今,企業需要處理大量來自客戶、銷售團隊和內部利益相關者的詢問。手動回覆這些查詢是一個緩慢而低效的過程,往往會導致延遲和答案不一致。由人工智慧驅動的查詢解決系統可確保快速、準確和可擴充套件的響應。它的工作原理是利用檢索增強生成(RAG)技術檢索相關資訊並生成精確的答案。在本文中,我將與大家分享我使用LangChain、ChromaDB 和 CrewAI 構建基於 RAG 的查詢解析系統的歷程。

我們為什麼需要AI驅動的查詢解析系統?

現在,人工回覆需要時間,因此可能會導致延遲。客戶需要即時回覆,企業需要快速獲取準確資訊。人工智慧驅動的系統可自動處理查詢,減少工作量並提高一致性。它可以提高生產率,加快決策速度,並在不同部門提供可靠的回覆。

人工智慧驅動的查詢解決系統在客戶支援方面非常有用,它可以自動回覆並提高客戶滿意度。在銷售和營銷領域,它可以提供實時的產品詳情和客戶洞察。金融、醫療保健、教育和電子商務等行業都能從自動查詢處理中受益,確保順利運營和更好的使用者體驗。

瞭解RAG工作流程

在深入實施之前,讓我們先了解一下檢索增強生成(RAG)系統是如何工作的。

RAG工作流程

該架構由三個關鍵階段組成: 索引、檢索和生成。

1. 建立向量儲存(文件處理與儲存)

系統首先處理並儲存相關文件,使其易於搜尋。以下是索引過程的工作原理:

  • 文件和分塊:將大文件分割成較小的文字塊,以便高效檢索。
  • 嵌入模型:使用基於人工智慧的嵌入模型,將這些文字塊轉換為向量表示。
  • 向量儲存:將向量化資料編入索引並儲存在資料庫(如 ChromaDB)中,以便快速查詢。

2. 查詢處理與檢索

當使用者提交查詢時,系統會先檢索相關資料,然後再生成響應。以下是查詢處理和檢索的步驟:

  • 使用者查詢輸入:使用者提交問題或請求。
  • 向量化:使用嵌入模型將查詢轉換為數字向量。
  • 搜尋和檢索:系統在向量儲存中搜尋並檢索最相關的塊。

3. 增強和生成響應

為了生成有理有據的響應,系統會利用檢索到的資料對查詢進行擴充。生成響應的步驟如下。

  • 增強查詢:將檢索到的文件塊與原始查詢結合起來。
  • LLM 處理:大語言模型(LLM)利用查詢和檢索到的上下文生成最終響應。
  • 最終回覆:系統向使用者提供一個符合事實並能感知上下文的答案。

現在您已經知道 RAG 系統是如何工作的了,讓我們來學習如何構建基於 RAG 的查詢解決系統。

構建基於RAG的查詢解析系統

在本文中,我將引導您構建一個基於 RAG 的查詢解決系統,該系統可使用人工智慧代理高效地回答學習者的查詢。為了簡單起見,我將演示專案的簡化版本,並解釋其工作原理。

為查詢解析選擇正確的資料

在構建基於 RAG 的查詢解析系統之前,需要考慮的最重要因素就是資料,具體來說,就是有效檢索所需的資料型別。結構良好的知識庫至關重要,因為響應的準確性和相關性取決於可用資料的質量。以下是針對不同目的應考慮的主要資料型別:

  • 客戶支援資料:常見問題、故障排除指南、產品手冊和過去的客戶互動。
  • 銷售和營銷資料:產品目錄、定價詳情、競爭對手分析和客戶諮詢。
  • 內部知識庫:公司政策、培訓檔案和標準操作程式 (SOP)。
  • 財務和法律檔案:合規指南、財務報告和監管政策。
  • 使用者生成的內容:論壇討論、聊天記錄和反饋表,提供真實的使用者查詢。

選擇正確的資料來源對我們的學習者查詢解決系統至關重要,這樣才能確保做出準確、相關的回覆。最初,我嘗試了不同型別的資料,以確定哪種資料能提供最佳結果。首先,我使用了 PowerPoint 幻燈片(PPT),但它們並沒有像預期的那樣提供全面的答案。接著,我加入了常見問題,這提高了回答的準確性,但缺乏足夠的上下文。然後,我對過去的討論進行了測試,這有助於通過利用學員之前的互動來提高回答的相關性。不過,最有效的方法還是使用課程視訊中的字幕,因為它們提供了與學員詢問直接相關的結構化詳細內容。這種方法有助於提供快速、相關的答案,因此對電子學習平臺和教育支援系統非常有用。

構建查詢解決系統

在編碼之前,構建查詢解決系統非常重要。最好的方法是定義系統需要執行的關鍵任務。

該系統將處理三項主要任務:

  1. 從字幕(SRT 檔案)中提取並儲存課程內容。
  2. 根據學習者的查詢檢索相關課程資料。
  3. 使用人工智慧代理生成結構化回覆。

為此,系統分為三個元件,每個元件處理特定的功能。這確保了效率和可擴充套件性。

系統包括

  • 字幕處理– 從 SRT 檔案中提取文字,對其進行處理,並將嵌入內容儲存在 ChromaDB 中。
  • 檢索– 根據學習者的查詢搜尋和檢索相關課程資料。
  • 查詢回答代理– 使用 CrewAI 生成結構化的準確回答。

每個元件都能確保高效的查詢解決、個性化的回覆和流暢的內容檢索。現在我們有了結構,接下來就開始實施。

實施步驟

現在我們有了結構,接下來就開始實施。

1. 匯入庫

要構建人工智慧驅動的學習支援系統,我們首先需要匯入必要的庫。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import pysrt
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from crewai import Agent, Task, Crew
import pandas as pd
import ast
import pysrt from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.schema import Document from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import Chroma from crewai import Agent, Task, Crew import pandas as pd import ast
import pysrt
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from crewai import Agent, Task, Crew 
import pandas as pd
import ast

讓我們來了解一下這些庫。

  • pysrt – 用於從 SRT 字幕檔案中提取文字。
  • langchain.text_splitter.RecursiveCharacterTextSplitter – 將大段文字分割成小塊,以便更好地檢索。
  • langchain.schema.Document – 表示結構化文字文件。
  • langchain.embeddings.OpenAIEmbeddings – 將文字轉換為數字向量,用於相似性搜尋。
  • langchain.vectorstores.Chroma – 將嵌入資料儲存在向量資料庫中,以便高效檢索。
  • crewai (Agent, Task, Crew) – 定義處理學習者查詢的人工智慧代理。
  • pandas – 處理資料幀形式的結構化資料。
  • ast – 幫助將基於字串的資料結構解析為 Python 物件。
  • os – 提供系統級操作,如讀取環境變數。
  • tqdm – 在長時間執行的任務中顯示進度條。

2. 設定環境

要使用 OpenAI 的 API 進行嵌入,我們必須載入 API 金鑰並配置模型設定。

第 1 步:從本地文字檔案中讀取 API 金鑰。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
with open('/home/janvi/Downloads/openai.txt', 'r') as file:
openai_api_key = file.read()
with open('/home/janvi/Downloads/openai.txt', 'r') as file: openai_api_key = file.read()
with open('/home/janvi/Downloads/openai.txt', 'r') as file:
openai_api_key = file.read()

第 2 步:將 API 金鑰儲存為環境變數,以便其他元件訪問。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
os.environ['OPENAI_API_KEY'] = openai_api_key
os.environ['OPENAI_API_KEY'] = openai_api_key
os.environ['OPENAI_API_KEY'] = openai_api_key

第 3 步:指定用於處理嵌入的 OpenAI 模型。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o-mini'
os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o-mini'
os.environ["OPENAI_MODEL_NAME"] = 'gpt-4o-mini'

通過設定這些配置,我們確保了與 OpenAI 應用程式介面的無縫整合,從而使我們的系統能夠高效地處理和儲存嵌入式內容。

3. 提取和儲存字幕資料

字幕通常包含視訊講座中的寶貴見解,因此對於基於人工智慧的檢索系統來說,字幕是結構化內容的豐富來源。有效提取和處理字幕資料可以在回答學習者的詢問時高效搜尋和檢索相關資訊。

第 1 步:從 SRT 檔案中提取文字

為了保留教育見解,我們使用 pysrt 來讀取和預處理 SRT 檔案中的文字。這樣可以確保提取的內容結構合理,便於進一步處理和儲存。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def extract_text_from_srt(srt_path):
"""Extracts text from an SRT subtitle file using pysrt."""
subs = pysrt.open(srt_path)
text = " ".join(sub.text for sub in subs)
return text
def extract_text_from_srt(srt_path): """Extracts text from an SRT subtitle file using pysrt.""" subs = pysrt.open(srt_path) text = " ".join(sub.text for sub in subs) return text
def extract_text_from_srt(srt_path):
"""Extracts text from an SRT subtitle file using pysrt."""
subs = pysrt.open(srt_path)
text = " ".join(sub.text for sub in subs)
return text

由於課程可能有多個字幕檔案,我們對儲存在預定義資料夾中的課程資料進行了系統整理和迭代。這樣就可以進行無縫文字提取和進一步處理。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Define course names and their respective folder paths
course_folders = {
"Introduction to Deep Learning using PyTorch": "C:\M\Code\GAI\Learn_queries\Subtitle_Introduction_to_Deep_Learning_Using_Pytorch",
"Building Production-Ready RAG systems using LlamaIndex": "C:\M\Code\GAI\Learn_queries\Subtitle of Building Production-Ready RAG systems using LlamaIndex",
"Introduction to LangChain - Building Generative AI Apps & Agents": "C:\M\Code\GAI\Learn_queries\Subtitle_introduction_to_langchain_using_agentic_ai"
}
# Dictionary to store course names and their respective .srt file paths
course_srt_files = {}
# Iterate through course folder mappings
for course, folder_path in course_folders.items():
srt_files = []
# Walk through the directory to find .srt files
for root, _, files in os.walk(folder_path):
srt_files.extend(os.path.join(root, file) for file in files if file.endswith(".srt"))
# Add to dictionary if there are .srt files
if srt_files:
course_srt_files[course] = srt_files
# Define course names and their respective folder paths course_folders = { "Introduction to Deep Learning using PyTorch": "C:\M\Code\GAI\Learn_queries\Subtitle_Introduction_to_Deep_Learning_Using_Pytorch", "Building Production-Ready RAG systems using LlamaIndex": "C:\M\Code\GAI\Learn_queries\Subtitle of Building Production-Ready RAG systems using LlamaIndex", "Introduction to LangChain - Building Generative AI Apps & Agents": "C:\M\Code\GAI\Learn_queries\Subtitle_introduction_to_langchain_using_agentic_ai" } # Dictionary to store course names and their respective .srt file paths course_srt_files = {} # Iterate through course folder mappings for course, folder_path in course_folders.items(): srt_files = [] # Walk through the directory to find .srt files for root, _, files in os.walk(folder_path): srt_files.extend(os.path.join(root, file) for file in files if file.endswith(".srt")) # Add to dictionary if there are .srt files if srt_files: course_srt_files[course] = srt_files
# Define course names and their respective folder paths
course_folders = {
"Introduction to Deep Learning using PyTorch": "C:\M\Code\GAI\Learn_queries\Subtitle_Introduction_to_Deep_Learning_Using_Pytorch",
"Building Production-Ready RAG systems using LlamaIndex": "C:\M\Code\GAI\Learn_queries\Subtitle of Building Production-Ready RAG systems using LlamaIndex",
"Introduction to LangChain - Building Generative AI Apps & Agents": "C:\M\Code\GAI\Learn_queries\Subtitle_introduction_to_langchain_using_agentic_ai"
}
# Dictionary to store course names and their respective .srt file paths
course_srt_files = {}
# Iterate through course folder mappings
for course, folder_path in course_folders.items():
srt_files = []
# Walk through the directory to find .srt files
for root, _, files in os.walk(folder_path):
srt_files.extend(os.path.join(root, file) for file in files if file.endswith(".srt"))
# Add to dictionary if there are .srt files
if srt_files:
course_srt_files[course] = srt_files

這些提取的文字構成了我們人工智慧驅動的學習支援系統的基礎,可實現高階檢索和查詢解析。

第 2 步:在 ChromaDB 中儲存字幕

在這一部分,我們將詳細介紹將課程字幕儲存到 ChromaDB 的過程,包括文字分塊、嵌入生成、持久化和成本估算。

a. ChromaDB 的持久目錄

persistent_directory 是儲存儲存資料的資料夾路徑,它允許我們在重啟程式後仍能保留嵌入內容。如果沒有這個目錄,每次執行後資料庫都會重置。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
persist_directory = "./subtitles_db"
persist_directory = "./subtitles_db"
persist_directory = "./subtitles_db"

ChromaDB 用作向量資料庫,可高效地儲存和檢索嵌入資訊。

b. 將文字分割成小塊

大型文件(如整個課程的字幕)超出了嵌入式的標記限制。為了解決這個問題,我們使用 RecursiveCharacterTextSplitter 將文字分割成更小的、重疊的塊,以提高搜尋的準確性。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Text splitter to break documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# Text splitter to break documents into smaller chunks text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# Text splitter to break documents into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

每個語塊長度為 1,000 個字元,確保文字被分割成易於管理的片段。為了保持各語料塊之間的上下文關係,上一個語料塊的 200 個字元會包含在下一個語料塊中。這種重疊有助於保留重要細節,提高檢索準確性。

c. 初始化 OpenAI 嵌入和 ChromaDB 向量儲存

我們需要將文字轉換為數字向量表示,以便進行相似性搜尋。OpenAI 的嵌入允許我們將課程內容編碼為可高效搜尋的格式。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
# Initialize OpenAI embeddings embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

在這裡,OpenAIEmbeddings() 使用 OpenAI API 金鑰(openai_api_key)初始化嵌入模型。這將確保每個文字塊都能轉換成高維向量表示。

d. 初始化 ChromaDB

現在,我們將這些向量嵌入儲存到 ChromaDB 中。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Initialize Chroma vectorstore with persistent directory
vectorstore = Chroma(
collection_name="course_materials",
embedding_function=embeddings,
persist_directory=persist_directory
)
# Initialize Chroma vectorstore with persistent directory vectorstore = Chroma( collection_name="course_materials", embedding_function=embeddings, persist_directory=persist_directory )
# Initialize Chroma vectorstore with persistent directory
vectorstore = Chroma(
collection_name="course_materials",
embedding_function=embeddings,
persist_directory=persist_directory
)

collection_name=’course_materials” 會在 ChromaDB 中建立一個專門的集合,用於組織所有與課程相關的嵌入。embedding_function=embeddings(嵌入函式)指定了將文字轉換為數字向量的 OpenAI 嵌入函式。persist_directory=persist_directory 確保所有儲存的嵌入式素材即使在重啟程式後仍可在 ./subtitles_db/ 中使用。

第 3 步:估算課程資料的儲存成本

在將文件新增到向量資料庫之前,必須估算令牌的使用成本。由於 OpenAI 按每 1,000 個代幣收費,因此我們計算了預期成本,以便有效管理費用。

a. 定義定價引數

由於 OpenAI 按每 1000 個 tokens 收費,因此我們在新增文件之前先估算成本。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import time
# OpenAI Pricing (adjust based on the model being used)
COST_PER_1K_TOKENS = 0.0001 # Cost per 1K tokens for 'text-embedding-ada-002'
TOKENS_PER_CHUNK_ESTIMATE = 750 # Approximate tokens per 1000-character chunk
# Track total tokens and cost
total_tokens = 0
total_cost = 0
# Start timing
start_time = time.time()
import time # OpenAI Pricing (adjust based on the model being used) COST_PER_1K_TOKENS = 0.0001 # Cost per 1K tokens for 'text-embedding-ada-002' TOKENS_PER_CHUNK_ESTIMATE = 750 # Approximate tokens per 1000-character chunk # Track total tokens and cost total_tokens = 0 total_cost = 0 # Start timing start_time = time.time()
import time
# OpenAI Pricing (adjust based on the model being used)
COST_PER_1K_TOKENS = 0.0001  # Cost per 1K tokens for 'text-embedding-ada-002'
TOKENS_PER_CHUNK_ESTIMATE = 750  # Approximate tokens per 1000-character chunk
# Track total tokens and cost
total_tokens = 0
total_cost = 0
# Start timing
start_time = time.time()

COST_PER_1K_TOKENS = 0.0001 定義了使用 OpenAI 嵌入時每 1000 個 token 的成本。TOKENS_PER_CHUNK_ESTIMATE = 750 估計每個 1000 個字元的資料塊包含約 750 個標記。total_tokens 和 total_cost 變數跟蹤執行過程中處理的資料總量和產生的成本。start_time 變數記錄的是開始時間,用於衡量程序所需的時間。

b. 檢查並向 ChromaDB 新增課程

我們希望避免重新處理已經儲存在向量資料庫中的課程。因此,我們會查詢 ChromaDB,檢查課程是否已經存在。如果未找到課程,我們將提取並儲存其字幕資料。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Add new courses to the vectorstore if they don't already exist
for course, srt_list in course_srt_files.items():
# Check if the course already exists in the vectorstore
existing_docs = vectorstore._collection.get(where={"course": course})
if not existing_docs['ids']:
# Course not found, add it
srt_texts = [extract_text_from_srt(srt) for srt in srt_list]
course_text = "\n\n\n\n".join(srt_texts) # Join SRT texts with four new lines
doc = Document(page_content=course_text, metadata={"course": course})
chunks = text_splitter.split_documents([doc])
# Add new courses to the vectorstore if they don't already exist for course, srt_list in course_srt_files.items(): # Check if the course already exists in the vectorstore existing_docs = vectorstore._collection.get(where={"course": course}) if not existing_docs['ids']: # Course not found, add it srt_texts = [extract_text_from_srt(srt) for srt in srt_list] course_text = "\n\n\n\n".join(srt_texts) # Join SRT texts with four new lines doc = Document(page_content=course_text, metadata={"course": course}) chunks = text_splitter.split_documents([doc])
# Add new courses to the vectorstore if they don't already exist
for course, srt_list in course_srt_files.items():
# Check if the course already exists in the vectorstore
existing_docs = vectorstore._collection.get(where={"course": course})
if not existing_docs['ids']:
# Course not found, add it
srt_texts = [extract_text_from_srt(srt) for srt in srt_list]
course_text = "\n\n\n\n".join(srt_texts)  # Join SRT texts with four new lines
doc = Document(page_content=course_text, metadata={"course": course})
chunks = text_splitter.split_documents([doc])

使用 extract_text_from_srt() 函式提取字幕。然後使用 \n\n\n 將多個字幕檔案連線在一起,以提高可讀性。建立一個 Document 物件,儲存完整的字幕文字及其後設資料。最後,使用 text_splitter.split_documents()將文字分割成小塊,以便高效處理和檢索。

c. 估算 tokens 用量和成本

在將塊新增到 ChromaDB 之前,我們要估算成本。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Estimate cost before adding documents
chunk_count = len(chunks)
batch_tokens = chunk_count * TOKENS_PER_CHUNK_ESTIMATE
batch_cost = (batch_tokens / 1000) * COST_PER_1K_TOKENS
total_tokens += batch_tokens
total_cost += batch_cost
# Estimate cost before adding documents chunk_count = len(chunks) batch_tokens = chunk_count * TOKENS_PER_CHUNK_ESTIMATE batch_cost = (batch_tokens / 1000) * COST_PER_1K_TOKENS total_tokens += batch_tokens total_cost += batch_cost
      # Estimate cost before adding documents
chunk_count = len(chunks)
batch_tokens = chunk_count * TOKENS_PER_CHUNK_ESTIMATE
batch_cost = (batch_tokens / 1000) * COST_PER_1K_TOKENS
total_tokens += batch_tokens
total_cost += batch_cost

chunk_count 表示分割文字後生成的塊數。batch_tokens 會根據塊計數估算標記總數。batch_cost 計算處理當前課程的估計成本。total_tokens 和 total_cost 會累計所有課程的值,以跟蹤總體處理情況和費用。

d. 向 ChromaDB 新增資料塊

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
vectorstore.add_documents(chunks)
print(f"Added course: {course} (Chunks: {chunk_count}, Cost: ${batch_cost:.4f})")
else:
print(f"Course already exists: {course}")
vectorstore.add_documents(chunks) print(f"Added course: {course} (Chunks: {chunk_count}, Cost: ${batch_cost:.4f})") else: print(f"Course already exists: {course}")
       vectorstore.add_documents(chunks)
print(f"Added course: {course} (Chunks: {chunk_count}, Cost: ${batch_cost:.4f})")
else:
print(f"Course already exists: {course}")

處理後的資料塊儲存在 ChromaDB 中,以便高效檢索。我們會顯示一條資訊,說明新增的塊數和估計的處理成本。

所有課程處理完畢後,我們會計算並顯示最終結果。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# End timing
end_time = time.time()
# Display cost and time
print(f"\nCourse Embeddings Update Completed! 🚀")
print(f"Total Chunks Processed: {total_tokens // TOKENS_PER_CHUNK_ESTIMATE}")
print(f"Estimated Total Tokens: {total_tokens}")
print(f"Estimated Cost: ${total_cost:.4f}")
print(f"Total Time Taken: {end_time - start_time:.2f} seconds")
# End timing end_time = time.time() # Display cost and time print(f"\nCourse Embeddings Update Completed! 🚀") print(f"Total Chunks Processed: {total_tokens // TOKENS_PER_CHUNK_ESTIMATE}") print(f"Estimated Total Tokens: {total_tokens}") print(f"Estimated Cost: ${total_cost:.4f}") print(f"Total Time Taken: {end_time - start_time:.2f} seconds")
# End timing
end_time = time.time()
# Display cost and time
print(f"\nCourse Embeddings Update Completed! 🚀")
print(f"Total Chunks Processed: {total_tokens // TOKENS_PER_CHUNK_ESTIMATE}")
print(f"Estimated Total Tokens: {total_tokens}")
print(f"Estimated Cost: ${total_cost:.4f}")
print(f"Total Time Taken: {end_time - start_time:.2f} seconds")

總處理時間用(end_time – start_time)計算。然後,系統會顯示已處理的資料塊數量、估計的 tokens 用量和總成本。最後,系統會提供整個嵌入過程的摘要。

輸出:

計算並顯示最終結果

從輸出中我們可以看到,10 秒內共處理了 739 個塊,估計成本為 0.0554 美元。

4. 查詢和響應學習者的查詢

字幕儲存到 ChromaDB 後,系統需要在學習者提交查詢時檢索相關內容。這一檢索過程是通過相似性搜尋來處理的,相似性搜尋可識別與輸入查詢最相關的儲存文字片段。

工作原理

  1. Query Input:學員提交與課程相關的問題。
  2. Filtering by Course:系統確保檢索僅限於相關課程材料。
  3. Similarity Search in ChromaDB:將查詢轉換為嵌入,ChromaDB 會檢索儲存的最相似文字塊。
  4. Returning the Top Results:系統會選擇前三個最相關的文字片段。
  5. Formatting the Output:對檢索到的文字進行格式化,並作為上下文呈現,以便進一步處理。
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Define retrieval tool with metadata filtering
def retrieve_course_materials(query: str, course = course):
"""Retrieves course materials filtered by course name."""
filter_dict = {"course": course}
results = vectorstore.similarity_search(query, k=3, filter=filter_dict)
return "\n\n".join([doc.page_content for doc in results])
# Define retrieval tool with metadata filtering def retrieve_course_materials(query: str, course = course): """Retrieves course materials filtered by course name.""" filter_dict = {"course": course} results = vectorstore.similarity_search(query, k=3, filter=filter_dict) return "\n\n".join([doc.page_content for doc in results])
# Define retrieval tool with metadata filtering
def retrieve_course_materials(query: str, course = course):
"""Retrieves course materials filtered by course name."""
filter_dict = {"course": course}
results = vectorstore.similarity_search(query, k=3, filter=filter_dict)
return "\n\n".join([doc.page_content for doc in results])

查詢示例:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
course_name = "Introduction to Deep Learning using PyTorch"
question = "What is gradient descent?"
context = retrieve_course_materials(query=question, course= course_name)
print(context)
course_name = "Introduction to Deep Learning using PyTorch" question = "What is gradient descent?" context = retrieve_course_materials(query=question, course= course_name) print(context)
course_name = "Introduction to Deep Learning using PyTorch"
question = "What is gradient descent?"
context = retrieve_course_materials(query=question, course= course_name)
print(context)

查詢和響應學習者的查詢

輸出包括從 ChromaDB 檢索的內容,按課程名稱和問題進行過濾,使用相似性搜尋查詢最相關的資訊。

為什麼使用相似性搜尋?

  • 語義理解:與關鍵詞搜尋不同,相似性搜尋能找到與查詢語義相關的文字。
  • 高效檢索:系統只檢索最相關的部分,而不是掃描整個文件。
  • 提高答案質量:通過課程篩選和相關性排序,學習者可以獲得針對性很強的內容。

這種機制可確保學習者在提交問題時,能從儲存的課程資料中獲得相關的、上下文準確的資訊。

5. 實現人工智慧查詢回答代理

一旦從 ChromaDB 中檢索到相關課程資料,下一步就是使用人工智慧驅動的代理來對學習者的詢問做出有意義的回答。CrewAI 用於定義一個智慧代理,負責分析查詢並生成結構良好的回覆。

現在,讓我們看看它是如何工作的。

第 1 步:定義代理

建立的查詢回答代理具有明確的角色和背景故事,以便在回答學習者的查詢時指導其行為。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Define the agent with a well-structured role and backstory
query_answer_agent = Agent(
role = "Learning Support Specialist",
goal = "You help learners with their queries with the best possible response",
backstory = """You lead the Learners Query resolution department of
an Ed tech company focussed on self paced courses on topics related to
Data Science, Machine Learning and Generative AI. You respond to learner
queries related to course content, assignments, technical and administrative issues.
You are polite, diplomatic and take ownership of things which could be
imporved in your oragnisation.
""",
verbose = False,
)
# Define the agent with a well-structured role and backstory query_answer_agent = Agent( role = "Learning Support Specialist", goal = "You help learners with their queries with the best possible response", backstory = """You lead the Learners Query resolution department of an Ed tech company focussed on self paced courses on topics related to Data Science, Machine Learning and Generative AI. You respond to learner queries related to course content, assignments, technical and administrative issues. You are polite, diplomatic and take ownership of things which could be imporved in your oragnisation. """, verbose = False, )
# Define the agent with a well-structured role and backstory
query_answer_agent = Agent(
role = "Learning Support Specialist",
goal = "You help learners with their queries with the best possible response",
backstory = """You lead the Learners Query resolution department of 
an Ed tech company focussed on self paced courses on topics related to 
Data Science, Machine Learning and Generative AI. You respond to learner
queries related to course content, assignments, technical and administrative issues. 
You are polite, diplomatic and take ownership of things which could be 
imporved in your oragnisation.
""",
verbose = False,
)

讓我們來了解一下程式碼塊中發生了什麼。首先,我們提供的角色是 “學習支援專家”,因為代理充當的是回答學生問題的虛擬導師。然後,我們定義目標,確保代理在回答問題時優先考慮準確性和清晰度。最後,我們設定 verbose=False,除非需要除錯,否則執行時保持沉默。這種定義明確的代理角色可確保回覆有幫助、有條理,並符合教育平臺的基調。

第 2 步:定義任務

定義代理後,我們需要為其分配任務

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
query_answering_task = Task(
description= """
Answer the learner queries to the best of your abilities. Try to keep your response concise with less than 100 words.
Here is the query: {query}
Here is similar content from the course extracted from subtitles, which you should use only when required: {relevant_content} .
Since this content is extracted from course subtitles, there may be spelling errors, make sure to correct these, while using this information in your response.
There may be some previous discussion with the learner on this thread. Here is the python list of past discussions: {thread} .
In this thread, the content which starts with 'learner' is the question by the student and the content which starts with 'support'
is the response given by you. Use this past discussion appropriatly to come with a great reply.
This is the full name of the learner: {learner_name}
Address each learner by their first name, if you are not sure what the first name is, simply start with Hi.
Also mention some appropriate and encouraging comforting lines at the end of the reponse, like "hope you found this helpful",
"I hope this information is useful. Keep up the great work!", "Glad to assist! Feel free to reach out anytime." etc.
If you are not sure about the answer mention - "Sorry, I am not sure about this, I will get back to you"
""",
expected_output = "A crisp accurate response to the query",
agent=query_answer_agent)
query_answering_task = Task( description= """ Answer the learner queries to the best of your abilities. Try to keep your response concise with less than 100 words. Here is the query: {query} Here is similar content from the course extracted from subtitles, which you should use only when required: {relevant_content} . Since this content is extracted from course subtitles, there may be spelling errors, make sure to correct these, while using this information in your response. There may be some previous discussion with the learner on this thread. Here is the python list of past discussions: {thread} . In this thread, the content which starts with 'learner' is the question by the student and the content which starts with 'support' is the response given by you. Use this past discussion appropriatly to come with a great reply. This is the full name of the learner: {learner_name} Address each learner by their first name, if you are not sure what the first name is, simply start with Hi. Also mention some appropriate and encouraging comforting lines at the end of the reponse, like "hope you found this helpful", "I hope this information is useful. Keep up the great work!", "Glad to assist! Feel free to reach out anytime." etc. If you are not sure about the answer mention - "Sorry, I am not sure about this, I will get back to you" """, expected_output = "A crisp accurate response to the query", agent=query_answer_agent)
query_answering_task  = Task(
description= """
Answer the learner queries to the best of your abilities. Try to keep your response concise with less than 100 words.
Here is the query: {query}
Here is similar content from the course extracted from subtitles, which you should use only when required: {relevant_content} .  
Since this content is extracted from course subtitles, there may be spelling errors, make sure to correct these, while using this information in your response.
There may be some previous discussion with the learner on this thread. Here is the python list of past discussions: {thread} . 
In this thread, the content which starts with 'learner' is the question by the student and the content which starts with 'support' 
is the response given by you. Use this past discussion appropriatly to come with a great reply.
This is the full name of the learner: {learner_name}
Address each learner by their first name, if you are not sure what the first name is, simply start with Hi.
Also mention some appropriate and encouraging comforting lines at the end of the reponse, like "hope you found this helpful", 
"I hope this information is useful. Keep up the great work!", "Glad to assist! Feel free to reach out anytime." etc.
If you are not sure about the answer mention - "Sorry, I am not sure about this, I will get back to you"
""",
expected_output = "A crisp accurate response to the query",
agent=query_answer_agent)

讓我們來分解一下提供給人工智慧代理的任務。查詢處理包括處理代表學習者問題的 {query}。回覆應簡明扼要(100 字以內)且準確無誤。在使用課程內容時,{relevant_content} 會從 ChromaDB 中儲存的字幕中提取,人工智慧必須先糾正拼寫錯誤,然後再將內容納入回覆中。

如果存在過去的討論,{thread} 則有助於保持連續性。學習者查詢以“learner”開頭,而過去的回覆則以“support”開頭,這樣人工智慧就能根據上下文提供答案。使用 {learner_name} 可以實現個性化,即代理可以用學生的名字稱呼他們,如果不確定,則預設為“Hi”。

為了讓回答更有吸引力,人工智慧會新增一個積極的結束語,如 “希望你覺得這對你有幫助!”或 “歡迎隨時聯絡我們”。如果人工智慧對某個答案不確定,它會明確表示 “對不起,我不確定,我會再聯絡您的。”這種方法確保了禮貌、清晰和有條理的回答格式,提高了學習者的參與度和信任度。

第 3 步:初始化 CrewAI 例項

現在,我們已經有了代理和任務,我們可以初始化 CrewAI,使代理能夠動態處理查詢。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
# Create the Crew
response_crew = Crew(
agents=[query_answer_agent],
tasks=[query_answering_task],
verbose=False
)
# Create the Crew response_crew = Crew( agents=[query_answer_agent], tasks=[query_answering_task], verbose=False )
# Create the Crew
response_crew = Crew(
agents=[query_answer_agent],
tasks=[query_answering_task],
verbose=False
)

agents=[query_answer_agent] 引數可將學習支援專家代理新增到機組中。tasks=[query_answering_task] 將查詢回答任務分配給該代理。除非需要除錯,否則設定 verbose=False 會將輸出保持在最低水平。CrewAI 使系統能夠同時處理多個學習者的查詢,使其在動態查詢處理方面具有可擴充套件性和高效性。

為什麼使用 CrewAI 進行查詢回答?

  • 結構化回覆:確保每個回覆都條理清晰、內容翔實。
  • 情境感知:利用檢索到的課程材料和過去的討論來提高回覆質量。
  • 可擴充套件性:可在 CrewAI 中將多個查詢作為任務處理,從而動態處理多個查詢。
  • 效率:通過簡化查詢解決工作流程來縮短響應時間。

通過實施這種人工智慧驅動的回答系統,學習者可以收到針對其特定查詢量身定製的資訊充分的回答。

第 4 步:為多個學習者的詢問生成回覆

人工智慧代理建立後,需要動態處理儲存在結構化資料集中的學習者查詢。

下面的程式碼使用人工智慧代理處理儲存在 CSV 檔案中的學習者查詢並生成回覆。它首先載入包含學員查詢、課程詳情和討論執行緒的資料集。reply_too_query 函式會提取相關詳細資訊,如學員姓名、課程名稱和當前查詢。如果存在以前的討論,則會檢索這些討論的上下文。如果查詢包含圖片,則會跳過。然後,該函式會從 ChromaDB 獲取相關課程資料,並將查詢、相關內容和過去的討論傳送給人工智慧代理,以生成結構化的回覆。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
df = pd.read_csv(filepath_or_buffer='C:\M\Code\GAI\Learn_queries/filtered_data_top3_courses.csv')
def reply_to_query(df_new=df_new, index=1):
learner_name = df_new.iloc[index]["thread_starter"]
course_name = df_new.iloc[index]["course"]
if df_new.iloc[index]['number_of_replies']>1:
thread = ast.literal_eval(df_new.iloc[index]["modified_thread"])
else:
thread = []
question = df_new.iloc[index]["current_query"]
if df_new.iloc[index]['has_image'] == True:
return " "
context = retrieve_course_materials(query = question , course=course_name)
response_result = response_crew.kickoff(inputs={"query": question, "relevant_content": context, "thread": thread, "learner_name": learner_name})
print('Q: ', question)
print('\n')
print('A: ', response_result)
print('\n\n')
df = pd.read_csv(filepath_or_buffer='C:\M\Code\GAI\Learn_queries/filtered_data_top3_courses.csv') def reply_to_query(df_new=df_new, index=1): learner_name = df_new.iloc[index]["thread_starter"] course_name = df_new.iloc[index]["course"] if df_new.iloc[index]['number_of_replies']>1: thread = ast.literal_eval(df_new.iloc[index]["modified_thread"]) else: thread = [] question = df_new.iloc[index]["current_query"] if df_new.iloc[index]['has_image'] == True: return " " context = retrieve_course_materials(query = question , course=course_name) response_result = response_crew.kickoff(inputs={"query": question, "relevant_content": context, "thread": thread, "learner_name": learner_name}) print('Q: ', question) print('\n') print('A: ', response_result) print('\n\n')
df = pd.read_csv(filepath_or_buffer='C:\M\Code\GAI\Learn_queries/filtered_data_top3_courses.csv')
def reply_to_query(df_new=df_new, index=1):
learner_name = df_new.iloc[index]["thread_starter"]
course_name = df_new.iloc[index]["course"]
if df_new.iloc[index]['number_of_replies']>1:
thread = ast.literal_eval(df_new.iloc[index]["modified_thread"])
else:
thread = []
question = df_new.iloc[index]["current_query"]
if df_new.iloc[index]['has_image'] == True:
return " "
context = retrieve_course_materials(query = question , course=course_name)
response_result = response_crew.kickoff(inputs={"query": question, "relevant_content": context, "thread": thread, "learner_name": learner_name})
print('Q: ', question)
print('\n')
print('A: ', response_result)
print('\n\n')

測試函式,執行一次查詢(index=1)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
reply_to_query(df, index=1)
reply_to_query(df, index=1)
reply_to_query(df, index=1)

為多個學習者的詢問生成回覆

由此我們可以看出,它僅對一個索引執行良好。

現在,迭代所有查詢,處理每個查詢的同時處理潛在錯誤。這確保了查詢解決的高效自動化,允許動態處理多個學習者查詢。

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
for i in range(len(df)):
try:
reply_to_query(df, index=i)
except:
print("Error in index number: ", i)
continue
for i in range(len(df)): try: reply_to_query(df, index=i) except: print("Error in index number: ", i) continue
for i in range(len(df)):
try:
reply_to_query(df, index=i)
except:
print("Error in index number: ", i)
continue

這一步為何重要?

  • 自動處理查詢:系統可高效處理多個學習者的查詢。
  • 確保上下文相關性:根據檢索到的課程資料和過去的討論生成回覆。
  • 可擴充套件性:該方法允許人工智慧代理動態處理和回覆數千次查詢。
  • 改進學習支援:學習者會收到個性化、資料驅動的查詢回覆。

這一步驟可確保對學習者的每個查詢進行分析,並根據具體情況作出有效答覆,從而提升整體學習體驗。

輸出:

回覆查詢的過程已經實現了自動化

從輸出結果中我們可以看到,回覆查詢的過程已經實現了自動化,然後是提問和回答。

未來改進

為了升級基於 RAG 的查詢解決系統,我們可以對以下幾個方面進行改進:

  • 常見問題及其解決方案:在查詢解決框架內實施結構化常見問題解答系統將有助於即時解答常見問題,減少對現場支援的依賴。
  • 影象處理能力:增加從影象(如截圖、圖表或掃描文件)中分析和提取相關資訊的功能將增強系統的多功能性,使其在教育和客戶支援領域更加有用。
  • 改進影象列布林運算:改進影象列檢測背後的邏輯,以便更準確地正確識別和處理基於影象的查詢。
  • 語義分塊和不同的分塊技術:試驗各種分塊策略,如語義分塊、固定長度分割和混合方法,可提高檢索準確性和對響應的上下文理解。

小結

這個基於 RAG 的查詢解析系統利用 LangChain、ChromaDB 和 CrewAI 高效地自動為學習者提供支援。它提取字幕,將其作為嵌入式內容儲存在 ChromaDB 中,並使用相似性搜尋檢索相關內容。CrewAI 代理可處理查詢、參考過去的討論並生成結構化回覆,從而確保準確性和個性化。

該系統提高了可擴充套件性、檢索效率和回覆質量,使自定進度的學習更具互動性。未來的改進包括多模式支援、更好的檢索優化和增強的回覆生成。通過自動解決查詢問題,該系統簡化了學習支援,為學習者提供更快的上下文感知響應,並提高了整體參與度。

常見問題

Q1. 什麼是 LangChain,為什麼要在本專案中使用它?

A. LangChain 是一個用於構建由語言模型(LLM)驅動的應用程式的框架。它有助於處理、檢索和生成基於文字資料的響應。在本專案中,LangChain 用於將文字分割成塊、生成嵌入和高效檢索課程資料。

Q2. ChromaDB 如何儲存和檢索課程內容?

A. ChromaDB 是一個向量資料庫,用於儲存和檢索嵌入。它能將課程資料轉換為數字表示,從而在學員提交查詢時,通過基於相似性的搜尋找到相關內容。

Q3. CrewAI 在回答學習者的查詢時扮演什麼角色?

A. CrewAI 能夠建立動態處理任務的人工智慧代理。在本專案中,它為學習支援專家代理提供動力,該代理可以檢索課程資料、處理過去的討論併為學習者的查詢生成結構化的回覆。

Q4. 為什麼 OpenAI 嵌入要用於文字處理?

A. OpenAI 嵌入將文字轉換為數字向量,從而更容易執行相似性搜尋。這有助於根據學習者的查詢有效檢索相關課程資料。

Q5. 系統如何處理字幕(SRT 檔案)?

A. 系統使用 pysrt 從字幕 (SRT) 檔案中提取文字。然後對提取的內容進行分塊,使用 OpenAI embeddings 進行嵌入,並儲存在 ChromaDB 中,以便在需要時進行檢索。

Q6. 這個系統能同時處理多個查詢嗎?

A. 可以,該系統具有可擴充套件性,可以使用 CrewAI 的任務管理動態處理多個學習者的查詢。這可確保快速高效的響應。

Q7. 本系統未來有哪些改進?

A. 未來的改進包括對影象和視訊的多模式支援、更好的檢索優化以及改進的回覆生成技術,以提供更準確、更符合上下文的回覆。

評論留言