在DeepSeek-7B上使用Unsloth進行GRPO微調

DeepSeek 在自然語言處理領域掀起了一場風暴。憑藉其驚人的規模和效能，這一尖端模型在問題解答和文字摘要等任務中表現出色。它處理細微差別理解的能力使其成為各行各業的遊戲規則改變者。微調增強了它的能力，使其適應細分市場的需求，並能快速提供精確的結果。通過在專業資料集上對 DeepSeek-7B 進行改進，微調技術將 DeepSeek-7B 從通才轉變為領域專家。

本文探討了GRPO（通用強化預訓練優化）如何通過強化學習改進微調，以及 Unsloth 如何優化記憶體管理，從而加快 DeepSeek-7B 等大型模型的微調過程。這些方法共同實現了更快、更具成本效益的微調，推動了新一代人工智慧應用的發展。

學習目標

通過本文的學習，您應該能夠

瞭解微調 DeepSeek-7B 的基本原理，以提高執行特殊任務的效能。
發現 GRPO 相對於 PPO 的優勢，提高微調訓練的效率。
使用 Unsloth 和 LoRA 對大型模型進行快速、記憶體效率高的微調。
利用 Unsloth、vLLM 和 Hugging Face 設定 DeepSeek-7B 微調，並優化 GPU 效能。
為強化學習中的結構化輸出實現正確性和 XML 等獎勵功能。
使用 LoRA 載入、儲存和重新載入微調模型，以實現高效記憶體和高效能推理。
排除 GPU 記憶體和配置問題，實現無縫微調。
探索如何擴充套件到更大的資料集、新的獎勵函式和多模式模型的 GRPO。

瞭解DeepSeek模型和GRPO演算法

什麼是DeepSeek-R1-Distill-Qwen-7B？

DeepSeek-R1-Distill-Qwen-7B是建立在Qwen架構之上的最先進的大型語言模型。它採用穩健、可擴充套件的設計，利用數十億個引數來處理複雜的 NLP 任務，如文字生成、問題解答和摘要。DeepSeek-7B 變體是其大型同類產品的精簡版，這意味著它保留了大部分效能，同時在計算和記憶體使用方面更加高效。這使它非常適合部署在推理速度和準確性都很重要的環境中。它的架構採用了具有自我關注機制的轉換器層，使其在處理文字中的長距離依賴關係時非常有效。

R1_Training

主要功能和架構概述

DeepSeek-7B 的核心是多層變壓器架構，該架構具有高度並行性，可在大規模資料集上進行高效訓練。每一層都由一系列多頭自注意模組和前饋網路組成。注意力機制有助於模型在處理過程中專注於輸入序列的相關部分，使其能夠高效地完成需要上下文理解的任務。

Archi

Source: DeepSeek V3

DeepSeek-7B 通過位置編碼、注意力層和前饋層處理標記嵌入，在保持高質量結果的同時，還能高效地擴充套件到大型資料集。其深入的上下文感知理解能力增強了微調後的跨域泛化能力。LoRA 等方法通過應用低秩更新提高了訓練效率，即使在計算資源有限的情況下也能進行微調。

GRPO簡介及其如何改進微調

GRPO（通用強化預訓練優化）是一種先進的技術，旨在提高大型語言模型的微調效率。它將強化學習原理與預訓練相結合，利用獎勵訊號而不是直接監督來完善模型的行為。GRPO 採用基於策略的優化方法反覆優化模型引數。

在典型的微調方案中，模型是在有監督的資料集上訓練的，它直接從地面實況標籤中學習。與此相反，GRPO 引入了強化學習（RL）正規化，對模型進行訓練，以最大化指導其行為的獎勵訊號。這一過程能讓模型更靈活地適應特定任務的細微差別，從而提高準確性和泛化能力。

GRPO 中策略優化的關鍵公式可表示為：

ObjFunc

其中:

Expl_ObjFunc

這種基於策略的方法可確保模型不斷適應訓練過程中提供的反饋，重點改進與特定任務目標相對應的獎勵訊號。

GRPO的獎勵訊號

在 GRPO 中，獎勵功能可根據具體任務要求進行定義，引導模型關注所需的行為。獎勵可以是多種因素的函式，如準確性、格式或邏輯一致性。例如，正確性獎勵函式R_correct 可以定義為：

RewFunc

這種反饋機制允許 GRPO 逐步完善模型，強調對特定任務而言最重要的領域。

GRPO與PPO（近端策略優化）有何不同？

GRPO 引入了基於策略的強化學習來優化預訓練過程，而 PPO（近端策略優化）則是強化學習中另一種廣泛使用的演算法，尤其是在微調大型模型時。PPO 以其穩定性和處理高維動作空間的能力而著稱，因此在訓練大規模模型時很受歡迎。不過，PPO 通常需要大量資料，而且對學習率等超引數很敏感。

GRPO 和 PPO 的主要區別在於策略優化的性質。在 PPO 中，策略更新使用的是削足適履目標，以防止與當前策略產生較大偏差，從而導致不穩定的訓練。PPO 目標函式如下：

PPO

其中：

Expl_PPO

PPO 中的這種 “剪下 ”機制有助於避免可能導致不穩定的大規模策略更新，但也會減慢學習程序，尤其是對於 DeepSeek-7B 這樣的大型模型。

剪下目標通過懲罰策略中的大偏差，確保模型不會進行大規模、不穩定的更新。不過，它也會在穩定性和學習速度之間做出權衡，尤其是對於較大的模型，更新次數和學習速度都必須仔細調整。

相比之下，GRPO 採用了一種更具適應性的動態獎勵結構，使其能夠直接根據任務的具體指標實現效能最大化，而無需依賴 “信任區域 ”方法。GRPO 的優化程式不需要剪下，其基於獎勵的學習機制為微調提供了更直接、更高效的途徑。因此，GRPO 通常只需要較少的更新即可達到最佳效能。

引數θ的梯度更新規則

在 GRPO 中，更新模型引數的梯度是通過模型的獎勵反向傳播來計算的。如果根據模型輸出計算出時間步長 t 的獎勵R_t，那麼引數 θ 的梯度更新規則就是：

GRPO

與根據優勢函式調整梯度的 PPO 削波法相比，這種梯度下降法更直接、更高效。PPO 演算法與 GRPO 演算法的主要區別總結如下：

特徵	GRPO	PPO
目標	隨時間累積獎勵最大化。	最小化剪下目標，實現穩定更新。
獎勵訊號	特定任務的自適應獎勵。	基於優勢的剪下獎勵。
訓練穩定性	更靈活、更直接。	通過剪下機制確保穩定性。
優化機制	直接獎勵最大化。	剪下策略更新。
使用案例	任務自適應微調與獎勵。	關注穩定性的一般 RL 任務。

Unsloth：提高微調效率

對 DeepSeek-7B 等大型語言模型進行微調的計算成本很高，需要大量記憶體和處理能力。Unsloth 是一個優化框架，旨在加快訓練速度，同時大幅降低記憶體消耗。在使用LoRA（Low-Rank Adaptation）和 GRPO 時，它的作用尤為明顯，因為它能確保高效利用 GPU 資源，並在消費級硬體上實現微調。

Unsloth如何優化模型訓練？

Unsloth 引入了多項優化措施，以提高模型微調效率：

記憶體高效載入： Unsloth 支援 4 位和 8 位量化，在保持效能的同時減少了模型的記憶體佔用。
快速訓練和推理：通過利用快閃記憶體注意力和分頁優化器，Unsloth 可顯著加快訓練和推理速度。
梯度檢查點它支援梯度檢查點，只儲存啟用的子集並在需要時重新計算，從而減少了所需的 GPU 記憶體。
與 LoRA 無縫整合：Unsloth 本機支援 LoRA，允許使用者只訓練模型引數子集，而不是整個網路。

使用 Unsloth 的模型載入過程非常簡單，可實現高效執行。下文將詳細介紹這一過程。

使用 Unsloth 的優勢

減少 GPU 記憶體使用量達 50%，允許在中級 GPU 上進行訓練。
通過整合優化的注意力機制，實現更快的訓練。
支援 vLLM（超大語言模型）以加速推理。
與 GRPO 無縫協作，確保基於強化學習的微調具有資源效率。

通過將 Unsloth 納入微調管道，研究人員和工程師可以最大限度地提高 DeepSeek-7B 的效能，而不會遇到常見的計算限制。

利用GRPO對DeepSeek-7B進行微調

前面幾節介紹了 DeepSeek-7B 的架構和 GRPO 演算法，在此基礎上，現在是時候深入研究微調模型所需的實際步驟了。本節將引導您完成從設定環境到配置 GRPO 訓練器的必要步驟，其中包括程式碼片段和對每一部分過程的詳細解釋。

如第 2 節所述，DeepSeek-7B 模型是處理大規模 NLP 任務的強大工具，如果與 GRPO（通用強化預訓練優化）搭配使用，它將變得更加高效。通過應用 GRPO 方法，我們可以利用強化學習框架在特定任務上對 DeepSeek-7B 進行微調。這不僅能讓模型產生更好的結果，還能比傳統方法更有效地適應新資料。

現在，讓我們探索使用 GRPO 和 Unsloth 微調 DeepSeek-7B 的詳細步驟，並利用 LoRA 在訓練過程中高效使用記憶體。

步驟 1：設定環境

要開始微調 DeepSeek-7B，首先需要設定環境。這包括安裝 Unsloth、vllm 和其他必要的軟體包。下面是安裝這些軟體包的命令：

!pip install unsloth vllm datasets

!pip install git+https://github.com/huggingface/trl.git

!pip install unsloth vllm datasets !pip install git+https://github.com/huggingface/trl.git

!pip install unsloth vllm datasets
!pip install git+https://github.com/huggingface/trl.git

說明：

Unsloth：用於高效語言模型微調和記憶體優化的庫。
vllm：實現大型模型的快速推理。
Dataset：用於處理各種 NLP 資料集（包括來自 Hugging Face 的資料集）的庫。

安裝完成後，我們就可以載入模型並開始微調了。

步驟 2：使用Unsloth載入模型

現在，我們將使用 Unsloth 載入 DeepSeek-7B 模型。該模型將使用 LoRA（Low-Rank Adaptation）載入，以實現高效微調。下面是這一步的程式碼片段：

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(

model_name="unsloth/DeepSeek-R1-Distill-Qwen-7B",

max_seq_length=512,

load_in_4bit=True, # Uses 4-bit quantization for memory efficiency

fast_inference=True, # Enables fast inference for quicker processing

max_lora_rank=32, # LoRA rank for fine-tuning efficiency

gpu_memory_utilization=0.6 # Controls memory usage

)

from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/DeepSeek-R1-Distill-Qwen-7B", max_seq_length=512, load_in_4bit=True, # Uses 4-bit quantization for memory efficiency fast_inference=True, # Enables fast inference for quicker processing max_lora_rank=32, # LoRA rank for fine-tuning efficiency gpu_memory_utilization=0.6 # Controls memory usage )

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/DeepSeek-R1-Distill-Qwen-7B",
max_seq_length=512,
load_in_4bit=True,  # Uses 4-bit quantization for memory efficiency
fast_inference=True,  # Enables fast inference for quicker processing
max_lora_rank=32,  # LoRA rank for fine-tuning efficiency
gpu_memory_utilization=0.6  # Controls memory usage
)

說明：

model_name：我們指定要載入的模型，本例中為 DeepSeek-R1-Distill-Qwen-7B。
max_seq_length：定義輸入標記的最大序列長度。
load_in_4bit：使用 4 位量化，大大減少記憶體使用量。
fast_inference：這可以讓 vLLM 加快推理時間。
max_lora_rank：LoRA 適應的秩，用於控制低秩矩陣的大小。
gpu_memory_utilization：調整模型使用 GPU 記憶體的大小，以避免記憶體不足錯誤。

預期結果：模型將以優化配置載入到記憶體中，以便使用 LoRA 進行微調。

步驟 3：應用LoRA進行高效微調

LoRA 用於優化 DeepSeek-7B 等大型模型的記憶體。通過應用 LoRA，我們只更新低秩矩陣，而不是整個模型，這樣就能高效地微調記憶體。下面是程式碼片段：

model = FastLanguageModel.get_peft_model(

model,

r=32, # Rank of LoRA layers, which controls memory and efficiency

target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj",

"up_proj", "down_proj"], # Modules to apply LoRA to

lora_alpha=32, # Scaling factor for LoRA

use_gradient_checkpointing="unsloth", # Enables gradient checkpointing

for long context fine-tuning

random_state=3407 # Seed for reproducibility

)

model = FastLanguageModel.get_peft_model( model, r=32, # Rank of LoRA layers, which controls memory and efficiency target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Modules to apply LoRA to lora_alpha=32, # Scaling factor for LoRA use_gradient_checkpointing="unsloth", # Enables gradient checkpointing for long context fine-tuning random_state=3407 # Seed for reproducibility )

model = FastLanguageModel.get_peft_model(
model,
r=32,  # Rank of LoRA layers, which controls memory and efficiency
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", 
"up_proj", "down_proj"],  # Modules to apply LoRA to
lora_alpha=32,  # Scaling factor for LoRA
use_gradient_checkpointing="unsloth",  # Enables gradient checkpointing 
for long context fine-tuning
random_state=3407  # Seed for reproducibility
)

說明：

r：LoRA 矩陣的秩。秩越高，訓練越智慧，但速度越慢。
target_modules：應用 LoRA 的模型層（例如，q_proj 用於查詢投影）。
lora_alpha：用於控制 LoRA 層重要性的縮放因子。
use_gradient_checkpointing：僅在需要時儲存中間梯度，從而減少記憶體消耗。
random_state：確保微調過程的可重複性。

預期結果：現在模型的記憶體使用已經優化，可以在大型資料集上高效地進行微調。

OP1-thumbnail_webp-600x300-1

步驟 4：準備訓練資料集

微調 DeepSeek-7B 需要一個特定格式的資料集。在這裡，我們將把資料集從 JSON 檔案格式載入並轉換為擁抱臉資料集物件。程式碼如下：

import json

from datasets import Dataset

def load_and_transform_json(json_path):

with open(json_path, "r") as f:

data = json.load(f)

transformed_data = [{"question": entry["question"], "answer": entry["response"], "prompt": [{"content": SYSTEM_PROMPT, "role": "system"}, {"content": entry["question"], "role": "user"}]} for entry in data]

return transformed_data

json_file_path = "/content/your_dataset.json" # Path to your JSON file

dataset = load_and_transform_json(json_file_path)

import json from datasets import Dataset def load_and_transform_json(json_path): with open(json_path, "r") as f: data = json.load(f) transformed_data = [{"question": entry["question"], "answer": entry["response"], "prompt": [{"content": SYSTEM_PROMPT, "role": "system"}, {"content": entry["question"], "role": "user"}]} for entry in data] return transformed_data json_file_path = "/content/your_dataset.json" # Path to your JSON file dataset = load_and_transform_json(json_file_path)

import json
from datasets import Dataset
def load_and_transform_json(json_path):
with open(json_path, "r") as f:
data = json.load(f)
transformed_data = [{"question": entry["question"], "answer": entry["response"], "prompt": [{"content": SYSTEM_PROMPT, "role": "system"}, {"content": entry["question"], "role": "user"}]} for entry in data]
return transformed_data
json_file_path = "/content/your_dataset.json"  # Path to your JSON file
dataset = load_and_transform_json(json_file_path)

說明：

load_and_transform_json： 載入 JSON 檔案並將其轉換為培訓所需的格式。
資料包括每個條目的問題和答案，以及系統生成的提示。

預期結果：資料集現在格式正確，可用於訓練。下面是資料集的一個示例。

dataset_VNhbi7V-thumbnail_webp-600x300-1

步驟 5：為結構化輸出設計獎勵函式

在強化學習中，獎勵函式會引導模型獲得理想的輸出結果。在這裡，我們定義獎勵函式來評估模型的響應。例如，正確性獎勵函式（correctness_reward_func）檢查提取的答案是否與預期答案一致。

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:

responses = [completion[0]['content'] for completion in completions]

q = prompts[0][-1]['content']

extracted_responses = [extract_xml_answer(r) for r in responses]

return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:

responses = [completion[0]['content'] for completion in completions]

extracted_responses = [extract_xml_answer(r) for r in responses]

return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:

pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"

responses = [completion[0]["content"] for completion in completions]

matches = [re.match(pattern, r) for r in responses]

return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:

pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"

responses = [completion[0]["content"] for completion in completions]

matches = [re.match(pattern, r) for r in responses]

return [0.5 if match else 0.0 for match in matches]

def xmlcount_reward_func(completions, **kwargs) -> list[float]:

contents = [completion[0]["content"] for completion in completions]

return [count_xml(c) for c in contents]

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] q = prompts[0][-1]['content'] extracted_responses = [extract_xml_answer(r) for r in responses] return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)] def int_reward_func(completions, **kwargs) -> list[float]: responses = [completion[0]['content'] for completion in completions] extracted_responses = [extract_xml_answer(r) for r in responses] return [0.5 if r.isdigit() else 0.0 for r in extracted_responses] def strict_format_reward_func(completions, **kwargs) -> list[float]: pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$" responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, r) for r in responses] return [0.5 if match else 0.0 for match in matches] def soft_format_reward_func(completions, **kwargs) -> list[float]: pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>" responses = [completion[0]["content"] for completion in completions] matches = [re.match(pattern, r) for r in responses] return [0.5 if match else 0.0 for match in matches] def xmlcount_reward_func(completions, **kwargs) -> list[float]: contents = [completion[0]["content"] for completion in completions] return [count_xml(c) for c in contents]

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
q = prompts[0][-1]['content']
extracted_responses = [extract_xml_answer(r) for r in responses]
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
def int_reward_func(completions, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
def strict_format_reward_func(completions, **kwargs) -> list[float]:
pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def soft_format_reward_func(completions, **kwargs) -> list[float]:
pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r) for r in responses]
return [0.5 if match else 0.0 for match in matches]
def xmlcount_reward_func(completions, **kwargs) -> list[float]:
contents = [completion[0]["content"] for completion in completions]
return [count_xml(c) for c in contents]

說明：

correctness_reward_func：將提取的答案與預期答案進行比較。如果匹配，則獎勵 2.0，否則獎勵 0.0。
int_reward_func：獎勵生成數字回答的模型。
strict_format_reward_func：確保模型的輸出嚴格遵循 XML 格式，獎勵格式良好的輸出。
soft_format_reward_func：檢查模型的輸出是否鬆散地遵循所需的格式。
xmlcount_reward_func：評估輸出在多大程度上遵循了 XML 結構，並對結構不良的響應進行懲罰。

預期結果：這些獎勵函式將引導模型生成不僅正確，而且結構合理並符合所需格式的響應。

步驟 6：配置GRPO訓練器

現在，我們將配置 GRPOTrainer 以使用訓練資料集和獎勵函式。GRPOConfig 物件用於指定學習率和批量大小等訓練引數。

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(

learning_rate=5e-6,

per_device_train_batch_size=1,

num_generations=6,

max_prompt_length=256,

max_completion_length=200,

max_steps=1,

)

trainer = GRPOTrainer(

model=model,

processing_class=tokenizer,

reward_funcs=[correctness_reward_func],

args=training_args,

train_dataset=dataset,

)

trainer.train()

from trl import GRPOConfig, GRPOTrainer training_args = GRPOConfig( learning_rate=5e-6, per_device_train_batch_size=1, num_generations=6, max_prompt_length=256, max_completion_length=200, max_steps=1, ) trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[correctness_reward_func], args=training_args, train_dataset=dataset, ) trainer.train()

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
learning_rate=5e-6,
per_device_train_batch_size=1,
num_generations=6,
max_prompt_length=256,
max_completion_length=200,
max_steps=1,
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[correctness_reward_func],
args=training_args,
train_dataset=dataset,
)
trainer.train()

說明：

GRPOConfig：配置各種訓練引數，如學習率、批量大小和生成的代數。
GRPOTrainer：該類負責實際的訓練過程。它接收模型、標記符、獎勵函式和訓練引數。

GRPOConfig引數說明：

learning_rate（學習率）： 模型優化的學習率。較低的值，如 5e-6，可以在多次迭代中實現穩定的訓練。
per_device_train_batch_size（每裝置訓練批次大小）： 每個訓練步驟的批量大小。這裡設定為 1，即每個 GPU 一次處理一個示例。
num_generations（代數）： 模型在每個微調步驟中產生的代數。
max_prompt_length（最大提示長度）： 輸入提示符的最大標記長度。
max_completion_length（最大完成長度）： 模型輸出的最大標記長度。
max_steps（最大訓練步數）： 要執行的訓練步數。

預期結果：將使用 GRPO 演算法使用定義的獎勵函式對模型進行訓練，對模型進行微調，使其在給定的資料集上表現更好。

Op2_0nxsY1J

儲存和過載微調模型

一旦使用 GRPO 和 LoRA 對 DeepSeek-7B 模型進行了微調，就必須將模型儲存到磁碟或雲端儲存中，以備將來使用。在本節中，我們將介紹如何儲存微調後的模型，並再次載入它進行推理。這將確保您能持續保持進展，避免從頭開始重新訓練。

儲存LoRA微調模型

使用 LoRA 和 GRPO 微調模型後，需要將其儲存到儲存位置。這是確保以後可以重新載入模型而無需重新訓練的關鍵步驟。以下是將微調後的模型（包括 LoRA 特定權重）儲存到磁碟的方法：

# Define the path to save the fine-tuned model

model_save_path = "/content/deepseek_lora_finetuned"

# Save the model and tokenizer

model.save_pretrained(model_save_path)

tokenizer.save_pretrained(model_save_path)

# Define the path to save the fine-tuned model model_save_path = "/content/deepseek_lora_finetuned" # Save the model and tokenizer model.save_pretrained(model_save_path) tokenizer.save_pretrained(model_save_path)

# Define the path to save the fine-tuned model
model_save_path = "/content/deepseek_lora_finetuned"
# Save the model and tokenizer
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

說明：

model.save_pretrained：這將同時儲存模型權重和 LoRA 特定層（如低秩適配矩陣）。
tokenizer.save_pretrained：儲存標記化器，其中包括特殊標記和詞彙等標記化邏輯。
model_save_path：儲存模型的目錄。可以是本地路徑，也可以是雲目錄（如 Google Drive、S3）。

預期結果：模型和標記符將被儲存到指定路徑，以備將來使用。您以後可以使用儲存的模型重新載入精確微調版本進行推理，而無需重新訓練。

載入模型用於未來推理

儲存微調模型後，您可以輕鬆地將其載入記憶體，用於推理或進一步微調。下面是載入已儲存模型和標記符的程式碼，以及 LoRA 特有的配置：

from unsloth import FastLanguageModel

# Define the path where the model is saved

model_save_path = "/content/deepseek_lora_finetuned"

# Reload the model and tokenizer

model, tokenizer = FastLanguageModel.from_pretrained(

model_save_path,

max_seq_length=512,

load_in_4bit=True, # Ensure it's still using efficient memory settings

fast_inference=True, # Enable fast inference

max_lora_rank=32, # LoRA rank must match what was used during fine-tuning

gpu_memory_utilization=0.6

)

from unsloth import FastLanguageModel # Define the path where the model is saved model_save_path = "/content/deepseek_lora_finetuned" # Reload the model and tokenizer model, tokenizer = FastLanguageModel.from_pretrained( model_save_path, max_seq_length=512, load_in_4bit=True, # Ensure it's still using efficient memory settings fast_inference=True, # Enable fast inference max_lora_rank=32, # LoRA rank must match what was used during fine-tuning gpu_memory_utilization=0.6 )

from unsloth import FastLanguageModel
# Define the path where the model is saved
model_save_path = "/content/deepseek_lora_finetuned"
# Reload the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
model_save_path,
max_seq_length=512,
load_in_4bit=True,  # Ensure it's still using efficient memory settings
fast_inference=True,  # Enable fast inference
max_lora_rank=32,  # LoRA rank must match what was used during fine-tuning
gpu_memory_utilization=0.6
)

說明：

FastLanguageModel.from_pretrained：此函式從指定路徑載入已儲存的模型權重和標記符。
max_lora_rank：推理過程中使用的 LoRA rank 必須與微調過程中使用的相匹配，以確保應用正確的適配。
load_in_4bit 和 gpu_memory_utilization：確保模型在載入推理時繼續保持記憶體效率。

預期結果：從儲存的目錄中載入模型及其 LoRA 配置，以便高效執行推理。這意味著模型將利用微調引數，您可以直接開始生成響應或執行任務，而無需重新應用微調過程。

下面是用於微調本部落格的資料集的輸出示例。它與工藝流程表有關。看看模型是如何推理並生成對查詢的響應的。使用 GRPO 模型進行微調包含了推理功能，這在下面的答案中有所體現。

QuestionOutput-thumbnail_webp-600x300-1

高階選項：儲存到雲端儲存

如果想將模型儲存到雲端儲存（如 Google Drive 或 Amazon S3），可以修改 model_save_path 指向相應的雲目錄。下面是一個使用 gdown 將模型儲存到 Google Drive 的示例：

!pip install gdown

import gdown

# Upload the model to Google Drive

gdown.upload(model_save_path, output="path_to_google_drive_folder")

!pip install gdown import gdown # Upload the model to Google Drive gdown.upload(model_save_path, output="path_to_google_drive_folder")

!pip install gdown
import gdown
# Upload the model to Google Drive
gdown.upload(model_save_path, output="path_to_google_drive_folder")

對於亞馬遜 S3，您可以使用 boto3 庫上傳模型：

!pip install boto3

import boto3

s3 = boto3.client('s3')

# Upload model to S3

s3.upload_file("/content/deepseek_lora_finetuned", "your-bucket-name",

"model_directory/deepseek_lora_finetuned")

!pip install boto3 import boto3 s3 = boto3.client('s3') # Upload model to S3 s3.upload_file("/content/deepseek_lora_finetuned", "your-bucket-name", "model_directory/deepseek_lora_finetuned")

!pip install boto3
import boto3
s3 = boto3.client('s3')
# Upload model to S3
s3.upload_file("/content/deepseek_lora_finetuned", "your-bucket-name", 
"model_directory/deepseek_lora_finetuned")

說明：

gdown.upload：此函式將模型從本地環境上傳到 Google Drive。
boto3：亞馬遜用於與 S3 等 AWS 服務互動的 Python SDK。它允許你將模型直接上傳到 S3 儲存桶。

預期結果：您可以從雲端儲存和訪問模型，從而便於在其他環境中共享和部署。

常見問題和故障排除

在微調 DeepSeek-7B 等大型模型時，可能會出現幾個常見陷阱，尤其是與 GPU 記憶體、訓練配置和獎勵函式調整相關的陷阱。意識到這些問題並瞭解如何排除故障，可以在微調過程中節省大量時間。

1. GPU記憶體超載

對大型模型進行微調往往會導致 GPU 記憶體超載，尤其是在使用 LoRA 等高階配置或進行大批量訓練時。要緩解這一問題

減少批量大小，或調整 GRPOConfig 中的 per_device_train_batch_size 引數，以適應 GPU 的記憶體。
通過設定use_gradient_checkpointing = “unsloth” 使用梯度檢查點，儲存中間啟用以減少記憶體使用。
如果遇到記憶體問題，請降低 LoRA 級別，因為級別越低所需記憶體越少。

2. 模型載入不當

有時，不正確的模型載入配置會導致問題，特別是在以 4 位精度載入大型模型或使用 LoRA 時。請務必

確認 LoRA 等級和其他特定於模型的配置（如 max_lora_rank 和gpu_memory_utilization）已根據 GPU 的能力正確設定。
在處理大型模型時，確保啟用 vLLM 進行快速推理，以避免不必要的延遲。

3. 獎勵函式不匹配

獎勵函式的微調需要仔細考慮。不正確或過於嚴格的獎勵函式配置可能會阻礙學習，使模型的效能低於最佳狀態。排除故障：

檢查獎勵函式（如 correctness_reward_func 和 strict_format_reward_func ）的執行情況，確保它們與所需輸出一致。
如果模型產生不穩定或不理想的響應，微調獎勵閾值和評分機制。

4. 資料問題

資料質量和格式化是成功訓練的關鍵。如果您使用的是自定義資料集，請將其轉換為 Hugging Face 資料集格式，並確保對任何基於 JSON 的輸入進行適當的解析和預處理。始終檢查資料集是否存在任何差異或缺失欄位，尤其是在複雜的獎勵函式（如 correctness_reward_func，它依賴於精確的答案匹配）中。

5. 訓練配置衝突

訓練配置中的衝突，如學習率、優化器設定或梯度累積步驟不匹配，會導致效能不理想或收斂速度減慢。請務必確保根據硬體和訓練目標的具體要求，對 GRPO 配置中的引數進行微調。此外，低學習率和高梯度累積步驟有助於穩定大型模型的訓練。

通過解決這些常見問題並監控記憶體使用情況、資料格式和獎勵函式的有效性，您可以簡化微調過程，確保模型訓練更加順利。

小結

在本指南中，我們探討了在 DeepSeek-7B（通用強化預訓練優化）和 LoRA（低秩自適應）上進行 GRPO 微調的過程，結合這些技術的優勢來優化大型模型訓練。我們首先討論了 DeepSeek-7B 和 GRPO 的架構，概述了 Unsloth 在記憶體管理和高效模型訓練中的作用。我們還演示了從設定環境、用 LoRA 載入模型到應用基於強化學習的獎勵函式進行微調所涉及的實際步驟。

有效的微調結合了 GRPO 和 LoRA：GRPO 通過基於策略的更新加強學習，而 LoRA 則實現了高效的記憶體訓練。我們演示了定義獎勵函式、使用 GRPOTrainer 進行優化，以及通過儲存和重新載入確保模型的可用性。主要挑戰包括擴充套件到更大的資料集，以及改進獎勵函式以提高適應性。將 GRPO 擴充套件到多模式模型可進一步提高人工智慧能力。

DeepSeek-7B 和 GRPO 為利用基於強化學習的優化技術微調大規模模型提供了強大的基礎。
LoRA 優化了記憶體使用情況，並通過應用低秩適配實現了對大型模型的高效微調。
GRPO 與 PPO 等傳統方法不同，它提供基於策略的更新，從而提高了訓練效率。
定義結構良好的獎勵函式在強化學習微調中至關重要，它能引導模型實現高質量的輸出。
儲存和重新載入微調模型的過程確保了模型的可重用性和長期效能。
未來的改進重點是擴充套件到更大的資料集，嘗試新的獎勵函式，以及將 GRPO 應用於多模式模型（文字、影象、音訊）。

在DeepSeek-7B上使用Unsloth進行GRPO微調

學習目標

瞭解DeepSeek模型和GRPO演算法

什麼是DeepSeek-R1-Distill-Qwen-7B？

主要功能和架構概述

GRPO簡介及其如何改進微調

GRPO的獎勵訊號

GRPO與PPO（近端策略優化）有何不同？

引數θ的梯度更新規則

Unsloth：提高微調效率

Unsloth如何優化模型訓練？

使用 Unsloth 的優勢

利用GRPO對DeepSeek-7B進行微調

步驟 1：設定環境

步驟 2：使用Unsloth載入模型

步驟 3：應用LoRA進行高效微調

步驟 4：準備訓練資料集

步驟 5：為結構化輸出設計獎勵函式

步驟 6：配置GRPO訓練器

儲存和過載微調模型

儲存LoRA微調模型

載入模型用於未來推理

高階選項：儲存到雲端儲存

常見問題和故障排除

1. GPU記憶體超載

2. 模型載入不當

3. 獎勵函式不匹配

4. 資料問題

5. 訓練配置衝突

小結

評論留言

取消回覆

文章目錄