用BLEU指標評估語言模型

在人工智慧領域，評估語言模型的效能是一項獨特的挑戰。與影像識別或數字預測不同，語言質量評估並不侷限於簡單的二進位制測量。BLEU（Bilingual Evaluation Understudy）自 2002 年由 IBM 研究人員引入以來，已成為機器翻譯評估的基石。

BLEU 是自然語言處理領域的一項突破，因為它是第一種既能與人類判斷達到相當高的相關性，又能保持自動化效率的評估方法。本文將探討 BLEU 的機制、應用、侷限性，以及在人工智慧日益驅動、關注語言生成輸出中更多細微差別的世界中，BLEU 的前景如何。

BLEU指標的起源：歷史視角

在 BLEU 出現之前，機器翻譯的評估主要依靠人工–這是一個資源密集型過程，需要語言專家對每項輸出進行人工評估。IBM Research 的 Kishore Papineni、Salim Roukos、Todd Ward 和 Wei-Jing Zhu 引入 BLEU 代表著一種模式的轉變。他們在 2002 年發表的論文 “BLEU：一種機器翻譯自動評估方法 ”中提出了一種自動度量方法，該方法可以對翻譯進行評分，並與人類的判斷非常一致。

時機非常關鍵。隨著統計機器翻譯系統的發展，該領域迫切需要標準化的評估方法。BLEU 填補了這一空白，提供了一種可重複的、與語言無關的評分機制，有助於對不同翻譯系統進行有意義的比較。

BLEU指標如何工作？

BLEU 的核心原理很簡單：將機器生成的譯文與參考譯文（通常由人工翻譯人員建立）進行比較。據觀察，BLEU 分數會隨著句子長度的增加而降低，但可能會因翻譯模型的不同而有所變化。不過，其實現涉及複雜的計算語言學概念：

BLEU指標如何工作？

N-gram 精確度

BLEU 的基礎在於 N-gram 精確度–機器翻譯中出現在任何參考譯文中的詞序列的百分比。BLEU 不侷限於單個單詞（unigrams），而是檢查不同長度的連續序列：

單字（單詞）修正精度：衡量詞彙準確性
大詞（雙詞序列）修正精度：捕捉基本短語的正確性
三片語和四片語修正精度：評估語法結構和詞序

BLEU 透過以下方法計算每個 n-gram 長度的修正精度：

計算候選譯文和參考譯文之間的 n-gram 匹配度
應用 “剪下 ”機制以防止重複詞的過度膨脹
除以候選譯文中的 n-gram 總數

簡潔度懲罰

為防止系統透過編寫極短的譯文（只包含容易匹配的詞即可達到高精確度）來玩弄該指標，BLEU 加入了簡短度懲罰，以降低比參考譯文短的譯文的得分。

該懲罰的計算公式為

BP = exp(1 - r/c) if c < r

1 if c ≥ r

BP = exp(1 - r/c) if c < r 1 if c ≥ r

BP = exp(1 - r/c) if c < r
        1            if c ≥ r

其中，r 是參考長度，c 是候選翻譯長度。

最終BLEU得分

最終的 BLEU 得分是將這些組成部分合併為一個介於 0 和 1 之間的單一值（通常以百分比表示）：

BLEU = BP × exp(∑ wn log pn)

BLEU = BP × exp(∑ wn log pn)

其中

BP 是簡潔性懲罰
wn 表示每個 n-gram 精確度的權重（通常為統一權重）
pn 是長度為 n 的 n 位元組的修正精度

實施BLEU指標

從概念上理解 BLEU 是一回事，正確實施 BLEU 則需要關注細節。以下是有效使用 BLEU 的實用指南：

所需輸入

BLEU 需要兩個主要輸入：

候選翻譯：要評估的機器生成的譯文
參考譯文：每個源句的一個或多個人工生成的譯文

這兩個輸入必須經過一致的預處理：

標記化：將文字分解為單詞或子單詞
大小寫規範化：通常將所有文字的大小寫降為小寫
標點符號處理：移除標點符號或將標點符號作為單獨標記符處理

實現步驟

典型的 BLEU 實現步驟如下：

預處理所有翻譯 ：應用一致的標記化和規範化
計算 n=1 到 N 的 n-gram 精確度（通常 N=4）：
- 計算候選翻譯中的所有 n-grams
- 計算參考譯文中的匹配 n-格（帶剪裁）
- 計算精確度為（匹配/候選 n-gram 總數）
計算簡短度懲罰 ：
- 確定有效參考長度（原始 BLEU 中最短的參考長度）
- 與候選長度相比
- 應用簡短度懲罰公式
將各部分合並為最終得分：
- 應用 n-gram 精確度的加權幾何平均數
- 乘以簡短度懲罰

常用實現工具

有幾個庫提供了即用型 BLEU 實現：

NLTK：Python的自然語言工具包提供了簡單的BLEU實現

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu

from nltk.translate.bleu_score import SmoothingFunction

# Create a smoothing function to avoid zero scores due to missing n-grams

smoothie = SmoothingFunction().method1

# Example 1: Single reference, good match

reference = [['this', 'is', 'a', 'test']]

candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate)

print(f"Perfect match BLEU score: {score}")

# Example 2: Single reference, partial match

reference = [['this', 'is', 'a', 'test']]

candidate = ['this', 'is', 'test']

# Using smoothing to avoid zero scores

score = sentence_bleu(reference, candidate, smoothing_function=smoothie)

print(f"Partial match BLEU score: {score}")

# Example 3: Multiple references (corrected format)

references = [[['this', 'is', 'a', 'test']], [['this', 'is', 'an', 'evaluation']]]

candidates = [['this', 'is', 'an', 'assessment']]

# The format for corpus_bleu is different - references need restructuring

correct_references = [[['this', 'is', 'a', 'test'], ['this', 'is', 'an', 'evaluation']]]

score = corpus_bleu(correct_references, candidates, smoothing_function=smoothie)

print(f"Multiple reference BLEU score: {score}")

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu from nltk.translate.bleu_score import SmoothingFunction # Create a smoothing function to avoid zero scores due to missing n-grams smoothie = SmoothingFunction().method1 # Example 1: Single reference, good match reference = [['this', 'is', 'a', 'test']] candidate = ['this', 'is', 'a', 'test'] score = sentence_bleu(reference, candidate) print(f"Perfect match BLEU score: {score}") # Example 2: Single reference, partial match reference = [['this', 'is', 'a', 'test']] candidate = ['this', 'is', 'test'] # Using smoothing to avoid zero scores score = sentence_bleu(reference, candidate, smoothing_function=smoothie) print(f"Partial match BLEU score: {score}") # Example 3: Multiple references (corrected format) references = [[['this', 'is', 'a', 'test']], [['this', 'is', 'an', 'evaluation']]] candidates = [['this', 'is', 'an', 'assessment']] # The format for corpus_bleu is different - references need restructuring correct_references = [[['this', 'is', 'a', 'test'], ['this', 'is', 'an', 'evaluation']]] score = corpus_bleu(correct_references, candidates, smoothing_function=smoothie) print(f"Multiple reference BLEU score: {score}")

from nltk.translate.bleu_score import sentence_bleu, corpus_bleu
from nltk.translate.bleu_score import SmoothingFunction
# Create a smoothing function to avoid zero scores due to missing n-grams
smoothie = SmoothingFunction().method1
# Example 1: Single reference, good match
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate)
print(f"Perfect match BLEU score: {score}")
# Example 2: Single reference, partial match
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'test']
# Using smoothing to avoid zero scores
score = sentence_bleu(reference, candidate, smoothing_function=smoothie)
print(f"Partial match BLEU score: {score}")
# Example 3: Multiple references (corrected format)
references = [[['this', 'is', 'a', 'test']], [['this', 'is', 'an', 'evaluation']]]
candidates = [['this', 'is', 'an', 'assessment']]
# The format for corpus_bleu is different - references need restructuring
correct_references = [[['this', 'is', 'a', 'test'], ['this', 'is', 'an', 'evaluation']]]
score = corpus_bleu(correct_references, candidates, smoothing_function=smoothie)
print(f"Multiple reference BLEU score: {score}")

輸出

Perfect match BLEU score: 1.0Partial match BLEU score: 0.19053627645285995Multiple reference BLEU score: 0.3976353643835253

SacreBLEU：解決可重複性問題的標準化BLEU實現

import sacrebleu

# For sentence-level BLEU with SacreBLEU

reference = ["this is a test"] # List containing a single reference

candidate = "this is a test" # String containing the hypothesis

score = sacrebleu.sentence_bleu(candidate, reference)

print(f"Perfect match SacreBLEU score: {score}")

# Partial match example

reference = ["this is a test"]

candidate = "this is test"

score = sacrebleu.sentence_bleu(candidate, reference)

print(f"Partial match SacreBLEU score: {score}")

# Multiple references example

references = ["this is a test", "this is a quiz"] # List of multiple references

candidate = "this is an exam"

score = sacrebleu.sentence_bleu(candidate, references)

print(f"Multiple references SacreBLEU score: {score}")

import sacrebleu # For sentence-level BLEU with SacreBLEU reference = ["this is a test"] # List containing a single reference candidate = "this is a test" # String containing the hypothesis score = sacrebleu.sentence_bleu(candidate, reference) print(f"Perfect match SacreBLEU score: {score}") # Partial match example reference = ["this is a test"] candidate = "this is test" score = sacrebleu.sentence_bleu(candidate, reference) print(f"Partial match SacreBLEU score: {score}") # Multiple references example references = ["this is a test", "this is a quiz"] # List of multiple references candidate = "this is an exam" score = sacrebleu.sentence_bleu(candidate, references) print(f"Multiple references SacreBLEU score: {score}")

import sacrebleu
# For sentence-level BLEU with SacreBLEU
reference = ["this is a test"]  # List containing a single reference
candidate = "this is a test"    # String containing the hypothesis
score = sacrebleu.sentence_bleu(candidate, reference)
print(f"Perfect match SacreBLEU score: {score}")
# Partial match example
reference = ["this is a test"]
candidate = "this is test"
score = sacrebleu.sentence_bleu(candidate, reference)
print(f"Partial match SacreBLEU score: {score}")
# Multiple references example
references = ["this is a test", "this is a quiz"]  # List of multiple references
candidate = "this is an exam"
score = sacrebleu.sentence_bleu(candidate, references)
print(f"Multiple references SacreBLEU score: {score}")

輸出

Perfect match SacreBLEU score: BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 4 ref_len = 4)Partial match SacreBLEU score: BLEU = 45.14 100.0/50.0/50.0/0.0 (BP = 0.717 ratio = 0.750 hyp_len = 3 ref_len = 4)Multiple references SacreBLEU score: BLEU = 31.95 50.0/33.3/25.0/25.0 (BP = 1.000 ratio = 1.000 hyp_len = 4 ref_len = 4)

Hugging Face評估：與ML管道整合的現代實現

from evaluate import load

bleu = load('bleu')

# Example 1: Perfect match

predictions = ["this is a test"]

references = [["this is a test"]]

results = bleu.compute(predictions=predictions, references=references)

print(f"Perfect match HF Evaluate BLEU score: {results}")

# Example 2: Multi-sentence evaluation

predictions = ["the cat is on the mat", "there is a dog in the park"]

references = [["the cat sits on the mat"], ["a dog is running in the park"]]

results = bleu.compute(predictions=predictions, references=references)

print(f"Multi-sentence HF Evaluate BLEU score: {results}")

# Example 3: More complex real-world translations

predictions = ["The agreement on the European Economic Area was signed in August 1992."]

references = [["The agreement on the European Economic Area was signed in August 1992.", "An agreement on the European Economic Area was signed in August of 1992."]]

results = bleu.compute(predictions=predictions, references=references)

print(f"Complex example HF Evaluate BLEU score: {results}")

from evaluate import load bleu = load('bleu') # Example 1: Perfect match predictions = ["this is a test"] references = [["this is a test"]] results = bleu.compute(predictions=predictions, references=references) print(f"Perfect match HF Evaluate BLEU score: {results}") # Example 2: Multi-sentence evaluation predictions = ["the cat is on the mat", "there is a dog in the park"] references = [["the cat sits on the mat"], ["a dog is running in the park"]] results = bleu.compute(predictions=predictions, references=references) print(f"Multi-sentence HF Evaluate BLEU score: {results}") # Example 3: More complex real-world translations predictions = ["The agreement on the European Economic Area was signed in August 1992."] references = [["The agreement on the European Economic Area was signed in August 1992.", "An agreement on the European Economic Area was signed in August of 1992."]] results = bleu.compute(predictions=predictions, references=references) print(f"Complex example HF Evaluate BLEU score: {results}")

from evaluate import load
bleu = load('bleu')
# Example 1: Perfect match
predictions = ["this is a test"]
references = [["this is a test"]]
results = bleu.compute(predictions=predictions, references=references)
print(f"Perfect match HF Evaluate BLEU score: {results}")
# Example 2: Multi-sentence evaluation
predictions = ["the cat is on the mat", "there is a dog in the park"]
references = [["the cat sits on the mat"], ["a dog is running in the park"]]
results = bleu.compute(predictions=predictions, references=references)
print(f"Multi-sentence HF Evaluate BLEU score: {results}")
# Example 3: More complex real-world translations
predictions = ["The agreement on the European Economic Area was signed in August 1992."]
references = [["The agreement on the European Economic Area was signed in August 1992.", "An agreement on the European Economic Area was signed in August of 1992."]]
results = bleu.compute(predictions=predictions, references=references)
print(f"Complex example HF Evaluate BLEU score: {results}")

輸出

Perfect match HF Evaluate BLEU score: {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 4, 'reference_length': 4}Multi-sentence HF Evaluate BLEU score: {'bleu': 0.0, 'precisions': [0.8461538461538461, 0.5454545454545454, 0.2222222222222222, 0.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 13, 'reference_length': 13}Complex example HF Evaluate BLEU score: {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 13, 'reference_length': 13}

解讀BLEU輸出

BLEU 分數的範圍通常為 0 到 1（如果以百分比表示，則為 0 到 100）：

0：候選者與參考文獻之間不匹配
1（或 100%）：與參考文獻完全匹配
典型範圍 ：
- 0-15: 翻譯不佳
- 15-30: 可以理解，但翻譯有缺陷
- 30-40: 翻譯良好
- 40-50: 高質量翻譯
- 50+: 優秀翻譯（可能接近人類質量）

然而，這些範圍在不同語對之間存在很大差異。例如，英漢之間的翻譯得分通常低於英法之間的翻譯，這是語言差異造成的，而非實際質量差異。

得分差異

不同的 BLEU 實現可能會產生不同的分數，原因如下

平滑方法：處理零精度值
標記化差異：對於沒有明確詞界的語言尤其重要
N-gram 加權方案：標準 BLEU 使用統一權重，但也有替代方案

更多資訊，請觀看此影片：

超越翻譯：BLEU的擴充套件應用

雖然 BLEU 是為機器翻譯評估而設計的，但其影響已擴充套件到整個自然語言處理領域：

文字摘要 – 研究人員已將 BLEU 用於評估自動摘要系統，將模型生成的摘要與人工建立的參考文獻進行比較。雖然總結帶來了獨特的挑戰，例如需要保留語義而不是準確的措辭，但經修改的 BLEU 變體已被證明在這一領域很有價值。
對話系統和聊天機器人 – 對話式 人工智慧開發人員使用 BLEU 來衡量對話系統的響應質量，但有一些重要的注意事項。對話的開放性意味著多個回覆可能同樣有效，這使得基於參考的評估尤其具有挑戰性。不過，BLEU 為評估回覆的適當性提供了一個起點。
影像字幕 – 在多模態人工智慧中，BLEU 可幫助評估生成影像文字描述的系統。透過將模型生成的標題與人類註釋進行比較，研究人員可以量化標題的準確性，同時承認描述的創造性。
程式碼生成 – 一種新興的應用涉及評估程式碼生成模型，BLEU 可以測量人工智慧生成的程式碼與參考實現之間的相似性。這一應用凸顯了 BLEU 在不同型別結構化語言中的通用性。

侷限性：為什麼BLEU並不完美？

儘管 BLEU 被廣泛採用，但它也有研究人員必須考慮的有據可查的侷限性：

語義盲區 – BLEU 最大的侷限可能是無法捕捉語義等同性。兩個譯文可以用完全不同的詞表達相同的意思，但 BLEU 會給與參考詞性不匹配的變體打低分。這種 “表層 ”評價可能會損害有效的文體選擇和替代措辭。
缺乏語境理解 – BLEU 將句子視為孤立的單元，忽略了文件層面的連貫性和語境的適當性。在評估上下文對選詞和含義有重大影響的文字翻譯時，這種侷限性尤其容易產生問題。
對關鍵錯誤不敏感 – 並非所有翻譯錯誤都具有同等權重。一個微小的詞序差異可能幾乎不影響可理解性，而一個翻譯錯誤的否定則可能扭轉整個句子的意思。BLEU 對這些錯誤一視同仁，無法區分微不足道的錯誤和關鍵性錯誤。
參考依賴性 – BLEU 對參考譯文的依賴帶來了固有的偏見。該指標無法識別與提供的參考譯文有顯著差異的有效譯文的優劣。這種依賴性也給低資源語言帶來了實際挑戰，因為在低資源語言中很難獲得多個高質量的參考譯文。

超越BLEU：評估指標的演變

BLEU 的侷限性促進了補充性指標的發展，每種指標都能解決特定的缺陷：

METEOR（使用顯式構詞法評估翻譯的度量標準）- METEOR 透過以下方式加強評估：
- 詞幹和同義詞匹配以識別語義等同性
- 明確的詞序評估
- 精確度和召回率的引數化加權
chrF（字元 n-gram F-score）- 該指標在字元層面而非單詞層面執行，因此對於詞形豐富的語言特別有效，因為在這些語言中，單詞的細微變化可能會大量出現。
BERTScore – 利用 BERT 等轉換器模型的上下文嵌入，該指標可捕捉譯文和參考文獻之間的語義相似性，解決 BLEU 的語義盲點問題。
COMET（跨語言最佳化翻譯評估指標）- COMET 使用根據人類判斷訓練的神經網路來預測翻譯質量，有可能捕捉到與人類感知相關但傳統指標無法捕捉到的翻譯方面。

神經機器翻譯時代BLEU的未來

隨著神經機器翻譯系統越來越多地生成人類質量的輸出結果，BLEU 面臨著新的挑戰和機遇：

天花板效應 – 在某些語言對上，表現最好的 NMT 系統現在的 BLEU 分數已經接近或超過了人工翻譯。這種 “天花板效應 ”讓人懷疑 BLEU 在區分高績效系統方面是否仍然有用。
人類平等的爭論 – 最近機器翻譯中“人類平等”的說法引發了關於評估方法的爭論。BLEU 已成為這些討論的核心，研究人員質疑當前的指標是否能充分反映接近人類水平的翻譯質量。
領域定製 – 不同領域對翻譯質量的優先順序不同。醫學翻譯要求術語精確，而市場營銷內容可能更看重創造性的改編。未來的 BLEU 實現可能會納入特定領域的權重，以反映這些不同的優先順序。
與人工反饋相結合 – 最有前途的方向可能是將 BLEU 等自動化指標與有針對性的人工評估相結合的混合評估方法。這些方法可以利用 BLEU 的效率，同時透過戰略性的人工干預來彌補其盲點。

小結

儘管 BLEU 有其侷限性，但它仍然是機器翻譯研究和開發的基礎。它的簡單性、可重複性以及與人類判斷的相關性使其成為翻譯評估的通用語言。雖然更新的度量標準解決了 BLEU 的特定弱點，但還沒有一個能完全取代它。

BLEU 的故事反映了人工智慧領域更廣泛的模式：計算效率與細緻評估之間的矛盾。隨著語言技術的進步，我們的評估方法也必須同步發展。BLEU 的最大貢獻可能最終會成為建立更復雜評估正規化的基礎。

隨著機器人成為人類交流的中介，BLEU 等指標已不僅僅是一種研究行為，而是確保人工智慧驅動的語言工具滿足人類需求的保障。瞭解 BLEU 指標的所有優點和侷限性，對於任何從事技術與語言結合工作的人來說都是不可或缺的。

BLEU 模型評估機器翻譯

用BLEU指標評估語言模型

BLEU指標的起源：歷史視角

BLEU指標如何工作？

N-gram 精確度

簡潔度懲罰

最終BLEU得分

實施BLEU指標

所需輸入

實現步驟

常用實現工具

NLTK：Python的自然語言工具包提供了簡單的BLEU實現

輸出

SacreBLEU：解決可重複性問題的標準化BLEU實現

輸出

Hugging Face評估：與ML管道整合的現代實現

輸出

解讀BLEU輸出

得分差異

超越翻譯：BLEU的擴充套件應用

侷限性：為什麼BLEU並不完美？

超越BLEU：評估指標的演變

神經機器翻譯時代BLEU的未來

小結

評論留言

取消回覆

文章目录

用BLEU指標評估語言模型

BLEU指標的起源：歷史視角

BLEU指標如何工作？

N-gram 精確度

簡潔度懲罰

最終BLEU得分

實施BLEU指標

所需輸入

實現步驟

常用實現工具

NLTK：Python的自然語言工具包提供了簡單的BLEU實現

輸出

SacreBLEU：解決可重複性問題的標準化BLEU實現

輸出

Hugging Face評估：與ML管道整合的現代實現

輸出

解讀BLEU輸出

得分差異

超越翻譯：BLEU的擴充套件應用

侷限性：為什麼BLEU並不完美？

超越BLEU：評估指標的演變

神經機器翻譯時代BLEU的未來

小結

相關文章

評論留言

取消回覆

文章目录