物體檢測模型RF-DETR：兼顧速度和精度

今天，我們要介紹一位新同學：RF-DETR。我們一起了解有關 Roboflow RF-DETR 的一切，以及它是如何做到在物體檢測的速度和準確性做到兼顧且表現出色。

什麼是RF-DETR？

RF-DETR 是一種基於 Transformer 的即時物體檢測模型，它在 COCO 資料集上實現了超過 60 mAP 的精度，展示了令人印象深刻的成就。我們自然很好奇：RF-DETR 的速度能否與 YOLO 匹敵？它能適應我們在現實世界中遇到的各種任務嗎？

這就是我們要探討的問題。在本文中，我們將分析 RF-DETR 的核心功能、即時能力、強大的領域適應性和開源可用性，看看它與其他模型相比表現如何。讓我們深入瞭解這款新產品是否具備在實際應用中脫穎而出的能力！

為什麼RF-DETR會改變遊戲規則？

在 COCO 和 RF100-VL 基準測試中均表現出色。
專為處理新型域和高速環境而設計，非常適合邊緣和低延遲應用。
與即時 COCO SOTA Transformer 模型（如 D-FINE 和 LW-DETR）和 SOTA YOLO CNN 模型（如 YOLOv11 和 YOLOv8）相比，在所有類別中均排名前二。

模型效能和新基準

物體檢測模型面臨的挑戰越來越大，它們需要證明自己的價值，而不僅僅侷限於 COCO–這個資料集雖然歷來至關重要，但自 2017 年以來就沒有更新過。因此，許多模型在 COCO 上僅有微弱的改進，轉而使用其他資料集（如 LVIS、Objects365）來證明其通用性。

RF100-VL：Roboflow 的新基準，從 Roboflow Universe 的 50 多萬個資料集中收集了約 100 個不同的資料集（航空影像、工業檢測等）。該基準強調領域適應性，這對於實際使用案例來說是一個關鍵因素，因為在實際使用案例中，資料可能與 COCO 的常見物件截然不同。

為什麼需要RF100-VL？

真實世界的多樣性：RF100-VL 包括涵蓋實驗室成像、工業檢測和航空攝影等場景的資料集，以測試模型在傳統基準之外的表現。
不同基準：透過標準化評估流程，RF100-VL 可以直接比較不同架構，包括基於 Transformer 的模型和基於 CNN 的 YOLO 變體。
對增量增益的適應性：隨著 COCO 飽和，領域適應性與延遲和原始準確性一起成為首要考慮因素。

RF-DETR 與其他即時物體檢測模型的比較

Source: Roboflow

在上表中，我們可以看到 RF-DETR 與其他即時物體檢測模型的比較：

COCO：RF-DETR 的基本變體達到 53.3 mAP，與其他即時模型相當。
RF100-VL：RF-DETR 超越了其他模型（86.7 mAP），顯示出其卓越的領域適應性。
速度：在 T4 GPU 上，RF-DETR 的速度為 6.0 ms/img，在考慮後處理因素後，RF-DETR 的速度與其他競爭模型相當，甚至更快。

注：目前提供 RF-DETR-large 和 RF-DETR-base 的程式碼和檢查點。

總延遲也很重要

YOLO中的NMS：YOLO 模型使用非最大值抑制（NMS）來完善邊界框。這一步驟會稍稍減慢推理速度，尤其是當幀中有許多物體時。

總延遲

Source: Roboflow

沒有 DETR 的額外步驟：RF-DETR 遵循 DETR 系列的方法，避免了邊界框細化所需的額外 NMS 步驟。

COCO上的延遲與精度

Source: Roboflow

水平軸（延遲） ：在使用 TensorRT10 FP16 的英偉達 T4 GPU 上以毫秒（ms）為單位測量每幅影像。延遲越低，推理速度越快 🙂
垂直軸（mAP @0.50:0.95）：微軟 COCO 基準的平均精度（Average Precision），這是檢測精度的標準衡量標準。mAP 越高表示效能越好。

在此圖表中，RF-DETR 的準確度與 YOLO 模型相比具有競爭力，同時延遲時間保持在相同範圍內。RF-DETR 超過了 60 mAP 臨界值，成為第一個在 COCO 上達到這一效能水平的記錄在案的即時模型。

在RF100-VL的領域適應性

Source: Roboflow

在這裡，RF-DETR 脫穎而出，在 RF100-VL 上獲得了最高的 mAP，這表明它在不同領域都有很強的適應性。這表明，RF-DETR 不僅在 COCO 上具有競爭力，而且在處理真實世界資料集時也表現出色，在真實世界資料集中，特定領域的物件和條件可能與 COCO 中的常見物件有很大不同。

RF-DETR的潛在排名

Source: Leaderboard.roboflow.com

根據 Roboflow 排行榜的效能指標，RF-DETR 在準確性和效率方面都表現出了很強的競爭力。

RF-DETR-Large（128M 個引數） 排名第一，超過了所有現有模型，估計 mAP 50:95 超過 60.5，是排行榜上最準確的模型。
RF-DETR-Base（2900 萬引數） 排名第 4 位左右，與DEIM-D-FINE-X（6170 萬引數，0.548 mAP 50:95）和D-FINE-X（6160 萬引數，0.541 mAP 50:95）等模型競爭激烈。儘管引數數量較少，但它仍保持了強大的精度優勢。

這一排名進一步凸顯了 RF-DETR 的效率，與一些競爭對手相比，RF-DETR 在保持較小模型尺寸的同時，透過最佳化延遲提供了高效能。

RF-DETR架構概述

在即時物體檢測領域，基於 CNN 的 YOLO 模型一直處於領先地位。然而，CNN 並不總能從大規模預訓練中獲益，而大規模預訓練在機器學習中正變得越來越重要。

Transformers 在大規模預訓練方面表現出色，但在即時應用中往往過於笨重或緩慢。然而，最近的研究表明，當我們考慮到 YOLO 所需的後處理開銷時，基於 DETR 的模型可以與 YOLO 的速度相媲美。

RF-DETR架構

Source: Deformable DETR paper (RF-DETR 就是建立在這一架構上的)

RF-DETR的混合優勢

預訓練的 DINOv2 Backbone：這有助於模型從大規模影像預訓練中轉移知識，提高在新領域或不同領域的效能。將 LW-DETR 與預訓練的 DINOv2 骨幹相結合，RF-DETR 可提供卓越的領域適應性，並從預訓練中獲得顯著優勢。
單尺度特徵提取：可變形 DETR 利用多尺度關注，而 RF-DETR 則將特徵提取簡化為單一尺度，在速度和效能之間取得了平衡。
多解析度訓練：RF-DETR 可在多種解析度下進行訓練，使您能夠在推理速度和準確性之間做出最佳權衡，而無需重新訓練模型。

更多資訊，請閱讀本研究論文。

如何使用RF-DETR？

任務 1：使用RF-DETR檢測影像中的物體

透過以下連結安裝 RF-DETR：

!pip install rfdetr

!pip install rfdetr

然後，您就可以載入預先訓練好的檢查點（在 COCO 上訓練好），以便在應用程式中立即使用：

import io

import requests

import supervision as sv

from PIL import Image

from rfdetr import RFDETRBase

model = RFDETRBase()

url = "https://media.roboflow.com/notebooks/examples/dog-2.jpeg"

image = Image.open(io.BytesIO(requests.get(url).content))

detections = model.predict(image, threshold=0.5)

annotated_image = image.copy()

annotated_image = sv.BoxAnnotator().annotate(annotated_image, detections)

annotated_image = sv.LabelAnnotator().annotate(annotated_image, detections)

sv.plot_image(annotated_image)

import io import requests import supervision as sv from PIL import Image from rfdetr import RFDETRBase model = RFDETRBase() url = "https://media.roboflow.com/notebooks/examples/dog-2.jpeg" image = Image.open(io.BytesIO(requests.get(url).content)) detections = model.predict(image, threshold=0.5) annotated_image = image.copy() annotated_image = sv.BoxAnnotator().annotate(annotated_image, detections) annotated_image = sv.LabelAnnotator().annotate(annotated_image, detections) sv.plot_image(annotated_image)

import io
import requests
import supervision as sv
from PIL import Image
from rfdetr import RFDETRBase
model = RFDETRBase()
url = "https://media.roboflow.com/notebooks/examples/dog-2.jpeg"
image = Image.open(io.BytesIO(requests.get(url).content))
detections = model.predict(image, threshold=0.5)
annotated_image = image.copy()
annotated_image = sv.BoxAnnotator().annotate(annotated_image, detections)
annotated_image = sv.LabelAnnotator().annotate(annotated_image, detections)
sv.plot_image(annotated_image)

使用RF-DETR來檢測影像中的物體

Source – Link

任務 2：使用它來檢測影片中的物體

我將為您提供我的 Github Repository 連結，您可以自由地自行實現該模型 🙂。只需按照 README.md 說明執行程式碼即可。GitHub 連結。

程式碼：

import cv2

import numpy as np

import json

from rfdetr import RFDETRBase

# Load the model

model = RFDETRBase()

# Read the classes.json file and store class names in a dictionary

with open('classes.json', 'r', encoding='utf-8') as file:

class_names = json.load(file)

# Open the video file

cap = cv2.VideoCapture('walking.mp4') # https://www.pexels.com/video/video-of-people-walking-855564/

# Create the output video

fourcc = cv2.VideoWriter_fourcc(*'XVID')

out = cv2.VideoWriter('output.mp4', fourcc, 20.0, (960, 540))

# For live video streaming:

# cap = cv2.VideoCapture(0) # 0 refers to the default camera

while True:

# Read a frame

ret, frame = cap.read()

if not ret:

break # Exit the loop when the video ends

# Perform object detection

detections = model.predict(frame, threshold=0.5)

# Mark the detected objects

for i, box in enumerate(detections.xyxy):

x1, y1, x2, y2 = map(int, box)

class_id = int(detections.class_id[i])

# Get the class name using class_id

label = class_names.get(str(class_id), "Unknown")

confidence = detections.confidence[i]

# Draw the bounding box (colored and thick)

color = (255, 255, 255) # White color

thickness = 7 # Thickness

cv2.rectangle(frame, (x1, y1), (x2, y2), color, thickness)

# Display the label and confidence score (in white color and readable font)

text = f"{label} ({confidence:.2f})"

font = cv2.FONT_HERSHEY_SIMPLEX

font_scale = 2

font_thickness = 7

text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]

text_x = x1

text_y = y1 - 10

cv2.putText(frame, text, (text_x, text_y), font, font_scale, (0, 0, 255), font_thickness, cv2.LINE_AA)

# Display the results

resized_frame = cv2.resize(frame, (960, 540))

cv2.imshow('Labeled Video', resized_frame)

# Save the output

out.write(resized_frame)

# Exit when 'q' key is pressed

if cv2.waitKey(1) & 0xFF == ord('q'):

break

# Release resources

cap.release()

out.release() # Release the output video

cv2.destroyAllWindows()

import cv2 import numpy as np import json from rfdetr import RFDETRBase # Load the model model = RFDETRBase() # Read the classes.json file and store class names in a dictionary with open('classes.json', 'r', encoding='utf-8') as file: class_names = json.load(file) # Open the video file cap = cv2.VideoCapture('walking.mp4') # https://www.pexels.com/video/video-of-people-walking-855564/ # Create the output video fourcc = cv2.VideoWriter_fourcc(*'XVID') out = cv2.VideoWriter('output.mp4', fourcc, 20.0, (960, 540)) # For live video streaming: # cap = cv2.VideoCapture(0) # 0 refers to the default camera while True: # Read a frame ret, frame = cap.read() if not ret: break # Exit the loop when the video ends # Perform object detection detections = model.predict(frame, threshold=0.5) # Mark the detected objects for i, box in enumerate(detections.xyxy): x1, y1, x2, y2 = map(int, box) class_id = int(detections.class_id[i]) # Get the class name using class_id label = class_names.get(str(class_id), "Unknown") confidence = detections.confidence[i] # Draw the bounding box (colored and thick) color = (255, 255, 255) # White color thickness = 7 # Thickness cv2.rectangle(frame, (x1, y1), (x2, y2), color, thickness) # Display the label and confidence score (in white color and readable font) text = f"{label} ({confidence:.2f})" font = cv2.FONT_HERSHEY_SIMPLEX font_scale = 2 font_thickness = 7 text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0] text_x = x1 text_y = y1 - 10 cv2.putText(frame, text, (text_x, text_y), font, font_scale, (0, 0, 255), font_thickness, cv2.LINE_AA) # Display the results resized_frame = cv2.resize(frame, (960, 540)) cv2.imshow('Labeled Video', resized_frame) # Save the output out.write(resized_frame) # Exit when 'q' key is pressed if cv2.waitKey(1) & 0xFF == ord('q'): break # Release resources cap.release() out.release() # Release the output video cv2.destroyAllWindows()

import cv2
import numpy as np
import json
from rfdetr import RFDETRBase
# Load the model
model = RFDETRBase()
# Read the classes.json file and store class names in a dictionary
with open('classes.json', 'r', encoding='utf-8') as file:
    class_names = json.load(file)
# Open the video file
cap = cv2.VideoCapture('walking.mp4')  # https://www.pexels.com/video/video-of-people-walking-855564/
# Create the output video
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('output.mp4', fourcc, 20.0, (960, 540))
# For live video streaming:
# cap = cv2.VideoCapture(0)  # 0 refers to the default camera
while True:
    # Read a frame
    ret, frame = cap.read()
    if not ret:
        break  # Exit the loop when the video ends
    # Perform object detection
    detections = model.predict(frame, threshold=0.5)
    # Mark the detected objects
    for i, box in enumerate(detections.xyxy):
        x1, y1, x2, y2 = map(int, box)
        class_id = int(detections.class_id[i])
        # Get the class name using class_id
        label = class_names.get(str(class_id), "Unknown")
        confidence = detections.confidence[i]
        # Draw the bounding box (colored and thick)
        color = (255, 255, 255)  # White color
        thickness = 7  # Thickness
        cv2.rectangle(frame, (x1, y1), (x2, y2), color, thickness)
        # Display the label and confidence score (in white color and readable font)
        text = f"{label} ({confidence:.2f})"
        font = cv2.FONT_HERSHEY_SIMPLEX
        font_scale = 2
        font_thickness = 7
        text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]
        text_x = x1
        text_y = y1 - 10
        cv2.putText(frame, text, (text_x, text_y), font, font_scale, (0, 0, 255), font_thickness, cv2.LINE_AA)
    # Display the results
    resized_frame = cv2.resize(frame, (960, 540))
    cv2.imshow('Labeled Video', resized_frame)
    # Save the output
    out.write(resized_frame)
    # Exit when 'q' key is pressed
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
# Release resources
cap.release()
out.release()  # Release the output video
cv2.destroyAllWindows()

輸出：

自定義資料集的微調

微調是 RF-DETR 真正的優勢所在，尤其是在處理特殊或較小的資料集時：

使用 COCO 格式：將資料集組織成 train/、valid/ 和 test/ 目錄，每個目錄都有自己的 _annotations.coco.json 檔案。
利用 Colab：Roboflow 團隊提供了詳細的Colab notebook（由 Roboflow 團隊提供），指導您在自己的資料集上進行訓練。

from rfdetr import RFDETRBase

model = RFDETRBase()

model.train(

dataset_dir="<DATASET_PATH>",

epochs=10,

batch_size=4,

grad_accum_steps=4,

lr=1e-4

)

from rfdetr import RFDETRBase model = RFDETRBase() model.train( dataset_dir="<DATASET_PATH>", epochs=10, batch_size=4, grad_accum_steps=4, lr=1e-4 )

from rfdetr import RFDETRBase
model = RFDETRBase()
model.train(
    dataset_dir="<DATASET_PATH>",
    epochs=10,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4
)

在訓練過程中，RF-DETR 將產生：

常規權重：標準模型檢查點。
EMA權重：模型的指數移動平均值版本，通常能產生更穩定的效能。

如何在自定義資料集上訓練RF-DETR？

作為一個例子，Roboflow 團隊使用了一個麻將牌識別資料集，它是 RF100-VL 基準的一部分，包含 2000 多張圖片。本指南演示瞭如何下載資料集、安裝必要的工具並在自定義資料上微調模型。

如需瞭解更多資訊，請參閱官方部落格文章。

麻將牌識別資料集

Source – Link

結果顯示的一邊是地面實況，另一邊是模型的檢測結果。在我們的示例中，RF-DETR 能正確識別大多數麻將牌，只有一些輕微的錯誤檢測，可以透過進一步訓練加以改進。

重要提示：

例項分割：RF-DETR 目前不支援例項分割，Roboflow 的開源負責人 Piotr Skalski 指出了這一點。
姿勢估計：姿勢估算支援也即將推出。

最終結論與超越其他CV模型的潛在優勢

RF-DETR 是基於 DETR 的最佳即時模型之一，在準確性、速度和領域適應性之間取得了很好的平衡。如果您需要一個基於 Transformer 的即時檢測器，它能避免後處理開銷，並能超越 COCO，那麼它將是您的最佳選擇。不過，在某些應用中，YOLOv8 在原始速度方面仍有優勢。

RF-DETR 優於其他 CV 模型的地方：

專業領域和定製資料集：RF-DETR 在領域適應方面表現出色（在 RF100-VL 上達到 86.7 mAP），因此非常適合醫療成像、工業缺陷檢測和自主導航等COCO 訓練模型難以勝任的領域。
低延遲應用：由於它不需要 NMS，因此在後處理會增加開銷的情況下，例如無人機檢測、影片分析或機器人技術，它的速度比 YOLO 更快。

摩托駕駛攝像頭物體識別

基於Transformer的面向未來技術：與基於 CNN 的檢測器（YOLO、Faster R-CNN）不同，RF-DETR 得益於自我注意和大規模預訓練（DINOv2 backbone），使其更適合多目標推理、遮擋處理和對未知環境的泛化。
邊緣人工智慧與嵌入式裝置：RF-DETR在 T4 GPU 上的推理時間為 6.0ms/img，這表明它可以成為傳統 DETR 模型速度太慢的即時邊緣部署的有力候選者。

掌聲送給 Roboflow ML 團隊–Peter Robicheaux、James Gallagher、Joseph Nelson 和 Isaac Robinson。

Peter Robicheaux, James Gallagher, Joseph Nelson, Isaac Robinson。(2025 年 3 月 20 日）。RF-DETR：SOTA 即時物件檢測模型。Roboflow 部落格：https://blog.roboflow.com/rf-detr/

小結

Roboflow 的 RF-DETR 代表了新一代即時物體檢測技術，在單一模型中兼顧了高精度、領域適應性和低延遲。無論您是在構建尖端的機器人系統，還是在資源有限的邊緣裝置上進行部署，RF-DETR 都能為您提供一個多功能、面向未來的解決方案。

RF-DETR Roboflow 物體檢測

物體檢測模型RF-DETR：兼顧速度和精度

什麼是RF-DETR？

為什麼RF-DETR會改變遊戲規則？

模型效能和新基準

為什麼需要RF100-VL？

總延遲也很重要

COCO上的延遲與精度

在RF100-VL的領域適應性

RF-DETR的潛在排名

RF-DETR架構概述

RF-DETR的混合優勢

如何使用RF-DETR？

任務 1：使用RF-DETR檢測影像中的物體

任務 2：使用它來檢測影片中的物體

自定義資料集的微調

如何在自定義資料集上訓練RF-DETR？

最終結論與超越其他CV模型的潛在優勢

RF-DETR 優於其他 CV 模型的地方：

小結

評論留言

取消回覆

文章目录

物體檢測模型RF-DETR：兼顧速度和精度

什麼是RF-DETR？

為什麼RF-DETR會改變遊戲規則？

模型效能和新基準

為什麼需要RF100-VL？

總延遲也很重要

COCO上的延遲與精度

在RF100-VL的領域適應性

RF-DETR的潛在排名

RF-DETR架構概述

RF-DETR的混合優勢

如何使用RF-DETR？

任務 1：使用RF-DETR檢測影像中的物體

任務 2：使用它來檢測影片中的物體

自定義資料集的微調

如何在自定義資料集上訓練RF-DETR？

最終結論與超越其他CV模型的潛在優勢

RF-DETR 優於其他 CV 模型的地方：

小結

相關文章

評論留言

取消回覆

文章目录