物体检测模型RF-DETR：兼顾速度和精度

今天，我们要介绍一位新同学：RF-DETR。我们一起了解有关 Roboflow RF-DETR 的一切，以及它是如何做到在物体检测的速度和准确性做到兼顾且表现出色。

什么是RF-DETR？

RF-DETR 是一种基于 Transformer 的实时物体检测模型，它在 COCO 数据集上实现了超过 60 mAP 的精度，展示了令人印象深刻的成就。我们自然很好奇：RF-DETR 的速度能否与 YOLO 匹敌？它能适应我们在现实世界中遇到的各种任务吗？

这就是我们要探讨的问题。在本文中，我们将分析 RF-DETR 的核心功能、实时能力、强大的领域适应性和开源可用性，看看它与其他模型相比表现如何。让我们深入了解这款新产品是否具备在实际应用中脱颖而出的能力！

为什么RF-DETR会改变游戏规则？

在 COCO 和 RF100-VL 基准测试中均表现出色。
专为处理新型域和高速环境而设计，非常适合边缘和低延迟应用。
与实时 COCO SOTA Transformer 模型（如 D-FINE 和 LW-DETR）和 SOTA YOLO CNN 模型（如 YOLOv11 和 YOLOv8）相比，在所有类别中均排名前二。

模型性能和新基准

物体检测模型面临的挑战越来越大，它们需要证明自己的价值，而不仅仅局限于 COCO–这个数据集虽然历来至关重要，但自 2017 年以来就没有更新过。因此，许多模型在 COCO 上仅有微弱的改进，转而使用其他数据集（如 LVIS、Objects365）来证明其通用性。

RF100-VL：Roboflow 的新基准，从 Roboflow Universe 的 50 多万个数据集中收集了约 100 个不同的数据集（航空图像、工业检测等）。该基准强调领域适应性，这对于实际使用案例来说是一个关键因素，因为在实际使用案例中，数据可能与 COCO 的常见对象截然不同。

为什么需要RF100-VL？

真实世界的多样性：RF100-VL 包括涵盖实验室成像、工业检测和航空摄影等场景的数据集，以测试模型在传统基准之外的表现。
不同基准：通过标准化评估流程，RF100-VL 可以直接比较不同架构，包括基于 Transformer 的模型和基于 CNN 的 YOLO 变体。
对增量增益的适应性：随着 COCO 饱和，领域适应性与延迟和原始准确性一起成为首要考虑因素。

RF-DETR 与其他实时物体检测模型的比较

Source: Roboflow

在上表中，我们可以看到 RF-DETR 与其他实时物体检测模型的比较：

COCO：RF-DETR 的基本变体达到 53.3 mAP，与其他实时模型相当。
RF100-VL：RF-DETR 超越了其他模型（86.7 mAP），显示出其卓越的领域适应性。
速度：在 T4 GPU 上，RF-DETR 的速度为 6.0 ms/img，在考虑后处理因素后，RF-DETR 的速度与其他竞争模型相当，甚至更快。

注：目前提供 RF-DETR-large 和 RF-DETR-base 的代码和检查点。

总延迟也很重要

YOLO中的NMS：YOLO 模型使用非最大值抑制（NMS）来完善边界框。这一步骤会稍稍减慢推理速度，尤其是当帧中有许多物体时。

总延迟

Source: Roboflow

没有 DETR 的额外步骤：RF-DETR 遵循 DETR 系列的方法，避免了边界框细化所需的额外 NMS 步骤。

COCO上的延迟与精度

Source: Roboflow

水平轴（延迟） ：在使用 TensorRT10 FP16 的英伟达 T4 GPU 上以毫秒（ms）为单位测量每幅图像。延迟越低，推理速度越快 🙂
垂直轴（mAP @0.50:0.95）：微软 COCO 基准的平均精度（Average Precision），这是检测精度的标准衡量标准。mAP 越高表示性能越好。

在此图表中，RF-DETR 的准确度与 YOLO 模型相比具有竞争力，同时延迟时间保持在相同范围内。RF-DETR 超过了 60 mAP 临界值，成为第一个在 COCO 上达到这一性能水平的记录在案的实时模型。

在RF100-VL的领域适应性

Source: Roboflow

在这里，RF-DETR 脱颖而出，在 RF100-VL 上获得了最高的 mAP，这表明它在不同领域都有很强的适应性。这表明，RF-DETR 不仅在 COCO 上具有竞争力，而且在处理真实世界数据集时也表现出色，在真实世界数据集中，特定领域的对象和条件可能与 COCO 中的常见对象有很大不同。

RF-DETR的潜在排名

Source: Leaderboard.roboflow.com

根据 Roboflow 排行榜的性能指标，RF-DETR 在准确性和效率方面都表现出了很强的竞争力。

RF-DETR-Large（128M 个参数） 排名第一，超过了所有现有模型，估计 mAP 50:95 超过 60.5，是排行榜上最准确的模型。
RF-DETR-Base（2900 万参数） 排名第 4 位左右，与DEIM-D-FINE-X（6170 万参数，0.548 mAP 50:95）和D-FINE-X（6160 万参数，0.541 mAP 50:95）等模型竞争激烈。尽管参数数量较少，但它仍保持了强大的精度优势。

这一排名进一步凸显了 RF-DETR 的效率，与一些竞争对手相比，RF-DETR 在保持较小模型尺寸的同时，通过优化延迟提供了高性能。

RF-DETR架构概述

在实时物体检测领域，基于 CNN 的 YOLO 模型一直处于领先地位。然而，CNN 并不总能从大规模预训练中获益，而大规模预训练在机器学习中正变得越来越重要。

Transformers 在大规模预训练方面表现出色，但在实时应用中往往过于笨重或缓慢。然而，最近的研究表明，当我们考虑到 YOLO 所需的后处理开销时，基于 DETR 的模型可以与 YOLO 的速度相媲美。

RF-DETR架构

Source: Deformable DETR paper (RF-DETR 就是建立在这一架构上的)

RF-DETR的混合优势

预训练的 DINOv2 Backbone：这有助于模型从大规模图像预训练中转移知识，提高在新领域或不同领域的性能。将 LW-DETR 与预训练的 DINOv2 骨干相结合，RF-DETR 可提供卓越的领域适应性，并从预训练中获得显著优势。
单尺度特征提取：可变形 DETR 利用多尺度关注，而 RF-DETR 则将特征提取简化为单一尺度，在速度和性能之间取得了平衡。
多分辨率训练：RF-DETR 可在多种分辨率下进行训练，使您能够在推理速度和准确性之间做出最佳权衡，而无需重新训练模型。

更多信息，请阅读本研究论文。

如何使用RF-DETR？

任务 1：使用RF-DETR检测图像中的物体

通过以下链接安装 RF-DETR：

!pip install rfdetr

!pip install rfdetr

然后，您就可以加载预先训练好的检查点（在 COCO 上训练好），以便在应用程序中立即使用：

import io

import requests

import supervision as sv

from PIL import Image

from rfdetr import RFDETRBase

model = RFDETRBase()

url = "https://media.roboflow.com/notebooks/examples/dog-2.jpeg"

image = Image.open(io.BytesIO(requests.get(url).content))

detections = model.predict(image, threshold=0.5)

annotated_image = image.copy()

annotated_image = sv.BoxAnnotator().annotate(annotated_image, detections)

annotated_image = sv.LabelAnnotator().annotate(annotated_image, detections)

sv.plot_image(annotated_image)

import io import requests import supervision as sv from PIL import Image from rfdetr import RFDETRBase model = RFDETRBase() url = "https://media.roboflow.com/notebooks/examples/dog-2.jpeg" image = Image.open(io.BytesIO(requests.get(url).content)) detections = model.predict(image, threshold=0.5) annotated_image = image.copy() annotated_image = sv.BoxAnnotator().annotate(annotated_image, detections) annotated_image = sv.LabelAnnotator().annotate(annotated_image, detections) sv.plot_image(annotated_image)

import io
import requests
import supervision as sv
from PIL import Image
from rfdetr import RFDETRBase
model = RFDETRBase()
url = "https://media.roboflow.com/notebooks/examples/dog-2.jpeg"
image = Image.open(io.BytesIO(requests.get(url).content))
detections = model.predict(image, threshold=0.5)
annotated_image = image.copy()
annotated_image = sv.BoxAnnotator().annotate(annotated_image, detections)
annotated_image = sv.LabelAnnotator().annotate(annotated_image, detections)
sv.plot_image(annotated_image)

使用RF-DETR来检测图像中的物体

Source – Link

任务 2：使用它来检测视频中的物体

我将为您提供我的 Github Repository 链接，您可以自由地自行实现该模型 🙂。只需按照 README.md 说明运行代码即可。GitHub 链接。

代码：

import cv2

import numpy as np

import json

from rfdetr import RFDETRBase

# Load the model

model = RFDETRBase()

# Read the classes.json file and store class names in a dictionary

with open('classes.json', 'r', encoding='utf-8') as file:

class_names = json.load(file)

# Open the video file

cap = cv2.VideoCapture('walking.mp4') # https://www.pexels.com/video/video-of-people-walking-855564/

# Create the output video

fourcc = cv2.VideoWriter_fourcc(*'XVID')

out = cv2.VideoWriter('output.mp4', fourcc, 20.0, (960, 540))

# For live video streaming:

# cap = cv2.VideoCapture(0) # 0 refers to the default camera

while True:

# Read a frame

ret, frame = cap.read()

if not ret:

break # Exit the loop when the video ends

# Perform object detection

detections = model.predict(frame, threshold=0.5)

# Mark the detected objects

for i, box in enumerate(detections.xyxy):

x1, y1, x2, y2 = map(int, box)

class_id = int(detections.class_id[i])

# Get the class name using class_id

label = class_names.get(str(class_id), "Unknown")

confidence = detections.confidence[i]

# Draw the bounding box (colored and thick)

color = (255, 255, 255) # White color

thickness = 7 # Thickness

cv2.rectangle(frame, (x1, y1), (x2, y2), color, thickness)

# Display the label and confidence score (in white color and readable font)

text = f"{label} ({confidence:.2f})"

font = cv2.FONT_HERSHEY_SIMPLEX

font_scale = 2

font_thickness = 7

text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]

text_x = x1

text_y = y1 - 10

cv2.putText(frame, text, (text_x, text_y), font, font_scale, (0, 0, 255), font_thickness, cv2.LINE_AA)

# Display the results

resized_frame = cv2.resize(frame, (960, 540))

cv2.imshow('Labeled Video', resized_frame)

# Save the output

out.write(resized_frame)

# Exit when 'q' key is pressed

if cv2.waitKey(1) & 0xFF == ord('q'):

break

# Release resources

cap.release()

out.release() # Release the output video

cv2.destroyAllWindows()

import cv2 import numpy as np import json from rfdetr import RFDETRBase # Load the model model = RFDETRBase() # Read the classes.json file and store class names in a dictionary with open('classes.json', 'r', encoding='utf-8') as file: class_names = json.load(file) # Open the video file cap = cv2.VideoCapture('walking.mp4') # https://www.pexels.com/video/video-of-people-walking-855564/ # Create the output video fourcc = cv2.VideoWriter_fourcc(*'XVID') out = cv2.VideoWriter('output.mp4', fourcc, 20.0, (960, 540)) # For live video streaming: # cap = cv2.VideoCapture(0) # 0 refers to the default camera while True: # Read a frame ret, frame = cap.read() if not ret: break # Exit the loop when the video ends # Perform object detection detections = model.predict(frame, threshold=0.5) # Mark the detected objects for i, box in enumerate(detections.xyxy): x1, y1, x2, y2 = map(int, box) class_id = int(detections.class_id[i]) # Get the class name using class_id label = class_names.get(str(class_id), "Unknown") confidence = detections.confidence[i] # Draw the bounding box (colored and thick) color = (255, 255, 255) # White color thickness = 7 # Thickness cv2.rectangle(frame, (x1, y1), (x2, y2), color, thickness) # Display the label and confidence score (in white color and readable font) text = f"{label} ({confidence:.2f})" font = cv2.FONT_HERSHEY_SIMPLEX font_scale = 2 font_thickness = 7 text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0] text_x = x1 text_y = y1 - 10 cv2.putText(frame, text, (text_x, text_y), font, font_scale, (0, 0, 255), font_thickness, cv2.LINE_AA) # Display the results resized_frame = cv2.resize(frame, (960, 540)) cv2.imshow('Labeled Video', resized_frame) # Save the output out.write(resized_frame) # Exit when 'q' key is pressed if cv2.waitKey(1) & 0xFF == ord('q'): break # Release resources cap.release() out.release() # Release the output video cv2.destroyAllWindows()

import cv2
import numpy as np
import json
from rfdetr import RFDETRBase
# Load the model
model = RFDETRBase()
# Read the classes.json file and store class names in a dictionary
with open('classes.json', 'r', encoding='utf-8') as file:
    class_names = json.load(file)
# Open the video file
cap = cv2.VideoCapture('walking.mp4')  # https://www.pexels.com/video/video-of-people-walking-855564/
# Create the output video
fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter('output.mp4', fourcc, 20.0, (960, 540))
# For live video streaming:
# cap = cv2.VideoCapture(0)  # 0 refers to the default camera
while True:
    # Read a frame
    ret, frame = cap.read()
    if not ret:
        break  # Exit the loop when the video ends
    # Perform object detection
    detections = model.predict(frame, threshold=0.5)
    # Mark the detected objects
    for i, box in enumerate(detections.xyxy):
        x1, y1, x2, y2 = map(int, box)
        class_id = int(detections.class_id[i])
        # Get the class name using class_id
        label = class_names.get(str(class_id), "Unknown")
        confidence = detections.confidence[i]
        # Draw the bounding box (colored and thick)
        color = (255, 255, 255)  # White color
        thickness = 7  # Thickness
        cv2.rectangle(frame, (x1, y1), (x2, y2), color, thickness)
        # Display the label and confidence score (in white color and readable font)
        text = f"{label} ({confidence:.2f})"
        font = cv2.FONT_HERSHEY_SIMPLEX
        font_scale = 2
        font_thickness = 7
        text_size = cv2.getTextSize(text, font, font_scale, font_thickness)[0]
        text_x = x1
        text_y = y1 - 10
        cv2.putText(frame, text, (text_x, text_y), font, font_scale, (0, 0, 255), font_thickness, cv2.LINE_AA)
    # Display the results
    resized_frame = cv2.resize(frame, (960, 540))
    cv2.imshow('Labeled Video', resized_frame)
    # Save the output
    out.write(resized_frame)
    # Exit when 'q' key is pressed
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break
# Release resources
cap.release()
out.release()  # Release the output video
cv2.destroyAllWindows()

输出：

自定义数据集的微调

微调是 RF-DETR 真正的优势所在，尤其是在处理特殊或较小的数据集时：

使用 COCO 格式：将数据集组织成 train/、valid/ 和 test/ 目录，每个目录都有自己的 _annotations.coco.json 文件。
利用 Colab：Roboflow 团队提供了详细的Colab notebook（由 Roboflow 团队提供），指导您在自己的数据集上进行训练。

from rfdetr import RFDETRBase

model = RFDETRBase()

model.train(

dataset_dir="<DATASET_PATH>",

epochs=10,

batch_size=4,

grad_accum_steps=4,

lr=1e-4

)

from rfdetr import RFDETRBase model = RFDETRBase() model.train( dataset_dir="<DATASET_PATH>", epochs=10, batch_size=4, grad_accum_steps=4, lr=1e-4 )

from rfdetr import RFDETRBase
model = RFDETRBase()
model.train(
    dataset_dir="<DATASET_PATH>",
    epochs=10,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-4
)

在训练过程中，RF-DETR 将产生：

常规权重：标准模型检查点。
EMA权重：模型的指数移动平均值版本，通常能产生更稳定的性能。

如何在自定义数据集上训练RF-DETR？

作为一个例子，Roboflow 团队使用了一个麻将牌识别数据集，它是 RF100-VL 基准的一部分，包含 2000 多张图片。本指南演示了如何下载数据集、安装必要的工具并在自定义数据上微调模型。

如需了解更多信息，请参阅官方博客文章。

麻将牌识别数据集

Source – Link

结果显示的一边是地面实况，另一边是模型的检测结果。在我们的示例中，RF-DETR 能正确识别大多数麻将牌，只有一些轻微的错误检测，可以通过进一步训练加以改进。

重要提示：

实例分割：RF-DETR 目前不支持实例分割，Roboflow 的开源负责人 Piotr Skalski 指出了这一点。
姿势估计：姿势估算支持也即将推出。

最终结论与超越其他CV模型的潜在优势

RF-DETR 是基于 DETR 的最佳实时模型之一，在准确性、速度和领域适应性之间取得了很好的平衡。如果您需要一个基于 Transformer 的实时检测器，它能避免后处理开销，并能超越 COCO，那么它将是您的最佳选择。不过，在某些应用中，YOLOv8 在原始速度方面仍有优势。

RF-DETR 优于其他 CV 模型的地方：

专业领域和定制数据集：RF-DETR 在领域适应方面表现出色（在 RF100-VL 上达到 86.7 mAP），因此非常适合医疗成像、工业缺陷检测和自主导航等COCO 训练模型难以胜任的领域。
低延迟应用：由于它不需要 NMS，因此在后处理会增加开销的情况下，例如无人机检测、视频分析或机器人技术，它的速度比 YOLO 更快。

摩托驾驶摄像头物体识别

基于Transformer的面向未来技术：与基于 CNN 的检测器（YOLO、Faster R-CNN）不同，RF-DETR 得益于自我注意和大规模预训练（DINOv2 backbone），使其更适合多目标推理、遮挡处理和对未知环境的泛化。
边缘人工智能与嵌入式设备：RF-DETR在 T4 GPU 上的推理时间为 6.0ms/img，这表明它可以成为传统 DETR 模型速度太慢的实时边缘部署的有力候选者。

掌声送给 Roboflow ML 团队–Peter Robicheaux、James Gallagher、Joseph Nelson 和 Isaac Robinson。

Peter Robicheaux, James Gallagher, Joseph Nelson, Isaac Robinson。(2025 年 3 月 20 日）。RF-DETR：SOTA 实时对象检测模型。Roboflow 博客：https://blog.roboflow.com/rf-detr/

小结

Roboflow 的 RF-DETR 代表了新一代实时物体检测技术，在单一模型中兼顾了高精度、领域适应性和低延迟。无论您是在构建尖端的机器人系统，还是在资源有限的边缘设备上进行部署，RF-DETR 都能为您提供一个多功能、面向未来的解决方案。

RF-DETR Roboflow 物体检测

物体检测模型RF-DETR：兼顾速度和精度

什么是RF-DETR？

为什么RF-DETR会改变游戏规则？

模型性能和新基准

为什么需要RF100-VL？

总延迟也很重要

COCO上的延迟与精度

在RF100-VL的领域适应性

RF-DETR的潜在排名

RF-DETR架构概述

RF-DETR的混合优势

如何使用RF-DETR？

任务 1：使用RF-DETR检测图像中的物体

任务 2：使用它来检测视频中的物体

自定义数据集的微调

如何在自定义数据集上训练RF-DETR？

最终结论与超越其他CV模型的潜在优势

RF-DETR 优于其他 CV 模型的地方：

小结

评论留言

取消回复

文章目录

物体检测模型RF-DETR：兼顾速度和精度

什么是RF-DETR？

为什么RF-DETR会改变游戏规则？

模型性能和新基准

为什么需要RF100-VL？

总延迟也很重要

COCO上的延迟与精度

在RF100-VL的领域适应性

RF-DETR的潜在排名

RF-DETR架构概述

RF-DETR的混合优势

如何使用RF-DETR？

任务 1：使用RF-DETR检测图像中的物体

任务 2：使用它来检测视频中的物体

自定义数据集的微调

如何在自定义数据集上训练RF-DETR？

最终结论与超越其他CV模型的潜在优势

RF-DETR 优于其他 CV 模型的地方：

小结

相关的

评论留言

取消回复

文章目录