CLC number: TP391.41
On-line Access: 2025-06-04
Received: 2024-10-29
Revision Accepted: 2025-02-09
Crosschecked: 2025-09-04
Cited: 0
Clicked: 809
Zuyi WANG, Zhimeng ZHENG, Jun MENG, Li XU. End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2400960 @article{title="End-to-end object detection using a query-selection encoder with hierarchical feature-aware attention", %0 Journal Article TY - JOUR
基于分层特征感知注意力与查询选择编码器的端到端目标检测1浙江大学电气工程学院,中国杭州市,310027 2浙江大学机器人研究院,中国余姚市,315400 摘要:由于无需设计复杂人工组件且简化了检测流程,端到端目标检测方法近年来受到广泛关注。然而,与传统检测器相比,这类方法存在训练收敛速度较慢、检测性能不足的问题,究其原因是在特征融合与选择过程中算法受限于正样本监督信号不足。针对此问题,本文提出一种用于端到端目标检测器的查询选择编码器(QSE),可以提升训练收敛速度与检测精度。QSE由多个编码器层组成,且在每个编码器层后添加了轻量级网络,以级联方式持续优化特征,为高效训练提供更充分的正样本监督。此外,每个编码器层引入分层特征感知注意力(HFA)机制,包括层内以及跨层特征注意力,以增强不同层级特征间的交互融合。HFA能有效抑制相似特征表征并强化判别性特征,从而加速特征选择过程。该方法可灵活应用于基于卷积神经网络和基于Transformer的检测器;在目标检测主流基准数据集MSCOCO、CrowdHuman以及PASCALVOC上的大量实验表明,使用QSE的基于卷积神经网络或基于Transformer的检测器均能在更少训练周期内获得更优的端到端检测性能。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with Transformers. Proc 16th European Conf on Computer Vision, p.213-229. ![]() [2]Chen K, Wang JQ, Pang JM, et al., 2019. MMDetection: open MMlab detection toolbox and benchmark. https://arxiv.org/abs/1906.07155 ![]() [3]Chen Q, Chen XK, Wang J, et al., 2023. Group DETR: fast DETR training with group-wise one-to-many assignment. Proc IEEE/CVF Int Conf on Computer Vision, p.6633-6642. ![]() [4]Chen YQ, Chen Q, Hu QH, et al., 2022. DATE: dual assignment for end-to-end fully convolutional object detection. https://arxiv.org/abs/2211.13859v1 ![]() [5]Dai XY, Chen YP, Yang JW, et al., 2021. Dynamic DETR: end-to-end object detection with dynamic attention. Proc IEEE/CVF Int Conf on Computer Vision, p.2988-2997. ![]() [6]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255. ![]() [7]Everingham M, van Gool L, Williams CK, et al., 2010. The PASCAL Visual Object Classes (VOC) challenge. Int J Comput Vis, 88(2):303-338. ![]() [8]Girshick R, 2015. Fast R-CNN. Proc IEEE Int Conf on Computer Vision, p.1440-1448. ![]() [9]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. ![]() [10]Hou XQ, Liu MQ, Zhang SL, et al., 2024. Relation DETR: exploring explicit position relation prior for object detection. Proc 18th European Conf on Computer Vision, p.89-105. ![]() [11]Jia D, Yuan YH, He HD, et al., 2023. DETRs with hybrid matching. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19702-19712. ![]() [12]Jocher G, Chaurasia A, Qiu J, 2023. Ultralytics YOLOv8. https://github.com/ultralytics/ultralytics ![]() [13]Kim K, Lee HS, 2020. Probabilistic anchor assignment with IoU prediction for object detection. Proc 16th European Conf on Computer Vision, p.355-371. ![]() [14]Law H, Deng J, 2018. CornerNet: detecting objects as paired keypoints. Proc 15th European Conf on Computer Vision, p.765-781. ![]() [15]Li F, Zhang H, Liu SL, et al., 2022. DN-DETR: accelerate DETR training by introducing query denoising. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.13609-13617. ![]() [16]Li F, Zeng AL, Liu SL, et al., 2023. Lite DETR: an interleaved multi-scale encoder for efficient DETR. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.18558-18567. ![]() [17]Li S, Li MH, Li RH, et al., 2023. One-to-few label assignment for end-to-end dense detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7350-7359. ![]() [18]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. Proc 13th European Conf on Computer Vision, p.740-755. ![]() [19]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007. ![]() [20]Liu SL, Li F, Zhang H, et al., 2022. Dab-DETR: dynamic anchor boxes are better queries for DETR. Proc 10th Int Conf on Learning Representations. ![]() [21]Liu W, Anguelov D, Erhan D, et al., 2016. SSD: single shot multibox detector. Proc 14th European Conf on Computer Vision, p.21-37. ![]() [22]Pu SL, Zhao W, Chen WJ, et al., 2021. Unsupervised object detection with scene-adaptive concept learning. Front Inform Technol Electron Eng, 22(5):638-651. ![]() [23]Qin XF, Hu WK, Xiao C, et al., 2023. Attention-based efficient robot grasp detection network. Front Inform Technol Electron Eng, 24(10):1430-1444. ![]() [24]Redmon J, Divvala S, Girshick R, et al., 2016. You only look once: unified, real-time object detection. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.779-788. ![]() [25]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 29th Int Conf on Neural Information Processing Systems, p.91-99. ![]() [26]Rezatofighi H, Tsoi N, Gwak J, et al., 2019. Generalized intersection over union: a metric and a loss for bounding box regression. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.658-666. ![]() [27]Shao S, Zhao ZJ, Li BX, et al., 2018. CrowdHuman: a benchmark for detecting human in a crowd. https://arxiv.org/abs/1805.00123 ![]() [28]Sun PZ, Zhang RF, Jiang Y, et al., 2021a. Sparse R-CNN: end-to-end object detection with learnable proposals. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.14449-14458. ![]() [29]Sun PZ, Jiang Y, Xie EZ, et al., 2021b. What makes for end-to-end object detection? Proc 38th Int Conf on Machine Learning, p.9934-9944. ![]() [30]Tian Z, Shen CH, Chen H, et al., 2019. FCOS: fully convolutional one-stage object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.9626-9635. ![]() [31]Wang A, Chen H, Liu LH, et al., 2024. YOLOv10: real-time end-to-end object detection. Proc 38th Annual Conf on Neural Information Processing Systems, p.107984-108011. ![]() [32]Wang CY, Bochkovskiy A, Liao HYM, 2023. YOLOv7: trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7464-7475. ![]() [33]Wang JF, Song L, Li ZM, et al., 2021. End-to-end object detection with fully convolutional network. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.15844-15853. ![]() [34]Wang YN, Zhang XY, Yang T, et al., 2022. Anchor DETR: query design for Transformer-based detector. Proc 36th AAAI Conf on Artificial Intelligence, 36(3):2567-2575. ![]() [35]Yao ZY, Ai JB, Li BX, et al., 2021. Efficient DETR: improving end-to-end object detector with dense prior. https://arxiv.org/abs/2104.01318 ![]() [36]Ye MQ, Ke L, Li SY, et al., 2023. Cascade-DETR: delving into high-quality universal object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.6681-6691. ![]() [37]Zhang H, Li F, Liu SL, et al., 2023. DINO: DETR with improved denoising anchor boxes for end-to-end object detection. Proc 11th Int Conf on Learning Representations. ![]() [38]Zhang SF, Chi C, Yao YQ, et al., 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9756-9765. ![]() [39]Zhang SL, Wang XJ, Wang JQ, et al., 2023. Dense distinct query for end-to-end object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7329-7338. ![]() [40]Zhou XY, Wang DQ, Krähenbühl P, 2019. Objects as points. https://arxiv.org/abs/1904.07850 ![]() [41]Zhu XZ, Su WJ, Lu LW, et al., 2021. Deformable DETR: deformable Transformers for end-to-end object detection. Proc 9th Int Conf on Learning Representations. ![]() [42]Zong ZF, Song GL, Liu Y, 2023. DETRs with collaborative hybrid assignments training. Proc IEEE/CVF Int Conf on Computer Vision, p.6725-6735. ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>