CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2021-04-01
Cited: 0
Clicked: 6391
Citations: Bibtex RefMan EndNote GB/T7714
Shiliang Pu, Wei Zhao, Weijie Chen, Shicai Yang, Di Xie, Yunhe Pan. Unsupervised object detection with scene-adaptive concept learning[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(5): 638-651.
@article{title="Unsupervised object detection with scene-adaptive concept learning",
author="Shiliang Pu, Wei Zhao, Weijie Chen, Shicai Yang, Di Xie, Yunhe Pan",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="22",
number="5",
pages="638-651",
year="2021",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2000567"
}
%0 Journal Article
%T Unsupervised object detection with scene-adaptive concept learning
%A Shiliang Pu
%A Wei Zhao
%A Weijie Chen
%A Shicai Yang
%A Di Xie
%A Yunhe Pan
%J Frontiers of Information Technology & Electronic Engineering
%V 22
%N 5
%P 638-651
%@ 2095-9184
%D 2021
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2000567
TY - JOUR
T1 - Unsupervised object detection with scene-adaptive concept learning
A1 - Shiliang Pu
A1 - Wei Zhao
A1 - Weijie Chen
A1 - Shicai Yang
A1 - Di Xie
A1 - Yunhe Pan
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 22
IS - 5
SP - 638
EP - 651
%@ 2095-9184
Y1 - 2021
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2000567
Abstract: Object detection is one of the hottest research directions in computer vision, has already made impressive progress in academia, and has many valuable applications in the industry. However, the mainstream detection methods still have two shortcomings: (1) even a model that is well trained using large amounts of data still cannot generally be used across different kinds of scenes; (2) once a model is deployed, it cannot autonomously evolve along with the accumulated unlabeled scene data. To address these problems, and inspired by visual knowledge theory, we propose a novel scene-adaptive evolution unsupervised video object detection algorithm that can decrease the impact of scene changes through the concept of object groups. We first extract a large number of object proposals from unlabeled data through a pre-trained detection model. Second, we build the visual knowledge dictionary of object concepts by clustering the proposals, in which each cluster center represents an object prototype. Third, we look into the relations between different clusters and the object information of different groups, and propose a graph-based group information propagation strategy to determine the category of an object concept, which can effectively distinguish positive and negative proposals. With these pseudo labels, we can easily fine-tune the pre-trained model. The effectiveness of the proposed method is verified by performing different experiments, and the significant improvements are achieved.
[1]Chen MH, Kira Z, AlRegib G, et al., 2019. Temporal attentive alignment for large-scale video domain adaptation. Proc IEEE/CVF Int Conf on Computer Vision, p.6320-6329.
[2]Cordts M, Omran M, Ramos S, et al., 2016. The cityscapes dataset for semantic urban scene understanding. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3213-3223.
[3]Croitoru I, Bogolin SV, Leordeanu M, 2017. Unsupervised learning from video to detect foreground objects in single images. Proc IEEE Int Conf on Computer Vision, p.4345-4353.
[4]Dai JF, Li Y, He KM, et al., 2016. R-FCN: object detection via region-based fully convolutional networks. Proc 30th Int Conf on Neural Information Processing Systems, p.379-387.
[5]Deng JJ, Pan YW, Yao T, et al., 2020. Single shot video object detector. IEEE Trans Multim, 23:846-858.
[6]Feichtenhofer C, Pinz A, Zisserman A, 2017. Detect to track and track to detect. Proc IEEE Int Conf on Computer Vision, p.3057-3065.
[7]Geiger A, Lenz P, Urtasun R, 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3354-3361.
[8]Girshick R, 2015. Fast R-CNN. Proc IEEE Int Conf on Computer Vision, p.1440-1448.
[9]Girshick R, Donahue J, Darrell T, et al., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.580-587.
[10]Guo CX, Fan B, Gu J, et al., 2019. Progressive sparse local attention for video object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.3908-3917.
[11]Han W, Khorrami P, Le Paine T, et al., 2016. Seq-NMS for video object detection. https://arxiv.org/abs/1602.08465v1
[12]He ZW, Zhang L, 2019. Multi-adversarial faster-RCNN for unrestricted object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.6667-6676.
[13]Htike KK, Hogg DC, 2014. Efficient non-iterative domain adaptation of pedestrian detectors to video scenes. Proc 22nd Int Conf on Pattern Recognition, p.654-659.
[14]Johnson-Roberson M, Barto C, Mehta R, et al., 2016. Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks? Proc IEEE Int Conf on Robotics and Automation, p.746-753.
[15]Kang K, Ouyang WL, Li HS, et al., 2016. Object detection from video tubelets with convolutional neural networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.817-825.
[16]Kang K, Li HS, Xiao T, et al., 2017. Object detection in videos with tubelet proposal networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.889-897.
[17]Kang K, Li HS, Yan JJ, et al., 2018. T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE Trans Circ Syst Video Technol, 28(10):2896-2907.
[18]Khodabandeh M, Vahdat A, Ranjbar M, et al., 2019. A robust learning approach to domain adaptive object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.480-490.
[19]Kipf TN, Welling M, 2017. Semi-supervised classification with graph convolutional networks. https://arxiv.org/abs/1609.02907
[20]Kwak S, Cho M, Laptev I, et al., 2015. Unsupervised object discovery and tracking in video collections. Proc IEEE Int Conf on Computer Vision, p.3173-3181.
[21]Lahiri A, Ragireddy SC, Biswas P, et al., 2019. Unsupervised adversarial visual level domain adaptation for learning video object detectors from images. Proc IEEE Winter Conf on Applications of Computer Vision, p.1807-1815.
[22]Law H, Deng J, 2018. CornerNet: detecting objects as paired keypoints. Proc 15th European Conf on Computer Vision, p.765-781.
[23]Li D, Hung WC, Huang JB, et al., 2016. Unsupervised visual representation learning by graph-based consistent constraints. Proc 14th European Conf on Computer Vision, p.678-694.
[24]Li JN, Liang XD, Shen SM, et al., 2018. Scale-aware fast R-CNN for pedestrian detection. IEEE Trans Multim, 20(4):985-996.
[25]Li NJ, Chang FL, Liu CS, 2020. Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes. IEEE Trans Multim, 23:203-215.
[26]Lin TY, Dollár P, Girshick R, et al., 2017a. Feature pyramid networks for object detection. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.936-944.
[27]Lin TY, Goyal P, Girshick R, et al., 2017b. Focal loss for dense object detection. Proc IEEE Int Conf on Computer Vision, p.2999-3007.
[28]Liu W, Anguelov D, Erhan D, et al., 2016. SSD: single shot multibox detector. Proc 14th European Conf on Computer Vision, p.21-37.
[29]Ma XL, Zhu XT, Gong SG, et al., 2017. Person re-identification by unsupervised video matching. Patt Recogn, 65:197-210.
[30]Pan YH, 2016. Heading toward artificial intelligence 2.0. Engineering, 2(4):409-413.
[31]Pan YH, 2019. On visual knowledge. Front Inform Technol Electron Eng, 20(8):1021-1025.
[32]Pan YH, 2020. Miniaturized five fundamental issues about visual knowledge. Front Inform Technol Electron Eng, online.
[33]Pang JM, Chen K, Shi JP, et al., 2019. Libra R-CNN: towards balanced learning for object detection. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.821-830.
[34]Redmon J, Farhadi A, 2017. YOLO9000: better, faster, stronger. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6517-6525.
[35]Redmon J, Divvala S, Girshick R, et al., 2016. You only look once: unified, real-time object detection. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.779-788.
[36]Ren SQ, He KM, Girshick R, et al., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. Proc 28th Int Conf on Neural Information Processing Systems, p.91-99.
[37]Shen ZQ, Maheshwari H, Yao WC, et al., 2019. SCL: towards accurate domain adaptive object detection via gradient detach based stacked complementary losses. https://arxiv.org/abs/1911.02559
[38]Shvets M, Liu W, Berg A, 2019. Leveraging long-range temporal relationships between proposals for video object detection. Proc IEEE/CVF Int Conf on Computer Vision, p.9755-9763.
[39]Subramaniam A, Nambiar A, Mittal A, 2019. Co-segmentation inspired attention networks for video-based person re-identification. Proc IEEE/CVF Int Conf on Computer Vision, p.562-572.
[40]Tang K, Ramanathan V, Li FF, et al., 2012. Shifting weights: adapting object detectors from image to video. Proc 25th Int Conf on Neural Information Processing Systems, p.638-646.
[41]Veličcković P, Casanova A, Lio P, et al., 2018. Graph attention networks. https://arxiv.org/abs/1710.10903
[42]Wang HW, Leskovec J, 2019. Unifying graph convolutional neural networks and label propagation. https://arxiv.org/abs/2002.06755
[43]Wang SG, Cheng J, Liu HJ, et al., 2018. Pedestrian detection via body part semantic and contextual information with DNN. IEEE Trans Multim, 20(11):3148-3159.
[44]Wang SY, Zhou YC, Yan JJ, et al., 2018. Fully motion-aware network for video object detection. Proc 15th European Conf on Computer Vision, p.557-573.
[45]Wang SY, Group A, Lu HC, et al., 2019. Fast object detection in compressed video. Proc IEEE/CVF Int Conf on Computer Vision, p.7103-7112.
[46]Wu F, Souza A, Zhang TY, et al., 2019. Simplifying graph convolutional networks. Proc 36th Int Conf on Machine Learning, p.6861-6871.
[47]Xiao FY, Lee YJ, 2016. Track and segment: an iterative unsupervised approach for video object proposals. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.933-942.
[48]Xiao FY, Lee YJ, 2018. Video object detection with an aligned spatial-temporal memory. Proc 15th European Conf on Computer Vision, p.494-510.
[49]Yu HK, Guo DZ, Yan ZP, et al., 2018. Unsupervised learning for large-scale fiber detection and tracking in microscopic material images. https://arxiv.org/abs/1805.10256
[50]Zhang XS, Wan F, Liu C, et al., 2019. FreeAnchor: learning to match anchors for visual object detection. https://arxiv.org/abs/1909.02466
[51]Zhu ML, Liu M, 2018. Mobile video object detection with temporally-aware feature maps. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5686-5695.
[52]Zhu XG, Pang JM, Yang CY, et al., 2019. Adapting object detectors via selective cross-domain alignment. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.687-696.
[53]Zhu XZ, Wang YJ, Dai JF, et al., 2017. Flow-guided feature aggregation for video object detection. Proc IEEE Int Conf on Computer Vision, p.408-417.
Open peer comments: Debate/Discuss/Question/Opinion
<1>