
CLC number: TP391.4
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2023-07-06
Cited: 0
Clicked: 2776
Citations: Bibtex RefMan EndNote GB/T7714
Huifen XIA, Yongzhao ZHAN, Honglin LIU, Xiaopeng REN. Enhancing action discrimination via category-specific frame clustering for weakly supervised temporal action localization[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2300024 @article{title="Enhancing action discrimination via category-specific frame clustering for weakly supervised temporal action localization", %0 Journal Article TY - JOUR
通过类别特定帧聚类增强动作显著性的弱监督时序动作检测1江苏大学计算机科学与通信工程学院,中国镇江市,212013 2大数据泛在感知与智慧农业应用工程研究中心,中国镇江市,212013 3常州机电职业技术学院,中国常州市,213164 摘要:时序动作检测任务是指在未裁剪的视频中检测出动作的开始时间和结束时间,并对动作实例进行分类。随着视频中动作类别的增多,现有仅提供视频级别标签的弱监督时序动作检测方法已无法提供足够的监督。单帧标注方法引起了人们兴趣。但现有单帧标注方法仅从视频片段序列的角度对标注的单帧建模,而忽略了标注单帧的动作显著性,并且没有充分考虑它们在同一动作类别中的相关性。考虑到在同一动作类别中,带标注的单帧能表现出独特的外观特征和清晰的动作模式,本文提出一种新颖的通过类别特定帧聚类来增强动作显著性的弱监督时序动作检测方法。该方法采用K-均值聚类算法对同一动作类别的帧聚合,将其作为该动作类别的特征表示。通过计算每帧与各个动作类别之间的相似度,得到类激活分数。特定于类别的单帧表征建模可以为主线中的视频片段序列建模提供补充性的指导。因此,针对标注的帧和其对应的视频片段序列,提出凸组合融合机制,用于增强动作显著性的一致性特性,从而生成更加鲁棒的类激活序列,进行精确的动作分类和动作定位。由于动作显著性增强的补充指导,该方法优于现有的基于单帧标注的动作检测方法。在THUMOS14、GTEA和BEOID3个数据集上进行的实验表明,与最新的方法相比,所提方法具有更高的检测性能。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Bojanowski P, Bach F, Laptev I, et al., 2013. Finding actors and actions in movies. IEEE Int Conf on Computer Vision, p.2280-2287. ![]() [2]Bojanowski P, Lajugie R, Bach F, et al., 2014. Weakly supervised action labeling in videos under ordering constraints. 13th European Conf Computer Vision, p.628-643. ![]() [3]Carreira J, Zisserman A, 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. IEEE Conf on Computer Vision and Pattern Recognition, p.4724-4733. ![]() [4]Chao YW, Vijayanarasimhan S, Seybold B, et al., 2018. Rethinking the faster R-CNN architecture for temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1130-1139. ![]() [5]Chen ZY, Liu H, Zhang LL, et al., 2022. Multi-dimensional attention with similarity constraint for weakly-supervised temporal action localization. IEEE Trans Multim, 25:4349-4360. ![]() [6]Damen D, Leelasawassuk T, Haines O, et al., 2014. You-Do, I-Learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. Proc British Machine Vision Conf, p.3. ![]() [7]Gan C, Sun C, Duan LX, et al., 2016. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. 14th European Conf on Computer Vision, p.849-866. https:/doi.org/10.1007/978-3-319-46487-9_52 ![]() [8]Gao JY, Chen MY, Xu CS, 2022. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19967-19977. ![]() [9]Ge YX, Qin XL, Yang D, et al., 2021. Deep snippet selective network for weakly supervised temporal action localization. Patt Recogn, 110:107686. ![]() [10]Huang DA, Fei-Fei L, Niebles JC, 2016. Connectionist temporal modeling for weakly supervised action labeling. 14th European Conf on Computer Vision, p.137-153. ![]() [11]Huang LJ, Wang L, Li HS, 2022a. Multi-modality self-distillation for weakly supervised temporal action localization. IEEE Trans Image Process, 31:1504-1519. ![]() [12]Huang LJ, Wang L, Li HS, 2022b. Weakly supervised temporal action localization via representative snippet knowledge propagation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3262-3271. ![]() [13]Jiang YG, Liu J, Roshan Zamir A, et al., 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. https://crcv.ucf.edu/THUMOS14 [Accessed on May 10, 2022]. ![]() [14]Ju C, Zhao PS, Chen SH, et al., 2021. Divide and conquer for single-frame temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.13435-13444. ![]() [15]Kay W, Carreira J, Simonyan K, et al., 2017. The kinetics human action video dataset. https://arxiv.org/abs/1705.06950 ![]() [16]Kingma D P, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980 ![]() [17]Lee P, Byun H, 2021. Learning action completeness from points for weakly-supervised temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.13628-13637. ![]() [18]Lee P, Uh Y, Byun H, 2020. Background suppression network for weakly-supervised temporal action localization. Proc AAAI Conf Artif Intell, 34(7):11320-11327. ![]() [19]Lei P, Todorovic S, 2018. Temporal deformable residual networks for action segmentation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6742-6751. ![]() [20]Liao YG, Qiu CZ, Zhang ZY, et al., 2021. GCRNet: global context relation network for weakly-supervised temporal action localization: identify the target actions in a long untrimmed video and find the corresponding action start point and end point. Proc 5th Int Conf on Video and Image Processing, p.184-190. ![]() [21]Lin CM, Xu CM, Luo DH, et al., 2021. Learning salient boundary feature for anchor-free temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3319-3328. ![]() [22]Lin TW, Zhao X, Shou Z, 2017. Single shot temporal action detection. Proc 25th ACM Int Conf on Multimedia, p.988-996. ![]() [23]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. IEEE Int Conf on Computer Vision, p.2999-3007. ![]() [24]Liu DC, Jiang TT, Wang YZ, 2019. Completeness modeling and context separation for weakly supervised temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1298-1307. ![]() [25]Long FC, Yao T, Qiu ZF, et al., 2019. Gaussian temporal awareness networks for action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.344-353. ![]() [26]Ma F, Zhu LC, Yang Y, et al., 2020. SF-Net: single-frame supervision for temporal action localization. 16th European Conf on Computer Vision, p.420-437. ![]() [27]Moltisanti D, Fidler S, Damen D, 2019. Action recognition from single timestamp supervision in untrimmed videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9907-9916. ![]() [28]Narayan S, Cholakkal H, Khan FS, et al, 2019. 3C-Net: category count and center loss for weakly-supervised action localization. IEEE/CVF Int Conf on Computer Vision, p.8678-8686. ![]() [29]Nguyen P, Han B, Liu T, et al., 2018. Weakly supervised action localization by sparse temporal pooling network. IEEE Conf on Computer Vision and Pattern Recognition, p.6752-6761. ![]() [30]Nguyen P, Ramanan D, Fowlkes C, 2019. Weakly-supervised action localization with background modeling. IEEE/CVF Int Conf on Computer Vision, p.5501-5510. ![]() [31]Paul S, Roy S, Roy-Chowdhury AK, 2018. W-TALC: weakly-supervised temporal activity localization and classification. Proc 15th European Conf on Computer Vision, p.588-607. ![]() [32]Shi BF, Dai Q, Mu YD, et al., 2020. Weakly-supervised action localization by generative attention modeling. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1006-1016. ![]() [33]Shou Z, Gao H, Zhang L, et al., 2018. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. Proc 15th European Conf on Computer Vision, p.162-179. ![]() [34]Singh KK, Lee YJ, 2017. Hide-and-Seek: forcing a network to be meticulous for weakly-supervised object and action localization. IEEE Int Conf on Computer Vision, p.3544-3553. ![]() [35]Sultani W, Chen C, Shah M, 2018. Real-world anomaly detection in surveillance videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6479-6488. ![]() [36]Tong Z, Song YB, Wang J, et al., 2022. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. https://arxiv.org/abs/2203.12602 ![]() [37]Wang LM, Xiong YJ, Lin DH, et al, 2017. UntrimmedNets for weakly supervised action recognition and detection. IEEE Conf on Computer Vision and Pattern Recognition, p.6402-6411. ![]() [38]Wedel A, Pock T, Zach C, et al., 2009. An improved algorithm for TV-L1 optical flow. Statistical and Geometrical Approaches to Visual Motion Analysis, p.23-45. ![]() [39]Xu MM, Zhao C, Rojas DS, et al., 2020. G-TAD: sub-graph localization for temporal action detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10153-10162. ![]() [40]Yang L, Han JW, Zhao T, et al., 2022. Background-click supervision for temporal action localization. IEEE Trans Patt Anal Mach Intell, 44(12):9814-9829. ![]() [41]Yang WF, Zhang TZ, Yu XY, et al., 2021. Uncertainty guided collaborative training for weakly supervised temporal action detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.53-63. ![]() [42]Yang Y, Zhuang YT, Pan YH, 2021. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inform Technol Electron Eng, 22(12):1551-1558. ![]() [43]Zeng RH, Huang WB, Gan C, et al., 2019. Graph convolutional networks for temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.7093-7102. ![]() [44]Zhai YH, Wang L, Tang W, et al., 2020. Two-stream consensus network for weakly-supervised temporal action localization. 16th European Conf on Computer Vision, p.37-54. ![]() [45]Zhang CW, Xu YL, Cheng ZZ, et al., 2019. Adversarial seeded sequence growing for weakly-supervised temporal action localization. Proc 27th ACM Int Conf on Multimedia, p.738-746. ![]() [46]Zhao Y, Xiong YJ, Wang LM, et al., 2017. Temporal action detection with structured segment networks. IEEE Int Conf on Computer Vision, p.2933-2942. ![]() [47]Zhou H, Zhan YZ, Mao QR, 2021. Video anomaly detection based on space-time fusion graph network learning. J Comput Res Dev, 58(1):48-59 (in Chinese). ![]() [48]Zhu LC, Fan HH, Luo YW, et al., 2022. Temporal cross-layer correlation mining for action recognition. IEEE Trans Multim, 24:668-676. https://10.1109/tmm.2021.3057503 ![]() Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2026 Journal of Zhejiang University-SCIENCE | ||||||||||||||


ORCID:
Open peer comments: Debate/Discuss/Question/Opinion
<1>