CLC number: TP391.4
On-line Access: 2024-07-05
Received: 2023-01-13
Revision Accepted: 2024-07-05
Crosschecked: 2023-07-06
Cited: 0
Clicked: 776
Citations: Bibtex RefMan EndNote GB/T7714
Huifen XIA, Yongzhao ZHAN, Honglin LIU, Xiaopeng REN. Enhancing action discrimination via category-specific frame clustering for weakly supervised temporal action localization[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(6): 809-823.
@article{title="Enhancing action discrimination via category-specific frame clustering for weakly supervised temporal action localization",
author="Huifen XIA, Yongzhao ZHAN, Honglin LIU, Xiaopeng REN",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="6",
pages="809-823",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300024"
}
%0 Journal Article
%T Enhancing action discrimination via category-specific frame clustering for weakly supervised temporal action localization
%A Huifen XIA
%A Yongzhao ZHAN
%A Honglin LIU
%A Xiaopeng REN
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 6
%P 809-823
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300024
TY - JOUR
T1 - Enhancing action discrimination via category-specific frame clustering for weakly supervised temporal action localization
A1 - Huifen XIA
A1 - Yongzhao ZHAN
A1 - Honglin LIU
A1 - Xiaopeng REN
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 6
SP - 809
EP - 823
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300024
Abstract: temporal action localization (TAL) is a task of detecting the start and end times of action instances and classifying them in an untrimmed video. As the number of action categories per video increases, existing weakly supervised temporal action localization (W-TAL) methods with only video-level labels cannot provide sufficient supervision. Single-frame supervision has attracted the interest of researchers. Existing paradigms model single-frame annotations from the perspective of video snippet sequences, neglect action discrimination of annotated frames, and do not pay sufficient attention to their correlations in the same category. Considering a category, the annotated frames exhibit distinctive appearance characteristics or clear action patterns. Thus, a novel method to enhance action discrimination via category-specific frame clustering for W-TAL is proposed. Specifically, the K-means clustering algorithm is employed to aggregate the annotated discriminative frames of the same category, which are regarded as exemplars to exhibit the characteristics of the action category. Then, the class activation scores are obtained by calculating the similarities between a frame and exemplars of various categories. Category-specific representation modelling can provide complimentary guidance to snippet sequence modelling in the mainline. As a result, a convex combination fusion mechanism is presented for annotated frames and snippet sequences to enhance the consistency properties of action discrimination, which can generate a robust class activation sequence for precise action classification and localization. Due to the supplementary guidance of action discriminative enhancement for video snippet sequences, our method outperforms existing single-frame annotation-based methods. Experiments conducted on three datasets THUMOS14, GTEA and BEOID show that our method achieves high localization performance compared with state-of-the-art methods.
[1]Bojanowski P, Bach F, Laptev I, et al., 2013. Finding actors and actions in movies. IEEE Int Conf on Computer Vision, p.2280-2287.
[2]Bojanowski P, Lajugie R, Bach F, et al., 2014. Weakly supervised action labeling in videos under ordering constraints. 13th European Conf Computer Vision, p.628-643.
[3]Carreira J, Zisserman A, 2017. Quo Vadis, action recognition? A new model and the kinetics dataset. IEEE Conf on Computer Vision and Pattern Recognition, p.4724-4733.
[4]Chao YW, Vijayanarasimhan S, Seybold B, et al., 2018. Rethinking the faster R-CNN architecture for temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1130-1139.
[5]Chen ZY, Liu H, Zhang LL, et al., 2022. Multi-dimensional attention with similarity constraint for weakly-supervised temporal action localization. IEEE Trans Multim, 25:4349-4360.
[6]Damen D, Leelasawassuk T, Haines O, et al., 2014. You-Do, I-Learn: discovering task relevant objects and their modes of interaction from multi-user egocentric video. Proc British Machine Vision Conf, p.3.
[7]Gan C, Sun C, Duan LX, et al., 2016. Webly-supervised video recognition by mutually voting for relevant web images and web video frames. 14th European Conf on Computer Vision, p.849-866. https:/doi.org/10.1007/978-3-319-46487-9_52
[8]Gao JY, Chen MY, Xu CS, 2022. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.19967-19977.
[9]Ge YX, Qin XL, Yang D, et al., 2021. Deep snippet selective network for weakly supervised temporal action localization. Patt Recogn, 110:107686.
[10]Huang DA, Fei-Fei L, Niebles JC, 2016. Connectionist temporal modeling for weakly supervised action labeling. 14th European Conf on Computer Vision, p.137-153.
[11]Huang LJ, Wang L, Li HS, 2022a. Multi-modality self-distillation for weakly supervised temporal action localization. IEEE Trans Image Process, 31:1504-1519.
[12]Huang LJ, Wang L, Li HS, 2022b. Weakly supervised temporal action localization via representative snippet knowledge propagation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3262-3271.
[13]Jiang YG, Liu J, Roshan Zamir A, et al., 2014. THUMOS Challenge: Action Recognition with a Large Number of Classes. https://crcv.ucf.edu/THUMOS14 [Accessed on May 10, 2022].
[14]Ju C, Zhao PS, Chen SH, et al., 2021. Divide and conquer for single-frame temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.13435-13444.
[15]Kay W, Carreira J, Simonyan K, et al., 2017. The kinetics human action video dataset. https://arxiv.org/abs/1705.06950
[16]Kingma D P, Ba J, 2014. Adam: a method for stochastic optimization. https://arxiv.org/abs/1412.6980
[17]Lee P, Byun H, 2021. Learning action completeness from points for weakly-supervised temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.13628-13637.
[18]Lee P, Uh Y, Byun H, 2020. Background suppression network for weakly-supervised temporal action localization. Proc AAAI Conf Artif Intell, 34(7):11320-11327.
[19]Lei P, Todorovic S, 2018. Temporal deformable residual networks for action segmentation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6742-6751.
[20]Liao YG, Qiu CZ, Zhang ZY, et al., 2021. GCRNet: global context relation network for weakly-supervised temporal action localization: identify the target actions in a long untrimmed video and find the corresponding action start point and end point. Proc 5th Int Conf on Video and Image Processing, p.184-190.
[21]Lin CM, Xu CM, Luo DH, et al., 2021. Learning salient boundary feature for anchor-free temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3319-3328.
[22]Lin TW, Zhao X, Shou Z, 2017. Single shot temporal action detection. Proc 25th ACM Int Conf on Multimedia, p.988-996.
[23]Lin TY, Goyal P, Girshick R, et al., 2017. Focal loss for dense object detection. IEEE Int Conf on Computer Vision, p.2999-3007.
[24]Liu DC, Jiang TT, Wang YZ, 2019. Completeness modeling and context separation for weakly supervised temporal action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1298-1307.
[25]Long FC, Yao T, Qiu ZF, et al., 2019. Gaussian temporal awareness networks for action localization. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.344-353.
[26]Ma F, Zhu LC, Yang Y, et al., 2020. SF-Net: single-frame supervision for temporal action localization. 16th European Conf on Computer Vision, p.420-437.
[27]Moltisanti D, Fidler S, Damen D, 2019. Action recognition from single timestamp supervision in untrimmed videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9907-9916.
[28]Narayan S, Cholakkal H, Khan FS, et al, 2019. 3C-Net: category count and center loss for weakly-supervised action localization. IEEE/CVF Int Conf on Computer Vision, p.8678-8686.
[29]Nguyen P, Han B, Liu T, et al., 2018. Weakly supervised action localization by sparse temporal pooling network. IEEE Conf on Computer Vision and Pattern Recognition, p.6752-6761.
[30]Nguyen P, Ramanan D, Fowlkes C, 2019. Weakly-supervised action localization with background modeling. IEEE/CVF Int Conf on Computer Vision, p.5501-5510.
[31]Paul S, Roy S, Roy-Chowdhury AK, 2018. W-TALC: weakly-supervised temporal activity localization and classification. Proc 15th European Conf on Computer Vision, p.588-607.
[32]Shi BF, Dai Q, Mu YD, et al., 2020. Weakly-supervised action localization by generative attention modeling. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.1006-1016.
[33]Shou Z, Gao H, Zhang L, et al., 2018. AutoLoc: Weakly-supervised temporal action localization in untrimmed videos. Proc 15th European Conf on Computer Vision, p.162-179.
[34]Singh KK, Lee YJ, 2017. Hide-and-Seek: forcing a network to be meticulous for weakly-supervised object and action localization. IEEE Int Conf on Computer Vision, p.3544-3553.
[35]Sultani W, Chen C, Shah M, 2018. Real-world anomaly detection in surveillance videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6479-6488.
[36]Tong Z, Song YB, Wang J, et al., 2022. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training. https://arxiv.org/abs/2203.12602
[37]Wang LM, Xiong YJ, Lin DH, et al, 2017. UntrimmedNets for weakly supervised action recognition and detection. IEEE Conf on Computer Vision and Pattern Recognition, p.6402-6411.
[38]Wedel A, Pock T, Zach C, et al., 2009. An improved algorithm for TV-L1 optical flow. Statistical and Geometrical Approaches to Visual Motion Analysis, p.23-45.
[39]Xu MM, Zhao C, Rojas DS, et al., 2020. G-TAD: sub-graph localization for temporal action detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10153-10162.
[40]Yang L, Han JW, Zhao T, et al., 2022. Background-click supervision for temporal action localization. IEEE Trans Patt Anal Mach Intell, 44(12):9814-9829.
[41]Yang WF, Zhang TZ, Yu XY, et al., 2021. Uncertainty guided collaborative training for weakly supervised temporal action detection. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.53-63.
[42]Yang Y, Zhuang YT, Pan YH, 2021. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Front Inform Technol Electron Eng, 22(12):1551-1558.
[43]Zeng RH, Huang WB, Gan C, et al., 2019. Graph convolutional networks for temporal action localization. IEEE/CVF Int Conf on Computer Vision, p.7093-7102.
[44]Zhai YH, Wang L, Tang W, et al., 2020. Two-stream consensus network for weakly-supervised temporal action localization. 16th European Conf on Computer Vision, p.37-54.
[45]Zhang CW, Xu YL, Cheng ZZ, et al., 2019. Adversarial seeded sequence growing for weakly-supervised temporal action localization. Proc 27th ACM Int Conf on Multimedia, p.738-746.
[46]Zhao Y, Xiong YJ, Wang LM, et al., 2017. Temporal action detection with structured segment networks. IEEE Int Conf on Computer Vision, p.2933-2942.
[47]Zhou H, Zhan YZ, Mao QR, 2021. Video anomaly detection based on space-time fusion graph network learning. J Comput Res Dev, 58(1):48-59 (in Chinese).
[48]Zhu LC, Fan HH, Luo YW, et al., 2022. Temporal cross-layer correlation mining for action recognition. IEEE Trans Multim, 24:668-676. https://10.1109/tmm.2021.3057503
Open peer comments: Debate/Discuss/Question/Opinion
<1>