CLC number: TP391
On-line Access: 2024-08-27
Received: 2023-10-17
Revision Accepted: 2024-05-08
Crosschecked: 2021-04-01
Cited: 0
Clicked: 5033
Ping Li, Chao Tang, Xianghua Xu. Video summarization with a graph convolutional attention network[J]. Frontiers of Information Technology & Electronic Engineering, 2021, 22(6): 902-913.
@article{title="Video summarization with a graph convolutional attention network",
author="Ping Li, Chao Tang, Xianghua Xu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="22",
number="6",
pages="902-913",
year="2021",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2000429"
}
%0 Journal Article
%T Video summarization with a graph convolutional attention network
%A Ping Li
%A Chao Tang
%A Xianghua Xu
%J Frontiers of Information Technology & Electronic Engineering
%V 22
%N 6
%P 902-913
%@ 2095-9184
%D 2021
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2000429
TY - JOUR
T1 - Video summarization with a graph convolutional attention network
A1 - Ping Li
A1 - Chao Tang
A1 - Xianghua Xu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 22
IS - 6
SP - 902
EP - 913
%@ 2095-9184
Y1 - 2021
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2000429
Abstract: video summarization has established itself as a fundamental technique for generating compact and concise video, which alleviates managing and browsing large-scale video data. Existing methods fail to fully consider the local and global relations among frames of video, leading to a deteriorated summarization performance. To address the above problem, we propose a graph convolutional attention network (GCAN) for video summarization. GCAN consists of two parts, embedding learning and context fusion, where embedding learning includes the temporal branch and graph branch. In particular, GCAN uses dilated temporal convolution to model local cues and temporal self-attention to exploit global cues for video frames. It learns graph embedding via a multi-layer graph convolutional network to reveal the intrinsic structure of frame samples. The context fusion part combines the output streams from the temporal branch and graph branch to create the context-aware representation of frames, on which the importance scores are evaluated for selecting representative frames to generate video summary. Experiments are carried out on two benchmark databases, SumMe and TVSum, showing that the proposed GCAN approach enjoys superior performance compared to several state-of-the-art alternatives in three evaluation settings.
[1]Aner A, Kender JR, 2002. Video summaries through mosaic-based shot and scene clustering. Proc 7th European Conf on Computer Vision, p.388-402.
[2]Basavarajaiah M, Sharma P, 2019. Survey of compressed domain video summarization techniques. ACM Comput Surv, 52(6):116.
[3]Chen YW, Tsai YH, Lin YY, et al., 2020. VOSTR: video object segmentation via transferable representations. Int J Comput Vis, 128(4):931-949.
[4]Chu WS, Song YL, Jaimes A, 2015. Video co-summarization: video summarization by visual co-occurrence. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3584-3592.
[5]Cisco, 2020. Cisco Global Networking Trends Report. https://www.cisco.com/c/m/en_us/solutions/enterprise-networks/networking-report.html
[6]Cong Y, Yuan JS, Luo JB, 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Trans Multim, 14(1):66-75.
[7]Cong Y, Liu J, Sun G, et al., 2017. Adaptive greedy dictionary selection for web media summarization. IEEE Trans Image Process, 26(1):185-195.
[8]de Avila SEF, Lopes APB, da Luz AJr, et al., 2011. VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Patt Recogn Lett, 32(1):56-68.
[9]Elhamifar E, Sapiro G, Vidal R, 2012. See all by looking at a few: sparse modeling for finding representative objects. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1600-1607.
[10]Gong BQ, Chao WL, Grauman K, et al., 2014. Diverse sequential subset selection for supervised video summarization. Proc 27th Int Conf on Neural Information Processing Systems, p.2069-2077.
[11]Guan GL, Wang ZY, Lu SY, et al., 2013. Keypoint-based keyframe selection. IEEE Trans Circ Syst Video Technol, 23(4):729-734.
[12]Gygli M, Grabner H, Riemenschneider H, et al., 2014. Creating summaries from user videos. Proc 13th European Conf on Computer Vision, p.505-520.
[13]Hannane R, Elboushaki A, Afdel K, et al., 2016. An efficient method for video shot boundary detection and keyframe extraction using SIFT-point distribution histogram. Int J Multim Inform Retr, 5(2):89-104.
[14]Huang JH, Di XG, Wu JD, et al., 2020. A novel convolutional neural network method for crowd counting. Front Inform Technol Electron Eng, 21(8):1150-1160.
[15]Ji Z, Xiong KL, Pang YW, et al., 2020. Video summarization with attention-based encoder-decoder networks. IEEE Trans Circ Syst Video Technol, 30(6):1709-1717.
[16]Jung Y, Cho D, Kim D, et al., 2019. Discriminative feature learning for unsupervised video summarization. Proc AAAI Conf on Artificial Intelligence, p.8537-8544.
[17]Kipf TN, Welling M, 2017. Semi-supervised classification with graph convolutional networks. Int Conf on Learning Representations, p.1-14.
[18]Kuanar SK, Panda R, Chowdhury AS, 2013. Video key frame extraction through dynamic Delaunay clustering with a structural constraint. J Vis Commun Image Represent, 24(7):1212-1227.
[19]Lei SS, Xie G, Yan GW, 2014. A novel key-frame extraction approach for both video summary and video index. Sci World J, 2014:695168.
[20]Li JN, Zhang SL, Wang JD, et al., 2019. Global-local temporal representations for video person re-identification. Proc IEEE/CVF Int Conf on Computer Vision, p.3957-3966.
[21]Li P, Ye QH, Zhang LM, et al., 2021. Exploring global diverse attention via pairwise temporal relation for video summarization. Patt Recogn, 111:107677.
[22]Li YD, Wang LQ, Yang TB, et al., 2018. How local is the local diversity? Reinforcing sequential determinantal point processes with dynamic ground sets for supervised video summarization. Proc 15th European Conf on Computer Vision, p.156-174.
[23]Lu SY, Wang ZY, Mei T, et al., 2014. A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans Multim, 16(6):1497-1509.
[24]Luan Q, Song ML, Liau CY, et al., 2014. Video summarization based on nonnegative linear reconstruction. IEEE Int Conf on Multimedia and Expo, p.1-6.
[25]Mahasseni B, Lam M, Todorovic S, 2017. Unsupervised video summarization with adversarial LSTM networks. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2982-2991.
[26]Mahmoud KM, Ghanem NM, Ismail MA, 2013. VGRAPH: an effective approach for generating static video summaries. Proc IEEE Int Conf on Computer Vision Workshops, p.811-818.
[27]Mei SH, Guan GL, Wang ZY, et al., 2015. Video summarization via minimum sparse reconstruction. Patt Recogn, 48(2):522-533.
[28]Potapov D, Douze M, Harchaoui Z, et al., 2014. Category-specific video summarization. Proc 14th European Conf on Computer Vision, p.540-555.
[29]Rochan M, Wang Y, 2019. Video summarization by learning from unpaired data. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7894-7903.
[30]Rochan M, Ye LW, Wang Y, 2018. Video summarization using fully convolutional sequence networks. Proc 15th European Conf on Computer Vision, p.358-374.
[31]Shen T, Zhou TY, Long GD, et al., 2018. Bi-directional block self-attention for fast and memory-efficient sequence modeling. Proc 6th Int Conf on Learning Representations, p.1-18.
[32]Song YL, Vallmitjana J, Stent A, et al., 2015. TVSum: summarizing web videos using titles. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5179-5187.
[33]Szegedy C, Liu W, Jia YQ, et al., 2015. Going deeper with convolutions. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.1-9.
[34]Wei HW, Ni BB, Yan YC, et al., 2018. Video summarization via semantic attended networks. Proc AAAI Conf on Artificial Intelligence, p.216-223.
[35]Yu F, Koltun V, 2016. Multi-scale context aggregation by dilated convolutions. http://arxiv.org/abs/1511.07122
[36]Yuan L, Tay FE, Li P, et al., 2019. Cycle-SUM: cycle-consistent adversarial LSTM networks for unsupervised video summarization. Proc AAAI Conf on Artificial Intelligence, p.9143-9150.
[37]Yuan YT, Mei T, Cui P, et al., 2019. Video summarization by learning deep side semantic embedding. IEEE Trans Circ Syst Video Technol, 29(1):226-237.
[38]Zhang K, Chao WL, Sha F, et al., 2016. Video summarization with long short-term memory. Proc 14th European Conf on Computer Vision, p.766-782.
[39]Zhao B, Xing EP, 2014. Quasi real-time summarization for consumer videos. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2513-2520.
[40]Zhao B, Li XL, Lu XQ, 2018. HSA-RNN: hierarchical structure-adaptive RNN for video summarization. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7405-7414.
[41]Zhao B, Li XL, Lu XQ, 2020. Property-constrained dual learning for video summarization. IEEE Trans Neur Netw Learn Syst, 31(10):3989-4000.
[42]Zhou KY, Qiao Y, Xiang T, 2018. Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. Proc AAAI Conf on Artificial Intelligence, p.7582-7589.
[43]Zhuang YT, Rui Y, Huang TS, et al., 1998. Adaptive key frame extraction using unsupervised clustering. Proc Int Conf on Image Processing, p.866-870.
Open peer comments: Debate/Discuss/Question/Opinion
<1>