CLC number: TP391
On-line Access: 2024-05-06
Received: 2022-12-12
Revision Accepted: 2024-05-06
Crosschecked: 2023-06-27
Cited: 0
Clicked: 264
Yuanhong ZHONG, Qianfeng XU, Daidi ZHONG, Xun YANG, Shanshan WANG. FaSRnet: a feature and semantics refinement network for human pose estimation[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2200639 @article{title="FaSRnet: a feature and semantics refinement network for human pose estimation", %0 Journal Article TY - JOUR
FaSRnet:用于人体姿态估计的特征和语义修正网络1重庆大学微电子与通信工程学院,中国重庆市,400044 2重庆大学生物工程学院,中国重庆市,400044 3中国科学技术大学信息科学与技术学院,中国合肥市,230039 4安徽大学物质科学与信息技术研究院,中国合肥市,230039 摘要:由于运动模糊、视频失焦和遮挡等因素,多帧人体姿态估计是一项有挑战性的任务。利用连续帧之间的时间一致性是解决这一问题的有效方法。目前,大多数方法通过修正最终热图来利用时间一致性。热图包含了关键点的语义信息,可在一定程度上提高检测质量。它们由特征生成,但这些方法很少考虑特征级别的修正。本文提出一种人体姿态估计框架,该框架在特征和语义层面进行了改进。将辅助特征与当前帧的特征对齐,以减少不同特征分布带来的损失。使用注意力机制将辅助特征与当前特征融合。在语义方面,使用相邻热图之间的差异作为辅助特征来修正当前热图。在大型基准数据集PoseTrack2017和PoseTrack2018上验证了该方法的有效性。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Andriluka M, Pishchulin L, Gehler P, et al., 2014. 2D human pose estimation: new benchmark and state of the art analysis. IEEE Conf on Computer Vision and Pattern Recognition, p.3686-3693. [2]Andriluka M, Iqbal U, Insafutdinov E, et al., 2018. PoseTrack: a benchmark for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5167-5176. [3]Bertasius G, Feichtenhofer C, Tran D, et al., 2019. Learning temporal pose estimation from sparsely-labeled videos. Proc 33rd Int Conf on Neural Information Processing Systems, p.3027-3038. [4]Cai YH, Wang ZC, Luo ZX, et al., 2020. Learning delicate local representations for multi-person pose estimation. 16th European Conf on Computer Vision, p.455-472. [5]Cao Z, Hidalgo G, Simon T, et al., 2021. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Patt Anal Mach Intell, 43(1):172-186. [6]Chu X, Yang W, Ouyang WL, et al., 2017. Multi-context attention for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.5669-5678. [7]Dang YH, Yin JQ, Zhang SJ, et al., 2022a. Learning human kinematics by modeling temporal correlations between joints for video-based human pose estimation. [8]Dang YH, Yin JQ, Zhang SJ, 2022b. Relation-based associative joint location for human pose estimation in videos. IEEE Trans Image Process, 31:3973-3986. [9]Doering A, Iqbal U, Gall J, 2018. Joint flow: temporal flow fields for multi person tracking. [10]Fang HS, Xie SQ, Tai YW, et al., 2017. RMPE: regional multi-person pose estimation. IEEE Int Conf on Computer Vision, p.2353-2362. [11]Fang HS, Li JF, Tang HY, et al., 2023. AlphaPose: whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans Patt Anal Mach Intell, 45(6):7157-7173. [12]Fang ZJ, López AM, 2020. Intention recognition of pedestrians and cyclists by 2D pose estimation. IEEE Trans Intell Transp Syst, 21(11):4773-4783. [13]Girdhar R, Gkioxari G, Torresani L, et al., 2018. Detect-and-track: efficient pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.350-359. [14]Guo HK, Tang T, Luo GZ, et al., 2019. Multi-domain pose network for multi-person pose estimation and tracking. European Conf on Computer Vision, p.209-216. [15]Hwang J, Lee J, Park S, et al., 2019. Pose estimator and tracker using temporal flow maps for limbs. Int Joint Conf on Neural Networks, p.1-8. [16]Insafutdinov E, Andriluka M, Pishchulin L, et al., 2017. ArtTrack: articulated multi-person tracking in the wild. Conf on Computer Vision and Pattern Recognition, p.1293-1301. [17]Iqbal U, Milan A, Gall J, 2017. PoseTrack: joint multi-person pose estimation and tracking. IEEE Conf on Computer Vision and Pattern Recognition, p.4654-4663. [18]Jin S, Liu WT, Ouyang WL, et al., 2019. Multi-person articulated tracking with spatial and temporal embeddings. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5657-5666. [19]Jin S, Liu WT, Xie EZ, et al., 2020. Differentiable hierarchical graph grouping for multi-person pose estimation. 16th European Conf on Computer Vision, p.718-734. [20]Li DW, Chen XT, Zhang Z, et al., 2018. Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. IEEE Int Conf on Multimedia and Expo, p.1-6. [21]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. 13th European Conf on Computer Vision, p.740-755. [22]Liu ZG, Wu S, Jin SY, et al., 2019. Towards natural and accurate future motion prediction of humans and animals. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9996-10004. [23]Liu ZG, Chen HM, Feng RY, et al., 2021. Deep dual consecutive network for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.525-534. [24]Liu ZG, Feng RY, Chen HM, et al., 2022. Temporal feature alignment and mutual information maximization for video-based human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10996-11006. [25]Luo Y, Ren J, Wang ZX, et al., 2018. LSTM pose machines. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5207-5215. [26]Martinez J, Hossain R, Romero J, et al., 2017. A simple yet effective baseline for 3D human pose estimation. IEEE Int Conf on Computer Vision, p.2659-2668. [27]Pfister T, Charles J, Zisserman A, 2015. Flowing ConvNets for human pose estimation in videos. IEEE Int Conf on Computer Vision, p.1913-1921. [28]Sapp B, Taskar B, 2013. MODEC: multimodal decomposable models for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.3674-3681. [29]Shao ZP, Zhou W, Wang WZ, et al., 2023. A temporal densely connected recurrent network for event-based human pose estimation. [30]Snower M, Kadav A, Lai F, et al., 2020. 15 keypoints is all you need. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6737-6747. [31]Song J, Wang LM, van Gool L, et al., 2017. Thin-slicing network: a deep structured model for pose estimation in videos. IEEE Conf on Computer Vision and Pattern Recognition, p.5563-5572. [32]Sun K, Xiao B, Liu D, et al., 2019. Deep high-resolution representation learning for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5686-5696. [33]Tian YP, Zhang YL, Fu Y, et al., 2020. TDAN: temporally-deformable alignment network for video super-resolution. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3357-3366. [34]Wang J, Long X, Gao Y, et al., 2020. Graph-PCNN: two stage human pose estimation with graph pose refinement. 16th European Conf on Computer Vision, p.492-508. [35]Wang M, Hong RC, Yuan XT, et al., 2012. Movie2Comics: towards a lively video content presentation. IEEE Trans Multim, 14(3):858-870. [36]Wang MC, Tighe J, Modolo D, 2020. Combining detection and tracking for human pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11085-11093. [37]Wang XL, Girshick R, Gupta A, et al., 2018. Non-local neural networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7794-7803. [38]Wang XT, Chan KCK, Yu K, et al., 2019. EDVR: video restoration with enhanced deformable convolutional networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.1954-1963. [39]Weinzaepfel P, Revaud J, Harchaoui Z, et al., 2013. DeepFlow: large displacement optical flow with deep matching. IEEE Int Conf on Computer Vision, p.1385-1392. [40]Xiao B, Wu HP, Wei YC, 2018. Simple baselines for human pose estimation and tracking. 15th European Conf on Computer Vision, p.472-487. [41]Xiu YL, Li JF, Wang HY, et al., 2018. Pose flow: efficient online pose tracking. [42]Yang X, Wang M, Hong RC, et al., 2017. Enhancing person re-identification in a self-trained subspace. ACM Trans Multim Comput Commun Appl, 13(3):27. [43]Yang X, Wang M, Tao DC, 2018. Person re-identification with metric learning using privileged information. IEEE Trans Image Process, 27(2):791-805. [44]Yang YD, Ren Z, Li HX, et al., 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8070-8080. [45]Yu F, Koltun V, 2016. Multi-scale context aggregation by dilated convolutions. [46]Zhang JB, Zhu Z, Zou W, et al., 2019. FastPose: towards real-time pose estimation and tracking via scale-normalized multi-task networks. [47]Zheng W, Li L, Zhang ZX, et al., 2019. Relational network for skeleton-based action recognition. IEEE Int Conf on Multimedia and Expo, p.826-831. [48]Zhu XZ, Hu H, Lin S, et al., 2019. Deformable ConvNets V2: more deformable, better results. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9300-9308. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>