JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

FaSRnet: a feature and semantics refinement network for human pose estimation

Author(s): Yuanhong ZHONG, Qianfeng XU, Daidi ZHONG, Xun YANG, Shanshan WANG
Affiliation(s): School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China; more
Corresponding email(s): zhongyh@cqu.edu.cn, daidi.zhong@cqu.edu.cn
Key Words: Human pose estimation; Multi-frame refinement; Heatmap and offset estimation; Feature alignment; Multi-person

Share this article to： More <<< Previous Paper \|Next Paper >>>

Yuanhong ZHONG, Qianfeng XU, Daidi ZHONG, Xun YANG, Shanshan WANG. FaSRnet: a feature and semantics refinement network for human pose estimation[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2200639

@article{title="FaSRnet: a feature and semantics refinement network for human pose estimation",
author="Yuanhong ZHONG, Qianfeng XU, Daidi ZHONG, Xun YANG, Shanshan WANG",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2200639"
}

%0 Journal Article
%T FaSRnet: a feature and semantics refinement network for human pose estimation
%A Yuanhong ZHONG
%A Qianfeng XU
%A Daidi ZHONG
%A Xun YANG
%A Shanshan WANG
%J Frontiers of Information Technology & Electronic Engineering
%P 513-526
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2200639"

TY - JOUR
T1 - FaSRnet: a feature and semantics refinement network for human pose estimation
A1 - Yuanhong ZHONG
A1 - Qianfeng XU
A1 - Daidi ZHONG
A1 - Xun YANG
A1 - Shanshan WANG
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 513
EP - 526
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2200639"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Due to factors such as motion blur, video out-of-focus, and occlusion, multi-frame human pose estimation is a challenging task. Exploiting temporal consistency between consecutive frames is an efficient approach for addressing this issue. Currently, most methods explore temporal consistency through refinements of the final heatmaps. The heatmaps contain the semantics information of key points, and can improve the detection quality to a certain extent. However, they are generated by features, and feature-level refinements are rarely considered. In this paper, we propose a human pose estimation framework with refinements at the feature and semantics levels. We align auxiliary features with the features of the current frame to reduce the loss caused by different feature distributions. An attention mechanism is then used to fuse auxiliary features with current features. In terms of semantics, we use the difference information between adjacent heatmaps as auxiliary features to refine the current heatmaps. The method is validated on the large-scale benchmark datasets PoseTrack2017 and PoseTrack2018, and the results demonstrate the effectiveness of our method.

FaSRnet：用于人体姿态估计的特征和语义修正网络

仲元红¹，徐乾锋¹，钟代笛²，杨勋³，王姗姗⁴
¹重庆大学微电子与通信工程学院，中国重庆市，400044
²重庆大学生物工程学院，中国重庆市，400044
³中国科学技术大学信息科学与技术学院，中国合肥市，230039
⁴安徽大学物质科学与信息技术研究院，中国合肥市，230039
摘要：由于运动模糊、视频失焦和遮挡等因素，多帧人体姿态估计是一项有挑战性的任务。利用连续帧之间的时间一致性是解决这一问题的有效方法。目前，大多数方法通过修正最终热图来利用时间一致性。热图包含了关键点的语义信息，可在一定程度上提高检测质量。它们由特征生成，但这些方法很少考虑特征级别的修正。本文提出一种人体姿态估计框架，该框架在特征和语义层面进行了改进。将辅助特征与当前帧的特征对齐，以减少不同特征分布带来的损失。使用注意力机制将辅助特征与当前特征融合。在语义方面，使用相邻热图之间的差异作为辅助特征来修正当前热图。在大型基准数据集PoseTrack2017和PoseTrack2018上验证了该方法的有效性。

关键词组：人体姿态估计；多帧修正；热图和偏移估计；特征对齐；多人

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Andriluka M, Pishchulin L, Gehler P, et al., 2014. 2D human pose estimation: new benchmark and state of the art analysis. IEEE Conf on Computer Vision and Pattern Recognition, p.3686-3693.

[2]Andriluka M, Iqbal U, Insafutdinov E, et al., 2018. PoseTrack: a benchmark for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5167-5176.

[3]Bertasius G, Feichtenhofer C, Tran D, et al., 2019. Learning temporal pose estimation from sparsely-labeled videos. Proc 33^rd Int Conf on Neural Information Processing Systems, p.3027-3038.

[4]Cai YH, Wang ZC, Luo ZX, et al., 2020. Learning delicate local representations for multi-person pose estimation. 16^th European Conf on Computer Vision, p.455-472.

[5]Cao Z, Hidalgo G, Simon T, et al., 2021. OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans Patt Anal Mach Intell, 43(1):172-186.

[6]Chu X, Yang W, Ouyang WL, et al., 2017. Multi-context attention for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.5669-5678.

[7]Dang YH, Yin JQ, Zhang SJ, et al., 2022a. Learning human kinematics by modeling temporal correlations between joints for video-based human pose estimation.

[8]Dang YH, Yin JQ, Zhang SJ, 2022b. Relation-based associative joint location for human pose estimation in videos. IEEE Trans Image Process, 31:3973-3986.

[9]Doering A, Iqbal U, Gall J, 2018. Joint flow: temporal flow fields for multi person tracking.

[10]Fang HS, Xie SQ, Tai YW, et al., 2017. RMPE: regional multi-person pose estimation. IEEE Int Conf on Computer Vision, p.2353-2362.

[11]Fang HS, Li JF, Tang HY, et al., 2023. AlphaPose: whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans Patt Anal Mach Intell, 45(6):7157-7173.

[12]Fang ZJ, López AM, 2020. Intention recognition of pedestrians and cyclists by 2D pose estimation. IEEE Trans Intell Transp Syst, 21(11):4773-4783.

[13]Girdhar R, Gkioxari G, Torresani L, et al., 2018. Detect-and-track: efficient pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.350-359.

[14]Guo HK, Tang T, Luo GZ, et al., 2019. Multi-domain pose network for multi-person pose estimation and tracking. European Conf on Computer Vision, p.209-216.

[15]Hwang J, Lee J, Park S, et al., 2019. Pose estimator and tracker using temporal flow maps for limbs. Int Joint Conf on Neural Networks, p.1-8.

[16]Insafutdinov E, Andriluka M, Pishchulin L, et al., 2017. ArtTrack: articulated multi-person tracking in the wild. Conf on Computer Vision and Pattern Recognition, p.1293-1301.

[17]Iqbal U, Milan A, Gall J, 2017. PoseTrack: joint multi-person pose estimation and tracking. IEEE Conf on Computer Vision and Pattern Recognition, p.4654-4663.

[18]Jin S, Liu WT, Ouyang WL, et al., 2019. Multi-person articulated tracking with spatial and temporal embeddings. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5657-5666.

[19]Jin S, Liu WT, Xie EZ, et al., 2020. Differentiable hierarchical graph grouping for multi-person pose estimation. 16^th European Conf on Computer Vision, p.718-734.

[20]Li DW, Chen XT, Zhang Z, et al., 2018. Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. IEEE Int Conf on Multimedia and Expo, p.1-6.

[21]Lin TY, Maire M, Belongie S, et al., 2014. Microsoft COCO: common objects in context. 13^th European Conf on Computer Vision, p.740-755.

[22]Liu ZG, Wu S, Jin SY, et al., 2019. Towards natural and accurate future motion prediction of humans and animals. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9996-10004.

[23]Liu ZG, Chen HM, Feng RY, et al., 2021. Deep dual consecutive network for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.525-534.

[24]Liu ZG, Feng RY, Chen HM, et al., 2022. Temporal feature alignment and mutual information maximization for video-based human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.10996-11006.

[25]Luo Y, Ren J, Wang ZX, et al., 2018. LSTM pose machines. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5207-5215.

[26]Martinez J, Hossain R, Romero J, et al., 2017. A simple yet effective baseline for 3D human pose estimation. IEEE Int Conf on Computer Vision, p.2659-2668.

[27]Pfister T, Charles J, Zisserman A, 2015. Flowing ConvNets for human pose estimation in videos. IEEE Int Conf on Computer Vision, p.1913-1921.

[28]Sapp B, Taskar B, 2013. MODEC: multimodal decomposable models for human pose estimation. IEEE Conf on Computer Vision and Pattern Recognition, p.3674-3681.

[29]Shao ZP, Zhou W, Wang WZ, et al., 2023. A temporal densely connected recurrent network for event-based human pose estimation.

[30]Snower M, Kadav A, Lai F, et al., 2020. 15 keypoints is all you need. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.6737-6747.

[31]Song J, Wang LM, van Gool L, et al., 2017. Thin-slicing network: a deep structured model for pose estimation in videos. IEEE Conf on Computer Vision and Pattern Recognition, p.5563-5572.

[32]Sun K, Xiao B, Liu D, et al., 2019. Deep high-resolution representation learning for human pose estimation. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5686-5696.

[33]Tian YP, Zhang YL, Fu Y, et al., 2020. TDAN: temporally-deformable alignment network for video super-resolution. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3357-3366.

[34]Wang J, Long X, Gao Y, et al., 2020. Graph-PCNN: two stage human pose estimation with graph pose refinement. 16^th European Conf on Computer Vision, p.492-508.

[35]Wang M, Hong RC, Yuan XT, et al., 2012. Movie2Comics: towards a lively video content presentation. IEEE Trans Multim, 14(3):858-870.

[36]Wang MC, Tighe J, Modolo D, 2020. Combining detection and tracking for human pose estimation in videos. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11085-11093.

[37]Wang XL, Girshick R, Gupta A, et al., 2018. Non-local neural networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.7794-7803.

[38]Wang XT, Chan KCK, Yu K, et al., 2019. EDVR: video restoration with enhanced deformable convolutional networks. IEEE/CVF Conf on Computer Vision and Pattern Recognition Workshops, p.1954-1963.

[39]Weinzaepfel P, Revaud J, Harchaoui Z, et al., 2013. DeepFlow: large displacement optical flow with deep matching. IEEE Int Conf on Computer Vision, p.1385-1392.

[40]Xiao B, Wu HP, Wei YC, 2018. Simple baselines for human pose estimation and tracking. 15^th European Conf on Computer Vision, p.472-487.

[41]Xiu YL, Li JF, Wang HY, et al., 2018. Pose flow: efficient online pose tracking.

[42]Yang X, Wang M, Hong RC, et al., 2017. Enhancing person re-identification in a self-trained subspace. ACM Trans Multim Comput Commun Appl, 13(3):27.

[43]Yang X, Wang M, Tao DC, 2018. Person re-identification with metric learning using privileged information. IEEE Trans Image Process, 27(2):791-805.

[44]Yang YD, Ren Z, Li HX, et al., 2021. Learning dynamics via graph neural networks for human pose estimation and tracking. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8070-8080.

[45]Yu F, Koltun V, 2016. Multi-scale context aggregation by dilated convolutions.

[46]Zhang JB, Zhu Z, Zou W, et al., 2019. FastPose: towards real-time pose estimation and tracking via scale-normalized multi-task networks.

[47]Zheng W, Li L, Zhang ZX, et al., 2019. Relational network for skeleton-based action recognition. IEEE Int Conf on Multimedia and Expo, p.826-831.

[48]Zhu XZ, Hu H, Lin S, et al., 2019. Deformable ConvNets V2: more deformable, better results. IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.9300-9308.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

FaSRnet：用于人体姿态估计的特征和语义修正网络

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference