CLC number: TP391.4
On-line Access: 2022-05-19
Received: 2020-12-04
Revision Accepted: 2022-05-19
Crosschecked: 2021-05-26
Cited: 0
Clicked: 4452
Wanpeng XU, Ling ZOU, Lingda WU, Yue QI, Zhaoyong QIAN. Depth estimation using an improved stereo network[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2000676 @article{title="Depth estimation using an improved stereo network", %0 Journal Article TY - JOUR
基于改进立体网络的深度估计1航天工程大学复杂电子系统仿真科学与技术实验室,中国北京市,101416 2鹏城实验室,中国深圳市,518055 3北京电影学院数字媒体学院,中国北京市,100088 摘要:自监督深度估计方法通过在训练数据中利用目标图像和参考图像之间的视角合成,呈现了可以与全监督方法相媲美的结果。然而,作为主干网络的ResNet最初是为了应对分类问题而设计的,在应用于下游领域时存在一些结构上的缺陷。图像中的低纹理区域也使深度估计的效果受到很大影响。为了解决这些问题,本文提出一系列改进,以实现更加有效的深度预测。首先,我们通过改进网络结构来促进网络中的信息流通,并提高学习空间结构的能力。其次,使用二值蒙版去除目标图像和参考图像之间低纹理区域中的像素,以更准确地重建图像。最后,我们随机输入目标图像和参考图像对数据集进行扩充,并在ImageNet上进行预训练,从而使模型获得良好的通用特征表示。我们使用立体图像对作为输入,在KITTI自动驾驶数据集的特征分割上验证了最先进的性能。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Aleotti F, Tosi F, Poggi M, et al., 2018. Generative adversarial networks for unsupervised monocular depth prediction. Proc European Conf on Computer Vision Workshops, p.337-354. [2]Atapour-Abarghouei A, Breckon TP, 2018. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2800-2810. [3]Casser V, Pirk S, Mahjourian R, et al., 2019. Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. Proc 33rd AAAI Conf on Artificial Intelligence, p.8001-8008. [4]Chen WF, Fu Z, Yang DW, et al., 2016. Single-image depth perception in the wild. https://arxiv.org/abs/1604.03901 [5]Chen YH, Schmid C, Sminchisescu C, 2019. Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. Proc IEEE/CVF Int Conf on Computer Vision, p.7062-7071. [6]Cordts M, Omran M, Ramos S, et al., 2016. The CityScapes dataset for semantic urban scene understanding. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3213-3223. [7]Deng J, Dong W, Socher R, et al., 2009. ImageNet: a large-scale hierarchical image database. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.248-255. [8]Desouza GN, Kak AC, 2002. Vision for mobile robot navigation: a survey. IEEE Trans Patt Anal Mach Intell, 24(2):237-267. [9]Duta IC, Liu L, Zhu F, et al., 2020. Improved residual networks for image and video recognition. Proc 25th Int Conf on Pattern Recognition, p.9415-9422. [10]Eigen D, Fergus R, 2015. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proc IEEE Int Conf on Computer Vision, p.2650-2658. [11]Eigen D, Puhrsch C, Fergus R, 2014. Depth map prediction from a single image using a multi-scale deep network. Proc 27 th Int Conf on Neural Information Processing Systems, p.2366-2374. [12]Garg R, Bg VK, Carneiro G, et al., 2016. Unsupervised CNN for single view depth estimation: geometry to the rescue. Proc European Conf on Computer Vision, p.740-756. [13]Gehrke S, Morin K, Downey M, et al., 2010. Semi-global matching: an alternative to LIDAR for DSM generation? Proc Canadian Geomatics Conf and Symp of Commission I, p.15-18. [14]Godard C, Aodha OM, Brostow GJ, 2017. Unsupervised monocular depth estimation with left-right consistency. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6602-6611. [15]Godard C, Aodha OM, Firman M, et al., 2019. Digging into self-supervised monocular depth estimation. Proc IEEE/CVF Int Conf on Computer Vision, p.3827-3837. [16]Gordon A, Li HH, Jonschkowski R, et al., 2019. Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. Proc IEEE/CVF Int Conf on Computer Vision, p.8976-8985. [17]Guizilini V, Ambruş R, Pillai S, et al., 2020. 3D packing for self-supervised monocular depth estimation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.2482-2491. [18]He KM, Zhang XY, Ren SQ, et al., 2016a. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778. [19]He KM, Zhang XY, Ren SQ, et al., 2016b. Identity mappings in deep residual networks. Proc European Conf on Computer Vision, p.630-645. [20]Jaderberg M, Simonyan K, Zisserman A, 2015. Spatial transformer networks. Proc 28th Int Conf on Neural Information Processing Systems, p.2017-2025. [21]Karsch K, Liu C, Kang SB, 2012. Depth extraction from video using non-parametric sampling. Proc European Conf on Computer Vision, p.775-788. [22]Kendall A, Martirosyan H, Dasgupta S, et al., 2017. End-to-end learning of geometry and context for deep stereo regression. Proc IEEE Int Conf on Computer Vision, p.66-75. [23]Kuznietsov Y, Stückler J, Leibe B, 2017. Semi-supervised deep learning for monocular depth map prediction. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.2215-2223. [24]Laina I, Rupprecht C, Belagiannis V, et al., 2016. Deeper depth prediction with fully convolutional residual networks. Proc 4th Int Conf on 3D Vision, p.239-248. [25]Luo CX, Yang ZH, Wang P, et al., 2020. Every pixel counts ++: joint learning of geometry and motion with 3D holistic understanding. IEEE Trans Patt Anal Mach Intell, 42(10):2624-2641. [26]Luo WJ, Schwing AG, Urtasun R, 2016. Efficient deep learning for stereo matching. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5695-5703. [27]Mahjourian R, Wicke M, Angelova A, 2018. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.5667-5675. [28]Mayer N, Ilg E, Fischer P, et al., 2018. What makes good synthetic training data for learning disparity and optical flow estimation? Int J Comput Vis, 126(9):942-960. [29]Menze M, Geiger A, 2015. Object scene flow for autonomous vehicles. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3061-3070. [30]Newcombe RA, Lovegrove SJ, Davison AJ, 2011. DTAM: dense tracking and mapping in real-time. Proc Int Conf on Computer Vision, p.2320-2327. [31]Poggi M, Aleotti F, Tosi F, et al., 2018. Towards real-time unsupervised monocular depth estimation on CPU. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.5848-5854. [32]Saxena A, Sun M, Ng AY, 2009. Make3D: learning 3D scene structure from a single still image. IEEE Trans Patt Anal Mach Intell, 31(5):824-840. [33]Uhrig J, Schneider N, Schneider L, et al., 2017. Sparsity invariant CNNs. Proc Int Conf on 3D Vision, p.11-20. [34]Ummenhofer B, Zhou HZ, Uhrig J, et al., 2017. DeMoN: depth and motion network for learning monocular stereo. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.5622-5631. [35]Vijayanarasimhan S, Ricco S, Schmid C, et al., 2017. SfM-Net: learning of structure and motion from video. https://arxiv.org/abs/1704.07804 [36]Wang Y, Yang Y, Yang ZH, et al., 2018. Occlusion aware unsupervised learning of optical flow. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4884-4893. [37]Wang Z, Bovik AC, Sheikh HR, et al., 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process, 13(4):600-612. [38]Watson J, Firman M, Brostow G, et al., 2019. Self-supervised monocular depth hints. Proc IEEE/CVF Int Conf on Computer Vision, p.2162-2171. [39]Wu YR, Ying SH, Zheng LM, 2018. Size-to-depth: a new perspective for single image depth estimation. https://arxiv.org/abs/1801.04461 [40]Xie JY, Girshick R, Farhadi A, 2016. Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks. Proc European Conf on Computer Vision, p.842-857. [41]Žbontar J, LeCun Y, 2016. Stereo matching by training a convolutional neural network to compare image patches. J Mach Learn Res, 17:1-32. [42]Zhan HY, Garg R, Weerasekera CS, et al., 2018. Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.340-349. [43]Zhou LP, Kaess M, 2020. Windowed bundle adjustment framework for unsupervised learning of monocular depth estimation with U-net extension and clip loss. IEEE Robot Autom Lett, 5(2):3283-3290. [44]Zhou TH, Brown M, Snavely N, et al., 2017. Unsupervised learning of depth and ego-motion from video. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.6612-6619. [45]Zoran D, Isola P, Krishnan D, et al., 2015. Learning ordinal relationships for mid-level vision. Proc IEEE Int Conf on Computer Vision, p.388-396. [46]Zou YL, Luo ZL, Huang JB, 2018. DF-Net: unsupervised joint learning of depth and flow using cross-task consistency. Proc European Conf on Computer Vision, p.36-53. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>