CLC number: TP39
On-line Access: 2020-12-10
Received: 2019-12-14
Revision Accepted: 2020-02-21
Crosschecked: 2020-07-22
Cited: 0
Clicked: 4748
Citations: Bibtex RefMan EndNote GB/T7714
Saqib Mamoon, Muhammad Arslan Manzoor, Fa-en Zhang, Zakir Ali, Jian-feng Lu. SPSSNet: a real-time network for image semantic segmentation[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(12): 1770-1782.
@article{title="SPSSNet: a real-time network for image semantic segmentation",
author="Saqib Mamoon, Muhammad Arslan Manzoor, Fa-en Zhang, Zakir Ali, Jian-feng Lu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="21",
number="12",
pages="1770-1782",
year="2020",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1900697"
}
%0 Journal Article
%T SPSSNet: a real-time network for image semantic segmentation
%A Saqib Mamoon
%A Muhammad Arslan Manzoor
%A Fa-en Zhang
%A Zakir Ali
%A Jian-feng Lu
%J Frontiers of Information Technology & Electronic Engineering
%V 21
%N 12
%P 1770-1782
%@ 2095-9184
%D 2020
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1900697
TY - JOUR
T1 - SPSSNet: a real-time network for image semantic segmentation
A1 - Saqib Mamoon
A1 - Muhammad Arslan Manzoor
A1 - Fa-en Zhang
A1 - Zakir Ali
A1 - Jian-feng Lu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 21
IS - 12
SP - 1770
EP - 1782
%@ 2095-9184
Y1 - 2020
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1900697
Abstract: Although deep neural networks (DNNs) have achieved great success in semantic segmentation tasks, it is still challenging for real-time applications. A large number of feature channels, parameters, and floating-point operations make the network sluggish and computationally heavy, which is not desirable for real-time tasks such as robotics and autonomous driving. Most approaches, however, usually sacrifice spatial resolution to achieve inference speed in real time, resulting in poor performance. In this paper, we propose a light-weight stage-pooling semantic segmentation network (SPSSN), which can efficiently reuse the paramount features from early layers at multiple stages, at different spatial resolutions. SPSSN takes input of full resolution 2048×1024 pixels, uses only 1.42×106 parameters, yields 69.4% mIoU accuracy without pre-training, and obtains an inference speed of 59 frames/s on the Cityscapes dataset. SPSSN can run directly on mobile devices in real time, due to its light-weight architecture. To demonstrate the effectiveness of the proposed network, we compare our results with those of state-of-the-art networks.
[1]Badrinarayanan V, Kendall A, Cipolla R, 2017. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Patt Anal Mach Intell, 39(12):2481-2495.
[2]Brostow GJ, Fauqueur J, Cipolla R, 2009. Semantic object classes in video: a high-definition ground truth database. Patt Recogn Lett, 30(2):88-97.
[3]Chen LC, Papandreou G, Schroff F, et al., 2017. Rethinking atrous convolution for semantic image segmentation. https://arxiv.org/abs/1706.05587
[4]Chen LC, Papandreou G, Kokkinos I, et al., 2018. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Patt Anal Mach Intell, 40(4):834-848.
[5]Cheng J, Wang P, Li G, et al., 2018. Recent advances in efficient computation of deep convolutional neural networks. Front Inform Technol Electron Eng, 19(1):64-77.
[6]Chollet F, 2016. Xception: deep learning with depthwise separable convolutions. https://arxiv.org/abs/1610.02357
[7]Christ PF, Elshaer MEA, Ettlinger F, et al., 2016. Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3D conditional random fields. Proc 19th Int Conf on Medical Image Computing and Computer-Assisted Intervention, p.415-423.
[8]Cordts M, Omran M, Ramos S, et al., 2016. The Cityscapes dataset for semantic urban scene understanding. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.3213-3223.
[9]Dai JF, He KM, Li Y, et al., 2016a. Instance-sensitive fully convolutional networks. Proc 14th European Conf on Computer Vision, p.534-549.
[10]Dai JF, Li Y, He KM, et al., 2016b. R-FCN: object detection via region-based fully convolutional networks. Proc 30th Int Conf on Neural Information Processing Systems, p.379-387.
[11]Devlin J, Chang MW, Lee K, et al., 2018. BERT: pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
[12]Han S, Mao HZ, Dally WJ, 2016. Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding. Proc 4th Int Conf on Learning Representations, p.1-14.
[13]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.
[14]Howard AG, Zhu ML, Chen B, et al., 2017. MobileNets: efficient convolutional neural networks for mobile vision applications. https://arxiv.org/abs/1704.04861
[15]Hu H, Gu JY, Zhang Z, et al., 2017. Relation networks for object detection. http://arxiv.org/abs/1711.11575
[16]Huang G, Liu SC, van der Maaten L, et al., 2017. Condensenet: an efficient densenet using learned group convolutions. https://arxiv.org/abs/1711.09224
[17]Hubara I, Courbariaux M, Soudry D, et al., 2016. Binarized neural networks. Proc 30th Int Conf on Neural Information Processing Systems, p.4114-4122.
[18]Hubara I, Courbariaux M, Soudry D, et al., 2018. Quantized neural networks: training neural networks with low precision weights and activations. J Mach Learn Res, 18(187):1-30.
[19]Ioffe S, Szegedy C, 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. Proc 32nd Int Conf on Machine Learning, p.1448-1456.
[20]Jégou S, Drozdzal M, Vazquez D, et al., 2017. The one hundred layers tiramisu: fully convolutional densenets for semantic segmentation. Proc IEEE Conf on Computer Vision and Pattern Recognition Workshops, p.1175-1183.
[21]Lee H, Matin T, Gleeson F, et al., 2019. Efficient 3D fully convolutional networks for pulmonary lobe segmentation in CT images. https://arxiv.org/abs/1909.07474
[22]Li C, Shi CJR, 2018. Constrained optimization based low-rank approximation of deep neural networks. Proc 15th European Conf on Computer Vision, p.746-761.
[23]Li H, Kadav A, Durdanovic I, et al., 2016. Pruning filters for efficient ConvNets. https://arxiv.org/abs/1608.08710
[24]Li HC, Xiong PF, Fan HQ, et al., 2019. DFANet: deep feature aggregation for real-time semantic segmentation. https://arxiv.org/abs/1904.02216
[25]Lin GS, Shen CH, van den Hengel A, et al., 2016. Efficient piecewise training of deep structured models for semantic segmentation. IEEE Conf on Computer Vision and Pattern Recognition, p.3194-3203.
[26]Lin GS, Liu FY, Milan A, et al., 2019. RefineNet: multi-path refinement networks for dense prediction. IEEE Trans Patt Anal Mach Intell, p.1228-1242.
[27]Liu ZW, Li XX, Luo P, et al., 2015. Semantic image segmentation via deep parsing network. IEEE Int Conf on Computer Vision, p.1377-1385.
[28]Long J, Shelhamer E, Darrell T, 2014. Fully convolutional networks for semantic segmentation. https://arxiv.org/abs/1411.4038
[29]Ma NN, Zhang XY, Zheng HT, et al., 2018. ShuffleNet V2: practical guidelines for efficient CNN architecture design. Proc 15th European Conf on Computer Vision, p.122-138.
[30]Mazzini D, 2018. Guided upsampling network for real-time semantic segmentation. https://arxiv.org/abs/1807.07466
[31]Mehta S, Rastegari M, Caspi A, et al., 2018. ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation. Proc 15th European Conf on Computer Vision, p.561-580.
[32]Mehta S, Rastegari M, Shapiro L, et al., 2019. ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network. IEEE Conf on Computer Vision and Pattern Recognition, p.9190-9200.
[33]Nekrasov V, Shen CH, Reid I, 2018. Light-weight RefineNet for real-time semantic segmentation. British Machine Vision Conf, p.125.
[34]Noh H, Hong S, Han B, 2015. Learning deconvolution network for semantic segmentation. https://arxiv.org/abs/1505.04366
[35]Pan Y, 2019. On visual knowledge. Front Inform Technol Electon Eng, 20(8):1021-1025.
[36]Paszke A, Chaurasia A, Kim S, et al., 2016. ENet: a deep neural network architecture for real-time semantic segmentation. https://arxiv.org/abs/1606.02147
[37]Peng YX, He XT, Zhao JJ, 2018. Object-part attention model for fine-grained image classification. IEEE Trans Image Process, 27(3):1487-1500.
[38]Poudel RPK, Bonde U, Liwicki S, et al., 2018. ContextNet: exploring context and detail for semantic segmentation in real-time. https://arxiv.org/abs/1805.04554
[39]Poudel RPK, Liwicki S, Cipolla R, 2019. Fast-SCNN: fast semantic segmentation network. https://arxiv.org/abs/1902.04502
[40]Rastegari M, Ordonez V, Redmon J, et al., 2016. XNOR-Net: ImageNet classification using binary convolutional neural networks. Proc 14th European Conf on Computer Vision, p.525-542.
[41]Ren SQ, He KM, Girshick R, et al., 2017. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Patt Anal Mach Intell, 39(6):1137-1149.
[42]Romera E, Álvarez JM, Bergasa LM, et al., 2018. ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation. IEEE Trans Intell Transp Syst, 19(1):263-272.
[43]Salvador A, Bellver M, Campos V, et al., 2017. Recurrent neural networks for semantic instance segmentation. https://arxiv.org/abs/1712.00617
[44]Sandler M, Howard A, Zhu ML, et al., 2018. MobileNetV2: inverted residuals and linear bottlenecks. IEEE Conf on Computer Vision and Pattern Recognition, p.4510-4520.
[45]Sherrah J, 2016. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. https://arxiv.org/abs/1606.02585
[46]Siam M, Gamal M, Abdel-Razek M, et al., 2018. A comparative study of real-time semantic segmentation for autonomous driving. IEEE Conf on Computer Vision and Pattern Recognition Workshops, p.587-597.
[47]Soudry D, Hubara I, Meir R, 2014. Expectation backpropagation: parameter-free training of multilayer neural networks with continuous or discrete weights. Proc 27th Int Conf on Neural Information Processing Systems, p.963-971.
[48]Sturgess P, Alahari K, Ladicky L, et al., 2009. Combining appearance and structure from motion features for road scene understanding. British Machine Vision Conf, p.1-11.
[49]Szegedy C, Vanhoucke V, Ioffe S, et al., 2015. Rethinking the inception architecture for computer vision. https://arxiv.org/abs/1512.00567
[50]Türkmen S, Heikkilä J, 2019. An efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutions. Proc 21st Scandinavian Conf on Image Analysis, p.41-53.
[51]Visin F, Kastner K, Courville AC, et al., 2015. ReSeg: a recurrent neural network for object segmentation. https://arxiv.org/abs/1511.07053
[52]Wen W, Wu CP, Wang YD, et al., 2016. Learning structured sparsity in deep neural networks. Proc 30th Int Conf on Neural Information Processing Systems, p.1-9.
[53]Wilson AC, Roelofs R, Stern M, et al., 2017. The marginal value of adaptive gradient methods in machine learning. Proc 31st Int Conf on Neural Information Processing Systems, p.1-14.
[54]Wu S, Li GQ, Chen F, et al., 2018. Training and inference with integers in deep neural networks. https://arxiv.org/abs/1802.04680
[55]Xiang W, Mao HD, Athitsos V, 2019. ThunderNet: a turbo unified network for real-time semantic segmentation. IEEE Winter Conf on Applications of Computer Vision, p.1789-1796.
[56]Yang J, Liu QS, Zhang KH, 2017. Stacked hourglass network for robust facial landmark localisation. IEEE Conf on Computer Vision and Pattern Recognition Workshops, p.2025-2033.
[57]Yu CQ, Wang JB, Peng C, et al., 2018. BiSeNet: bilateral segmentation network for real-time semantic segmentation. Proc 15th European Conf on Computer Vision, p.334-349.
[58]Yu F, Koltun V, 2016. Multi-scale context aggregation by dilated convolutions. Proc 4th Int Conf on Learning Representations, p.1-13.
[59]Yu F, Koltun V, Funkhouser T, 2017. Dilated residual networks. IEEE Conf on Computer Vision and Pattern Recognition, p.636-644.
[60]Zhang JC, Peng YX, 2019a. Hierarchical vision-language alignment for video captioning. Proc $25^rm th$ Int Conf on Multimedia Modeling, p.42-54.
[61]Zhang JC, Peng YX, 2019b. Object-aware aggregation with bidirectional temporal graph for video captioning. https://arxiv.org/abs/1906.04375
[62]Zhang QS, Zhu SC, 2018. Visual interpretatbility for deep learning: a survey. Front Inform Technol Electron Eng, 19(1):27-39.
[63]Zhao HS, Shi JP, Qi XJ, et al., 2017. Pyramid scene parsing network. IEEE Conf on Computer Vision and Pattern Recognition, p.6230-6239.
[64]Zhao HS, Qi XJ, Shen XY, et al., 2018. ICNet for real-time semantic segmentation on high-resolution images. Proc 15th European Conf on Computer Vision, p.418-434.
[65]Zheng S, Jayasumana S, Romera-Paredes B, et al., 2015. Conditional random fields as recurrent neural networks. IEEE Int Conf on Computer Vision, p.1529-1537.
Open peer comments: Debate/Discuss/Question/Opinion
<1>