Full Text:  <3409>

Summary:  <1514>

CLC number: TP306

On-line Access: 2021-08-17

Received: 2020-04-06

Revision Accepted: 2020-07-02

Crosschecked: 2021-05-07

Cited: 0

Clicked: 5081

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Yining Qi

https://orcid.org/0000-0001-6817-4443

Peng Cheng

https://orcid.org/0000-0002-4221-2162

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering 

Accepted manuscript available online (unedited version)


A survey of cloud network fault diagnostic systems and tools


Author(s):  Yining Qi, Chongrong Fang, Haoyu Liu, Daxiang Kang, Biao Lyu, Peng Cheng, Jiming Chen

Affiliation(s):  State Key Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou 310027, China; more

Corresponding email(s):  qyning710@gmail.com, chongrongfang.zju@gmail.com, haoyu_liu@zju.edu.cn, daxiang.kdx@alibaba-inc.com, lubiao.lb@alibaba-inc.com, pcheng@iipc.zju.edu.cn, cjm@zju.edu.cn

Key Words:  Cloud network, Network diagnostics, Network anomaly, Network monitoring


Share this article to: More |Next Paper >>>

Yining Qi, Chongrong Fang, Haoyu Liu, Daxiang Kang, Biao Lyu, Peng Cheng, Jiming Chen. A survey of cloud network fault diagnostic systems and tools[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.2000153

@article{title="A survey of cloud network fault diagnostic systems and tools",
author="Yining Qi, Chongrong Fang, Haoyu Liu, Daxiang Kang, Biao Lyu, Peng Cheng, Jiming Chen",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.2000153"
}

%0 Journal Article
%T A survey of cloud network fault diagnostic systems and tools
%A Yining Qi
%A Chongrong Fang
%A Haoyu Liu
%A Daxiang Kang
%A Biao Lyu
%A Peng Cheng
%A Jiming Chen
%J Frontiers of Information Technology & Electronic Engineering
%P 1031-1045
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.2000153"

TY - JOUR
T1 - A survey of cloud network fault diagnostic systems and tools
A1 - Yining Qi
A1 - Chongrong Fang
A1 - Haoyu Liu
A1 - Daxiang Kang
A1 - Biao Lyu
A1 - Peng Cheng
A1 - Jiming Chen
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1031
EP - 1045
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.2000153"


Abstract: 
Recently, cloud computing has become a vital part that supports people’s normal lives and production. However, accompanied by the increasing complexity of the cloud network, failures constantly keep coming up and cause huge economic losses. Thus, to guarantee the cloud network performance and prevent execrable effects caused by failures, cloud network diagnostics has become of great interest for cloud service providers. Due to the characteristics of cloud network (e.g., virtualization and multi-tenancy), transplanting traditional network diagnostic tools to the cloud network face several difficulties. Additionally, many existing tools cannot solve problems in the cloud network. In this paper, we summarize and classify the state-of-the-art technologies of cloud diagnostics which can be used in the production cloud network according to their features. Moreover, we analyze the differences between cloud network diagnostics and traditional network diagnostics based on the characteristics of the cloud network. Considering the operation requirements of the cloud network, we propose the points that should be cared about when designing a cloud network diagnostic tool. Also, we discuss the challenges that cloud network diagnostics will face in future development.

云网络故障诊断系统及工具综述

戚依宁1,方崇荣1,刘昊俣1,康达祥2,吕彪2,程鹏1,陈积明1
1浙江大学工业控制技术国家重点实验室,中国杭州市,310027
2阿里巴巴集团,中国杭州市,310024
摘要:近年来,云网络已成为支撑人们正常生产生活的重要基础产业。然而,随着云网络日益复杂化,网络故障越来越容易出现,并且造成巨大经济损失。因此,为保障云网络性能,防止故障造成恶劣影响,云网络故障诊断已成为云服务提供商的重点研究技术之一。由于云网络的特性(例如虚拟化和多租户),将传统网络诊断工具移植到云网络面临不少困难。此外,许多现有工具无法解决云网络的独有问题。本文总结了近年提出的可用于云网络生产环境的最先进的云网络故障诊断系统及工具,并根据其特点分类。此外,根据云网络特点,分析了云网络故障诊断与传统网络故障诊断的区别。考虑到云网络的实际生产需求,提出设计云网络故障诊断工具时应注意的要点。此外,讨论了云网络故障诊断在未来发展中面临的机遇与挑战。

关键词组:云网络;网络诊断;网络异常;网络监控

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Aceto G, Botta A, de Donato W, et al., 2013. Cloud monitoring: a survey. Comput Netw, 57(9):2093-2115.

[2]Andreyev A, 2014. Introducing Data Center Fabric, the Next-Generation Facebook Data Center Network. https://engineering.fb.com/2014/11/14/production-engineering/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/

[3]Armbrust M, Fox A, Griffith R, et al., 2010. A view of cloud computing. Commun ACM, 53(4):50-58.

[4]Arzani B, Ciraci S, Loo BT, et al., 2016. Taking the blame game out of data centers operations with NetPoirot. Proc ACM SIGCOMM Conf, p.440-453.

[5]Arzani B, Ciraci S, Chamon L, et al., 2018. 007: democratically finding the cause of packet drops. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.419-435.

[6]Bahl P, Chandra R, Greenberg A, et al., 2007. Towards highly reliable enterprise network services via inference of multi-level dependencies. Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communications, p.13-24.

[7]Bannour F, Souihi S, Mellouk A, 2018. Distributed SDN control: survey, taxonomy, and challenges. IEEE Commun Surv Tutor, 20(1):333-354.

[8]Calder M, Schröder M, Gao R, et al., 2018. Odin: Microsoft’s scalable fault-tolerant CDN measurement system. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.501-517.

[9]Casella G, Berger RL, 2002. Statistical Inference (2nd Ed.). Duxbury Press, Pacific Grove, USA.

[10]Claise B, Sadasivan G, Valluri V, et al., 2004. RFC 3954: Cisco Systems NetFlow Services Export Version 9. https://www.hjp.at/doc/rfc/rfc3954.html

[11]Dhamdhere A, Teixeira R, Dovrolis C, et al., 2007. NetDiagnoser: troubleshooting network unreachabilities using end-to-end probes and routing data. Proc ACM CoNEXT Conf, p.1-12.

[12]Duffield N, Haffner P, Krishnamurthy B, et al., 2009. Rule-based anomaly detection on IP flows. IEEE INFOCOM, p.424-432.

[13]Fang CR, Liu HY, Miao M, et al., 2020. VTrace: automatic diagnostic system for persistent packet loss in cloud-scale overlay network. Proc Annual Conf of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, p.31-43.

[14]Ganguli S, Corbett T, 2019. Gartner Magic Quadrant for Network Performance Monitoring and Diagnostics.

[15]Garfinkel SL, 1999. Architects of the Information Society: Thirty-Five Years of the Laboratory for Computer Science at MIT. The MIT Press, Cambridge, USA.

[16]Geng YL, Liu SY, Yin Z, et al., 2019. SIMON: a simple and scalable method for sensing, inference and measurement in data center networks. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.549-564.

[17]Gong CY, Liu J, Zhang Q, et al., 2010. The characteristics of cloud computing. Proc 39th Int Conf on Parallel Processing Workshops, p.275-279.

[18]Guo CX, Yuan LH, Xiang D, et al., 2015. Pingmesh: a large-scale system for data center network latency measurement and analysis. Proc ACM Conf on Special Interest Group on Data Communication, p.139-152.

[19]Herodotou H, Ding BL, Balakrishnan S, et al., 2014. Scalable near real-time failure localization of data center networks. Proc 20th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.1689-1698.

[20]Huang P, Guo CX, Zhou LD, et al., 2017. Gray failure: the Achilles’ heel of cloud-scale systems. Proc 16th Workshop on Hot Topics in Operating Systems, p.150-155.

[21]Jin YC, Renganathan S, Ananthanarayanan G, et al., 2019. Zooming in on wide-area latencies to a global cloud provider. Proc ACM Conf on Special Interest Group on Data Communication, p.104-116.

[22]Kanuparthy P, Dovrolis C, 2014. Pythia: diagnosing performance problems in wide area providers. Proc USENIX Conf on USENIX Annual Technical Conference, p.371-382.

[23]Kim C, Bhide P, Doe E, et al., 2015. In-Band Network Telemetry via Programmable Dataplanes. Technical Specification P, 4:2015.

[24]Li Z, Cheng Q, Hsieh K, et al., 2020. Gandalf: an intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure. Proc 17th USENIX Symp on Networked Systems Design and Implementation, p.389-402.

[25]Marston S, Li Z, Bandyopadhyay S, et al., 2011. Cloud computing—the business perspective. Dec Support Syst, 51(1):176-189.

[26]Mell P, Grance T, 2011. The NIST Definition of Cloud Computing. Gaithersburg: Computer Security Division, Information Technology Laboratory.

[27]Moshref M, Yu ML, Govindan R, et al., 2016. Trumpet: timely and precise triggers in data centers. Proc ACM SIGCOMM Conf, p.129-143.

[28]Padmanabhan VN, Ramabhadran S, Padhye J, 2005. NetProfiler: profiling wide-area networks using peer cooperation. Proc 4th Int Conf on Peer-to-Peer Systems, p.80-92.

[29]Peng YH, Yang J, Wu C, et al., 2017. deTector: a topology-aware monitoring system for data center networks. Proc USENIX Conf on Usenix Annual Technical Conf, p.55-68.

[30]Roskind J, 2013. Quick UDP Internet Connections: Multiplexed Stream Transport over UDP. https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/

[31]Roy A, Zeng HY, Bagga J, et al., 2015. Inside the social network’s (datacenter) network. Proc ACM Conf on Special Interest Group on Data Communication, p.123-137.

[32]Roy A, Zeng HY, Bagga J, et al., 2017. Passive realtime datacenter fault detection and localization. Proc 14th USENIX Symp on Networked Systems Design and Implementation, p.595-612.

[33]Tan C, Jin Z, Guo CX, et al., 2019. NetBouncer: active device and link failure localization in data center networks. Proc16th USENIX Conf on Networked Systems Design and Implementation, p.599-614.

[34]Tibshirani R, 1996. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B, 58(1):267-288.

[35]Veloso B, Malheiro B, Burguillo JC, et al., 2020. Impact of trust and reputation based brokerage on the CloudAnchor platform. Int Conf on Practical Applications of Agents and Multi-agent Systems, p.303-314.

[36]Wang M, Li BC, Li ZP, 2004. sFlow: towards resource-efficient and agile service federation in service overlay networks. Proc 24th Int Conf on Distributed Computing Systems, p.628-635.

[37]Wang T, Zhang WB, Ye CY, et al., 2016. FD4C: automatic fault diagnosis framework for web applications in cloud computing. IEEE Trans Syst Man Cybern Syst, 46(1):61-75.

[38]Widanapathirana C, Li J, Sekercioglu YA, et al., 2011. Intelligent automated diagnosis of client device bottlenecks in private clouds. Proc 4th IEEE Int Conf on Utility and Cloud Computing, p.261-266.

[39]Wu X, Turner D, Chen CC, et al., 2012. NetPilot: automating datacenter network failure mitigation. Proc Conf on Applications, Technologies, Architectures, and Protocols for Computer Communication, p.419-430.

[40]Yu D, Zhu YB, Arzani B, et al., 2019. dShark: a general, easy to program and scalable framework for analyzing in-network packet traces. Proc 16th USENIX Conf on Networked Systems Design and Implementation, p.207-220.

[41]Yu ML, Greenberg A, Maltz D, et al., 2011. Profiling network performance for multi-tier data center applications. Proc 8th USENIX Conf on Networked Systems Design and Implementation, p.57-70.

[42]Zeng HY, Mahajan R, McKeown N, et al., 2015. Measuring and Troubleshooting Large Operational Multipath Networks with Gray Box Testing. Technical Report MSR-TR-2015-55 (Microsoft Research).

[43]Zhang Q, Yu G, Guo CX, et al., 2018. Deepview: virtual disk failure diagnosis and pattern detection for Azure. Proc 15th USENIX Conf on Networked Systems Design and Implementation, p.519-532.

[44]Zhu YB, Kang NX, Cao JX, et al., 2015. Packet-level telemetry in large datacenter networks. ACM SIGCOMM Comput Commun Rev, p.479-491.

[45]Zhuo DY, Ghobadi M, Mahajan R, et al., 2017. Understanding and mitigating packet corruption in data center networks. Proc ACM Conf on Special Interest Group on Data Communication, p.362-375.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE