CLC number: TP311
On-line Access: 2018-01-11
Received: 2016-03-13
Revision Accepted: 2016-06-24
Crosschecked: 2017-11-22
Cited: 0
Clicked: 5967
Ming-hao Hu, Chang-jian Wang, Yu-xing Peng. Meeting deadlines for approximation processing in MapReduce environments[J]. Frontiers of Information Technology & Electronic Engineering, 2017, 18(11): 1754-1772.
@article{title="Meeting deadlines for approximation processing in MapReduce environments",
author="Ming-hao Hu, Chang-jian Wang, Yu-xing Peng",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="18",
number="11",
pages="1754-1772",
year="2017",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1601056"
}
%0 Journal Article
%T Meeting deadlines for approximation processing in MapReduce environments
%A Ming-hao Hu
%A Chang-jian Wang
%A Yu-xing Peng
%J Frontiers of Information Technology & Electronic Engineering
%V 18
%N 11
%P 1754-1772
%@ 2095-9184
%D 2017
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1601056
TY - JOUR
T1 - Meeting deadlines for approximation processing in MapReduce environments
A1 - Ming-hao Hu
A1 - Chang-jian Wang
A1 - Yu-xing Peng
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 18
IS - 11
SP - 1754
EP - 1772
%@ 2095-9184
Y1 - 2017
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1601056
Abstract: To provide timely results for big data analytics, it is crucial to satisfy deadline requirements for mapReduce jobs in today’s production environments. Much effort has been devoted to the problem of meeting deadlines, and typically there exist two kinds of solutions. The first is to allocate appropriate resources to complete the entire job before the specified time limit, where missed deadlines result because of tight deadline constraints or lack of resources; the second is to run a pre-constructed sample based on deadline constraints, which can satisfy the time requirement but fail to maximize the volumes of processed data. In this paper, we propose a deadline-oriented task scheduling approach, named ‘’Dart’, to address the above problem. Given a specified deadline and restricted resources, Dart uses an iterative estimation method, which is based on both historical data and job running status to precisely estimate the real-time job completion time. Based on the estimated time, Dart uses an approach–revise algorithm to make dynamic scheduling decisions for meeting deadlines while maximizing the amount of processed data and mitigating stragglers. Dart also efficiently handles task failures and data skew, protecting its performance from being harmed. We have validated our approach using workloads from OpenCloud and Facebook on a cluster of 64 virtual machines. The results show that Dart can not only effectively meet the deadline but also process near-maximum volumes of data even with tight deadlines and limited resources.
[1]Acharya, S., Gibbons, P., Poosala, V., 1999. Aqua: a fast decision support system using hboxapproximate query hboxanswers. Proc. 25th Int. Conf. on Very Large Data Bases, p.754-757.
[2]Agarwal, S., Mozafari, B., Panda, A., et al., 2013. Blinkdb: queries with bounded errors and bounded response times on very large data. Proc. 8th ACM European Conf. on Computer Systems, p.29-42.
[3]Ananthanarayanan, G., Kandula, S., Greenberg, A.G., et al., 2010. Reining in the outliers in Map-Reduce clusters using Mantri. Proc. 10th USENIX Symp. on Operating Systems Design and Implementation, p.24-38.
[4]Ananthanarayanan, G., Ghodsi, A., Shenker, S., et al., 2013. Effective straggler mitigation: attack of the clones. Proc. 10th USENIX Symp. on Networked Systems Design and Implementation, p.185-198.
[5]Ananthanarayanan, G., Hung, M.C.C., Ren, X., et al., 2014. Grass: trimming stragglers in approximation analytics. Proc. 11th USENIX Symp. on Networked Systems Design and Implementation, p.289-302.
[6]Apache, 2016. The Apache Hadoop Project. http://hadoop.apache.org/
[7]Bates, D.M., Watts, D.G., 1988. Nonlinear regression inference using the linear approximation. In: Jantsch, E., Waddington, C. (Eds.), Nonlinear Regression: Iterative Estimation and Linear Approximations. Wiley Online Library, p.142-167.
[8]Bell Laboratories, 2001. Approximate Query Processing: Taming the Terabytes. http://www.vldb.org/conf/2001/tut4.pdf
[9]Chen, Y., Ganapathi, A., Griggith, R., et al., 2011. The case for evaluating MapReduce performance using workload suites. Proc. IEEE 19th Int. Symp. on Modeling, Analysis ’ Simulation of Computer and Telecommunication Systems.
[10]Chen, Y., Alspaugh, S., Katz, R., 2012. Interactive analytical processing in big data systems: a cross-industry study of MapReduce workloads. Proc. VLDB Endow., 5(12):1802-1813.
[11]Chowdhury, M., Zaharia, M., Ma, J., et al., 2011. Managing data transfers in computer clusters with orchestra. SIGCOMM Comput. Commun. Rev., 41(4):98-109.
[12]Chowdhury, M., Zhong, Y., Stoica, I., 2014. Efficient coflow scheduling with varys. SIGCOMM Comput. Commun. Rev., 44(4):443-454.
[13]Cloudera, 2013. Statistical Workload Injector for MapReduce. https://github.com/SWIMProjectUCB/SWIM
[14]Dean, J., Ghemawat, S., 2008. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113.
[15]Ferguson, A.D., Bodik, P., Kandula, S., 2012. Jockey: guaranteed job latency in data parallel clusters. Proc. 7th ACM European Conf. on Computer Systems, p.99-112.
[16]Herodotou, H., Lim, H., Luo, G., 2011. Starfish: a self-tuning system for big data analytics. Proc. 7th Biennial Conf. on Innovative Data Systems Research, p.261-272.
[17]Hu, M., Wang, C., You, P., et al., 2015. Deadline-oriented task scheduling for mapreduce environments. LNCS, 9529:359-372.
[18]Kc, K., Anyanwu, K., 2010. Scheduling Hadoop jobs to meet deadlines. IEEE 2nd Int. Conf. on Cloud Computing Technology and Science, p.388-392.
[19]Li, S., Hu, S., Wang, S., et al., 2014. Woha: deadline-aware Map-Reduce workflow scheduling framework over Hadoop clusters. IEEE 34th Int. Conf. on Distributed Computing Systems, p.93-103.
[20]Liu, J., Shih, K., Lin, W., et al., 1994. Imprecise computations. Proc. IEEE, 82:83-94.
[21]Lohr, S., 2009. Simple probability samples. In: Sampling: Design and Analysis. Addison-Wesley, London, p.35-67.
[22]Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Ind. Appl. Math., 11(2):431-441.
[23]Morton, K., Balazinska, M., Grossman, D., 2010a. ParaTimer: a progress indicator for MapReduce dags. Proc. ACM SIGMOD Int. Conf. on Management of Data, p.507-518.
[24]Morton, K., Friesen, A., Balazinska, M., et al., 2010b. Estimating the progress of MapReduce pipelines. Proc. IEEE 26th Int. Conf. on Data Engineering, p.681-684.
[25]Motulsky, H.J., Ransnas, L.A., 1987. Fitting curves to data using nonlinear regression: a practical and nonmathematical review. FASEB J., 1(5):365-374.
[26]OREILLY, 2013. Interactive Big Data Analysis Using Approximate Answers. https://tinyurl.com/k5favda
[27]Polo, J., Carrera, D., Becerra, Y., et al., 2010. Performance-driven task co-scheduling for MapReduce environments. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.373-380.
[28]Ren, K., Kwon, Y., Balazinska, M., et al., 2013. Hadoop’s adolescence: an analysis of Hadoop usage in scientific workloads. Proc. VLDB Endow., 6(10):853-864.
[29]Vavilapalli, V.K., Murthy, A.C., Douglas, C., et al., 2013. Apache Hadoop Yarn: yet another resource negotiator. Proc. 4th Annual Symp. on Cloud Computing, p.5:1-5:16.
[30]Venkataraman, S., Panda, A., Ananthanarayanan, G., et al., 2007. The power of choice in data-aware cluster scheduling. Proc. 11th USENIX Symp. on Operating Systems Design and Implementation, p.301-316.
[31]Verma, A., Cherkasova, L., Campbell, R.H., 2011. Aria: automatic resource inference and allocation for MapReduce environments. Proc. 8th ACM Int. Conf. on Autonomic Computing, p.235-244.
[32]Verma, A., Cherkasova, L., Kumar, V.S., et al., 2012. Deadline-based workload management for MapReduce environments: pieces of the performance puzzle. Proc. IEEE Int. Congress on Network Operations and Management Symp., p.900-905.
[33]Wang, C., Peng, Y., Tang, M., et al., 2014. MapCheckReduce: an improved MapReduce computing model for imprecise applications. Proc. IEEE Int. Congress on Big Data, p.366-373.
[34]Wang, X., Shen, D., Bai, M., et al., 2015. SAMES: deadline-constraint scheduling in MapReduce. Front. Comput. Sci., 9(1):128-141.
[35]Zacheilas, N., Kalogeraki, V., 2014. Real-time scheduling of skewed MapReduce jobs in heterogeneous environments. Proc. 11th Int. Conf. on Autonomic Computing, p.189-200.
[36]Zaharia, M., Konwinski, A., Joseph, A.D., et al., 2008. Improving MapReduce performance in heterogeneous environments. Proc. 8th USENIX Symp. on Operating Systems Design and Implementation, p.7-21.
[37]Zaharia, M., Borthakur, D., Sen, S., et al., 2010. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Proc. 5th European Conf. on Computer Systems, p.265-278.
Open peer comments: Debate/Discuss/Question/Opinion
<1>