JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

FTRP: a new fault tolerance framework using linebreak process replication and prefetching for linebreak high-performance computing

Author(s): Wei Hu, Guang-ming Liu, Yan-huang Jiang
Affiliation(s): College of Computer, National University of Defense Technology, Changsha 410073, China; more
Corresponding email(s): huwei@nscc-tj.gov.cn, liugm@nscc-tj.gov.cn, yhjiang@nudt.edu.cn
Key Words: High-performance computing, Proactive fault tolerance, Failure locality, Process replication, Process prefetching

Share this article to： More <<< Previous Paper \|Next Paper >>>

Wei Hu, Guang-ming Liu, Yan-huang Jiang. FTRP: a new fault tolerance framework using linebreak process replication and prefetching for linebreak high-performance computing[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1601450

@article{title="FTRP: a new fault tolerance framework using linebreak process replication and prefetching for linebreak high-performance computing",
author="Wei Hu, Guang-ming Liu, Yan-huang Jiang",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1601450"
}

%0 Journal Article
%T FTRP: a new fault tolerance framework using linebreak process replication and prefetching for linebreak high-performance computing
%A Wei Hu
%A Guang-ming Liu
%A Yan-huang Jiang
%J Frontiers of Information Technology & Electronic Engineering
%P 1273-1290
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1601450"

TY - JOUR
T1 - FTRP: a new fault tolerance framework using linebreak process replication and prefetching for linebreak high-performance computing
A1 - Wei Hu
A1 - Guang-ming Liu
A1 - Yan-huang Jiang
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1273
EP - 1290
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1601450"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: AAs the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond.

基于进程复制和预取的高性能计算容错框架

概要：随着超级计算机规模迅速增大，可靠性成为制约系统可用性的主要问题。现有容错机制，包括检查点技术和进程冗余等，不能有效解决该问题。为此，提出一种基于进程复制和预取的高性能计算容错框架-FTRP（fault tolerance frame work using process replication and prefetching），该框架兼具主动和被动容错机制的优点，引入创新的开销模型和主动容错机制，能够有效改善应用运行效率。提出"工作最多"（work-most，WM）的创新开销模型，基于故障预测结果和应用状态，从容错机制集中在线自适应给出运行容错决策。与程序运行过程中的局部性相似，我们第一次观察到超级计算机故障局部性现象。基于故障局部性，提出一种新的进程复制和进程预取相结合的容错机制，无论故障能否被预测到，都能够有效避免故障引起的损失。通过基于实际故障路径和普通故障预测准确率的模拟实验，并采用FTRP容错框架的应用，可以获得比现有容错机制10%的改进，且在P级甚至更大规模系统上有效。

关键词组：高性能计算；主动容错；故障局部性；进程复制；进程预取

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Alam SR, Kuehn JA, Barrett RF, et al., 2007. Cray XT4: an early evaluation for petascale scientific simulation. Proc ACM/IEEE Conf on Supercomputing, p.1-12.

[2]Babaoglu O, Joy W, 1981. Converting a swap-based system to do paging in an architecture lacking page-referenced bits. Proc 8^th ACM Symp on Operating Systems Principles, p.78-86.

[3]Bhatele A, Jetley P, Gahvari H, et al., 2011. Architectural constraints to attain 1 exaflop/s for three scientific application classes. Proc IEEE Int Parallel & Distributed Processing Symp, p.80-91,

[4]Bouguerra MS, Gainaru A, Gomez LB, et al., 2013. Improving the computing efficiency of HPC systems using a combination of proactive and preventive checkpointing. IEEE 27^th Int Symp on Parallel Distributed Processing, p.501-512.

[5]Brown D, Smith G, 2008. MPP2 Syslog Data (2006-2008). Technical Report, PNNL-SA-61371.

[6]Cappello F, Casanova H, Robert Y, 2010. Checkpointing vs. migration for post-petascale supercomputers. Proc 39^th Int Conf on Parallel Processing, p.168-177.

[7]Daly JT, 2006. A higher order estimate of the optimum checkpoint interval for restart dumps. Fut Gener Comput Syst, 22(3):303-312.

[8]Denning PJ, 2005. The locality principle. Commun ACM, 48(7):19-24.

[9]Dwork C, Lynch N, Stockmeyer L, 1988. Consensus in the presence of partial synchrony. J ACM, 35(2):288-323.

[10]Egwutuoha IP, Levy D, Selic B, et al., 2013. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J Supercomput, 65(3):1302-1326.

[11]Elliott J, Kharbas K, Fiala D, et al., 2012. Combining partial redundancy and checkpointing for HPC. IEEE 32^nd Int Conf on Distributed Computing Systems, p.615-626.

[12]Elnozahy ENM, Alvisi L, Wang YM, et al., 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv, 34(3):375-408.

[13]Fahey M, Larkin J, Adams J, 2008. I/O performance on a massively parallel Cray XT3/XT4. IEEE Int Symp on Parallel and Distributed Processing, p.1-12.

[14]Ferreira K, Stearley J, Laros JH III, et al., 2011. Evaluating the viability of process replication reliability for exascale systems. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 44.

[15]Gainaru A, Cappello F, Kramer W, 2012a. Taming of the shrew: modeling the normal and faulty behaviour of large-scale HPC systems. IEEE 26^th Int Symp on Parallel Distributed Processing, p.1168-1179.

[16]Gainaru A, Cappello F, Snir M, et al., 2012b. Fault prediction under the microscope: a closer look into HPC systems. Proc Int Conf on High Performance Computing, Networking, Storage and Analysis, Article 77.

[17]George C, Vadhiyar S, 2012. ADFT: an adaptive framework for fault tolerance on large scale systems using application malleability. Proc Comput Sci, 9:166-175.

[18]George C, Vadhiyar S, 2015. Fault tolerance on large scale systems using adaptive process replication. IEEE Trans Comput, 64(8):2213-2225.

[19]Gujrati P, Li Y, Lan Z, et al., 2007. A meta-learning failure predictor for blue gene/l systems. Proc Int Conf on Parallel Processing, p.1-8.

[20]Gupta S, Xiang P, Yang Y, et al., 2013. Locality principle revisited: a probability-based quantitative approach. J Parall Distrib Comput, 73(7):1011-1027.

[21]Hamerly G, Elkan C, 2001. Bayesian approaches to failure prediction for disk drives. Proc 18^th Int Conf on Machine Learning, p.202-209.

[22]Hargrove PH, Duell JC, 2006. Berkeley Lab Checkpoint/linebreak Restart (BLCR) for Linux clusters. J Phys Conf Ser, 46(1):494.

[23]Hellerstein JL, Zhang F, Shahabuddin P, 2001. A statistical approach to predictive detection. Comput Netw, 35(1):77-95.

[24]Hu W, Jiang Y, Liu G, et al., 2015. DDC: Distributed Data Collection Framework for Failure Prediction in Tianhe Supercomputers. Springer International Publishing, p.18-32.

[25]Kalaiselvi S, Rajaraman V, 2000. A survey of checkpointing algorithms for parallel and distributed computers. Sadhana, 25(5):489-510.

[26]Lan Z, Li Y, 2008. Adaptive fault management of parallel applications for high-performance computing. IEEE Trans Comput, 57(12):1647-1660.

[27]Lan Z, Gu J, Zheng Z, et al., 2010. A study of dynamic meta-learning for failure prediction in large-scale systems. J Parall Distrib Comput, 70(6):630-643.

[28]Liang Y, Zhang Y, Jette M, et al., 2006. Bluegene/l failure analysis and prediction models. Int Conf on Dependable Systems and Networks, p.425-434.

[29]Lu CD, 2005. Scalable Diskless Checkpointing for Large Parallel Systems. PhD Thesis, Champaign, IL, USA.

[30]Mohammed A, Kavuri R, Upadhyaya N, 2012. Fault tolerance: case study. Proc 2^nd Int Conf on Computational Science, Engineering and Information Technology, p.138-144.

[31]Mohror K, Moody A, de Supinski BR, 2012. Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint/Restart Library. IEEE/IFIP Int Conf on Dependable Systems and Networks Workshops, p.1-6.

[32]Moody A, Bronevetsky G, Mohror K, et al., 2010. Design, modeling, and evaluation of a scalable multi-level checkpointing system. Proc ACM/IEEE Int Conf for High Performance Computing, Networking, Storage and Analysis, p.1-11.

[33]Pinheiro E, Weber WD, Barroso LA, 2007. Failure trends in a large disk drive population. Proc 5^th USENIX Conf on File and Storage Technologies, p.2.

[34]Plank JS, Beck M, Kingsley G, et al., 1995. Libckpt: transparent checkpointing under Unix. Proc USENIX Technical Conf Proc, p.18.

[35]Plank JS, Li K, Puening MA, 1998. Diskless checkpointing. IEEE Trans Parall Distrib Syst, 9(10):972-986.

[36]Roman E, 2002. A survey of checkpoint/restart implementations. Technical Report LBNL-54942, Lawrence Berkeley National Laboratory.

[37]Sahoo RK, Oliner AJ, Rish I, et al., 2003. Critical event prediction for proactive management in large-scale computer clusters. Proc 9^th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.426-435.

[38]Salfner F, Lenk M, Malek M, 2010. A survey of online failure prediction methods. ACM Comput Surv, 42(3):10.1-10.42.

[39]Sancho JC, Petrini F, Johnson G, et al., 2004. On the feasibility of incremental checkpointing for scientific computing. Proc 18^th Int Symp on Parallel and Distributed Processing Symp, p.58.

[40]Schroeder B, Pinheiro E, Weber WD, 2009. DRAM errors in the wild: a large-scale field study. Proc 11^th Int Joint Conf on Measurement and Modeling of Computer Systems, p.193-204.

[41]Vetter JS, Mueller F, 2003. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. J Parall Distrib Comput, 63(9):853-865.

[42]Vilalta R, Ma S, 2002. Predicting rare events in temporal domains. Proc IEEE Int Conf on Data Mining, p.474-481.

[43]Weinberg J, McCracken MO, Strohmaier E, et al., 2005. Quantifying locality in the memory access patterns of HPC applications. Proc ACM/IEEE Conf on Supercomputing, p.50.

[44]Young JW, 1974. A first order approximation to the optimum checkpoint interval. Commun ACM, 17(9):530-531.

[45]Zhong Y, Shen X, Ding C, 2009. Program locality analysis using reuse distance. ACM Trans Program Lang Syst, 31(6), Article 20.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

基于进程复制和预取的高性能计算容错框架

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference