JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

A vision of post-exascale programming

Author(s): Ji-dong Zhai, Wen-guang Chen
Affiliation(s): Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Corresponding email(s): zhaijidong@tsinghua.edu.cn
Key Words: Computing model, Fault-tolerance, Heterogeneous, Parallelism, Post-exascale

Share this article to： More <<< Previous Paper \|Next Paper >>>

Ji-dong Zhai, Wen-guang Chen. A vision of post-exascale programming[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1800442

@article{title="A vision of post-exascale programming",
author="Ji-dong Zhai, Wen-guang Chen",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1800442"
}

%0 Journal Article
%T A vision of post-exascale programming
%A Ji-dong Zhai
%A Wen-guang Chen
%J Frontiers of Information Technology & Electronic Engineering
%P 1261-1266
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1800442"

TY - JOUR
T1 - A vision of post-exascale programming
A1 - Ji-dong Zhai
A1 - Wen-guang Chen
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1261
EP - 1266
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1800442"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Exascale systems have been under development for quite some time and will be available for use in a few years. It is time to think about future post-exascale systems. There are many main challenges with regard to future post-exascale systems, such as processor architecture, programming, storage, and interconnect. In this study, we discuss three significant programming challenges for future post-exascale systems: heterogeneity, parallelism, and fault tolerance. Based on our experience of programming on current large-scale systems, we propose several potential solutions for these challenges. Nevertheless, more research efforts are needed to solve these problems.

后E级系统编程模型的构想

摘要：E级高性能计算系统已经研制很长时间，可以在未来几年投入使用。现在是时候考虑未来的后E级高性能计算系统。后E级系统存在许多主要挑战，例如处理器体系结构、编程模型、存储架构和互连网络。讨论了后E级系统编程模型面临的3个重要挑战：异构性、并行性和容错性。基于我们当前在大规模系统上编程的经验，针对这些挑战，提出一些可能的解决方案。然而，未来需要更多研究工作以应对这些挑战。

关键词组：计算模型；容错；异构；并行性；后E级

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Bahmani A, Mueller F, 2014. Scalable performance analysis of exascale MPI programs through signature-based clustering algorithms. 28^th ACM Int Conf on Supercomputing, p.155-164.

[2]Balaji P, Snir M, Amer A, et al., 2013. Exascale MPI. https://www.exascaleproject.org/project/exascale-mpi/ [Accessed on Sept. 10, 2018].

[3]Bland W, Du P, Bouteiller A, et al., 2012. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. European Conf on Parallel Processing, p.477-488.

[4]Bland W, Bouteiller A, Herault T, et al., 2013. Post-failure recovery of MPI communication capability: design and rationale. {em Int J High Perform Comput Appl}, 27(3):244-254.

[5]Bouteiller A, Cappello F, Herault T, et al., 2003. MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging. ACM/IEEE Conf on Supercomputing, p.1-17.

[6]Cappello F, 2009. Fault tolerance in petascale/exascale systems: current knowledge, challenges, and research opportunities. Int J High Perform Comput Appl, 23(3):212-226.

[7]Chen Z, 2013. Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Not, 48(8):167-176.

[8]Dagum L, Menon R, 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Comput Sci Eng, 5(1):46-55.

[9]Dean J, Ghemawat S, 2008. MapReduce: simplified data processing on large clusters. Commun ACM, 51(1):107-113.

[10]Dong X, Muralimanohar N, Jouppi N, et al., 2009. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems. Int Conf on High Performance Computing Networking, Storage, and Analysis, p.1-12.

[11]Fu H, Liao J, Yang J, et al., 2016. The Sunway Taihulight supercomputer: system and applications. Sci Chin Inform Sci, 59(7):072001.

[12]Gropp W, 2009. MPI at exascale: challenges for data structures and algorithms. In: Ropo M, Westerholm J, Dongarra J (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer Berlin Heidelberg.

[13]Huang KH, Abraham JA, 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput, C-33(6):518-528.

[14]Jeffers J, Reinders J, 2013. Intel Xeon Phi Coprocessor High Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, USA.

[15]Lee S, Vetter JS, 2012. Early evaluation of directive-based GPU programming models for productive exascale computing. Int Conf on High Performance Computing, Networking, Storage, and Analysis, p.1-11.

[16]Lin H, Tang X, Yu B, et al., 2017. Scalable graph traversal on Sunway Taihulight with ten million cores. Int Parallel and Distributed Processing Symp, p.635-645.

[17]Munshi A, 2009. The OpenCL specification. 21$^rm st$ IEEE Hot Chips Symp, p.1-314.

[18]Ragan-Kelley J, Adams A, 2012. Halide. http://halide-lang.org [Accessed on Sept. 10, 2018].

[19]Ragan-Kelley J, Barnes C, Adams A, et al., 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not, 48(6):519-530.

[20]Schroeder B, Gibson G, 2010. A large-scale study of failures in high-performance computing systems. IEEE Trans Depend Sec Comput, 7(4):337-350.

[21]Stone JE, Gohara D, Shi G, 2010. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng, 12(3):66-73.

[22]Tang X, Zhai J, Yu B, et al., 2017. Self-checkpoint: an in-memory checkpoint method using less space and its practice on fault-tolerant HPL. 22^nd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.401-413.

[23]Tang X, Zhai J, Qian X, et al., 2018. VSensor: leveraging fixed-workload snippets of programs for performance variance detection. 23^rd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming, p.124-136.

[24]Vetter JS, Glassbrook R, Dongarra J, et al., 2011. Keeneland: bringing heterogeneous GPU computing to the computational science community. Comput Sci Eng, 13(5):90-95.

[25]Xin RS, Gonzalez JE, Franklin MJ, et al., 2013. GraphX: a resilient distributed graph system on Spark. 1^st Int Workshop on Graph Data Management Experiences and Systems, p.1-6.

[26]Yao E, Wang R, Chen M, et al., 2012. A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism. 26^th Int Parallel and Distributed Processing Symp, p.438-448.

[27]Zhu X, Chen W, Zheng W, et al., 2016. Gemini: a computation-centric distributed graph processing system. 12^th USENIX Symp on Operating Systems Design and Implementation, p.301-316.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

后E级系统编程模型的构想

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference