Full Text:   <512>

Summary:  <38>

CLC number: TP319

On-line Access: 2020-06-12

Received: 2019-02-09

Revision Accepted: 2019-07-12

Crosschecked: 2020-05-09

Cited: 0

Clicked: 679

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Yi-shui Li

https://orcid.org/0000-0001-7826-7504

Jie Liu

https://orcid.org/0000-0003-3745-7541

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2020 Vol.21 No.6 P.939-949

10.1631/FITEE.1900075


OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype


Author(s):  Yi-shui Li, Xin-hai Chen, Jie Liu, Bo Yang, Chun-ye Gong, Xin-biao Gan, Sheng-guo Li, Han Xu

Affiliation(s):  Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China

Corresponding email(s):   liyishui_lys@163.com, chenxinhai1995@aliyun.com, liujie@nudt.edu.cn

Key Words:  High-performance computing, Topology mapping, Heuristic algorithm


Yi-shui Li, Xin-hai Chen, Jie Liu, Bo Yang, Chun-ye Gong, Xin-biao Gan, Sheng-guo Li, Han Xu. OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype[J]. Frontiers of Information Technology & Electronic Engineering, 2020, 21(6): 939-949.

@article{title="OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype",
author="Yi-shui Li, Xin-hai Chen, Jie Liu, Bo Yang, Chun-ye Gong, Xin-biao Gan, Sheng-guo Li, Han Xu",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="21",
number="6",
pages="939-949",
year="2020",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.1900075"
}

%0 Journal Article
%T OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype
%A Yi-shui Li
%A Xin-hai Chen
%A Jie Liu
%A Bo Yang
%A Chun-ye Gong
%A Xin-biao Gan
%A Sheng-guo Li
%A Han Xu
%J Frontiers of Information Technology & Electronic Engineering
%V 21
%N 6
%P 939-949
%@ 2095-9184
%D 2020
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.1900075

TY - JOUR
T1 - OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype
A1 - Yi-shui Li
A1 - Xin-hai Chen
A1 - Jie Liu
A1 - Bo Yang
A1 - Chun-ye Gong
A1 - Xin-biao Gan
A1 - Sheng-guo Li
A1 - Han Xu
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 21
IS - 6
SP - 939
EP - 949
%@ 2095-9184
Y1 - 2020
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.1900075


Abstract: 
With the rapid increase of the size of applications and the complexity of the supercomputer architecture, topology-aware process mapping becomes increasingly important. High communication cost has become a dominant constraint of the performance of applications running on the supercomputer. To avoid a bad mapping strategy which can lead to terrible communication performance, we propose an optimized heuristic topology-aware mapping algorithm (OHTMA). The algorithm attempts to minimize the hop-byte metric that we use to measure the mapping results. OHTMA incorporates a new greedy heuristic method and pair-exchange-based optimization. It reduces the number of long-distance communications and effectively enhances the locality of the communication. Experimental results on the Tianhe-3 exascale supercomputer prototype indicate that OHTMA can significantly reduce the communication costs.

OHTMA:面向天河三号E级原型机的一种启发式优化拓扑感知映射算法

李翊谁,陈新海,刘杰,杨博,龚春叶,甘新标,李胜国,徐涵
国防科技大学计算机学院并行与分布处理国家重点实验室,中国长沙市,410073

摘要:随着应用程序规模和超级计算机体系结构复杂性的迅速增加,拓扑映射的重要性愈加凸显。高通信成本已成为超级计算机上运行的应用程序性能的主要限制因素。为避免不合适的映射策略可能带来的较差通信性能,提出一种启发式优化拓扑感知映射算法(OHTMA)。该算法旨在最小化用于测量映射结果的字节跳跃度量。OHTMA结合贪婪启发式算法和基于对交换的优化方法,减少了远程通信数量,有效增强了通信局部性。在天河三号E级原型机的实验结果表明,OHTMA算法可显著降低通信成本。

关键词:高性能计算;拓扑映射;启发式算法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Agarwal T, Sharma A, Laxmikant A, et al., 2006. Topology-aware task mapping for reducing communication contention on large parallel machines. Proc 20th IEEE Int Parallel & Distributed Processing Symp, p.1-10.

[2]Bailey DH, Barszcz E, Barton JT, et al., 1991. The NAS parallel benchmarks—summary and preliminary results. Proc ACM/IEEE Conf on Supercomputing, p.158-165.

[3]Bhatele A, 2010. Automating Topology Aware Mapping for Supercomputers. PhD Thesis, University of Illinois at Urbana-Champaign, Urbana, USA.

[4]Bhatele A, Laxmikant V, 2009. An evaluative study on the effect of contention on message latencies in large supercomputers. Proc IEEE Int Symp on Parallel & Distributed Processing, p.1-8.

[5]Brandfass B, Alrutz T, Gerhold T, 2013. Rank reordering for MPI communication optimization. Comput Fluid, 80:372-380.

[6]Chen X, Liu J, Li S, et al., 2018. TAMM: a new topology-aware mapping method for parallel applications on the Tianhe-2A supercomputer. Proc 18th Int Conf on Algorithms and Architectures for Parallel Processing, p.242-256.

[7]Deveci M, Kaya K, Uccar B, et al., 2015. Fast and high quality topology-aware task mapping. Proc IEEE Int Parallel and Distributed Processing Symp, p.197-206.

[8]Hoefler T, Snir M, 2011. Generic topology mapping strategies for large-scale parallel architectures. Proc Int Conf on Supercomputing, p.75-84.

[9]Hoefler T, Jeannot E, Mercier G, 2014. An overview of topology mapping algorithms and techniques in high-performance computing. In: Jeannot E, Žilinskas J (Eds.), High-Performance Computing on Complex Environments. Wiley, Hoboken, New Jersey, USA.

[10]Jeannot E, Mercier G, 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra P, Guarracino M, Talia D (Eds.), Euro-Par 2010 Parallel Processing. Springer Berlin Heidelberg, Germany, p.199-210.

[11]Jeannot E, Mercier G, Tessier F, 2014. Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans Parall Distrib Syst, 25(4):993-1002.

[12]Karypis G, Kumar V, 1998. METIS—A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes and Computing Fill-Reducing Ordering of Sparse Matrices. Technical Report, University of Minnesota, Minneapolis, USA.

[13]Liao X, Pang Z, Wang K, et al., 2015. High performance interconnect network for Tianhe system. J Comput Sci Technol, 30(2):259-272.

[14]Mercier G, Clet-Ortega J, 2009. Towards an efficient process placement policy for MPI applications in multicore environments. In: Ropo M, Westerholm J, Dongarra J (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer Berlin Heidelberg, Germany, p.104-115.

[15]Mirsadeghi SH, Afsahi A, 2016. PTRAM: a parallel topology-and routing-aware mapping framework for large-scale HPC systems. Proc IEEE Int Parallel and Distributed Processing Symp Workshops, p.386-396.

[16]Pellegrini F, Roman J, 1996. SCOTCH: a software package for static mapping by dual recursive bipartitioning of process and architecture graphs. Proc Int Conf and Exhibition on High-Performance Computing and Networking, p.493-498.

[17]Rodrigues E, Madruga F, Navaux P, et al., 2009. Multi-core aware process mapping and its impact on communication overhead of parallel applications. Int Symp on Computers and Communications, p.811-817.

[18]Sahni S, Gonzalez T, 1976. P-complete approximation problems. J ACM, 23(3):555-565.

[19]Sudheer CD, Srinivasan A, 2012. Optimization of the hop-byte metric for effective topology aware mapping. Proc 19th Int Conf on High Performance Computing, p.1-9.

[20]Tuncer O, Leung VJ, Coskun AK, 2015. PaCMap: topology mapping of unstructured communication patterns onto non-contiguous allocations. Proc 29th ACM on Int Conf on Supercomputing, p.37-46.

[21]Walshaw C, Cross M, 2007. JOSTLE–-parallel multilevel graph-partitioning software: an overview. In: Magoul‘es F (Ed.), Mesh Partitioning Techniques and Domain Decomposition Methods. Saxe-Coburg Publications, Stirlingshire, UK, p.22-58.

[22]Wang T, Qing P, Wei D, et al., 2015. Optimization of process-to-core mapping based on clustering analysis. Chin J Comput, 38(5):1044-1055 (in Chinese).

[23]Wylie BJN, Böhme D, Mohr B, et al., 2010. Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset. Proc IEEE Int Symp on Parallel & Distributed Processing, Workshops and PhD Forum, p.1-8.

[24]Zerr RJ, Baker RS, 2013. Snap: SN (Discrete Ordinates) Application Proxy-Proxy Description. Technical Report, LA-UR-13-21070, Los Alamos National Laboratory, Los Alamos, USA.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - Journal of Zhejiang University-SCIENCE