JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2025 Vol.26 No.1 P.109-118

Automatic parallelism strategy generation with minimal memory redundancy

Author(s): Yanqi SHI, Peng LIANG, Hao ZHENG, Linbo QIAO, Dongsheng LI
Affiliation(s): 1. National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410000, China
Corresponding email(s): yqshi@nudt.edu.cn, peng_leung@nudt.edu.cn, zhengh@nudt.edu.cn, linboqiao@nudt.edu.cn, lds1201@163.com
Key Words: Deep learning, Automatic parallelism, Minimal memory redundancy

Share this article to： More <<< Previous Article \|Next Article >>>

Yanqi SHI, Peng LIANG, Hao ZHENG, Linbo QIAO, Dongsheng LI. Automatic parallelism strategy generation with minimal memory redundancy[J]. Frontiers of Information Technology & Electronic Engineering, 2025, 26(1): 109-118.

@article{title="Automatic parallelism strategy generation with minimal memory redundancy",
author="Yanqi SHI, Peng LIANG, Hao ZHENG, Linbo QIAO, Dongsheng LI",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="26",
number="1",
pages="109-118",
year="2025",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300684"
}

%0 Journal Article
%T Automatic parallelism strategy generation with minimal memory redundancy
%A Yanqi SHI
%A Peng LIANG
%A Hao ZHENG
%A Linbo QIAO
%A Dongsheng LI
%J Frontiers of Information Technology & Electronic Engineering
%V 26
%N 1
%P 109-118
%@ 2095-9184
%D 2025
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300684

TY - JOUR
T1 - Automatic parallelism strategy generation with minimal memory redundancy
A1 - Yanqi SHI
A1 - Peng LIANG
A1 - Hao ZHENG
A1 - Linbo QIAO
A1 - Dongsheng LI
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 26
IS - 1
SP - 109
EP - 118
%@ 2095-9184
Y1 - 2025
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300684

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Large-scale deep learning models are trained distributedly due to memory and computing resource limitations. Few existing strategy generation approaches take optimal memory minimization as the objective. To fill in this gap, we propose a novel algorithm that generates optimal parallelism strategies with the constraint of minimal memory redundancy. We propose a novel redundant memory cost model to calculate the memory overhead of each operator in a given parallel strategy. To generate the optimal parallelism strategy, we formulate the parallelism strategy search problem into an integer linear programming problem and use an efficient solver to find minimal-memory intra-operator parallelism strategies. Furthermore, the proposed algorithm has been extended and implemented in a multi-dimensional parallel training framework and is characterized by high throughput and minimal memory redundancy. Experimental results demonstrate that our approach achieves memory savings of up to 67% compared to the latest Megatron-LM strategies; in contrast, the gap between the throughput of our approach and its counterparts is not large.

最小化内存冗余的自动并行策略生成方法

时彦琦，梁鹏，郑浩，乔林波，李东升
国防科技大学并行与分布处理国家重点实验室，中国长沙市，410000
摘要：受内存和计算资源限制，大规模深度学习模型通常以分布式方式训练。现有策略生成方法很少以最小化内存占用作为目标。为此，提出一种新算法，能够生成以最小化内存冗余为目标的自动并行策略。提出一种冗余内存代价模型来计算给定并行策略中每个算子的内存开销。为确保生成最优的并行策略，将并行策略搜索问题形式化为整数线性规划问题，使用高效求解器寻找具有最小内存占用的算子内并行策略。所提方法在多维并行训练框架中实现；实验结果表明，与最新Megatron-LM方法相比，可节省高达67%的内存开销，而吞吐量相差不大。

关键词：深度学习；自动并行；最小化内存冗余

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Int Conf on Neural Information Processing Systems, Article 159.

[2]Cai ZK, Yan X, Ma KH, et al., 2022. TensorOpt: exploring the tradeoffs in distributed DNN training with auto-parallelism. IEEE Trans Parall Distrib Syst, 33(8):1967-1981.

[3]Chowdhery A, Narang S, Devlin J, et al., 2022. PaLM: scaling language modeling with pathways. https://arxiv.org/abs/2204.02311

[4]Dan YH, Lei ZK, Gu YY, et al., 2023. EduChat: a large-scale language model-based chatbot system for intelligent education. https://arxiv.org/abs/2308.02773

[5]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf of the 9^th American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[6]Guan L, Sun T, Qiao LB, et al., 2020. An efficient parallel and distributed solution to nonconvex penalized linear SVMs. Front Inform Technol Electron Eng, 21(4):587-603.

[7]Harlap A, Narayanan D, Phanishayee A, et al., 2018. PipeDream: fast and efficient pipeline parallel DNN training. https://arxiv.org/abs/1806.03377

[8]He XB, Chen X, Guo H, et al., 2023. Scalability and efficiency challenges for the exascale supercomputing system: practice of a parallel supporting environment on the Sunway exascale prototype system. Front Inform Technol Electron Eng, 24(1):41-58.

[9]Huang YP, Cheng YL, Bapna A, et al., 2019. GPipe: efficient training of giant neural networks using pipeline parallelism. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 10.

[10]Jia ZH, Lin SN, Qi CR, et al., 2018. Exploring hidden dimensions in accelerating convolutional neural networks. Proc 35^th Int Conf on Machine Learning, p.2274-2283.

[11]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 26^th Annual Conf on Neural Information Processing Systems, p.1106-1114.

[12]Lan Q, Qiao LB, Wang YJ, 2018. Stochastic extra-gradient based alternating direction methods for graph-guided regularized minimization. Front Inform Technol Electron Eng, 19(6):755-762.

[13]Li SG, Liu HX, Bian ZD, et al., 2023. Colossal-AI: a unified deep learning system for large-scale parallel training. Proc 52^nd Int Conf on Parallel Processing, p.766-775.

[14]Liu YL, Li SG, Fang JR, et al., 2023. Colossal-Auto: unified automation of parallelization and activation checkpoint for large-scale models. https://arxiv.org/abs/2302.02599

[15]Liu ZM, Cheng SG, Zhou HT, et al., 2023. Hanayo: harnessing wave-like pipeline parallelism for enhanced large model training efficiency. Proc Int Conf for High Performance Computing, Networking, Article 56.

[16]Mo ZY, 2018. Extreme-scale parallel computing: bottlenecks and strategies. Front Inform Technol Electron Eng, 19(10):1251-1260.

[17]Narayanan D, Shoeybi M, Casper J, et al., 2021. Efficient large-scale language model training on GPU clusters using Megatron-LM. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 58.

[18]Naumov M, Mudigere D, Shi HJM, et al., 2019. Deep learning recommendation model for personalization and recommendation systems. https://arxiv.org/abs/1906.00091

[19]Rajbhandari S, Rasley J, Ruwase O, et al., 2020. ZeRO: memory optimizations toward training trillion parameter models. Proc Int Conf for High Performance Computing, Networking, Storage and Analysis, Article 20.

[20]Shazeer N, Cheng YL, Parmar N, et al., 2018. Mesh-TensorFlow: deep learning for supercomputers. Proc 32^nd Int Conf on Neural Information Processing Systems, p.10435-10444.

[21]Shoeybi M, Patwary M, Puri R, et al., 2019. Megatron-LM: training multi-billion parameter language models using model parallelism. https://arxiv.org/abs/1909.08053

[22]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[23]Wang MJ, Huang CC, Li JY, 2019. Supporting very large models using automatic dataflow graph partitioning. Proc 14^th EuroSys Conf, Article 26.

[24]Zheng LM, Li ZH, Zhang H, et al., 2022. Alpa: automating inter- and intra-operator parallelism for distributed deep learning. Proc 16^th USENIX Symp on Operating Systems Design and Implementation, p.559-578.

[25]Zhuang YT, Wu F, Chen C, et al., 2017. Challenges and opportunities: from big data to knowledge in AI 2.0. Front Inform Technol Electron Eng, 18(1):3-14.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

最小化内存冗余的自动并行策略生成方法

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference