JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering 2024 Vol.25 No.6 P.763-790

Transformer in reinforcement learning for decision-making: a survey

Author(s): Weilin YUAN, Jiaxing CHEN, Shaofei CHEN, Dawei FENG, Zhenzhen HU, Peng LI, Weiwei ZHAO
Affiliation(s): College of Information and Communication, National University of Defense Technology, Wuhan 430014, China; more
Corresponding email(s): yuanweilin12@nudt.edu.cn, zhaozww@163.com
Key Words: Transformer, Reinforcement learning (RL), Decision-making (DM), Deep neural network (DNN), Multi-agent reinforcement learning (MARL), Meta-reinforcement learning (Meta-RL)

Share this article to： More \|Next Article >>>

Weilin YUAN, Jiaxing CHEN, Shaofei CHEN, Dawei FENG, Zhenzhen HU, Peng LI, Weiwei ZHAO. Transformer in reinforcement learning for decision-making: a survey[J]. Frontiers of Information Technology & Electronic Engineering, 2024, 25(6): 763-790.

@article{title="Transformer in reinforcement learning for decision-making: a survey",
author="Weilin YUAN, Jiaxing CHEN, Shaofei CHEN, Dawei FENG, Zhenzhen HU, Peng LI, Weiwei ZHAO",
journal="Frontiers of Information Technology & Electronic Engineering",
volume="25",
number="6",
pages="763-790",
year="2024",
publisher="Zhejiang University Press & Springer",
doi="10.1631/FITEE.2300548"
}

%0 Journal Article
%T Transformer in reinforcement learning for decision-making: a survey
%A Weilin YUAN
%A Jiaxing CHEN
%A Shaofei CHEN
%A Dawei FENG
%A Zhenzhen HU
%A Peng LI
%A Weiwei ZHAO
%J Frontiers of Information Technology & Electronic Engineering
%V 25
%N 6
%P 763-790
%@ 2095-9184
%D 2024
%I Zhejiang University Press & Springer
%DOI 10.1631/FITEE.2300548

TY - JOUR
T1 - Transformer in reinforcement learning for decision-making: a survey
A1 - Weilin YUAN
A1 - Jiaxing CHEN
A1 - Shaofei CHEN
A1 - Dawei FENG
A1 - Zhenzhen HU
A1 - Peng LI
A1 - Weiwei ZHAO
J0 - Frontiers of Information Technology & Electronic Engineering
VL - 25
IS - 6
SP - 763
EP - 790
%@ 2095-9184
Y1 - 2024
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/FITEE.2300548

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: reinforcement learning (RL) has become a dominant decision-making paradigm and has achieved notable success in many real-world applications. Notably, deep neural networks play a crucial role in unlocking RL’s potential in large-scale decision-making tasks. Inspired by current major success of transformer in natural language processing and computer vision, numerous bottlenecks have been overcome by combining transformer with RL for decision-making. This paper presents a multiangle systematic survey of various transformer-based RL (TransRL) models applied in decision-making tasks, including basic models, advanced algorithms, representative implementation instances, typical applications, and known challenges. Our work aims to provide insights into problems that inherently arise with the current RL approaches, and examines how we can address them with better TransRL models. To our knowledge, we are the first to present a comprehensive review of the recent transformer research developments in RL for decision-making. We hope that this survey provides a comprehensive review of TransRL models and inspires the RL community in its pursuit of future directions. To keep track of the rapid TransRL developments in the decision-making domains, we summarize the latest papers and their open-source implementations at https://github.com/williamyuanv0/transformer-in-Reinforcement-Learning-for-Decision-Making-A-Survey.

基于Transformer的强化学习方法在智能决策领域的应用：综述

袁唯淋¹，陈佳星²，陈少飞²，冯大为³，胡振震²，李鹏²，赵卫伟¹
¹国防科技大学信息通信学院，中国武汉市，430014
²国防科技大学智能科学学院，中国长沙市，410072
³国防科技大学并行与分布计算全国重点实验室，中国长沙市，410072
摘要：强化学习已成为一种主导的决策范式，在许多现实应用中取得令人瞩目的成果。在大规模决策场景中，深度神经网络成为释放强化学习巨大潜力的关键所在。受自然语言和视觉领域中先进Transformer方法的启发，Transformer和强化学习的结合，突破了智能决策领域许多瓶颈。本文从基础模型、先进算法、代表性示例、典型应用和挑战分析等层面，归纳总结了基于Transformer的强化学习方法（TransRL），旨在深入分析当前强化学习方法的痛点，讨论TransRL如何突破强化学习范式的局限。据我们所知，本文是第一篇系统性回顾基于Transformer的强化学习方法在智能决策领域应用进展的综述，期望提供一个全面的TransRL讨论基础，推动强化学习在此领域的应用。为便于跟进TransRL的前沿进展，我们整理了最新相关论文及其开源项目，详见https://github.com/williamyuanv0/Transformer-in-Reinforcement-Learning-for-Decision-Making-A-Survey。

关键词：Transformer；强化学习；智能决策；深度神经网络；多智能体强化学习；元强化学习

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Ahmed O, Träuble F, Goyal A, et al., 2021. CausalWorld: a robotic manipulation benchmark for causal structure and transfer learning. Proc 9^th Int Conf on Learning Representations.

[2]Aleissaee AA, Kumar A, Anwer RM, et al., 2023. Transformers in remote sensing: a survey. Remote Sens, 15(7):1860.

[3]Alquier P, 2020. Approximate Bayesian inference. Entropy, 22(11):1272.

[4]Ambartsoumian A, Popowich F, 2018. Self-attention: a better building block for sentiment analysis neural network classifiers. Proc 9^th Workshop on Computational Approaches to Subjectivity, p.130-139.

[5]Anbuudayasankar SP, Ganesh K, Mohapatra S, 2014. Survey of methodologies for TSP and VRP. In: Anbuudayasankar SP, Ganesh K, Mohapatra S (Eds.), Models for Practical Routing Problems in Logistics: Design and Practices. Springer, Cham, p.11-42.

[6]Anderson P, Fernando B, Johnson M, et al., 2016. SPICE: semantic propositional image caption evaluation. Proc 14^th European Conf on Computer Vision, p.382-398.

[7]Anderson P, Wu Q, Teney D, et al., 2018. Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.3674-3683.

[8]Ba JL, Kiros JR, Hinton GE, 2016. Layer normalization. https://arxiv.org/abs/1607.06450

[9]Badia AP, Piot B, Kapturowski S, et al., 2020. Agent57: outperforming the Atari human benchmark. Proc 37^th Int Conf on Machine Learning, p.507-517.

[10]Baevski A, Auli M, 2018. Adaptive input representations for neural language modeling. Proc 7^th Int Conf on Learning Representations.

[11]Bahdanau D, Cho K, Bengio Y, 2015. Neural machine translation by jointly learning to align and translate. Proc 3^rd Int Conf on Learning Representations.

[12]Banerjee S, Lavie A, 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. Proc ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, p.65-72.

[13]Barthet M, Liapis A, Yannakakis GN, 2023. Open-ended evolution for Minecraft building generation. IEEE Trans Games, 15(4):603-612.

[14]Bauer J, Baumli K, Behbahani F, et al., 2023. Human-timescale adaptation in an open-ended task space. Proc 40^th Int Conf on Machine Learning, p.1887-1935.

[15]Bellemare MG, Naddaf Y, Veness J, et al., 2013. The arcade learning environment: an evaluation platform for general agents. J Artif Intell Res, 47:253-279.

[16]Bello I, Pham H, Le QV, et al., 2017. Neural combinatorial optimization with reinforcement learning. Proc 5^th Int Conf on Learning Representations.

[17]Berner C, Brockman G, Chan B, et al., 2019. Dota 2 with large scale deep reinforcement learning. https://arxiv.org/abs/1912.06680

[18]Bernstein DS, Givan R, Immerman N, et al., 2002. The complexity of decentralized control of Markov decision processes. Math Oper Res, 27(4):819-840.

[19]Bommasani R, Hudson DA, Adeli E, et al., 2021. On the opportunities and risks of foundation models. https://arxiv.org/abs/2108.07258

[20]Boularias A, Duvallet F, Oh J, et al., 2015. Grounding spatial relations for outdoor robot navigation. Proc IEEE Int Conf on Robotics and Automation, p.1976-1982.

[21]Bresson X, Laurent T, 2021. The Transformer network for the traveling salesman problem. https://arxiv.org/abs/2103.03012

[22]Brockman G, Cheung V, Pettersson L, et al., 2016. OpenAI Gym. https://arxiv.org/abs/1606.01540

[23]Brown TB, Mann B, Ryder N, et al., 2020. Language models are few-shot learners. Proc 34^th Int Conf on Neural Information Processing Systems, Article 159.

[24]Carion N, Massa F, Synnaeve G, et al., 2020. End-to-end object detection with Transformers. Proc 16^th European Conf on Computer Vision, p.213-229.

[25]Chen HT, Wang YH, Guo TY, et al., 2021. Pre-trained image processing Transformer. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12299-12310.

[26]Chen LL, Lu K, Rajeswaran A, et al., 2021. Decision Transformer: reinforcement learning via sequence modeling. Proc 34^th Int Conf on Neural Information Processing Systems, p.15084-15097.

[27]Chen M, Radford A, Child R, et al., 2020. Generative pretraining from pixels. Proc 37^th Int Conf on Machine Learning, p.1691-1703.

[28]Cheng Y, Wang D, Zhou P, et al., 2020. A survey of model compression and acceleration for deep neural networks. https://arxiv.org/abs/1710.09282

[29]Cirulli G, 2014. 2048. https://play2048.co/ [Accessed on Aug. 1, 2023].

[30]Clever HM, Handa A, Mazhar H, et al., 2022. Assistive Tele-op: leveraging Transformers to collect robotic task demonstrations. https://arxiv.org/abs/2112.05129

[31]Conneau A, Khandelwal K, Goyal N, et al., 2020. Unsupervised cross-lingual representation learning at scale. Proc 58^th Annual Meeting of the Association for Computational Linguistics, p.8440-8451.

[32]Correia A, Alexandre LA, 2022. Hierarchical Decision Transformer. https://arxiv.org/abs/2209.10447

[33]Coulom R, 2007. Efficient selectivity and backup operators in Monte-Carlo tree search. Proc 5^th Int Conf on Computers and Games, p.72-83.

[34]Czarnecki WM, Gidel G, Tracey B, et al., 2020. Real world games look like spinning tops. Proc 34^th Int Conf on Neural Information Processing Systems, Article 1463.

[35]Davis JQ, Gu A, Choromanski K, et al., 2021. Catformer: designing stable Transformers via sensitivity analysis. Proc 38^th Int Conf on Machine Learning, p.2489-2499.

[36]Deudon M, Cournut P, Lacoste A, et al., 2018. Learning heuristics for the TSP by policy gradient. Proc 15^th Int Conf on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, p.170-181.

[37]Devlin J, Chang MW, Lee K, et al., 2019. BERT: pre-training of deep bidirectional Transformers for language understanding. Proc Conf on North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.4171-4186.

[38]de Witt CS, Peng B, Kamienny PA, et al., 2020. Deep multi-agent reinforcement learning for decentralized continuous cooperative control. https://arxiv.org/abs/2003.06709v2

[39]Dong YH, Cordonnier JB, Loukas A, 2021. Attention is not all you need: pure attention loses rank doubly exponentially with depth. Proc 38^th Int Conf on Machine Learning, p.2793-2803.

[40]Dosovitskiy A, Beyer L, Kolesnikov A, et al., 2021. An image is worth 16×16 words: Transformers for image recognition at scale. Proc 9^th Int Conf on Learning Representations.

[41]Du N, Huang YP, Dai AM, et al., 2022. GLaM: efficient scaling of language models with mixture-of-experts. Proc 39^th Int Conf on Machine Learning, p.5547-5569.

[42]Duan Y, Schulman J, Chen X, et al., 2016. RL²: fast reinforcement learning via slow reinforcement learning. https://arxiv.org/abs/1611.02779

[43]Duvallet F, Walter MR, Howard T, et al., 2016. Inferring maps and behaviors from natural language instructions. In: Hsieh MA, Khatib O, Kumar V (Eds.), Experimental Robotics: 14^th Int Symp on Experimental Robotics. Springer, Cham, p.373-388.

[44]Ehsani K, Han W, Herrasti A, et al., 2021. Manipula-THOR: a framework for visual object manipulation. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.4497-4506.

[45]Ergen T, Neyshabur B, Mehta H, 2022. Convexifying Transformers: improving optimization and understanding of Transformer networks. https://arxiv.org/abs/2211.11052

[46]Esser P, Rombach R, Ommer B, 2021. Taming Transformers for high-resolution image synthesis. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12873-12883.

[47]Esslinger K, Platt R, Amato C, 2022. Deep Transformer Q-networks for partially observable reinforcement learning. https://arxiv.org/abs/2206.01078

[48]Fan LX, Wang GZ, Jiang YF, et al., 2022. MineDojo: building open-ended embodied agents with internet-scale knowledge. Proc 36^th Int Conf on Neural Information Processing Systems, p.18343-18362.

[49]Foerster J, Farquhar G, Afouras T, et al., 2018. Counterfactual multi-agent policy gradients. Proc 32^nd AAAI Conf on Artificial Intelligence, p.2974-2982.

[50]Furuta H, Matsuo Y, Gu SS, 2022. Generalized decision Transformer for offline hindsight information matching. Proc 10^th Int Conf on Learning Representations.

[51]Gehring J, Auli M, Grangier D, et al., 2017. Convolutional sequence to sequence learning. Proc 34^th Int Conf on Machine Learning, p.1243-1252.

[52]Goh YL, Lee WS, Bresson X, et al., 2022. Combining reinforcement learning and optimal transport for the traveling salesman problem. https://arxiv.org/abs/2203.00903

[53]Golden BL, Levy L, Vohra R, 1987. The orienteering problem. Nav Res Log, 34(3):307-318.

[54]Graves A, 2013. Generating sequences with recurrent neural networks. https://arxiv.org/abs/1308.0850

[55]Gronauer S, Diepold K, 2022. Multi-agent deep reinforcement learning: a survey. Artif Intell Rev, 55(2):895-943.

[56]Gu J, Stefani E, Wu Q, et al., 2022. Vision-and-language navigation: a survey of tasks, methods, and future directions. Proc 60^th Annual Meeting of the Association for Computational Linguistics, p.7606-7623.

[57]Guhur PL, Chen SZ, Pinel RG, et al., 2022. Instruction-driven history-aware policies for robotic manipulations. Proc 6^th Conf on Robot Learning, p.175-187.

[58]Guo MS, Zhang Y, Liu T, 2019. Gaussian Transformer: a lightweight approach for natural language inference. Proc 33^rd AAAI Conf on Artificial Intelligence, p.6489-6496.

[59]Guss WH, Houghton B, Topin N, et al., 2019. MineRL: a large-scale dataset of Minecraft demonstrations. Proc 28^th Int Joint Conf on Artificial Intelligence, p.2442-2448.

[60]Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35^th Int Conf on Machine Learning, p.1856-1865.

[61]Han K, Wang YH, Chen HT, et al., 2023. A survey on vision Transformer. IEEE Trans Patt Anal Mach Intell, 45(1):87-110.

[62]Han YH, Yu KL, Batra R, et al., 2021. Learning generalizable vision-tactile robotic grasping strategy for deformable objects via Transformer. https://arxiv.org/abs/2112.06374

[63]Hansen N, Su H, Wang XL, 2021. Stabilizing deep Q-learning with ConvNets and vision Transformers under data augmentation. Proc 34^th Int Conf on Neural Information Processing Systems, p.3680-3693.

[64]Hartmanis J, 1982. Computers and intractability: a guide to the theory of NP-completeness (Michael R. Garey and David S. Johnson). SIAM Rev, 24(1):90-91.

[65]He KM, Zhang XY, Ren SQ, et al., 2016. Deep residual learning for image recognition. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.770-778.

[66]Hermann KM, Hill F, Green S, et al., 2017. Grounded language learning in a simulated 3D world. https://arxiv.org/abs/1706.06551

[67]Hochreiter S, Schmidhuber J, 1997. Long short-term memory. Neur Comput, 9(8):1735-1780.

[68]Hong S, Yoon D, Kim KE, 2022. Structure-aware Transformer policy for inhomogeneous multi-task reinforcement learning. Proc 10^th Int Conf on Learning Representations.

[69]Hospedales T, Antoniou A, Micaelli P, et al., 2022. Meta-learning in neural networks: a survey. IEEE Trans Patt Anal Mach Intell, 44(9):5149-5169.

[70]Hu SY, Zhu FD, Chang XJ, et al., 2021. UPDeT: universal multi-agent reinforcement learning via policy decoupling with Transformers. https://arxiv.org/abs/2101.08001

[71]Imhof T, 2022. A Review of the Decision Transformer Architecture: Framing Reinforcement Learning as a Sequence Modeling Problem. https://api.semanticscholar.org/CorpusID:248941921

[72]Jaderberg M, Czarnecki WM, Dunning I, et al., 2019. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science, 364(6443):859-865.

[73]Jain V, Lin YX, Undersander E, et al., 2023. Transformers are adaptable task planners. Proc 6^th Conf on Robot Learning, p.1011-1037.

[74]James S, Ma ZC, Arrojo DR, et al., 2020. RLBench: the robot learning benchmark & learning environment. IEEE Robot Autom Lett, 5(2):3019-3026.

[75]Janner M, Li QY, Levine S, 2021. Offline reinforcement learning as one big sequence modeling problem. Proc 34^th Int Conf on Neural Information Processing Systems, p.1273-1286.

[76]Jiang YF, Chang SY, Wang ZY, 2021. TransGAN: two pure Transformers can make one strong GAN, and that can scale up. Proc 34^th Int Conf on Neural Information Processing Systems, p.14745-14758.

[77]Kaplan J, McCandlish S, Henighan T, et al., 2020. Scaling laws for neural language models. https://arxiv.org/abs/2001.08361

[78]Kapturowski S, Campos V, Jiang R, et al., 2023. Human-level Atari 200× faster. Proc 11^th Int Conf on Learning Representations.

[79]Keneshloo Y, Shi T, Ramakrishnan N, et al., 2020. Deep reinforcement learning for sequence-to-sequence models. IEEE Trans Neur Netw Learn Syst, 31(7):2469-2489.

[80]Khan MJ, Ahmed SH, Sukthankar G, 2022. Transformer-based value function decomposition for cooperative multi-agent reinforcement learning in StarCraft. Proc 18^th AAAI Conf on Artificial Intelligence and Interactive Digital Entertainment, p.113-119.

[81]Kim Y, 2014. Convolutional neural networks for sentence classification. Proc Conf on Empirical Methods in Natural Language Processing, p.1746-1751.

[82]Kochenderfer MJ, Wheeler TA, Wray KH, 2022. Algorithms for Decision Making. MIT Press, Cambridge, USA.

[83]Kool W, van Hoof H, Welling M, 2019. Attention, learn to solve routing problems! Proc 7^th Int Conf on Learning Representations.

[84]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25^th Int Conf on Neural Information Processing Systems, p.1097-1105.

[85]Kuba JG, Wen MN, Meng LH, et al., 2021. Settling the variance of multi-agent policy gradients. Proc 34^th Int Conf on Neural Information Processing Systems, p.13458-13470.

[86]Kudo T, Richardson J, 2018. SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. Proc Conf on Empirical Methods in Natural Language Processing: System Demonstrations, p.66-71.

[87]Kurach K, Raichuk A, Stańczyk P, et al., 2020. Google Research Football: a novel reinforcement learning environment. Proc 34^th AAAI Conf on Artificial Intelligence, p.4501-4510.

[88]Lan ZZ, Chen MD, Goodman S, et al., 2020. ALBERT: a lite BERT for self-supervised learning of language representations. Proc 8^th Int Conf on Learning Representations.

[89]Lee KH, Nachum O, Yang MJ, et al., 2022. Multi-game decision Transformers. Proc 36^th Int Conf on Neural Information Processing Systems, p.27921-27936.

[90]Levine S, Kumar A, Tucker G, et al., 2020. Offline reinforcement learning: tutorial, review, and perspectives on open problems. https://arxiv.org/abs/2005.01643

[91]Levy A, Konidaris GD, Platt RJr, et al., 2019. Learning multi-level hierarchies with hindsight. Proc 7^th Int Conf on Learning Representations.

[92]Lewis M, Liu YH, Goyal N, et al., 2020. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proc 58^th Annual Meeting of the Association for Computational Linguistics, p.7871-7880.

[93]Lewis P, Stenetorp P, Riedel S, 2021. Question and answer test-train overlap in open-domain question answering datasets. Proc 16^th Conf on European Chapter of the Association for Computational Linguistics, p.1000-1008.

[94]Li CL, Zhuang BH, Wang GR, et al., 2022. Automated progressive learning for efficient training of vision Transformers. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.12486-12496.

[95]Li JJ, Koyamada S, Ye QW, et al., 2020. Suphx: mastering Mahjong with deep reinforcement learning. https://arxiv.org/abs/2003.13590

[96]Li WY, Hong RX, Shen JW, et al., 2022. Learning to navigate in interactive environments with the Transformer-based memory. https://api.semanticscholar.org/CorpusID:249980271

[97]Li X, Zhang Y, Yuan WL, et al., 2022. Incorporating external knowledge reasoning for vision-and-language navigation with assistant’s help. Appl Sci, 12(14):7053.

[98]Li XX, Meng M, Hong YG, et al., 2023. A survey of decision making in adversarial games. Sci China Inform Sci, early access.

[99]Lin CY, 2004. ROUGE: a package for automatic evaluation of summaries. Proc Text Summarization Branches Out, p.74-81.

[100]Lin QJ, Liu H, Sengupta B, 2022. Switch Trajectory Transformer with distributional value approximation for multi-task reinforcement learning. https://arxiv.org/abs/2203.07413

[101]Lin RJ, Li Y, Feng XD, et al., 2022. Contextual Transformer for offline meta reinforcement learning. https://arxiv.org/abs/2211.08016

[102]Lin TY, Wang YX, Liu XY, et al., 2022. A survey of Transformers. AI Open, 3:111-132.

[103]Liu BY, Balaji Y, Xue LZ, et al., 2021. Analyzing attention mechanisms through lens of sample complexity and loss landscape. Proc Int Conf on Learning Representations.

[104]Liu HC, Huang ZY, Mo XY, et al., 2022. Augmenting reinforcement learning with Transformer-based scene representation learning for decision-making of autonomous driving. https://arxiv.org/abs/2208.12263

[105]Liu LY, Liu XD, Gao JF, et al., 2020. Understanding the difficulty of training Transformers. Proc Conf on Empirical Methods in Natural Language Processing, p.5747-5763.

[106]Liu T, Wang JH, Zhang X, et al., 2019. Game theoretic control of multiagent systems. SIAM J Contr Optim, 57(3):1691-1709.

[107]Liu YH, Ott M, Goyal N, et al., 2019. RoBERTa: a robustly optimized BERT pretraining approach. https://arxiv.org/abs/1907.11692

[108]Lowe R, Wu Y, Tamar A, et al., 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Proc 31^st Int Conf on Neural Information Processing Systems, p.6382-6393.

[109]Lu K, Grover A, Abbeel P, et al., 2022. Frozen pretrained Transformers as universal computation engines. Proc 36^th AAAI Conf on Artificial Intelligence, p.7628-7637.

[110]Lu YL, Li WX, 2022. Techniques and paradigms in modern game AI systems. Algorithms, 15(8):282.

[111]Ma SM, Wang HY, Huang SH, et al., 2022. TorchScale: Transformers at scale. https://arxiv.org/abs/2211.13184

[112]Mazyavkina N, Sviridov S, Ivanov S, et al., 2021. Reinforcement learning for combinatorial optimization: a survey. Comput Oper Res, 134:105400.

[113]Mees O, Hermann L, Rosete-Beas E, et al., 2022. CALVIN: a benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robot Autom Lett, 7(3):7327-7334.

[114]Melo LC, 2022. Transformers are meta-reinforcement learners. Proc 39^th Int Conf on Machine Learning, p.15340-15359.

[115]Meng LH, Wen MN, Yang YD, et al., 2021. Offline pre-trained multi-agent decision Transformer: one big sequence model tackles all SMAC tasks. https://arxiv.org/abs/2112.02845

[116]Mesnard T, Weber T, Viola F, et al., 2021. Counterfactual credit assignment in model-free reinforcement learning. Proc 38^th Int Conf on Machine Learning, p.7654-7664.

[117]Miao XP, Wang YJ, Jiang YH, et al., 2022. Galvatron: efficient Transformer training over multiple GPUs using automatic parallelism. Proc VLDB Endow, 16(3):470-479.

[118]Mitchell E, Rafailov R, Peng XB, et al., 2021. Offline meta-reinforcement learning with advantage weighting. Proc 38^th Int Conf on Machine Learning, p.7780-7791.

[119]Mohamed N, Al-Jaroodi J, Lazarova-Molnar S, et al., 2021. Applications of integrated IoT-fog-cloud systems to smart cities: a survey. Electronics, 10(23):2918.

[120]Moravčík M, Schmid M, Burch N, et al., 2017. DeepStack: expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508-513.

[121]Motokawa Y, Sugawara T, 2022. Distributed multi-agent deep reinforcement learning for robust coordination against noise. Proc Int Joint Conf on Neural Networks, p.1-8.

[122]Niu ZY, Zhong GQ, Yu H, 2021. A review on the attention mechanism of deep learning. Neurocomputing, 452:48-62.

[123]Oh J, Suppé A, Duvallet F, et al., 2015. Toward mobile robots reasoning like humans. Proc 29^th AAAI Conf on Artificial Intelligence, p.1371-1379.

[124]Oliehoek FA, Spaan MTJ, Vlassis N, 2008. Optimal and approximate Q-value functions for decentralized POMDPs. J Artif Intell Res, 32(1):289-353.

[125]Omidshafiei S, Tuyls K, Czarnecki WM, et al., 2020. Navigating the landscape of multiplayer games. Nat Commun, 11(1):5603.

[126]Open Ended Learning Team, Stooke A, Mahajan A, et al., 2021. Open-ended learning leads to generally capable agents. https://arxiv.org/abs/2107.12808

[127]Ortega PA, Wang JX, Rowland M, et al., 2019. Meta-learning of sequential strategies. https://arxiv.org/abs/1905.03030

[128]Ozair S, Li YZ, Razavi A, et al., 2021. Vector quantized models for planning. Proc 38^th Int Conf on Machine Learning, p.8302-8313.

[129]Pan C, Okorn B, Zhang H, et al., 2023. TAX-pose: task-specific cross-pose estimation for robot manipulation. Proc 6^th Conf on Robot Learning, p.1783-1792.

[130]Pan YW, Li YH, Zhang YH, et al., 2022. Silver-bullet-3D at ManiSkill 2021: learning-from-demonstrations and heuristic rule-based methods for object manipulation. Proc Int Conf on Learning Representations.

[131]Papineni K, Roukos S, Ward T, et al., 2002. BLEU: a method for automatic evaluation of machine translation. Proc 40^th Annual Meeting of the Association for Computational Linguistics, p.311-318.

[132]Parisotto E, Salakhutdinov R, 2021. Efficient Transformers in reinforcement learning using actor-learner distillation. Proc 9^th Int Conf on Learning Representations.

[133]Parisotto E, Song F, Rae J, et al., 2020. Stabilizing Transformers for reinforcement learning. Proc 37^th Int Conf on Machine Learning, p.7487-7498.

[134]Parr R, Russell S, 1997. Reinforcement learning with hierarchies of machines. Proc 10^th Int Conf on Neural Information Processing Systems, p.1043-1049.

[135]Paster K, McIlraith SA, Ba J, 2021. Planning from pixels using inverse dynamics models. Proc 9^th Int Conf on Learning Representations.

[136]Paster K, McIlraith S, Ba J, 2022. You can’t count on luck: why decision Transformers and RvS fail in stochastic environments. Proc 36^th Int Conf on Neural Information Processing Systems, p.38966-38979.

[137]Pateria S, Subagdja B, Tan AH, et al., 2022. Hierarchical reinforcement learning: a comprehensive survey. ACM Comput Surv, 54(5):109.

[138]Phillips-Wren G, 2012. AI tools in decision making support systems: a review. Int J Artif Intell Tools, 21(2):1240005.

[139]Phuong M, Hutter M, 2022. Formal algorithms for Transformers. https://arxiv.org/abs/2207.09238

[140]Pinon B, Delvenne JC, Jungers R, 2022. A model-based approach to meta-reinforcement learning: Transformers and tree search. https://arxiv.org/abs/2208.11535

[141]Radford A, Narasimhan K, Salimans T, et al., 2018. Improving language understanding by generative pre-training. https://api.semanticscholar.org/CorpusID:49313245

[142]Radford A, Wu J, Child R, et al., 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9.

[143]Radford A, Kim JW, Hallacy C, et al., 2021. Learning transferable visual models from natural language supervision. Proc 38^th Int Conf on Machine Learning, p.8748-8763.

[144]Raffel C, Shazeer N, Roberts A, et al., 2020. Exploring the limits of transfer learning with a unified text-to-text Transformer. J Mach Learn Res, 21(1):140.

[145]Rashid T, Samvelyan M, de Witt CS, et al., 2020. Monotonic value function factorisation for deep multi-agent reinforcement learning. J Mach Learn Res, 21(1):178.

[146]Reed S, Zolna K, Parisotto E, et al., 2022. A generalist agent. Trans Mach Learn Res, 2022:2835-8856.

[147]Reid M, Yamada Y, Gu SS, 2022. Can Wikipedia help offline reinforcement learning? https://arxiv.org/abs/2201.12122

[148]Samvelyan M, Rashid T, de Witt CS, et al., 2019. The StarCraft multi-agent challenge. Proc 18^th Int Conf on Autonomous Agents and Multiagent Systems, p.2186-2188.

[149]Sanchez FR, Redmond S, McGuinness K, et al., 2022. Towards advanced robotic manipulation. Proc 6^th IEEE Int Conf on Robotic Computing, p.302-305.

[150]Schrittwieser J, Antonoglou I, Hubert T, et al., 2020. Mastering Atari, Go, Chess and Shogi by planning with a learned model. Nature, 588(7839):604-609,

[151]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347

[152]Shamshad F, Khan S, Zamir SW, et al., 2023. Transformers in medical imaging: a survey. Med Image Anal, 88:102802.

[153]Shang JH, Kahatapitiya K, Li X, et al., 2022. StARformer: Transformer with state-action-reward representations for visual reinforcement learning. Proc 17^th European Conf on Computer Vision, p.462-479,

[154]Shaw P, Uszkoreit J, Vaswani A, 2018. Self-attention with relative position representations. Proc Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.464-468.

[155]Shoham Y, Leyton-Brown K, 2008. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, New York, USA.

[156]Shridhar M, Manuelli L, Fox D, 2023. Perceiver-actor: a multi-task Transformer for robotic manipulation. Proc 6^th Conf on Robot Learning, p.785-799.

[157]Siebenborn M, Belousov B, Huang JN, et al., 2022. How crucial is Transformer in Decision Transformer? https://arxiv.org/abs/2211.14655

[158]Silver D, Hubert T, Schrittwieser J, et al., 2017a. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm. https://arxiv.org/abs/1712.01815

[159]Silver D, Schrittwieser J, Simonyan K, et al., 2017b. Mastering the game of Go without human knowledge. Nature, 550(7676):354-359.

[160]Singh B, Kumar R, Singh VP, 2022. Reinforcement learning in robotic applications: a comprehensive survey. Artif Intell Rev, 55(2):945-990.

[161]Srinidhi CL, Ciga O, Martel AL, 2021. Deep neural network models for computational histopathology: a survey. Med Image Anal, 67:101813.

[162]Srivastava RK, Shyam P, Mutz F, et al., 2019. Training agents using upside-down reinforcement learning. https://arxiv.org/abs/1912.02877

[163]Stadie BC, Yang G, Houthooft R, et al., 2018. Some considerations on learning to explore via meta-reinforcement learning. https://arxiv.org/abs/1803.01118

[164]Sutton RS, Barto AG, 2018. Reinforcement Learning: an Introduction (2^nd Ed.). MIT Press, Cambridge, USA.

[165]Takase S, Kiyono S, Kobayashi S, et al., 2022. On layer normalizations and residual connections in Transformers. https://arxiv.org/abs/2206.00330v1

[166]Tay Y, Dehghani M, Bahri D, et al., 2023. Efficient Transformers: a survey. ACM Comput Surv, 55(6):109.

[167]Toth P, Vigo D, 2014. Vehicle Routing: Problems, Methods, and Applications (2^nd Ed.). Society for Industrial and Applied Mathematics. Mathematical Optimization Society, Philadelphia, USA.

[168]Tunyasuvunakool S, Muldal A, Doron Y, et al., 2020. dm_control: software and tasks for continuous control. Softw Impacts, 6:100022.

[169]Upadhyay U, Shah N, Ravikanti S, et al., 2019. Transformer based reinforcement learning for games. https://arxiv.org/abs/1912.03918

[170]Vashishth S, Upadhyay S, Tomar GS, et al., 2019. Attention interpretability across NLP tasks. https://arxiv.org/abs/1909.11218

[171]Vaswani A, Shazeer N, Parmar N, et al., 2017. Attention is all you need. Proc 31^st Int Conf on Neural Information Processing Systems, p.6000-6010.

[172]Vedantam R, Lawrence Zitnick C, Parikh D, 2015. CIDEr: consensus-based image description evaluation. Proc IEEE Conf on Computer Vision and Pattern Recognition, p.4566-4575.

[173]Vesselinova N, Steinert R, Perez-Ramirez DF, et al., 2020. Learning combinatorial optimization on graphs: a survey with applications to networking. IEEE Access, 8:120388-120416.

[174]Villaflor AR, Huang Z, Pande S, et al., 2022. Addressing optimism bias in sequence modeling for reinforcement learning. Proc 39^th Int Conf on Machine Learning, p.22270-22283.

[175]Vinyals O, Babuschkin I, Czarnecki WM, et al., 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575(7782):350-354.

[176]Voita E, Talbot D, Moiseev F, et al., 2019. Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. Proc 57^th Annual Meeting of the Association for Computational Linguistics, p.5797-5808.

[177]Wang HB, Xie XD, Zhou LK, 2023. Transform networks for cooperative multi-agent deep reinforcement learning. Appl Intell, 53(8):9261-9269.

[178]Wang HY, Ma SM, Dong L, et al., 2022. DeepNet: scaling Transformers to 1,000 layers. https://arxiv.org/abs/2203.00555

[179]Wang J, King M, Porcel N, et al., 2021. Alchemy: a benchmark and analysis toolkit for meta-reinforcement learning agents. Proc 1^st Neural Information Processing Systems Track on Datasets and Benchmarks.

[180]Wang KR, Zhao HY, Luo XF, et al., 2022. Bootstrapped Transformer for offline reinforcement learning. Proc 36^th Int Conf on Neural Information Processing Systems, p.34748-34761.

[181]Wang MR, Feng MX, Zhou WG, et al., 2022. Stabilizing voltage in power distribution networks via multi-agent reinforcement learning with Transformer. Proc 28^th ACM SIGKDD Conf on Knowledge Discovery and Data Mining, p.1899-1909.

[182]Wang Q, Tang CL, 2021. Deep reinforcement learning for transportation network combinatorial optimization: a survey. Knowl-Based Syst, 233:107526.

[183]Wen MN, Kuba JG, Lin RJ, et al., 2022. Multi-agent reinforcement learning is a sequence modeling problem. Proc 36^th Int Conf on Neural Information Processing Systems, p.16509-16521.

[184]Wolsey LA, 2020. Integer Programming (2^nd Ed.). Wiley, New Jersey, USA.

[185]Wu TH, Jiang MZ, Han YH, et al., 2021. A traffic-aware federated imitation learning framework for motion control at unsignalized intersections with Internet of Vehicles. Electronics, 10(24):3050.

[186]Wu YX, Song W, Cao ZG, et al., 2022. Learning improvement heuristics for solving routing problems. IEEE Trans Neur Netw Learn Syst, 33(9):5057-5069.

[187]Xiang FB, Qin YZ, Mo KC, et al., 2020. SAPIEN: a SimulAted Part-based Interactive ENvironment. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.11097-11107.

[188]Xiang XC, Foo S, 2021. Recent advances in deep reinforcement learning applications for solving partially observable Markov decision processes (POMDP) problems: part 1—fundamentals and applications in games, robotics and natural language processing. Mach Learn Knowl Extr, 3(3):554-581.

[189]Xie ZH, Lin ZC, Li JY, et al., 2022. Pretraining in deep reinforcement learning: a survey. https://arxiv.org/abs/2211.03959

[190]Xiong RB, Yang YC, He D, et al., 2020. On layer normalization in the Transformer architecture. Proc 37^th Int Conf on Machine Learning, p.10524-10533.

[191]Xu MD, Shen YK, Zhang S, et al., 2022. Prompting Decision Transformer for few-shot policy generalization. Proc 39^th Int Conf on Machine Learning, p.24631-24645.

[192]Yamagata T, Khalil A, Santos-Rodríguez R, 2023. Q-learning decision Transformer: leveraging dynamic programming for conditional sequence modelling in offline RL. Proc 40^th Int Conf on Machine Learning, Article 1625.

[193]Yang RH, Zhang MH, Hansen N, et al., 2022. Learning vision-guided quadrupedal locomotion end-to-end with cross-modal Transformers. Proc 10^th Int Conf on Learning Representations.

[194]Yang YD, Wang J, 2020. An overview of multi-agent reinforcement learning from game theoretical perspective. https://arxiv.org/abs/2011.00583

[195]Yang YD, Wen Y, Wang JH, et al., 2020. Multi-agent determinantal Q-learning. Proc 37^th Int Conf on Machine Learning, Article 997.

[196]Yang YD, Chen GY, Wang WX, et al., 2022. Transformer-based working memory for multiagent reinforcement learning with action parsing. Proc 36^th Int Conf on Neural Information Processing Systems, p.34874-34886.

[197]Yang YM, Xing DP, Xu B, 2022. Efficient spatiotemporal Transformer for robotic reinforcement learning. IEEE Robot Autom Lett, 7(3):7982-7989.

[198]Yang ZL, Dai ZH, Yang YM, et al., 2019. XLNet: generalized autoregressive pretraining for language understanding. Proc 33^rd Int Conf on Neural Information Processing Systems, Article 517.

[199]Yao ZW, Wu XX, Li CL, et al., 2022. Random-LTD: random and layerwise token dropping brings efficient training for large-scale Transformers. https://arxiv.org/abs/2211.11586

[200]Yu C, Velu A, Vinitsky E, et al., 2022. The surprising effectiveness of PPO in cooperative multi-agent games. Proc 36^th Int Conf on Neural Information Processing Systems, p.24611-24624.

[201]Yu TH, Kumar S, Gupta A, et al., 2020a. Gradient surgery for multi-task learning. Proc 34^th Int Conf on Neural Information Processing Systems, Article 489.

[202]Yu TH, Quillen D, He ZP, et al., 2020b. Meta-World: a benchmark and evaluation for multi-task and meta reinforcement learning. Proc Conf on Robot Learning, p.1094-1100.

[203]Yuan WL, Hu ZZ, Luo JR, et al., 2021. Imperfect information game in multiplayer no-limit Texas hold’em based on mean approximation and deep CFVnet. Proc China Automation Congress, p.2459-2466.

[204]Yuan Z, Wu TH, Wang QW, et al., 2022. T3OMVP: a Transformer-based time and team reinforcement learning scheme for observation-constrained multi-vehicle pursuit in urban area. Electronics, 11(9):1339.

[205]Yurtsever E, Lambert J, Carballo A, et al., 2020. A survey of autonomous driving: common practices and emerging technologies. IEEE Access, 8:58443-58469.

[206]Zaremba W, Sutskever I, Vinyals O, 2014. Recurrent neural network regularization. https://arxiv.org/abs/1409.2329

[207]Zha DC, Xie JR, Ma WY, et al., 2021. DouZero: mastering DouDizhu with self-play deep reinforcement learning. Proc 38^th Int Conf on Machine Learning, p.12333-12344.

[208]Zhang JZ, Kim J, O’Donoghue B, et al., 2021. Sample efficient reinforcement learning with REINFORCE. Proc 35^th AAAI Conf on Artificial Intelligence, p.10887-10895.

[209]Zhao EM, Yan RY, Li JQ, et al., 2022. AlphaHoldem: high-performance artificial intelligence for heads-up no-limit poker via end-to-end reinforcement learning. Proc 36^th AAAI Conf on Artificial Intelligence, p.4689-4697.

[210]Zhao WS, Queralta JP, Westerlund T, 2020. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. Proc IEEE Symp Series on Computational Intelligence, p.737-744.

[211]Zhao YP, Zhao J, Hu XH, et al., 2022. DouZero+: improving DouDizhu AI by opponent modeling and coach-guided learning. Proc IEEE Conf on Games, p.127-134.

[212]Zheng QQ, Zhang A, Grover A, 2022. Online decision Transformer. Proc 39^th Int Conf on Machine Learning, p.27042-27059.

[213]Zhou J, Ke P, Qiu XP, et al., 2023. ChatGPT: potential, prospects, and limitations. Front Inform Technol Electron Eng, early access.

[214]Zoph B, Vasudevan V, Shlens J, et al., 2018. Learning transferable architectures for scalable image recognition. Proc IEEE/CVF Conf on Computer Vision and Pattern Recognition, p.8697-8710.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Similar articles

- Go to

基于Transformer的强化学习方法在智能决策领域的应用：综述

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference