CLC number: TP18
On-line Access: 2021-07-12
Received: 2019-11-30
Revision Accepted: 2020-08-17
Crosschecked: 2021-04-29
Cited: 0
Clicked: 5174
Kaiqing Zhang, Zhuoran Yang, Tamer Başar. Decentralized multi-agent reinforcement learning with networked agents: recent advances[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1900661 @article{title="Decentralized multi-agent reinforcement learning with networked agents: recent advances", %0 Journal Article TY - JOUR
带有网络智能体的去中心化多智能体强化学习进展1伊利诺伊大学香槟分校协调科学实验室,美国伊利诺伊州,61801 2普林斯顿大学运筹学与金融工程系,美国新泽西州,08544 摘要:多智能体强化学习长期以来一直是机器学习和控制领域的重要研究课题。最近在(单智能体)深度强化学习领域的进展重新唤醒了对多智能体强化学习的研究兴趣,尤其在理论分析方面。本文回顾这个大课题中的一个子领域:带有网络智能体的去中心化多智能体强化学习。在这一场景中,多个智能体在一个共同的环境中进行序贯决策,无需中心控制器的协调,且智能体被允许和它们在通信网络上的邻居交换信息。这样的一个模型在很多方向都有相关应用,包括机器人控制、无人车控制、移动传感器网络控制、智能电网,等等。本综述旨在覆盖和整理我们和其他科研人员在这一方向的相关工作。我们希望该综述能够激发更多研究热情,投入到这个激动人心却又充满挑战的领域。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Adler JL, Blue VJ, 2002. A cooperative multi-agent transportation management and route guidance system. Transp Res Part C Emerg Technol, 10(5-6):433-454. [2]Agarwal A, Duchi JC, 2011. Distributed delayed stochastic optimization. Proc 24th Int Conf on Neural Information Processing Systems, p.873-881. [3]Antos A, Szepesvári C, Munos R, 2008a. Fitted Q-iteration in continuous action-space MDPs. Advances in Neural Information Processing Systems, p.9-16. [4]Antos A, Szepesvári C, Munos R, 2008b. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Mach Learn, 71(1):89-129. [5]Assran M, Romoff J, Ballas N, et al., 2019. Gossip-based actor-learner architectures for deep reinforcement learning. Advances in Neural Information Processing Systems, p.13299-13309. [6]Bacsar T, Olsder GJ, 1999. Dynamic Noncooperative Game Theory. SIAM, Philadelphia. [7]Baxter J, Bartlett PL, 2001. Infinite-horizon policy-gradient estimation. J Artif Intell Res, 15:319-350. [8]Bertsekas D, 2019. Multiagent rollout algorithms and reinforcement learning. https://arxiv.org/abs/1910.00120 [9]Bertsekas DP, 2005. Dynamic Programming and Optimal Control. Athena Scientific, Belmont, MA, USA. [10]Bhandari J, Russo D, Singal R, 2018. A finite time analysis of temporal difference learning with linear function approximation. Proc 31st Conf on Learning Theory, p.1691-1692. [11]Bhatnagar S, Sutton RS, Ghavamzadeh M, et al., 2009. Natural actor-critic algorithms. Automatica, 45(11):2471-2482. [12]Borkar VS, 2008. Stochastic Approximation: a Dynamical Systems Viewpoint. Cambridge University Press, Cambridge, UK. [13]Boutilier C, 1996. Planning, learning and coordination in multiagent decision processes. Proc 6th Conf on Theoretical Aspects of Rationality and Knowledge, p.195-210. [14]Boyd S, Parikh N, Chu E, et al., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found Trends® Mach Learn, 3(1):1-122. [15]Busoniu L, Babuska R, de Schutter B, et al., 2008. A comprehensive survey of multiagent reinforcement learning. IEEE Trans Syst Man Cybern Part C Appl Rev, 38(2):156-172. [16]Cassano L, Yuan K, Sayed AH, 2018. Multi-agent fully decentralized value function learning with linear convergence rates. https://arxiv.org/abs/1810.07792 [17]Cassano L, Alghunaim SA, Sayed AH, 2019. Team policy learning for multi-agent reinforcement learning. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.3062-3066. [18]Chen TY, Zhang KQ, Giannakis GB, et al., 2018. Communication-efficient distributed reinforcement learning. https://arxiv.org/abs/1812.03239 [19]Ciosek K, Whiteson S, 2018. Expected policy gradients for reinforcement learning. https://arxiv.org/abs/1801.03326 [20]Corke P, Peterson R, Rus D, 2005. Networked robots: flying robot navigation using a sensor net. In: Dario P, Chatila R (Eds.), Robotics Research. Springer, Berlin, p.234-243. [21]Dall’Anese E, Zhu H, Giannakis GB, 2013. Distributed optimal power flow for smart microgrids. IEEE Trans Smart Grid, 4(3):1464-1475. [22]Ding DS, Wei XH, Yang ZR, et al., 2019. Fast multi-agent temporal-difference learning via homotopy stochastic primal-dual optimization. https://arxiv.org/abs/1908.02805 [23]Doan TT, Maguluri S, Romberg J, 2019a. Finite-time analysis of distributed TD(0) with linear function approximation for multi-agent reinforcement learning. Proc 36th Int Conf on Machine Learning, p.1626-1635. [24]Doan TT, Maguluri ST, Romberg J, 2019b. Finite-time performance of distributed temporal difference learning with linear function approximation. https://arxiv.org/abs/1907.12530 [25]Fan JQ, Tong X, Zeng Y, 2015. Multi-agent inference in social networks: a finite population learning approach. J Am Stat Assoc, 110(509):149-158. [26]Farahmand AM, Munos R, Szepesvári C, 2010. Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, p.568-576. [27]Foerster JN, Assael YM, de Freitas N, et al., 2016. Learning to communicate with deep multi-agent reinforcement learning. Proc 30th Int Conf on Neural Information Processing Systems, p.2137-2145. [28]Gupta JK, Egorov M, Kochenderfer M, 2017. Cooperative multi-agent control using deep reinforcement learning. Int Conf on Autonomous Agents and Multiagent Systems, p.66-83. [29]Hong MY, Chang TH, 2017. Stochastic proximal gradient consensus over random networks. IEEE Trans Signal Process, 65(11):2933-2948. [30]Jakovetic D, Xavier J, Moura JMF, 2011. Cooperative convex optimization in networked systems: augmented Lagrangian algorithms with directed gossip communication. IEEE Trans Signal Process, 59(8):3889-3902. [31]Kar S, Moura JMF, 2013. Consensus + innovations distributed inference over networks: cooperation and sensing in networked systems. IEEE Signal Process Mag, 30(3):99-109. [32]Kar S, Moura JMF, Poor HV, 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Trans Signal Process, 61(7):1848-1862. [33]Kober J, Bagnell JA, Peters J, 2013. Reinforcement learning in robotics: a survey. Int J Rob Res, 32(11):1238-1274. [34]Konda VR, Tsitsiklis JN, 1999. Actor-critic algorithms. Advances in Neural Information Processing Systems, p.1008-1014. [35]Lange S, Gabel T, Riedmiller M, 2012. Batch reinforcement learning. In: Wiering M, van Otterlo M (Eds.), Reinforcement Learning. Adaptation, Learning, and Optimization. Springer, Berlin, Heidelberg. [36]Lauer M, Riedmiller MA, 2000. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. Proc 17th Int Conf on Machine Learning, p.535-542. [37]Lee D, Yoon H, Hovakimyan N, 2018. Primal-dual algorithm for distributed reinforcement learning: distributed GTD. IEEE Conf on Decision and Control, p.1967-1972. [38]Lillicrap TP, Hunt JJ, Pritzel A, et al., 2016. Continuous control with deep reinforcement learning. Proc 4th Int Conf on Learning Representations. [39]Lin YX, Zhang KQ, Yang ZR, et al., 2019. A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. Proc IEEE 58th Conf on Decision and Control, p.5562-5567. [40]Littman ML, 1994. Markov games as a framework for multi-agent reinforcement learning. Proc 11th Int Conf on Machine Learning, p.157-163. [41]Liu B, Liu J, Ghavamzadeh M, et al., 2015. Finite-sample analysis of proximal gradient TD algorithms. Proc 31st Conf on Uncertainty in Artificial Intelligence, p.504-513. [42]Lowe R, Wu Y, Tamar A, et al., 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Proc 31st Int Conf on Neural Information Processing Systems, p.6379-6390. [43]Macua SV, Chen JS, Zazo S, et al., 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Trans Autom Contr, 60(5):1260-1274. [44]Macua SV, Tukiainen A, Hernández DGO, et al., 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. https://arxiv.org/abs/1710.10363 [45]Mahajan A, Teneketzis D, 2008. Sequential Decomposition of Sequential Dynamic Teams: Applications to Real-Time Communication and Networked Control Systems. University of Michigan, Ann Arbor, USA. [46]Meai HR, Szepesvári C, Bhatnagar S, et al., 2009. Convergent temporal-difference learning with arbitrary smooth function approximation. Proc 22nd Int Conf on Neural Information Processing Systems, p.1204-1212. [47]Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533. [48]Munos R, 2007. Performance bounds in Lp-norm for approximate value iteration. SIAM J Contr Optim, 46(2):541-561. [49]Munos R, Szepesvári C, 2008. Finite-time bounds for fitted value iteration. J Mach Learn Res, 9:815-857. [50]Nedić A, Ozdaglar A, 2009. Distributed subgradient methods for multi-agent optimization. IEEE Trans Autom Contr, 54(1):48-61. [51]Nedić A, Olshevsky A, Shi W, 2017. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J Optim, 27(4):2597-2633. [52]Oliehoek FA, Amato C, 2016. A Concise Introduction to Decentralized POMDPs. Springer, Cham. [53]Omidshafiei S, Pazis J, Amato C, et al., 2017. Deep decentralized multi-task multi-agent reinforcement learning under partial observability. Proc 34th Int Conf on Machine Learning, p.2681-2690. [54]Pennesi P, Paschalidis IC, 2010. A distributed actor-critic algorithm and applications to mobile sensor network coordination problems. IEEE Trans Autom Contr, 55(2):492-497. [55]Qie H, Shi DX, Shen TL, et al., 2019. Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access, 7:146264-146272. [56]Qu GN, Li N, 2018. Harnessing smoothness to accelerate distributed optimization. IEEE Trans Contr Netw Syst, 5(3):1245-1260. [57]Rabbat M, Nowak R, 2004. Distributed optimization in sensor networks. Proc 3rd Int Symp on Information Processing in Sensor Networks, p.20-27. [58]Ren J, Haupt J, 2019. A communication efficient hierarchical distributed optimization algorithm for multi-agent reinforcement learning. Real-World Sequential Decision Making Workshop at Int Conf on Machine Learning. [59]Riedmiller M, 2005. Neural fitted Q iteration—first experiences with a data efficient neural reinforcement learning method. Proc 16th European Conf on Machine Learning, p.317-328. [60]Sayed AH, 2014. Adaptation, learning, and optimization over networks. Found Trends® Mach Learn, 7(4-5):311-801. [61]Schmidt M, Le Roux N, Bach F, 2017. Minimizing finite sums with the stochastic average gradient. Math Program, 162(1-2):83-112. [62]Sha XY, Zhang JQ, Zhang KQ, et al., 2020. Asynchronous policy evaluation in distributed reinforcement learning over networks. https://arxiv.org/abs/2003.00433 [63]Shalev-Shwartz S, Shammah S, Shashua A, 2016. Safe, multi-agent, reinforcement learning for autonomous driving. https://arxiv.org/abs/1610.03295 [64]Shapley LS, 1953. Stochastic games. PNAS, 39(10):1095-1100. [65]Shi W, Ling Q, Wu G, et al., 2015. Extra: an exact first-order algorithm for decentralized consensus optimization. SIAM J Optim, 25(2):944-966. [66]Silver D, Lever G, Heess N, et al., 2014. Deterministic policy gradient algorithms. Proc 31st Int Conf on Machine Learning, p.387-395. [67]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489. [68]Silver D, Schrittwieser J, Simonyan K, et al., 2017. Mastering the game of Go without human knowledge. Nature, 550(7676):354-359. [69]Singh S, Jaakkola T, Littman ML, et al., 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach Learn, 38(3):287-308. [70]Singh SP, Sutton RS, 1996. Reinforcement learning with replacing eligibility traces. Mach Learn, 22(1-3):123-158. [71]Srikant R, Ying L, 2019. Finite-time error bounds for linear stochastic approximation and TD learning. Proc 32nd Conf on Learning Theory, p.2803-2830. [72]Stanković MS, Stanković SS, 2016. Multi-agent temporal-difference learning with linear function approximation: weak convergence under time-varying network topologies. American Control Conf, p.167-172. [73]Stanković MS, Ilić N, Stanković SS, 2016. Distributed stochastic approximation: weak convergence and network design. IEEE Trans Autom Contr, 61(12):4069-4074. [74]Suttle W, Yang ZR, Zhang KQ, et al., 2019. A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. https://arxiv.org/abs/1903.06372 [75]Sutton RS, McAllester DA, Singh SP, et al., 2000. Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, p.1057-1063. [76]Sutton RS, Szepesvári C, Maei HR, 2008. A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation. Proc 21st Int Conf on Neural Information Processing Systems, p.1609-1616. [77]Sutton RS, Maei HR, Precup D, et al., 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proc 26th Annual Int Conf on Machine Learning, p.993-1000. [78]Sutton RS, Mahmood AR, White M, 2016. An emphatic approach to the problem of off-policy temporal-difference learning. J Mach Learn Res, 17(1):2603-2631. [79]Tesauro G, 1995. Temporal difference learning and TD-Gammon. Commun ACM, 38(3):58-68. [80]Tsitsiklis JN, van Roy B, 1997. Analysis of temporal-diffference learning with function approximation. Advances in Neural Information Processing Systems, p.1075-1081. [81]Tu SY, Sayed AH, 2012. Diffusion strategies outperform consensus strategies for distributed estimation over adaptive networks. IEEE Trans Signal Process, 60(12):6217-6234. [82]Varshavskaya P, Kaelbling LP, Rus D, 2009. Efficient distributed reinforcement learning through agreement. In: Asama H, Kurokawa H, Ota J, et al. (Eds.), Distributed Autonomous Robotic Systems. Springer, Berlin, p.367-378. [83]Wai HT, Yang Z, Wang ZR, et al., 2018. Multi-agent reinforcement learning via double averaging primal-dual optimization. Advances in Neural Information Processing Systems, p.9649-9660. [84]Wang XF, Sandholm T, 2003. Reinforcement learning to play an optimal Nash equilibrium in team Markov games. Proc 15th Int Conf on Neural Information Processing Systems, p.1603-1610. [85]Watkins CJCH, Dayan P, 1992. Q-learning. Mach Learn, 8(3-4):279-292. [86]Williams RJ, 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 8(3-4):229-256. [87]Xiao L, Boyd S, Kim SJ, 2007. Distributed average consensus with least-mean-square deviation. J Parall Distrib Comput, 67(1):33-46. [88]Ying BC, Yuan K, Sayed AH, 2018. Convergence of variance-reduced learning under random reshuffling. IEEE Int Conf on Acoustics, Speech and Signal Processing, p.2286-2290. [89]Yu HZ, 2015. On convergence of emphatic temporal-difference learning. Proc 28th Conf on Learning Theory, p.1724-1751. [90]Zazo S, Macua SV, Sánchez-Fernández M, et al., 2016. Dynamic potential games with constraints: fundamentals and applications in communications. IEEE Trans Signal Process, 64(14):3806-3821. [91]Zhang HG, Jiang H, Luo YH, et al., 2017. Data-driven optimal consensus control for discrete-time multi-agent systems with unknown dynamics using reinforcement learning method. IEEE Trans Ind Electron, 64(5):4091-4100. [92]Zhang KQ, Lu LQ, Lei C, et al., 2018a. Dynamic operations and pricing of electric unmanned aerial vehicle systems and power networks. Transp Res Part C Emerg Technol, 92:472-485. [93]Zhang KQ, Yang ZR, Liu H, et al., 2018b. Finite-sample analyses for fully decentralized multi-agent reinforcement learning. https://arxiv.org/abs/1812.02783v5 [94]Zhang KQ, Yang ZR, Liu H, et al., 2018c. Fully decentralized multi-agent reinforcement learning with networked agents. Proc 35th Int Conf on Machine Learning, p.5867-5876. [95]Zhang KQ, Yang ZR, Bacsar T, 2018d. Networked multi-agent reinforcement learning in continuous spaces. IEEE Conf on Decision and Control, p.2771-2776. [96]Zhang KQ, Yang ZR, Bacsar T, 2019. Multi-agent reinforcement learning: a selective overview of theories and algorithms. https://arxiv.org/abs/1911.10635 [97]Zhang QC, Zhao DB, Lewis FL, 2018. Model-free reinforcement learning for fully cooperative multi-agent graphical games. Int Joint Conf on Neural Networks, p.1-6. [98]Zhang Y, Zavlanos MM, 2019. Distributed off-policy actor-critic reinforcement learning with policy consensus. https://arxiv.org/abs/1903.09255 Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>