CLC number: TP18
On-line Access: 2020-12-10
Received: 2019-09-29
Revision Accepted: 2020-03-30
Crosschecked: 2020-06-04
Cited: 0
Clicked: 4892
Hao-nan Wang, Ning Liu, Yi-yun Zhang, Da-wei Feng, Feng Huang, Dong-sheng Li, Yi-ming Zhang. Deep reinforcement learning: a survey[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1900533 @article{title="Deep reinforcement learning: a survey", %0 Journal Article TY - JOUR
深度强化学习综述国防科技大学并行与分布处理国家重点实验室,中国长沙市,410000 摘要:深度强化学习已成为人工智能研究中最受欢迎的主题之一,已被广泛应用于端到端控制、机器人控制、推荐系统、自然语言对话系统等多个领域。本文对深度强化学习算法和应用进行系统分类,提供详细论述,并将现有深度强化学习算法分为基于模型的方法、无模型方法和高级深度强化学习方法。之后,全面分析探索、逆强化学习和迁移强化学习等高级算法的进展。最后,概述当前深度强化学习的代表性应用,并分析4个亟待解决的问题。 关键词组: Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article
Reference[1]Abbeel P, Ng AY, 2004. Apprenticeship learning via inverse reinforcement learning. Proc 21st Int Conf on Machine Learning, p.1-8. [2]Achiam J, Held D, Tamar A, et al., 2017. Constrained policy optimization. Proc 34th Int Conf on Machine Learning, p.22-31. [3]Al-Nima RRO, Han TT, Chen TL, 2019. Road tracking using deep reinforcement learning for self-driving car applications. Int Conf on Computer Recognition Systems, p.106-116. [4]Arik SO, Chen JT, Peng KN, et al., 2018. Neural voice cloning with a few samples. Proc 32nd Neural Information Processing Systems, p.10019-10029. [5]Aytar Y, Pfaff T, Budden D, et al., 2018. Playing hard exploration games by watching YouTube. Proc 32nd Neural Information Processing Systems, p.2930-2941. [6]Bellemare MG, Naddaf Y, Veness J, et al., 2013. The Arcade learning environment: an evaluation platform for general agents. J Artif Intell Res, 47:253-279. [7]Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30th Neural Information Processing Systems, p.1471-1479. [8]Bellemare MG, Dabney W, Munos R, 2017. A distributional perspective on reinforcement learning. Proc 34th Int Conf on Machine Learning, p.449-458. [9]Blundell C, Cornebise J, Kavukcuoglu K, et al., 2015. Weight uncertainty in neural networks. Proc 32nd Int Conf on Machine Learning, p.1613-1622. [10]Botvinick M, Ritter S, Wang JX, et al., 2019. Reinforcement learning, fast and slow. Trends Cogn Sci, 23(5):408-422. [11]Buckman J, Hafner D, Tucker G, et al., 2018. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Proc 32nd Neural Information Processing Systems, p.8224-8234. [12]Burda Y, Edwards H, Pathak D, et al., 2019. Large-scale study of curiosity-driven learning. https://arxiv.org/abs/1808.04355 [13]Chapelle O, Li LH, 2011. An empirical evaluation of Thompson sampling. Proc 24th Neural Information Processing Systems, p.2249-2257. [14]Chebotar Y, Hausman K, Zhang M, et al., 2017. Combining model-based and model-free updates for trajectory-centric reinforcement learning. Proc 34th Int Conf on Machine Learning, p.703-711. [15]Chen L, Lingys J, Chen K, et al., 2018. AuTO: scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. Proc Conf of the ACM Special Interest Group on Data Communication, p.191-205. [16]Chen YT, Assael Y, Shillingford B, et al., 2019. Sample efficient adaptive text-to-speech. https://arxiv.org/abs/1809.10460 [17]Chua K, Calandra R, McAllister R, et al., 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Proc 32nd Neural Information Processing Systems, p.4754-4765. [18]Devin C, Gupta A, Darrell T, et al., 2017. Learning modular neural network policies for multi-task and multi-robot transfer. Proc IEEE Int Conf on Robotics and Automation, p.2169-2176. [19]Dhingra B, Li LH, Li XJ, et al., 2017. Towards end-to-end reinforcement learning of dialogue agents for information access. Proc 55th Annual Meeting of the Association for Computational Linguistics, p.484-495. [20]Duan Y, Schulman J, Chen X, et al., 2017. RL2: fast reinforcement learning via slow reinforcement learning. https://arxiv.org/abs/1611.02779 [21]Ebert F, Finn C, Lee AX, et al., 2017. Self-supervised visual planning with temporal skip connections. Proc 1st Annual Conf on Robot Learning, p.344-356. [22]Feinberg V, Wan A, Stoica I, et al., 2018. Model-based value estimation for efficient model-free reinforcement learning. https://arxiv.org/abs/1803.00101 [23]Finn C, Levine S, 2017. Deep visual foresight for planning robot motion. Proc IEEE Int Conf on Robotics and Automation, p.2786-2793. [24]Finn C, Levine S, Abbeel P, 2016a. Guided cost learning: deep inverse optimal control via policy optimization. Proc 33rd Int Conf on Machine Learning, p.49-58. [25]Finn C, Tan XY, Duan Y, et al., 2016b. Deep spatial autoencoders for visuomotor learning. Proc IEEE Int Conf on Robotics and Automation, p.512-519. [26]Finn C, Abbeel P, Levine S, 2017a. Model-agnostic meta-learning for fast adaptation of deep networks. Proc 34th Int Conf on Machine Learning, p.1126-1135. [27]Finn C, Yu T, Zhang T, et al., 2017b. One-shot visual imitation learning via meta-learning. Proc 1st Conf on Robot Learning, p.357-368. [28]Fortunato M, Azar MG, Piot B, et al., 2019. Noisy networks for exploration. https://arxiv.org/abs/1706.10295 [29]Fu J, Levine S, Abbeel P, 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.4019-4026. [30]Fu J, Co-Reyes JD, Levine S, 2017a. EX2: exploration with exemplar models for deep reinforcement learning. Proc 30th Neural Information Processing Systems, p.2577-2587. [31]Fu J, Luo K, Levine S, 2017b. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248 [32]Fujimoto S, Hoof H, Meger D, 2018. Addressing function approximation error in actor-critic methods. Proc 35th Int Conf on Machine Learning, p.1587-1596. [33]Gal Y, Hron J, Kendall A, 2017. Concrete dropout. Proc 30$^rm th$ Neural Information Processing Systems, p.3581-3590. [34]Garcia FM, Thomas PS, 2019. A meta-MDP approach to exploration for lifelong reinforcement learning. Proc 32nd Neural Information Processing Systems, p.5691-5700. [35]Ghasemipour SKS, Gu SX, Zemel R, 2019. SMILe: scalable meta inverse reinforcement learning through context-conditional policies. Proc 32nd Neural Information Processing Systems, p.7879-7889. [36]Gu JT, Hassan H, Devlin J, et al., 2018a. Universal neural machine translation for extremely low resource languages. Proc 16th Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.344-354. [37]Gu JT, Wang Y, Chen Y, et al., 2018b. Meta-learning for low-resource neural machine translation. Proc Conf on Empirical Methods in Natural Language Processing, p.3622-3631. [38]Gu SX, Lillicrap T, Sutskever I, et al., 2016. Continuous deep Q-learning with model-based acceleration. Proc 33rd Int Conf on Machine Learning, p.2829-2838. [39]Gu SX, Holly E, Lillicrap T, et al., 2017a. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. Proc IEEE Int Conf on Robotics and Automation, p.3389-3396. [40]Gu SX, Lillicrap T, Ghahramani Z, et al., 2017b. Q-Prop: sample-efficient policy gradient with an off-policy critic. https://arxiv.org/abs/1611.02247 [41]Gupta A, Mendonca R, Liu YX, et al., 2018. Meta-reinforcement learning of structured exploration strategies. Proc 32nd Neural Information Processing Systems, p.5302-5311. [42]Haarnoja T, Tang HR, Abbeel P, et al., 2017. Reinforcement learning with deep energy-based policies. Proc 34th Int Conf on Machine Learning, p.1352-1361. [43]Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35th Int Conf on Machine Learning, p.1861-1870. [44]Hausknecht M, Stone P, 2017. Deep recurrent Q-learning for partially observable MDPs. https://arxiv.org/abs/1507.06527 [45]He D, Xia YC, Qin T, et al., 2016. Dual learning for machine translation. Proc 30th Neural Information Processing Systems, p.820-828. [46]Heess N, Sriram S, Lemmon J, et al., 2017. Emergence of locomotion behaviours in rich environments. https://arxiv.org/abs/1707.02286 [47]Hessel M, Modayil J, van Hasselt H, et al., 2018. Rainbow: combining improvements in deep reinforcement learning. https://arxiv.org/abs/1710.02298 [48]Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30th Neural Information Processing Systems, p.4565-4573. [49]Horgan D, Quan J, Budden D, et al., 2018. Distributed prioritized experience replay. https://arxiv.org/abs/1803.00933 [50]Houthooft R, Chen X, Duan Y, et al., 2017. Variational information maximizing exploration. Proc 30th Neural Information Processing Systems, p.1109-1117. [51]Kakade S, Langford J, 2002. Approximately optimal approximate reinforcement learning. Proc 19th Int Conf on Machine Learning, p.267-274. [52]Kalashnikov D, Irpan A, Pastor P, et al., 2018. QT-Opt: scalable deep reinforcement learning for vision-based robotic manipulation. Proc 2nd Conf on Robot Learning, p.651-673. [53]Klein E, Geist M, Piot B, et al., 2012. Inverse reinforcement learning through structured classification. Proc 25th Neural Information Processing Systems, p.1007-1015. [54]Kolter JZ, Ng AY, 2009. Near-Bayesian exploration in polynomial time. Proc 26th Int Conf on Machine Learning, p.513-520. [55]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25th Neural Information Processing Systems, p.1097-1105. [56]Kröse BJA, 1995. Learning from delayed rewards. Robot Auton Syst, 15(4):233-235. [57]Lange S, Riedmiller M, Voigtländer A, 2012. Autonomous reinforcement learning on raw visual input data in a real world application. Proc Int Joint Conf on Neural Networks, p.1-8. [58]Levine S, Koltun V, 2013. Guided policy search. Proc 30th Int Conf on Machine Learning, p.1-9. [59]Levine S, Wagener N, Abbeel P, 2015. Learning contact-rich manipulation skills with guided policy search. Proc IEEE Int Conf on Robotics and Automation, p.156-163. [60]Levine S, Finn C, Darrell T, et al., 2016. End-to-end training of deep visuomotor policies. J Mach Learn Res, 17(1): 1334-1373. [61]Lillicrap TP, Hunt JJ, Pritzel A, et al., 2016. Continuous control with deep reinforcement learning. Proc 4th Int Conf on Learning Representations, p.2829-2838. [62]Lin LJ, 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn, 8(3-4):293-321. [63]Mao HZ, Alizadeh M, Menache I, et al., 2016. Resource management with deep reinforcement learning. Proc 15th ACM Workshop on Hot Topics in Networks, p.50-56. [64]Mao HZ, Schwarzkopf M, Venkatakrishnan SB, et al., 2019a. Learning scheduling algorithms for data processing clusters. Proc ACM Special Interest Group on Data Communication, p.270-288. [65]Mao HZ, Negi P, Narayan A, et al., 2019b. Park: an open platform for learning-augmented computer systems. Proc 36th Int Conf on Machine Learning, p.2490-2502. [66]Mishra N, Rohaninejad M, Chen X, et al., 2018. A simple neural attentive meta-learner. https://arxiv.org/abs/1707.03141 [67]Mnih V, Kavukcuoglu K, Silver D, et al., 2013. Playing Atari with deep reinforcement learning. https://arxiv.org/abs/1312.5602 [68]Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533. [69]Mnih V, Badia AP, Mirza M, et al., 2016. Asynchronous methods for deep reinforcement learning. Proc 33rd Int Conf on Machine Learning, p.1928-1937. [70]Mousavi SS, Schukat M, Howley E, 2018. Deep reinforcement learning: an overview. Proc SAI Intelligent Systems Conf, p.426-440. [71]Nachum O, Norouzi M, Xu K, et al., 2017a. Bridging the gap between value and policy based reinforcement learning. Proc 31st Neural Information Processing Systems, p.2775-2785. [72]Nachum O, Norouzi M, Xu K, et al., 2017b. Trust-PCL: an off-policy trust region method for continuous control. https://arxiv.org/abs/1707.01891 [73]Nagabandi A, Kahn G, Fearing RS, et al., 2018. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. IEEE Int Conf on Robotics and Automation, p.7559-7566. [74]Nagabandi A, Clavera I, Liu SM, et al., 2019. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. https://arxiv.org/abs/1803.11347v6 [75]Ng AY, Russell SJ, 2000. Algorithms for inverse reinforcement learning. Proc 17th Int Conf on Machine Learning, p.663-670. [76]Osband I, Blundell C, Pritzel A, et al., 2016. Deep exploration via bootstrapped DQN. Proc 29th Neural Information Processing Systems, p.4026-4034. [77]Ostrovski G, Bellemare MG, van den Oord A, et al., 2017. Count-based exploration with neural density models. Proc 34th Int Conf on Machine Learning, p.2721-2730. [78]Parisotto E, Ba JL, Salakhutdinov R, 2016. Actor-Mimic: deep multitask and transfer reinforcement learning. https://arxiv.org/abs/1511.06342 [79]Pathak D, Agrawal P, Efros AA, et al., 2017. Curiosity-driven exploration by self-supervised prediction. Proc IEEE Conf on Computer Vision and Pattern Recognition Workshops, p.488-489. [80]Peng XB, Abbeel P, Levine S, et al., 2018a. DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans Graph, 37(4):143. [81]Peng XB, Andrychowicz M, Zaremba W, et al., 2018b. Sim-to-real transfer of robotic control with dynamics randomization. Proc IEEE Int Conf on Robotics and Automation, p.3803-3810. [82]Ping W, Peng KN, Gibiansky A, et al., 2018. Deep voice 3: 2000-speaker neural text-to-speech. Proc Int Conf on Learning Representations, p.214-217. [83]Pohlen T, Piot B, Hester T, et al., 2018. Observe and look further: achieving consistent performance on Atari. https://arxiv.org/abs/1805.11593 [84]Racani‘ere S, Weber T, Reichert DP, et al., 2017. Imagination-augmented agents for deep reinforcement learning. Proc 31st Neural Information Processing Systems, p.5694-5705. [85]Rahmatizadeh R, Abolghasemi P, Behal A, et al., 2016. Learning real manipulation tasks from virtual demonstrations using LSTM. https://arxiv.org/abs/1603.03833v2 [86]Rajeswaran A, Ghotra S, Ravindran B, et al., 2017. EPOpt: learning robust neural network policies using model ensembles. https://arxiv.org/abs/1610.01283 [87]Rakelly K, Zhou A, Quillen D, et al., 2019. Efficient off-policy meta-reinforcement learning via probabilistic context variables. Proc 36th Int Conf on Machine Learning, p.5331-5340. [88]Ratliff ND, Bagnell JA, Zinkevich MA, 2006. Maximum margin planning. Proc 23rd Int Conf on Machine Learning, p.729-736. [89]Russo D, Roy BV, 2014. Learning to optimize via information-directed sampling. Proc 27th Neural Information Processing Systems, p.1583-1591. [90]Rusu AA, Colmenarejo SG, Gulcehre C, et al., 2016a. Policy distillation. https://arxiv.org/abs/1511.06295 [91]Rusu AA, Rabinowitz NC, Desjardins G, et al., 2016b. Progressive neural networks. https://arxiv.org/abs/1606.04671 [92]Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. https://arxiv.org/abs/1511.05952 [93]Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc Int Conf on Machine Learning, p.1889-1897. [94]Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438 [95]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347 [96]Shum HY, He XD, Li D, 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Front Inform Technol Electron Eng, 19(1):10-26. [97]Silver D, Lever G, Heess N, et al., 2014. Deterministic policy gradient algorithms. Proc 31st Int Conf on Machine Learning, p.387-395. [98]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484-489. [99]Skerry-Ryan RJ, Battenberg E, Xiao Y, et al., 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. Int Conf on Machine Learning, p.4693-4702. [100]Stadie BC, Yang G, Houthooft R, et al., 2018. Some considerations on learning to explore via meta-reinforcement learning. https://arxiv.org/abs/1803.01118 [101]Strehl AL, Littman ML, 2008. An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci, 74(8):1309-1331. [102]Sutton RS, 1988. Learning to predict by the methods of temporal differences. Mach Learn, 3(1):9-44. [103]Sutton RS, Barto AG, 2018. Reinforcement Learning: an Introduction (2nd Ed.). MIT Press, Cambridge, MA, USA. [104]Tang HR, Houthooft R, Foote D, et al., 2017. #Exploration: a study of count-based exploration for deep reinforcement learning. Proc 31st Neural Information Processing Systems, p.2753-2762. [105]van Hasselt H, Guez A, Silver D, 2016. Deep reinforcement learning with double Q-learning. Proc 30th AAAI Conf on Artificial Intelligence, p.2096-2100. [106]Vanschoren J, 2018. Meta-learning: a survey. https://arxiv.org/abs/1810.03548 [107]Vinyals O, Ewalds T, Bartunov S, et al., 2017. StarCraft II: a new challenge for reinforcement learning. https://arxiv.org/abs/1708.04782 [108]Wang JX, Kurth-Nelson Z, Tirumala D, et al., 2017. Learning to reinforcement learn. https://arxiv.org/abs/1611.05763 [109]Wang ZY, Schaul T, Hessel M, et al., 2016. Dueling network architectures for deep reinforcement learning. Proc 33rd Int Conf on Machine Learning, p.1995-2003. [110]Wang ZY, Bapst V, Heess N, et al., 2017. Sample efficient actor-critic with experience replay. https://arxiv.org/abs/1611.01224 [111]Watter M, Springenberg JT, Boedecker J, et al., 2015. Embed to control: a locally linear latent dynamics model for control from raw images. Proc 28th Neural Information Processing Systems, p.2746-2754. [112]Williams RJ, 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 8(3-4):229-256. [113]Wu YH, Mansimov E, Grosse RB, et al., 2017. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. Proc 30th Neural Information Processing Systems, p.5279-5288. [114]Xia C, El Kamel A, 2016. Neural inverse reinforcement learning in autonomous navigation. Robot Auton Syst, 84:1-14. [115]Yahya A, Li A, Kalakrishnan M, et al., 2017. Collective robot reinforcement learning with distributed asynchronous guided policy search. IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.79-86. [116]Yu TH, Finn C, Xie AN, et al., 2018. One-shot imitation from observing humans via domain-adaptive meta-learning. https://arxiv.org/abs/1802.01557v1 [117]Yu WH, Tan J, Liu CK, et al., 2017. Preparing for the unknown: learning a universal policy with online system identification. https://arxiv.org/abs/1702.02453 [118]Zhang M, Vikram S, Smith L, et al., 2019. SOLAR: deep structured representations for model-based reinforcement learning. Proc 36th Int Conf on Machine Learning, p.7444-7453. [119]Ziebart BD, Maas A, Bagnell JA, et al., 2008. Maximum entropy inverse reinforcement learning. Proc 23rd AAAI Conf on Artificial Intelligence, p.1433-1438. [120]Zintgraf L, Shiarli K, Kurin V, et al., 2019. Fast context adaptation via meta-learning. Proc 36th Int Conf on Machine Learning, p.7693-7702. Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou
310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn Copyright © 2000 - 2024 Journal of Zhejiang University-SCIENCE |
Open peer comments: Debate/Discuss/Question/Opinion
<1>