JZUS - Journal of Zhejiang University SCIENCE

Frontiers of Information Technology & Electronic Engineering

Accepted manuscript available online (unedited version)

Deep reinforcement learning: a survey

Author(s): Hao-nan Wang, Ning Liu, Yi-yun Zhang, Da-wei Feng, Feng Huang, Dong-sheng Li, Yi-ming Zhang
Affiliation(s): Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410000, China
Corresponding email(s): wanghaonan14@nudt.edu.cn, liuning17a@nudt.edu.cn, zhangyiyun213@163.com, fengdawei@nudt.edu.cn, huangfeng@nudt.edu.cn, dsli@nudt.edu.cn, zhangyiming@nudt.edu.cn
Key Words: Reinforcement learning, Deep reinforcement learning, Reinforcement learning applications

Share this article to： More <<< Previous Paper \|Next Paper >>>

Hao-nan Wang, Ning Liu, Yi-yun Zhang, Da-wei Feng, Feng Huang, Dong-sheng Li, Yi-ming Zhang. Deep reinforcement learning: a survey[J]. Frontiers of Information Technology & Electronic Engineering,in press.https://doi.org/10.1631/FITEE.1900533

@article{title="Deep reinforcement learning: a survey",
author="Hao-nan Wang, Ning Liu, Yi-yun Zhang, Da-wei Feng, Feng Huang, Dong-sheng Li, Yi-ming Zhang",
journal="Frontiers of Information Technology & Electronic Engineering",
year="in press",
publisher="Zhejiang University Press & Springer",
doi="https://doi.org/10.1631/FITEE.1900533"
}

%0 Journal Article
%T Deep reinforcement learning: a survey
%A Hao-nan Wang
%A Ning Liu
%A Yi-yun Zhang
%A Da-wei Feng
%A Feng Huang
%A Dong-sheng Li
%A Yi-ming Zhang
%J Frontiers of Information Technology & Electronic Engineering
%P 1726-1744
%@ 2095-9184
%D in press
%I Zhejiang University Press & Springer
doi="https://doi.org/10.1631/FITEE.1900533"

TY - JOUR
T1 - Deep reinforcement learning: a survey
A1 - Hao-nan Wang
A1 - Ning Liu
A1 - Yi-yun Zhang
A1 - Da-wei Feng
A1 - Feng Huang
A1 - Dong-sheng Li
A1 - Yi-ming Zhang
J0 - Frontiers of Information Technology & Electronic Engineering
SP - 1726
EP - 1744
%@ 2095-9184
Y1 - in press
PB - Zhejiang University Press & Springer
ER -
doi="https://doi.org/10.1631/FITEE.1900533"

Abstract
Chinese Summary
Academic Network
Reviewer Comment

Abstract: Deep reinforcement learning (RL) has become one of the most popular topics in artificial intelligence research. It has been widely used in various fields, such as end-to-end control, robotic control, recommendation systems, and natural language dialogue systems. In this survey, we systematically categorize the deep RL algorithms and applications, and provide a detailed review over existing deep RL algorithms by dividing them into model-based methods, model-free methods, and advanced RL methods. We thoroughly analyze the advances including exploration, inverse RL, and transfer RL. Finally, we outline the current representative applications, and analyze four open problems for future research.

深度强化学习综述

王浩楠，刘苧，章艺云，冯大伟，黄峰，李东升，张一鸣
国防科技大学并行与分布处理国家重点实验室，中国长沙市，410000

摘要：深度强化学习已成为人工智能研究中最受欢迎的主题之一，已被广泛应用于端到端控制、机器人控制、推荐系统、自然语言对话系统等多个领域。本文对深度强化学习算法和应用进行系统分类，提供详细论述，并将现有深度强化学习算法分为基于模型的方法、无模型方法和高级深度强化学习方法。之后，全面分析探索、逆强化学习和迁移强化学习等高级算法的进展。最后，概述当前深度强化学习的代表性应用，并分析4个亟待解决的问题。

关键词组：强化学习；深度强化学习；强化学习应用

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference

[1]Abbeel P, Ng AY, 2004. Apprenticeship learning via inverse reinforcement learning. Proc 21^st Int Conf on Machine Learning, p.1-8.

[2]Achiam J, Held D, Tamar A, et al., 2017. Constrained policy optimization. Proc 34^th Int Conf on Machine Learning, p.22-31.

[3]Al-Nima RRO, Han TT, Chen TL, 2019. Road tracking using deep reinforcement learning for self-driving car applications. Int Conf on Computer Recognition Systems, p.106-116.

[4]Arik SO, Chen JT, Peng KN, et al., 2018. Neural voice cloning with a few samples. Proc 32^nd Neural Information Processing Systems, p.10019-10029.

[5]Aytar Y, Pfaff T, Budden D, et al., 2018. Playing hard exploration games by watching YouTube. Proc 32^nd Neural Information Processing Systems, p.2930-2941.

[6]Bellemare MG, Naddaf Y, Veness J, et al., 2013. The Arcade learning environment: an evaluation platform for general agents. J Artif Intell Res, 47:253-279.

[7]Bellemare MG, Srinivasan S, Ostrovski G, et al., 2016. Unifying count-based exploration and intrinsic motivation. Proc 30^th Neural Information Processing Systems, p.1471-1479.

[8]Bellemare MG, Dabney W, Munos R, 2017. A distributional perspective on reinforcement learning. Proc 34^th Int Conf on Machine Learning, p.449-458.

[9]Blundell C, Cornebise J, Kavukcuoglu K, et al., 2015. Weight uncertainty in neural networks. Proc 32^nd Int Conf on Machine Learning, p.1613-1622.

[10]Botvinick M, Ritter S, Wang JX, et al., 2019. Reinforcement learning, fast and slow. Trends Cogn Sci, 23(5):408-422.

[11]Buckman J, Hafner D, Tucker G, et al., 2018. Sample-efficient reinforcement learning with stochastic ensemble value expansion. Proc 32^nd Neural Information Processing Systems, p.8224-8234.

[12]Burda Y, Edwards H, Pathak D, et al., 2019. Large-scale study of curiosity-driven learning. https://arxiv.org/abs/1808.04355

[13]Chapelle O, Li LH, 2011. An empirical evaluation of Thompson sampling. Proc 24^th Neural Information Processing Systems, p.2249-2257.

[14]Chebotar Y, Hausman K, Zhang M, et al., 2017. Combining model-based and model-free updates for trajectory-centric reinforcement learning. Proc 34^th Int Conf on Machine Learning, p.703-711.

[15]Chen L, Lingys J, Chen K, et al., 2018. AuTO: scaling deep reinforcement learning for datacenter-scale automatic traffic optimization. Proc Conf of the ACM Special Interest Group on Data Communication, p.191-205.

[16]Chen YT, Assael Y, Shillingford B, et al., 2019. Sample efficient adaptive text-to-speech. https://arxiv.org/abs/1809.10460

[17]Chua K, Calandra R, McAllister R, et al., 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Proc 32^nd Neural Information Processing Systems, p.4754-4765.

[18]Devin C, Gupta A, Darrell T, et al., 2017. Learning modular neural network policies for multi-task and multi-robot transfer. Proc IEEE Int Conf on Robotics and Automation, p.2169-2176.

[19]Dhingra B, Li LH, Li XJ, et al., 2017. Towards end-to-end reinforcement learning of dialogue agents for information access. Proc 55^th Annual Meeting of the Association for Computational Linguistics, p.484-495.

[20]Duan Y, Schulman J, Chen X, et al., 2017. RL²: fast reinforcement learning via slow reinforcement learning. https://arxiv.org/abs/1611.02779

[21]Ebert F, Finn C, Lee AX, et al., 2017. Self-supervised visual planning with temporal skip connections. Proc 1^st Annual Conf on Robot Learning, p.344-356.

[22]Feinberg V, Wan A, Stoica I, et al., 2018. Model-based value estimation for efficient model-free reinforcement learning. https://arxiv.org/abs/1803.00101

[23]Finn C, Levine S, 2017. Deep visual foresight for planning robot motion. Proc IEEE Int Conf on Robotics and Automation, p.2786-2793.

[24]Finn C, Levine S, Abbeel P, 2016a. Guided cost learning: deep inverse optimal control via policy optimization. Proc 33^rd Int Conf on Machine Learning, p.49-58.

[25]Finn C, Tan XY, Duan Y, et al., 2016b. Deep spatial autoencoders for visuomotor learning. Proc IEEE Int Conf on Robotics and Automation, p.512-519.

[26]Finn C, Abbeel P, Levine S, 2017a. Model-agnostic meta-learning for fast adaptation of deep networks. Proc 34^th Int Conf on Machine Learning, p.1126-1135.

[27]Finn C, Yu T, Zhang T, et al., 2017b. One-shot visual imitation learning via meta-learning. Proc 1^st Conf on Robot Learning, p.357-368.

[28]Fortunato M, Azar MG, Piot B, et al., 2019. Noisy networks for exploration. https://arxiv.org/abs/1706.10295

[29]Fu J, Levine S, Abbeel P, 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. Proc IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.4019-4026.

[30]Fu J, Co-Reyes JD, Levine S, 2017a. EX²: exploration with exemplar models for deep reinforcement learning. Proc 30^th Neural Information Processing Systems, p.2577-2587.

[31]Fu J, Luo K, Levine S, 2017b. Learning robust rewards with adversarial inverse reinforcement learning. https://arxiv.org/abs/1710.11248

[32]Fujimoto S, Hoof H, Meger D, 2018. Addressing function approximation error in actor-critic methods. Proc 35^th Int Conf on Machine Learning, p.1587-1596.

[33]Gal Y, Hron J, Kendall A, 2017. Concrete dropout. Proc 30$^rm th$ Neural Information Processing Systems, p.3581-3590.

[34]Garcia FM, Thomas PS, 2019. A meta-MDP approach to exploration for lifelong reinforcement learning. Proc 32^nd Neural Information Processing Systems, p.5691-5700.

[35]Ghasemipour SKS, Gu SX, Zemel R, 2019. SMILe: scalable meta inverse reinforcement learning through context-conditional policies. Proc 32^nd Neural Information Processing Systems, p.7879-7889.

[36]Gu JT, Hassan H, Devlin J, et al., 2018a. Universal neural machine translation for extremely low resource languages. Proc 16^th Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, p.344-354.

[37]Gu JT, Wang Y, Chen Y, et al., 2018b. Meta-learning for low-resource neural machine translation. Proc Conf on Empirical Methods in Natural Language Processing, p.3622-3631.

[38]Gu SX, Lillicrap T, Sutskever I, et al., 2016. Continuous deep Q-learning with model-based acceleration. Proc 33^rd Int Conf on Machine Learning, p.2829-2838.

[39]Gu SX, Holly E, Lillicrap T, et al., 2017a. Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. Proc IEEE Int Conf on Robotics and Automation, p.3389-3396.

[40]Gu SX, Lillicrap T, Ghahramani Z, et al., 2017b. Q-Prop: sample-efficient policy gradient with an off-policy critic. https://arxiv.org/abs/1611.02247

[41]Gupta A, Mendonca R, Liu YX, et al., 2018. Meta-reinforcement learning of structured exploration strategies. Proc 32^nd Neural Information Processing Systems, p.5302-5311.

[42]Haarnoja T, Tang HR, Abbeel P, et al., 2017. Reinforcement learning with deep energy-based policies. Proc 34^th Int Conf on Machine Learning, p.1352-1361.

[43]Haarnoja T, Zhou A, Abbeel P, et al., 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proc 35^th Int Conf on Machine Learning, p.1861-1870.

[44]Hausknecht M, Stone P, 2017. Deep recurrent Q-learning for partially observable MDPs. https://arxiv.org/abs/1507.06527

[45]He D, Xia YC, Qin T, et al., 2016. Dual learning for machine translation. Proc 30^th Neural Information Processing Systems, p.820-828.

[46]Heess N, Sriram S, Lemmon J, et al., 2017. Emergence of locomotion behaviours in rich environments. https://arxiv.org/abs/1707.02286

[47]Hessel M, Modayil J, van Hasselt H, et al., 2018. Rainbow: combining improvements in deep reinforcement learning. https://arxiv.org/abs/1710.02298

[48]Ho J, Ermon S, 2016. Generative adversarial imitation learning. Proc 30^th Neural Information Processing Systems, p.4565-4573.

[49]Horgan D, Quan J, Budden D, et al., 2018. Distributed prioritized experience replay. https://arxiv.org/abs/1803.00933

[50]Houthooft R, Chen X, Duan Y, et al., 2017. Variational information maximizing exploration. Proc 30^th Neural Information Processing Systems, p.1109-1117.

[51]Kakade S, Langford J, 2002. Approximately optimal approximate reinforcement learning. Proc 19^th Int Conf on Machine Learning, p.267-274.

[52]Kalashnikov D, Irpan A, Pastor P, et al., 2018. QT-Opt: scalable deep reinforcement learning for vision-based robotic manipulation. Proc 2^nd Conf on Robot Learning, p.651-673.

[53]Klein E, Geist M, Piot B, et al., 2012. Inverse reinforcement learning through structured classification. Proc 25^th Neural Information Processing Systems, p.1007-1015.

[54]Kolter JZ, Ng AY, 2009. Near-Bayesian exploration in polynomial time. Proc 26^th Int Conf on Machine Learning, p.513-520.

[55]Krizhevsky A, Sutskever I, Hinton GE, 2012. ImageNet classification with deep convolutional neural networks. Proc 25^th Neural Information Processing Systems, p.1097-1105.

[56]Kröse BJA, 1995. Learning from delayed rewards. Robot Auton Syst, 15(4):233-235.

[57]Lange S, Riedmiller M, Voigtländer A, 2012. Autonomous reinforcement learning on raw visual input data in a real world application. Proc Int Joint Conf on Neural Networks, p.1-8.

[58]Levine S, Koltun V, 2013. Guided policy search. Proc 30^th Int Conf on Machine Learning, p.1-9.

[59]Levine S, Wagener N, Abbeel P, 2015. Learning contact-rich manipulation skills with guided policy search. Proc IEEE Int Conf on Robotics and Automation, p.156-163.

[60]Levine S, Finn C, Darrell T, et al., 2016. End-to-end training of deep visuomotor policies. J Mach Learn Res, 17(1): 1334-1373.

[61]Lillicrap TP, Hunt JJ, Pritzel A, et al., 2016. Continuous control with deep reinforcement learning. Proc 4^th Int Conf on Learning Representations, p.2829-2838.

[62]Lin LJ, 1992. Self-improving reactive agents based on reinforcement learning, planning and teaching. Mach Learn, 8(3-4):293-321.

[63]Mao HZ, Alizadeh M, Menache I, et al., 2016. Resource management with deep reinforcement learning. Proc 15^th ACM Workshop on Hot Topics in Networks, p.50-56.

[64]Mao HZ, Schwarzkopf M, Venkatakrishnan SB, et al., 2019a. Learning scheduling algorithms for data processing clusters. Proc ACM Special Interest Group on Data Communication, p.270-288.

[65]Mao HZ, Negi P, Narayan A, et al., 2019b. Park: an open platform for learning-augmented computer systems. Proc 36^th Int Conf on Machine Learning, p.2490-2502.

[66]Mishra N, Rohaninejad M, Chen X, et al., 2018. A simple neural attentive meta-learner. https://arxiv.org/abs/1707.03141

[67]Mnih V, Kavukcuoglu K, Silver D, et al., 2013. Playing Atari with deep reinforcement learning. https://arxiv.org/abs/1312.5602

[68]Mnih V, Kavukcuoglu K, Silver D, et al., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529-533.

[69]Mnih V, Badia AP, Mirza M, et al., 2016. Asynchronous methods for deep reinforcement learning. Proc 33^rd Int Conf on Machine Learning, p.1928-1937.

[70]Mousavi SS, Schukat M, Howley E, 2018. Deep reinforcement learning: an overview. Proc SAI Intelligent Systems Conf, p.426-440.

[71]Nachum O, Norouzi M, Xu K, et al., 2017a. Bridging the gap between value and policy based reinforcement learning. Proc 31^st Neural Information Processing Systems, p.2775-2785.

[72]Nachum O, Norouzi M, Xu K, et al., 2017b. Trust-PCL: an off-policy trust region method for continuous control. https://arxiv.org/abs/1707.01891

[73]Nagabandi A, Kahn G, Fearing RS, et al., 2018. Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning. IEEE Int Conf on Robotics and Automation, p.7559-7566.

[74]Nagabandi A, Clavera I, Liu SM, et al., 2019. Learning to adapt in dynamic, real-world environments through meta-reinforcement learning. https://arxiv.org/abs/1803.11347v6

[75]Ng AY, Russell SJ, 2000. Algorithms for inverse reinforcement learning. Proc 17^th Int Conf on Machine Learning, p.663-670.

[76]Osband I, Blundell C, Pritzel A, et al., 2016. Deep exploration via bootstrapped DQN. Proc 29^th Neural Information Processing Systems, p.4026-4034.

[77]Ostrovski G, Bellemare MG, van den Oord A, et al., 2017. Count-based exploration with neural density models. Proc 34^th Int Conf on Machine Learning, p.2721-2730.

[78]Parisotto E, Ba JL, Salakhutdinov R, 2016. Actor-Mimic: deep multitask and transfer reinforcement learning. https://arxiv.org/abs/1511.06342

[79]Pathak D, Agrawal P, Efros AA, et al., 2017. Curiosity-driven exploration by self-supervised prediction. Proc IEEE Conf on Computer Vision and Pattern Recognition Workshops, p.488-489.

[80]Peng XB, Abbeel P, Levine S, et al., 2018a. DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans Graph, 37(4):143.

[81]Peng XB, Andrychowicz M, Zaremba W, et al., 2018b. Sim-to-real transfer of robotic control with dynamics randomization. Proc IEEE Int Conf on Robotics and Automation, p.3803-3810.

[82]Ping W, Peng KN, Gibiansky A, et al., 2018. Deep voice 3: 2000-speaker neural text-to-speech. Proc Int Conf on Learning Representations, p.214-217.

[83]Pohlen T, Piot B, Hester T, et al., 2018. Observe and look further: achieving consistent performance on Atari. https://arxiv.org/abs/1805.11593

[84]Racani‘ere S, Weber T, Reichert DP, et al., 2017. Imagination-augmented agents for deep reinforcement learning. Proc 31^st Neural Information Processing Systems, p.5694-5705.

[85]Rahmatizadeh R, Abolghasemi P, Behal A, et al., 2016. Learning real manipulation tasks from virtual demonstrations using LSTM. https://arxiv.org/abs/1603.03833v2

[86]Rajeswaran A, Ghotra S, Ravindran B, et al., 2017. EPOpt: learning robust neural network policies using model ensembles. https://arxiv.org/abs/1610.01283

[87]Rakelly K, Zhou A, Quillen D, et al., 2019. Efficient off-policy meta-reinforcement learning via probabilistic context variables. Proc 36^th Int Conf on Machine Learning, p.5331-5340.

[88]Ratliff ND, Bagnell JA, Zinkevich MA, 2006. Maximum margin planning. Proc 23^rd Int Conf on Machine Learning, p.729-736.

[89]Russo D, Roy BV, 2014. Learning to optimize via information-directed sampling. Proc 27^th Neural Information Processing Systems, p.1583-1591.

[90]Rusu AA, Colmenarejo SG, Gulcehre C, et al., 2016a. Policy distillation. https://arxiv.org/abs/1511.06295

[91]Rusu AA, Rabinowitz NC, Desjardins G, et al., 2016b. Progressive neural networks. https://arxiv.org/abs/1606.04671

[92]Schaul T, Quan J, Antonoglou I, et al., 2016. Prioritized experience replay. https://arxiv.org/abs/1511.05952

[93]Schulman J, Levine S, Moritz P, et al., 2015. Trust region policy optimization. Proc Int Conf on Machine Learning, p.1889-1897.

[94]Schulman J, Moritz P, Levine S, et al., 2016. High-dimensional continuous control using generalized advantage estimation. https://arxiv.org/abs/1506.02438

[95]Schulman J, Wolski F, Dhariwal P, et al., 2017. Proximal policy optimization algorithms. https://arxiv.org/abs/1707.06347

[96]Shum HY, He XD, Li D, 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Front Inform Technol Electron Eng, 19(1):10-26.

[97]Silver D, Lever G, Heess N, et al., 2014. Deterministic policy gradient algorithms. Proc 31^st Int Conf on Machine Learning, p.387-395.

[98]Silver D, Huang A, Maddison CJ, et al., 2016. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484-489.

[99]Skerry-Ryan RJ, Battenberg E, Xiao Y, et al., 2018. Towards end-to-end prosody transfer for expressive speech synthesis with Tacotron. Int Conf on Machine Learning, p.4693-4702.

[100]Stadie BC, Yang G, Houthooft R, et al., 2018. Some considerations on learning to explore via meta-reinforcement learning. https://arxiv.org/abs/1803.01118

[101]Strehl AL, Littman ML, 2008. An analysis of model-based interval estimation for Markov decision processes. J Comput Syst Sci, 74(8):1309-1331.

[102]Sutton RS, 1988. Learning to predict by the methods of temporal differences. Mach Learn, 3(1):9-44.

[103]Sutton RS, Barto AG, 2018. Reinforcement Learning: an Introduction (2^nd Ed.). MIT Press, Cambridge, MA, USA.

[104]Tang HR, Houthooft R, Foote D, et al., 2017. #Exploration: a study of count-based exploration for deep reinforcement learning. Proc 31^st Neural Information Processing Systems, p.2753-2762.

[105]van Hasselt H, Guez A, Silver D, 2016. Deep reinforcement learning with double Q-learning. Proc 30^th AAAI Conf on Artificial Intelligence, p.2096-2100.

[106]Vanschoren J, 2018. Meta-learning: a survey. https://arxiv.org/abs/1810.03548

[107]Vinyals O, Ewalds T, Bartunov S, et al., 2017. StarCraft II: a new challenge for reinforcement learning. https://arxiv.org/abs/1708.04782

[108]Wang JX, Kurth-Nelson Z, Tirumala D, et al., 2017. Learning to reinforcement learn. https://arxiv.org/abs/1611.05763

[109]Wang ZY, Schaul T, Hessel M, et al., 2016. Dueling network architectures for deep reinforcement learning. Proc 33^rd Int Conf on Machine Learning, p.1995-2003.

[110]Wang ZY, Bapst V, Heess N, et al., 2017. Sample efficient actor-critic with experience replay. https://arxiv.org/abs/1611.01224

[111]Watter M, Springenberg JT, Boedecker J, et al., 2015. Embed to control: a locally linear latent dynamics model for control from raw images. Proc 28^th Neural Information Processing Systems, p.2746-2754.

[112]Williams RJ, 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn, 8(3-4):229-256.

[113]Wu YH, Mansimov E, Grosse RB, et al., 2017. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. Proc 30^th Neural Information Processing Systems, p.5279-5288.

[114]Xia C, El Kamel A, 2016. Neural inverse reinforcement learning in autonomous navigation. Robot Auton Syst, 84:1-14.

[115]Yahya A, Li A, Kalakrishnan M, et al., 2017. Collective robot reinforcement learning with distributed asynchronous guided policy search. IEEE/RSJ Int Conf on Intelligent Robots and Systems, p.79-86.

[116]Yu TH, Finn C, Xie AN, et al., 2018. One-shot imitation from observing humans via domain-adaptive meta-learning. https://arxiv.org/abs/1802.01557v1

[117]Yu WH, Tan J, Liu CK, et al., 2017. Preparing for the unknown: learning a universal policy with online system identification. https://arxiv.org/abs/1702.02453

[118]Zhang M, Vikram S, Smith L, et al., 2019. SOLAR: deep structured representations for model-based reinforcement learning. Proc 36^th Int Conf on Machine Learning, p.7444-7453.

[119]Ziebart BD, Maas A, Bagnell JA, et al., 2008. Maximum entropy inverse reinforcement learning. Proc 23^rd AAAI Conf on Artificial Intelligence, p.1433-1438.

[120]Zintgraf L, Shiarli K, Kurin V, et al., 2019. Fast context adaptation via meta-learning. Proc 36^th Int Conf on Machine Learning, p.7693-7702.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

- Go to

深度强化学习综述

Darkslateblue:Affiliate; Royal Blue:Author; Turquoise:Article

Reference