Full Text:   <9082>

Summary:  <558>

CLC number: TP181

On-line Access: 2024-08-27

Received: 2023-10-17

Revision Accepted: 2024-05-08

Crosschecked: 2023-05-19

Cited: 0

Clicked: 2404

Citations:  Bibtex RefMan EndNote GB/T7714

 ORCID:

Jian Pu

https://orcid.org/0000-0002-0892-1213

Shihmin WANG

https://orcid.org/0000-0002-7288-8323

-   Go to

Article info.
Open peer comments

Frontiers of Information Technology & Electronic Engineering  2023 Vol.24 No.11 P.1541-1556

http://doi.org/10.1631/FITEE.2300084


Embedding expert demonstrations into clustering buffer for effective deep reinforcement learning


Author(s):  Shihmin WANG, Binqi ZHAO, Zhengfeng ZHANG, Junping ZHANG, Jian PU

Affiliation(s):  Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai 200433, China; more

Corresponding email(s):   wangshimin20@fudan.edu.cn, bqzhao20@fudan.edu.cn, jpzhang@fudan.edu.cn, jianpu@fudan.edu.cn

Key Words:  Reinforcement learning, Sample efficiency, Sampling process, Clustering methods, Autonomous driving



Abstract: 
As one of the most fundamental topics in reinforcement learning (RL), sample efficiency is essential to the deployment of deep RL algorithms. Unlike most existing exploration methods that sample an action from different types of posterior distributions, we focus on the policy sampling process and propose an efficient selective sampling approach to improve sample efficiency by modeling the internal hierarchy of the environment. Specifically, we first employ clustering methods in the policy sampling process to generate an action candidate set. Then we introduce a clustering buffer for modeling the internal hierarchy, which consists of on-policy data, off-policy data, and expert data to evaluate actions from the clusters in the action candidate set in the exploration stage. In this way, our approach is able to take advantage of the supervision information in the expert demonstration data. Experiments on six different continuous locomotion environments demonstrate superior reinforcement learning performance and faster convergence of selective sampling. In particular, on the LGSVL task, our method can reduce the number of convergence steps by 46.7% and the convergence time by 28.5%. Furthermore, our code is open-source for reproducibility. The code is available at https://github.com/Shihwin/SelectiveSampling.

Open peer comments: Debate/Discuss/Question/Opinion

<1>

Please provide your name, email address and a comment





Journal of Zhejiang University-SCIENCE, 38 Zheda Road, Hangzhou 310027, China
Tel: +86-571-87952783; E-mail: cjzhang@zju.edu.cn
Copyright © 2000 - 2025 Journal of Zhejiang University-SCIENCE