CLC number: TP181
On-line Access: 2014-03-05
Received: 2013-09-06
Revision Accepted: 2014-01-14
Crosschecked: 2014-02-21
Cited: 0
Clicked: 7354
Feng-fei Zhao, Zheng Qin, Zhuo Shao, Jun Fang, Bo-yan Ren. Greedy feature replacement for online value function approximation[J]. Journal of Zhejiang University Science C, 2014, 15(3): 223-231.
@article{title="Greedy feature replacement for online value function approximation",
author="Feng-fei Zhao, Zheng Qin, Zhuo Shao, Jun Fang, Bo-yan Ren",
journal="Journal of Zhejiang University Science C",
volume="15",
number="3",
pages="223-231",
year="2014",
publisher="Zhejiang University Press & Springer",
doi="10.1631/jzus.C1300246"
}
%0 Journal Article
%T Greedy feature replacement for online value function approximation
%A Feng-fei Zhao
%A Zheng Qin
%A Zhuo Shao
%A Jun Fang
%A Bo-yan Ren
%J Journal of Zhejiang University SCIENCE C
%V 15
%N 3
%P 223-231
%@ 1869-1951
%D 2014
%I Zhejiang University Press & Springer
%DOI 10.1631/jzus.C1300246
TY - JOUR
T1 - Greedy feature replacement for online value function approximation
A1 - Feng-fei Zhao
A1 - Zheng Qin
A1 - Zhuo Shao
A1 - Jun Fang
A1 - Bo-yan Ren
J0 - Journal of Zhejiang University Science C
VL - 15
IS - 3
SP - 223
EP - 231
%@ 1869-1951
Y1 - 2014
PB - Zhejiang University Press & Springer
ER -
DOI - 10.1631/jzus.C1300246
Abstract: reinforcement learning (RL) in real-world problems requires function approximations that depend on selecting the appropriate feature representations. Representational expansion techniques can make linear approximators represent value functions more effectively; however, most of these techniques function well only for low dimensional problems. In this paper, we present the greedy feature replacement (GFR), a novel online expansion technique, for value-based RL algorithms that use binary features. Given a simple initial representation, the feature representation is expanded incrementally. New feature dependencies are added automatically to the current representation and conjunctive features are used to replace current features greedily. The virtual temporal difference (TD) error is recorded for each conjunctive feature to judge whether the replacement can improve the approximation. Correctness guarantees and computational complexity analysis are provided for GFR. Experimental results in two domains show that GFR achieves much faster learning and has the capability to handle large-scale problems.
[1]Albus, J.S., 1971. A theory of cerebellar function. Math. Biosci., 10(1-2):25-61.
[2]Barto, A.G., Bradtke, S.J., Singh, S.P., 1995. Learning to act using real-time dynamic programming. Artif. Intell., 72(1-2):81-138.
[3]Buro, M., 1999. From simple features to sophisticated evaluation functions. Proc. 1st Int. Conf. on Computers and Games, p.126-145.
[4]de Hauwere, Y.M., Vrancx, P., Nowé, A., 2010. Generalized learning automata for multi-agent reinforcement learning. AI Commun., 23(4):311-324.
[5]Geramifard, A., Doshi, F., Redding, J., et al., 2011. Online discovery of feature dependencies. Proc. 28th Int. Conf. on Machine Learning, p.881-888.
[6]Geramifard, A., Dann, C., How, J.P., 2013. Off-policy learning combined with automatic feature expansion for solving large MDPs. Proc. 1st Multidisciplinary Conf. on Reinforcement Learning and Decision Making, p.29-33.
[7]Kaelbling, L.P., Littman, M.L., Moore, A.W., 1996. Reinforcement learning: a survey. J. Artif. Intell. Res., 4:237-285.
[8]Kolter, J.Z., Ng, A.Y., 2009. Near-Bayesian exploration in polynomial time. Proc. 26th Annual Int. Conf. on Machine Learning, p.513-520.
[9]Lagoudakis, M.G., Parr, R., 2003. Least-squares policy iteration. J. Mach. Learn. Res., 4(6):1107-1149.
[10]Pazis, J., Lagoudakis, M.G., 2009. Binary action search for learning continuous-action control policies. Proc. 26th Annual Int. Conf. on Machine Learning, p.793-800.
[11]Puterman, M.L., 1994. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York, NY, p.139-161.
[12]Ratitch, B., Precup, D., 2004. Sparse distributed memories for on-line value-based reinforcement learning. Proc. 15th European Conf. on Machine Learning, p.347-358.
[13]Rummery, G.A., Niranjan, M., 1994. On-line Q-learning Using Connectionist Systems. Technical Report No. cued/ f-infeng/tr166, Engineering Department, Cambridge University.
[14]Singh, S., Jaakkola, T., Littman, M.L., et al., 2000. Convergence results for single-step on-policy reinforcement-learning algorithms. Mach. Learn., 38(3):287-308.
[15]Singh, S.P., Sutton, R.S., 1996. Reinforcement learning with replacing eligibility traces. Mach. Learn., 22(1-3):123-158.
[16]Singh, S.P., Yee, R.C., 1994. An upper bound on the loss from approximate optimal-value functions. Mach. Learn., 16(3):227-233.
[17]Sprague, N., Ballard, D., 2003. Multiple-goal reinforcement learning with modular sarsa(0). Proc. 18th Int. Joint Conf. on Artificial Intelligence, p.1445-1447.
[18]Sturtevant, N.R., White, A.M., 2006. Feature construction for reinforcement learning in hearts. Proc. 5th Int. Conf. on Computers and Games, p.122-134.
[19]Sutton, R.S., 1996. Generalization in reinforcement learning: successful examples using sparse coarse coding. Adv. Neur. Inform. Process. Syst., 8:1038-1044.
[20]Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning: an Introduction. MIT Press, Cambridge, MA, USA, p.3-25.
[21]Tsitsiklis, J.N., 1994. Asynchronous stochastic approximation and Q-learning. Mach. Learn., 16(3):185-202.
[22]Tsitsiklis, J.N., van Roy, B., 1997. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Contr., 42(5):674-690.
[23]Watkins, C.J.C.H., Dayan, P., 1992. Q-learning. Mach. Learn., 8(3-4):279-292.
[24]Whiteson, S., Taylor, M.E., Stone, P., 2007. Adaptive Tile Coding for Value Function Approximation. Technical Report No. AI-TR-07-339, University of Texas at Austin.
Open peer comments: Debate/Discuss/Question/Opinion
<1>