layout: post
title: "Reinforcement Learning 第九周课程笔记"
date: "2015-10-18 02:41:40"
categories: 计算机科学
excerpt: "This week Watch Generalization. The readings are Gordon (1995) and Baird..."
auth: conge
This week
- Watch Generalization.
- The readings are Gordon (1995) and Baird (1995).
- When there are a lot of states, we can use features to represent states (e.g. grouping states using some common features or using some base states to represent other stats)
- Inductive bias: refer to the way algorithm prefer one solution against another.
- representation can stay unchanged even when states changed(?)
- In RL, the goal is to learn some input-output mapping.
- e.g. Policy maps state -> actions
- value function maps state/actions to expected return
- Model maps T/R (transition and reward)
- Generalization can happen on all these levels
- Most researchers are focused on value function approximation.
- Q is represented by function approximator F(w<sup>a</sup>, f(s)). w<sup>a</sup> is some kind of weights, f(s) takes a state s and generate a series of features.
- experience tuple <s,a,r,s'> comes in, and do an update.
- TD error= the difference between predicted value (reward + discounted value of next state) and value of current state. TD error tells us the direction to move the parameters w. (the prediction is high, low or just right).
- α is learning rate.
Linear Value Function Approximation
- Q value for an action is represented by features and a set of weights
- w<sup>a</sup><sub>i</sub> are weights of features f<sub>i</sub>(s), which represent importance of features
Success and fail stories
- 3-layer back prop: decision making in Jeopardy
- CMAC(wikipedia) :
- V(s') = V(7) =8; V(s) = V(1) =3, derivative is w =1; so Δw<sub>0</sub> = 0.1( 0 + 0.9 8 -3)1 = 0.42;
- Δw<sub>0</sub> is 0.42, and Δw<sub>i</sub>, i = [0, 6] are all positive, so weight will increase with each update
- Δw<sub>7</sub> is negative when we update <7,0,7>, but w<sub>7</sub> will still increase because it's original value is >> |Δw<sub>7</sub>|
- So, even the algorithm is doing the right thing, the updates will never converge
- in this case, the temporal differences are all zeroes because w<sub>i</sub> is 0, which means no updates (changes) are going to happen. Sticky
- If at any circumstance, when all values are zero, it could not escape the circumstance unless reward (r) is non-zero. If reward is also zero, we can call the circumstance the solution of the MDP.
- this holds when MDP is deterministic.
- the lesson: Shared weights can "not converge"!
- Problem: given a certain points of value, estimate the function of line. How to estimates the points in-between?
- Naturally, weighted average of neighbours and get convex combinations of anchor points
- for points that are far from anchor points, are good estimate will be the average of the anchor points.
- more anchor points lead to less error in MDP estimation.
- Using base states, we can respresent the original MDP with MDP over S<sub>b</sub>s
What have we learned
- Need for generalization: some problem have Zillions of states; Method: applying supervised learning method .
- Methods: Linear function approximation.
- 1) value functions 2) policies; 3) models;
- Success: TD gamma, Atari, Fitted Q-iteration (but still have need not work cases)
- Problem cases: even linear need not work (baird counter problem)
- Averagers: can be viewed as MDP itself. Use anchor points to represent the whole MDP.
- LSPI: Least Squared policy iteration
2015-10-13 初稿 upto quiz 1
2015-10-17 完成
2015-12-04 reviewed and revisted