本周三件事:看课程视频,阅读 Sutton (1988),作业3(HW3)。
以下为视频截图和笔记:
Temporal Difference Learning
data:image/s3,"s3://crabby-images/8cb8e/8cb8ee5cb40e79f647c35c63f20885a635e5a53a" alt="Read Sutton 1988 first"
- Read Sutton, Read Sutton, Read Sutton. Because the final project was based on it!
data:image/s3,"s3://crabby-images/7512b/7512b07b1c7bfccd098682038fc9044939bf4b57" alt="Three families of RL algorithms"
- Model based
- Model free
- Policy search
- Form 1 --> 3: more direct learning
- From 3 --> 1 more supervised
TD-lambda
data:image/s3,"s3://crabby-images/fd739/fd7396a373702c73aa76dfbf785468265d9c321f" alt="TD-lambda"
data:image/s3,"s3://crabby-images/c0ea0/c0ea0e7a358dfd3a5eb16956d2dd2c80addab789" alt="Quiz 1: TD-lambda Example"
- in this case the model is known, the calculation is easy.
data:image/s3,"s3://crabby-images/79d93/79d93bb4f7646f53f593e5a56e543c8d6664b124" alt="Quiz 2: Estimating from Data"
- Remember from the previous lecture, we need to get value from each episode and average over them.
data:image/s3,"s3://crabby-images/2c2e8/2c2e8c75512235b50f7b885e8e7a8f7a06ae6557" alt="Computing Estimates Incrementally"
- The rewrite makes the formula looks a lot like neuro-net learning. and alpha is introduced.
data:image/s3,"s3://crabby-images/4d45d/4d45d8333663af6142f28f9f14144f44580af4f1" alt="Quiz 2: alpha will mke learning converge (tips:if 指数i大于1, 1/(T)<sup>i</sup> will be bounded"
data:image/s3,"s3://crabby-images/84773/84773cb6821952dbac1d8de828b843f2486c3e47" alt="TD (1) rule"
data:image/s3,"s3://crabby-images/2e169/2e1695a556c763876a5f382dc7e3ee441cc19bdf" alt="TD(1) with and without repeated states"
- When no repeated states, the TD(1) is the same as outcome-based updates ( which is see all the rewards in each state and update weights).
- when there is repeated states, extra learning happens.
data:image/s3,"s3://crabby-images/87af8/87af80a9656ca7898a0e6c7312856434617c20d3" alt="Why TD(1) is "Wrong""
- in case of TD(1) rule, V(s2) can be estimated by average episodes. we only see V(s2) once and the value is 12. Then V(s2) = 12
- in case of Maximum likelihood estimates, we have to kind of learn the transition from data. e.g. for the first 5 episodes, we saw s<sub>3</sub>->s<sub>4</sub> 3 times and s<sub>3</sub> -> s<sub>5</sub> 2 times. So the transition probability can be extracted from data as 0.6 and 0.4 respectively.
data:image/s3,"s3://crabby-images/d39aa/d39aa5692b9042da43880616c973ec82a2a7c4e9" alt="TD(0) Rule"
- First of all, if we have infinite data, TD(1) will also do the right thing.
- When we have finite data, we can repeatedly infinitely sample the data to figure out all the ML. This is what TD(0) do.
data:image/s3,"s3://crabby-images/a4172/a417268ce857e520d7c83399d7de3fa9325f7f14" alt="Connecting TD(0) and TD(1)"
K-Step Estimators
data:image/s3,"s3://crabby-images/1e299/1e299100820115fb0f28d6cae5528168bc94218f" alt="K-Step Estimators"
- E1 is one-step estimator (one-step look up) TD(0)
- E2 is two-step estimator, and Ek is k-step lookup.
- When K goes to infinity, we got TD(1)
data:image/s3,"s3://crabby-images/316a9/316a9d1b81adc8f0e0b16561d727c9069f8e3e2e" alt="K-step Estimators and TD-lambda"
TD-lambda can be seen as weighted combination of K-step estimators. the weight factor are λ<sup>k</sup>(1-λ).
data:image/s3,"s3://crabby-images/f4f86/f4f8634df0ba24ca693d291c32d2fc272590c137" alt="Why use TD-lambda?"
The best performed lambda is typically not TD(0), but some λ in between 0 and 1.
data:image/s3,"s3://crabby-images/fbc97/fbc97647478b4b0be827408f72330d099e543a4d" alt="Summary"
2015-09-5 初稿
2015-12-03 reviewed and revised until the "Connecting TD(0) and TD(1)" slides