layout: post title: "Reinforcement Learning 第五周课程笔记" date: "2015-09-18 11:05:48" categories: 计算机科学 excerpt: "本周三件事:看课程视频 Convergence. 读 Littman and Szepesvari (1996), 作业四。 no action..."
本周三件事:看课程视频 Convergence. 读 Littman and Szepesvari (1996), 作业四。
Q function is used to estimate the value at current state, take a action then get a reward and ended up in a new state < s<sub>t-1</sub>, a<sub>t-1</sub>,r<sub>t</sub>,s<sub>t</sub> >. The Q updating rule takes care of two approximations: 1) if model is known, it can be used to update Q; 2) if Q* is known, it can also be used to update Q.
And both will converge.
① and ② are true, so that BF<sub>t-1</sub> = F<sub>t</sub>, F<sub>t</sub> will converge at F* through value iteration.
If there is two fix point, G and F, Putting them into the B operator will not change the distance of them because both of them are fixed, and this violates the definition of B.
The three properties need to be true for Q<sub>t</sub> to converge to Q*. and they are true
So, Q-learning converges.
B<sub>t</sub> is the operator we are going to update Q at t time step. Q(s,a) is the Q function value of the state we just left and Q function w is the Q value of the state we just arrive. In regular Q Learning update, Q and w are the same, here we separated them in the theorem.
So, rule number one is saying if we know the Q and use the updating rule (in pink) to update the Q function, the expected value of the one-step look ahead (w*) can be calculated and the stochasticity will be averaged out.
If we hold the Q(s,a) fixed and only varies the way we calculate the one-step look ahead, the distance between the Q* and Q can only get closer with each update.
The third condition is the learning rate condition. which is needed for Bellman equation.
Generalized MDP can be seen as redefine fix point. Contraction might be something like "收敛" in Chinese. Q-learning converges to Q*. Generalized Convergence theorem uses two Q-functions to prove the convergence of Bellman Equation.