仓库源文站点原文


layout: post title: "Reinforcement Learning 第五周课程笔记" date: "2015-09-18 11:05:48" categories: 计算机科学 excerpt: "本周三件事:看课程视频 Convergence. 读 Littman and Szepesvari (1996), 作业四。 no action..."

auth: conge

本周三件事:看课程视频 Convergence. 读 Littman and Szepesvari (1996), 作业四。

Learning without control:

Learning with control/actions.

Q function is used to estimate the value at current state, take a action then get a reward and ended up in a new state < s<sub>t-1</sub>, a<sub>t-1</sub>,r<sub>t</sub>,s<sub>t</sub> >. The Q updating rule takes care of two approximations: 1) if model is known, it can be used to update Q; 2) if Q* is known, it can also be used to update Q.

And both will converge.

Quiz 1: Bellman Operator

Contraction mapping def: If applying the B operator makes the distance between two functions smaller than the the distance between the original functions.

Quize 2:

Contraction Properties

quiz 3

Statement of therom

The three properties need to be true for Q<sub>t</sub> to converge to Q*. and they are true

So, Q-learning converges.

B<sub>t</sub> is the operator we are going to update Q at t time step. Q(s,a) is the Q function value of the state we just left and Q function w is the Q value of the state we just arrive. In regular Q Learning update, Q and w are the same, here we separated them in the theorem.

So, rule number one is saying if we know the Q and use the updating rule (in pink) to update the Q function, the expected value of the one-step look ahead (w*) can be calculated and the stochasticity will be averaged out.

If we hold the Q(s,a) fixed and only varies the way we calculate the one-step look ahead, the distance between the Q* and Q can only get closer with each update.

The third condition is the learning rate condition. which is needed for Bellman equation.

Quiz 4: 1. decision making on estimated value based on best next action (regular MDP). 2. The environment puts you in the worst possible state and you choose the next best action given the state (risk averse); 3. The state is the expected value but the action to take is based on ranking of the action(exploration-sensitive, min, max, mediocre ); 4. (zero-sum game)

Recap

Generalized MDP can be seen as redefine fix point. Contraction might be something like "收敛" in Chinese. Q-learning converges to Q*. Generalized Convergence theorem uses two Q-functions to prove the convergence of Bellman Equation.