layout: post title: "Reinforcement Learning 第十周课程笔记" date: "2015-10-31 04:57:13" categories: 计算机科学 excerpt: "This week watch POMDPs. The reading is Littman (2009). POMDP POMDPs g..."
This week
Solution:
using belief state can turn POMDP into belief MDP <b, a, z, b'>
b'(s') = Pr(s'|b, a, z) = Pr(z|b,a,s') Pr(s'|b,a) / Pr(z|b,a)
= Pr(z|b,a,s') sum Pr(s'|s,b,a) Pr(s|b,a) / Pr(z|b,a)
= O(s',z) sum<sub>s</sub> T(s,a,s') b(s) / Pr(z|b,a)
Note: belief MDP has infinite number of belief states which make VI, LP, PI impossible because they can only deal with finite number of states.
In the figure:
X = p * (0.5 * X) + (1 -p)(0.5 * Y) => X = 0.5 * (1 - p)Y / (1 - 0.5p) Y = p * (0.5 * x) + (1 -p) => X = 2 (Y - 1 + p) / p Z = p \ (1) + (1 -p) (0.5 * Z) => V = 1/3 * (X + Y + Z)
Why go to PSR?
2015-10-23 初稿
2015-11-03 补全
2015-12-04 reviewed