layout: post title: "Reinforcement Learning 第七周课程笔记" date: "2015-09-30 02:54:06" categories: 计算机科学 excerpt: "This week's tasks watch Reward Shaping. read Ng, Harada, Russell (1999) ..."
Given an MDP, RF can affect the behavior of the learner/agent so it ultimately specifies the behavior (or policy) we want for the MDP. So changing rewards can make the MDP easy to solve and represent
Given an MDP described by <S, A, R, T, γ>, there are three ways to change R without changing optimal solution. (Note, if we know T, then it is not a RL problem any more, so this part of lecture if for MDP not RL specifically).
Here is how to solve the problem:
- Q = R + γR+γ<sup>2</sup>R + ... + γ<sup>∞ </sup>R)
- Q' = R' + γR'+γ<sup>2</sup>R' + ... + γ<sup>∞ </sup>R'
- Replace R' with R+c, Q'=(R+c) +γ(R+c)+γ<sup>2</sup>(R+c) + ... + γ<sup>∞ </sup>(R+c) =(R + γR+γ<sup>2</sup>R + ... + γ<sup>∞ </sup>R) + (c+γc+γ<sup>2</sup>c + ... + γ<sup>∞ </sup>c)
- The first part is Q and the second part is geometric series. So, Q' = Q + c/(1-γ)
- Q = R + γR+γ<sup>2</sup>R + ... + γ<sup>∞ </sup>R)
- Q' = R' + γR'+γ<sup>2</sup>R' + ... + γ<sup>∞ </sup>R'
- Replace R' with R-ψ(s) + γψ(s'), Q'=(R-ψ(s) + γψ(s')) +γ(R-ψ(s') + γψ(s''))+γ<sup>2</sup>(R-ψ(s'') + γψ(s''')) + ... + γ<sup>∞ </sup>(R-ψ(s<sup>∞ </sup>) + γψ(s'<sup>∞ </sup>)) =(R + γR+γ<sup>2</sup>R + ... + γ<sup>∞ </sup>R) + (-ψ(s) + γψ(s') +γ(-ψ(s') + γψ(s''))+γ<sup>2</sup>(-ψ(''s) + γψ(s''')) + ... + γ<sup>∞ </sup>(-ψ(s<sup>∞ </sup>) + γψ(s'<sup>∞ </sup>))
- The first part is Q. In the second part, most of the elements are cancelling each other out and only has the very first and last elements left. So, Q' = Q + (-ψ(s) + γ<sup>∞ </sup>ψ(s'<sup>∞ </sup>)
- Given γ is in (0,1), so γ<sup>∞ </sup>=0. Then we have Q': Q' = Q - ψ(s)
Updating the Q function with the potential based reward shaping,
2015-09-29 初稿
2015-12-04 reviewed and revised