This week
- should watch CCC.
- The readings are: Zeibart et al. (2008). Babes et al. (2011). Griffith et al (2013). Cederborg et al (2015). Roberts (2006). Bhat (2007).
data:image/s3,"s3://crabby-images/5a2a9/5a2a92588e3fb9a19ba67efa99322573e291ce78" alt="CCC"
data:image/s3,"s3://crabby-images/4e5f8/4e5f8ed2f9a5249b1aac2d5640cc6a7801b8f4ac" alt="Coordinating and communicating"
data:image/s3,"s3://crabby-images/257f3/257f39373ffa37e981eef5f26a7cb2c2fa41b524" alt="The decentralized partially observable Markov decision process (*Dec*-*POMDP*)"
- Dec-POMDP has some perspectives of game theory and MDP
- Multiple agent working on getting a common reward. (if the rewards are separated for all the agents, then it's a POSG partially observable stochastic game)
data:image/s3,"s3://crabby-images/adc28/adc284d4c82b18d714da5f331c8641e06eb73d03" alt="DEC-POMDPs properties"
data:image/s3,"s3://crabby-images/a19d8/a19d88aff0aeca2a5ef8e78889db2dcf79f1caf4" alt="DEC-POMDPs example"
- two agents, they know where they are but don't know the other's position. when the two are in the same room, they win.
- Strategy: go to a shared room. But my knowledge of my current position could be wrong ( partially observable world).
data:image/s3,"s3://crabby-images/c1050/c1050a86d7abe990b76df1021004ceb6dd290994" alt="Communicating and Coaching"
- agent 1 wants to set up some kind of reward function to move agent to do something (e.g. get the apple for me).
Inverse Reinforcement Learning
data:image/s3,"s3://crabby-images/0e63a/0e63acb96602c7af44f587537a60d594883901d4" alt="Inverse Reinforcement Learning"
- Inverse Reinforcement Learning: the agent experience the environment and a set a behavior and then generate a reward function based on the inputs.
data:image/s3,"s3://crabby-images/5ed78/5ed782d887dbd5a8641585c1d0ef756f7107b88f" alt="MLIRL: Maximium Likelyhod inverse reinforcement learning."
data:image/s3,"s3://crabby-images/a49ab/a49ab77829cefb6662061448964b876874e2e85b" alt="MLIRL result"
data:image/s3,"s3://crabby-images/923e6/923e62f2f780946b04368015eebe07ca0ed760b1" alt="CCC"
Policy Shaping
data:image/s3,"s3://crabby-images/fcba0/fcba0d845bd4cc5630a266b728127d0363bb0ce6" alt="Policy Shaping"
- if a human is giving feedback (commentary) about weather the agent's action is good or bad, s/he is doing policy shaping.
- policy shaping could be realized by reward shaping which is replace reward of an action with a new reward?
- Agent need a mechanism to learn from the environment and the commentary to decide what policy to take (not just listening to the commentary, cause the commentary might not be always right).
data:image/s3,"s3://crabby-images/189d4/189d48b0ffd2840e86ad583c00dbd188eee217b2" alt="quiz 1: Policy Shaping"
- If human is alway correct, given the feedback, what's the probability that the action (x, y, or z) is optimal?
- answers in the slides above.
data:image/s3,"s3://crabby-images/d04c4/d04c42c9a230b6deb1563cdf044d1a75af96f24d" alt="Quiz 2: Policy Shaping"
- what if human is 0.8 probability of right?
- counting method:
- saying x is optimal is liking saying y and z is not optimal.
- since human is 0.8 correct, then x, y, z being optimal is 0.8, 0.2, 0.2.
- normalize the numbers above, will get 2/3, 1/6, 1/6.
data:image/s3,"s3://crabby-images/3093e/3093ecce45b208b5e20333ed8965ab629e7ad577" alt="Policy Shaping probabiligy calculation"
- Δ<sub>a</sub> is coming from data of action a (d<sub>a</sub>). C is the probability of correct of the people giving commentary.
- The formula above give the method of calculating probability of action a is optimal.
- Note: the final probability will need to be normalized against the probabilities of other actions.
data:image/s3,"s3://crabby-images/666f3/666f3672ceefa08f8087fdb4f63fa340346e9b38" alt="quiz 3: How to combine info from multiple sources in Policy shaping?"
- in the policy shaping case, information are coming from multiple sources.
- E.g. π<sub>a</sub> and π<sub>H</sub> are policy info from agent exploring the world and human giving feedback.
- Some algorithm decrease the importance of π<sub>H</sub> as time goes. One need to know that π<sub>a</sub> already incorporated the information of human uncertainty (C).
- the way to combine the two sources is to calculate the probability that the two policy will agree: a<sub>opt</sub>=argmax<sub>a</sub> p(a|π<sub>1</sub>) * p(a|π<sub>2</sub>).
- in the quiz x<sub>opt</sub> = 1/15, y<sub>opt</sub>=1/60,a<sub>opt</sub>=2/15. So we should choose z as optimal.
Drama Management
data:image/s3,"s3://crabby-images/5e6aa/5e6aa493da14a39e3e0dc95160929b0ee6c87a73" alt="Drama management world"
- the way a human can communicate to an agent
- demonstration: show the agent what's the correct action (inverse RL)
- reward shaping: giving reward for agent's actions
- policy shaping: commentary on the agent's actions
- author convey his intent to the agent so the agent can
data:image/s3,"s3://crabby-images/504ab/504ab0f999642b891bc75d07d74d4a68f0222c94" alt="Drama Management: what's a stroy"
- story can be defined as a trajectory through plot points
data:image/s3,"s3://crabby-images/9d02f/9d02fa381f9a4cb0a33178a5d69ba95e3782177c" alt="Trajectories as MDP"
- above a some mapping of MDP elements to trajectory MDP elements
- Problems
- large number of sequence of states (hyper exponential)
- Since MDP will maximize rewards, treating story as an MDP will only make the author happy and force the player to experience the story.
data:image/s3,"s3://crabby-images/c22bb/c22bbcc833eff2c24a06db6f384fda585e2f3677" alt="TTD-MDP: Targeted trajectory distributions MDPs"
- p(t'|a,t) is the probability that the player at trajectory t and take action a then ended up in trajectory t'. P(T) is a target distribution.
- the action is not player's action but the story action
- the optimal policy is the policy that will lead to the targeted trajectory distribution P(T)
- the calculation time is linear and dependent on the length of the story.
what have we learned
data:image/s3,"s3://crabby-images/79607/796073e3801fabc5a71b4ee5b58d15c4e22870ce" alt="recap"
2015-11-18 初稿 完成