仓库源文-Chapters-1-2.md),站点原文-Chapters-1-2)


layout: post title: "RL 阅读材料笔记 Littman (1996) Chapters 1-2" date: "2015-09-01 10:16:07" categories: 计算机科学 excerpt: "Littman, 1996, chapter 1,2 Chapter 1: Introduction 1.1 Core concepts of ..."

auth: conge

Littman, 1996, chapter 1,2

Chapter 1: Introduction

1.1 Core concepts of Sequential decision making

Purpose: producing policies that maximize a measure of long-term reward to an agent following it in a specific environment.

Elements of Sequential decision making

Elements of Sequential decision making

In the figure above, s is state, a is action, T is transition function which changes s given a. R is the reward function and r is the reward of the state. O is perception function (usually it is an identity function) and z is the agent's perception of the environment.

Problem scenarios:

Problem scenarios

1.2 Formal Models

To simplify the problem, the properties of the environment model must follow the following conditions:

finite state space; finite actions; sequential environment; accessible and inaccessible environment; Markovian environment (prediction can be made by environment's state); fixed environment; stochastic or deterministic transitions; synchronous environment ( environment won't change until the agent takes an action); with one or two agent.

Markovian Models

1.3 evaluation criteria

Evaluating Policies: an objective function takes all the <s, a> sequences and their probabilities to generate value. The optimal sequence maximize the objective function. 1) for each <s, a> sequences, objective function replace each transition with a "goodness" value; 2) the values are truncated according to the finite horizon; 3) transition values were summarized for each sequence; 4) this sequence value are summarized to get a policy value; 5) in multiagent environment, other agent can change transition value.

Evaluating Planning algorithms: 1) All criteria above; 2) whether an optimal policy can be reached; 3) algorithm efficiency.

RL algorithms: RL algorithms tries to achieve what planning can do without complete knowledge of the whole environment. If the algorithm can reach optimal policy (converge), then it's good.

Paste_Image.png


Chapter 2: Markov Decision Process

The problem to address is, given complete and corrent dynamics of environment find the best way to behave

Core concept of MDP below is extracted from my notes of the course lecture.

MDP

States: set of tokens which describe every state that one could be in,  

Model/Transition Function: probability that you'll end up transition to s' given you are in state s and taking action a. This is 'physics' or rules of the world(not changeable) . 

Action: A(s), things you can do in a given state, a function of state, rule you can play; or A, a set actions not depends regardless of state.

Reward: scalar value that you get for being in a state. Usefulness of entering a state/and taking an action/and ending up into s' 

S, T, A and R define the problem, policy is the solution,

Policy: the solution, a function that takes up a state and tells an action that you'll take <s,a> while a is the correct action you want to take to maximize the reward 

*Special policy:  π*, optimized, maximize long term expected reward. (note, from <s,a,r>, you know s, a and then know the r, based on rewards, find π*) r-->z s-->x a-->y, π-->f 

Algorithms

Value function algorithm

Value Iteration algorithm

Policy iteration algorithm

Linear Programming

Linear programming dual

Paste_Image.png

2015-08-26 第一章完成

第二章待续