仓库源文站点原文


layout: post title: "Reinforcement Learning 第八周课程笔记" date: "2015-10-13 11:01:08" categories: 计算机科学 excerpt: "This week Watch Exploration. The readings were Fong (1995) and Li, Lit..."

auth: conge

This week

Exploration: Specific to RL

Sub topics of Exploration

Type state transition Stochastic solution
Bandits ✘  ✔  hoeffding bound to do stochastic decision making
Deterministic MDPs: ✔  ✘  Rmax to do sequential decision making
Stochastic MDPs ✔  ✔  both hoeffding bound and Rmax

K-armed Bandit Problem

Confidence based exploration

!Quiz 1: K-Armed Bandits, given the observed data, which arm gives the ightest expected payoff and which arm are most comfident](/assets/images/计算机科学/118382-6578df35f8c00073.png)

Confidence-based Exploration

Metrics for Bandits

Metrics for Bandits

  1. identify optimal arm in the limit: exploration
  2. maximize (discounted) expected rewards: exploration + exploitation, but computationaly expensive
  3. Maximize expected reward in finite horizon - similar to value iteration
  4. Identify near-optimal arms (ε) with high probabilities ( 1- δ) in time τ(k, ε, δ) (polynomial in k, 1/ε, 1/δ) - The "get there" goal is to find near optimal arms. similar to PAC learning: probably approximately correct.
  5. Nearly Maximize reward (ε) with high probabilities ( 1- δ) in time τ'(k, ε, δ) (polynomial in k, 1/ε, 1/δ): the "how to get there" goal is to accumulate as much more rewards as possible in finite time τ'.
  6. Pull a non-near optimal arm (ε) no more than τ''(k, ε, δ) times with high probabilities ( 1- δ): The "control error" goal is to minimize mistakes.

Find Best Implies Few Mistakes

Find Best Implies Few Mistakes

Few Mistakes Implies Do Well

Few Mistakes Implies Do Well

Do Well Implies Find Best

Do Well Implies Find Best

Putting it together

Hoeffding

Hoeffding bounds

Combining Arm Info

Combining Arm Info

How Many Samples?

quiz 2

Exploring Deterministic MDPs

MDP optimization Criteria

Exploring MDP

Rmax Analysis

Rmax Analyis

Lower bound using a special kind of MDP: sequential combination lock

General Rmax

For Stochastic properties, applying hoeffding bound. For sequential deterministic MDP, Rmax can solve the problem. Now applying both, we can solve the stochastic MDP.

General Rmax

Some key ideas for this to work: Simulation Lemma and Explore-or-Exploit Lemma

Simulation Lemma

Simulation Lemma

the difference between estimated and real MDP (α) is polynomially related to the error in value of actions (ε)

Explore-or-Exploit Lemma

Explore-or-Exploit

What Have We Learned?

Recap

2015-10-06 初稿 up-to quiz 1
2015-10-07 updated to Exploring Deterministic MDPs
2015-12-04 reviewed and revised