仓库源文站点原文


layout: post title: "ML4T笔记 | 03-07 Dyna" date: "2019-03-31 03:00:00" categories: 计算机科学 auth: conge

tags: Machine_Learning Trading ML4T OMSCS

1 - Overview

Q-learning is expensive because it takes many experienced tuples to converge. Creating experienced tuples means taking a real step to execute a trade, in order to gather information.

To address this problem, Rich Sutton invented Dyna.

Time: 00:00:37

2 - Dyna-Q Big Picture

image.png

Q learning is model-free meaning that it does not rely on T or R.

Dyna-Q is an algorithm developed by Rich Sutton intended to speed up learning or model convergence for Q learning.

Dyna is a blend of model-free and model-based methods.

  1. first consider plain old Q learning, initialize our Q table and then we begin iterating, observe s, take action a, and then observe our new state s' and rewardr`, and then update Q table with this experience tubal and repeat.
  2. when we augment Q learning with Dyna-Q, we had three new components, the first is that we add some "logic that enables us to learn models of T and R", then we hallucinate an experience.

Let's look at each of these components in a little more detail now.

When updating our model we want to do is find new values for T and R.

The point where we update our model includes T, our reward function R.

how we hallucinate an experience.

  1. randomly select an s,
  2. randomly select an a, so our state and action are chosen totally at random.
  3. infer our new state s' by looking at T.
  4. we infer a reward r by looking at big R or R table.

Now we have a complete experience tuple to update our Q table using that.

So, the Q table update is our final step and this is really all there is to die in a queue.

Time: 00:04:14

3 - Learning T

Note the methods are Balch version, might not be Rich Sutton version.

Learning T.

T[s,a,s'] represents the probability that if we are in the state s, take action a we will end up in state s'.

To learn a model of T: just observe how these transitions occur.

Time: 00:01:34

4 - How to Evaluate T

$T[s,a,s'] = \frac{Tc[s,a,s']}{\sum\limits{i}T_c[s,a,i]} $

Time: 00:01:03

6 - Learning R

The last step here is how do we learn a model for R?

Time: 00:01:39

7 - Dyna Q Recap

how Dyna-Q works.

Dyna-Q adds three new components based on regular Q-Learning

  1. update models of T and R,
  2. then we hallucinate an experience.
  3. update our Q table.

We can repeat 2-3 many times. ~100 or 200 here, then return back up to the top and continue our interaction with the real world.

Time: 00:00:57

Total Time: 00:10:41

Summary

The Dyna architecture consists of a combination of:

Dyna learning architecture

Sutton and Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998. [web]

Resources

2019-03-31 初稿