仓库源文站点原文


layout: post title: "ML4T笔记 | 03-06 Q-Learning" date: "2019-03-31 03:31:31" categories: 计算机科学 auth: conge

tags: Machine_Learning Trading ML4T OMSCS

01 - Overview

image.png

Time: 00:00:38

02 - What is Q

image.png

What is Q?

Well, let's dig in and find out.

How can we use Q to figure out what to do?

Policy ($\Pi$) defines what we do in any particular state

After we run Q learning for long enough, we will eventually converge to the optimal policy ($\Pi^*$).

Time: 00:02:53

03 - Learning Procedure

the big picture of how to train a Q-learner.

  1. select data to train on: time series of the stock market.
  2. iterate over this data over time. Evaluate the situation there and for a particular stock that gives us s our state. consult our policy and get an action a. take that action, evaluate the next state s' and our reward r. that is <s, a, r, s'>, the experience tuple.
  3. Once get all the way through the training data, test our policy and see how well it performs in a backtest. 4 If it's converged or it's not getting any better then we say we're done. If not, repeat this whole process all the way through the training data.

what does converge mean?

As we cycle through the data training our Q table and then testing back across that same data, we get some performance. And we expect performance to get better and better.

Detail on what happens here when we're iterating over the data.

  1. start by setting our start time, and initialize our Q table.
  2. observe the features of our stock or stocks and from those build up together our state s.
  3. consult our policy (Q table) to find the best action in the current state to get a.
  4. step forward and get reward r and new state s'.
  5. Update the Q table with this complete experience tuple

Then repeat.

Time: 00:03:22

04 - Update Rule

Once an experience tuple <s, a, r, s'> is generated by interacting with the environment, how does it take that information to improve this Q table?

There are two main parts to the update rule.

  1. the old value that we used to have: Q [s, a].
  2. improved estimate

Learning Rate: New concept: $\alpha$ is the learning rate. $\alpha \in [0,1]$, usually use about 0.2.

Discount rate $\gamma$ is the discount rate. $\gamma \in [0, 1]$. A low value of $\gamma$ means that we value later rewards less.

future discounted rewards.

what is the value of those future rewards if we reach state s' and we act appropriately?

This is the equation you need to know to implement Q learning.

Time: 00:05:07

05 Update Rule

The formula for computing Q for any state-action pair <s, a>, given an experience tuple <s, a, s', r>, is: Q'[s, a] = (1 - α) · Q[s, a] + α · (r + γ · Q[s', argmaxa'(Q[s', a'])])

Here:

06 - Two Finer Points

  1. Q-learning depends to a large extent on exploration. So we need to explore as much of the state and action space as possible.

Randomness.

Two flips of the coin. 1) choose a random action or are action with the highest Q value? 2) if random choice, then flip the coin again to choose which of those actions we're going to select.

By doing this,

  1. we're forcing the system to explore and try different actions in different states.
    1. it also causes us to arrive at different states that we might not otherwise arrive at if we didn't try those random actions.

Time: 00:01:32

07 - The Trading Problem - Actions

Actions

To turn the stock trading problem into a problem that Q learning can solve, we need to define our actions, we need to define our state, and we also need to define our rewards.

Actions.

Three actions, buy, sell or do nothing.

Usually what's going to happen most frequently is that we do nothing.

  1. So we evaluate the factors of the stock (e.g. several technical indicators) and get our state.
  2. We consider that state and we do nothing for a while
  3. something triggers an buy action: So we buy the stock and holding 4. then do nothing until our very intelligent Q learner says otherwise.

How this sort of stepped behavior affect our portfolio value

Time: 00:02:22

08 - The Trading Problem: Rewards

Now consider rewards for our learner: 1)Short-term rewards in terms of daily returns or 2) long-term rewards that reflect the cumulative return of a trade cycle from a buy to a sell, or for shorting from a sell to a buy.

Which one of these do you think will result in faster conversions? Solution: The correct answer is daily returns.

Daily returns give more frequent feedback while cumulative rewards need to wait for a trading cycle to end.

Time: 00:00:41

09 - The Trading Problem: State

Which of the indicators are good candidates for states?

Solution: Adjusted close or SMA are not good factors for learning, not able to generalize over different price regimes for when the stock was low to when it was high. but the combine adjusted close and simple moving average into a ratio that makes a good factor to use in state.

Bollinger Band value, P/E ratio is good.

holding the stock or not: is important to the actions. return since we entered the position might be useful to determine the exit points

Time: 00:01:34

10 - Creating the State

The state is a single integer so that we can address it in our cue table. We can 1) discretize each factor and 2) combine all of those integers together into a single number.

Now we can stack them one after the other into our overall discretized state.

Time: 00:01:52

12 - Discretizing

Discretization or discretizing: convert a real number into an integer across a limited scale.

Time: 00:01:53

13 - Q-Learning Recap

Total Time: 00:24:12

Summary

Advantages

Issues

Resources

2019-03-31 初稿