CH13ML

Reinforcement Learning
Chapter 13
Introduction
Supervised (inductive) learning is the simplest and most studied type of learning How can an agent learn behaviors when it doesnt have a teacher to tell it how to perform?
The agent has a task to perform It takes some actions in the world At some later point, it gets feedback telling it how well it did on performing the task The agent performs the same task over and over again
This problem is called reinforcement learning:

The agent gets positive reinforcement for tasks done well The agent gets negative reinforcement for tasks done poorly
Introduction (contd..)
The goal is to get the agent to act in the world so as to maximize its rewards The agent has to figure out what it did that made it get the reward/punishment Examples
backgammon and chess playing job shop scheduling controlling robot limbs
Learning from Interaction
Learning Cycle
Learning agent performs actions at in state st

Receives reward rt+1 from environment Reaches state st+1
Reward Function
Basic idea of the learner is to choose the action that gets maximum reward Reward function tells the learner what the goal is, not how the goal should be achieved, which would be supervised learning. sub-goals is not presented to learner
Selection of reward function is crucial
Reward Function (contd..)

Example 1: consider following two reward functions for a mazetraversing robot
1. Receive a reward of 50 when it finds the center of maze 2. Receive a reward of -1 for each move and a reward of +50 when it finds the center of the maze
First function does not cares about the path length where as second one rewards maximum to shortest path This example is episodic (i.e. learning is split into episodes that have a definite endpoint when robot reaches the center) The reward can be given at the terminal state and propagated back through all the actions that were performed to update the learner.
Reward Function (contd..)

Example 2: consider a problem where child learns walking A child can walk successfully when it doesnt fall at all, not when it doesnt fall over for 10 minutes this problem is continual and there is no terminal state or accepting state
It is hard to reward because there is no terminal state
Discounting
Solution of the rewarding problem of continual problem is discounting
We take account how certain we can be about things that happen in the future
Since there is lots of uncertainty in learning, we should discount our predictions of rewards in the future according to how much chance there is that they are wrong.
Discounting (contd..)
Rewards that we expect to get very soon are probably going to be more accurate predictions that those a long time in the future, because lots of things might change is added as an additional parameter where 01
discounts future rewards by multiplying them by t where t is the number of the time steps in the future this reward is from
Prediction of the total future reward is:
Discounting in Child Walking Problem

When child falls we can reward -1, and otherwise there is no reward -1 reward is discounted in future, so that a reward k steps into the future has reward -k the learner will therefore try to make k as large as possible, resulting in proper walking
Action Selection
at every step of learning process, the algorithm looks at the actions that can be performed in the current state and computes the value (average reward) of each action, average reward is computed that has been received each time in the past average reward is represented as Qs,t(a) where s is the state a is action and t is the number of times that the actin has been taken before in this state
Value will eventually converge to the true prediction of the reward for that action.
Action Selection Methods

1. Greedy: pick the action that has highest value of Qs,t(a), so always choose to exploit your current knowledge 2. -greedy: modification made to greedy to make some exploration. Greedy algorithm is used with small probability where we pick some other action at random. 3. Soft-max: refinement of -greedy algorithm. Provides alternative actions to select when the exploration happens.
Policy
Once the algorithm has decided on the reward, it needs to choose the action that should be performed in the current state. combination of exploration and exploitation in order to deciding whether to take the action that gave the highest reward last time we were in this state, or trying out different action in hope of finding something even better
Markov Property
refers to the memory less property of a stochastic process Current state provides enough information for the reward to be computed without looking at previous states Equation below represents computation of Probability that the next reward is r and next state is s
A process with this property is called a Markov process.
Markov Decision Process

Agent State Reward Environment a0 r0 a1 r1 a2 r2 Action
s0
s1
s2
s3
Markov Decision processes

a stochastic process that satisfies the Markov property. The changes of state of the process are called transitions, and the probabilities associated with various state changes are called transition probabilities. Diagram can be extended into something called a transition diagram, which shows the dynamics of a finite Markov Decision Process and usually includes information about the
Values
Expected reward is known as the value two ways to compute value
1. self-value funcion V(s): We can consider the current state, and average across all of the actions that can be taken, leaving the policy to sort this out for itself 2. the action-value function Q(s,a): We can consider the current state and each possible action that can be taken separately,
The Q-Learning Algorithm

Initialization Set Q(s,a) to small random values for all s and a Repeat: Initialize s repeat
o o o o o Select actin a using -greedy or another policy Take action a and receive reward r Sample new state s Update Set s < s
For each step of the current episode
Until there are no more episodes
The Sarsa Algorithm

Initialization Set Q(s,a) to small random values for all s and a Repeat:
Initialize s
Choose action a using the current policy repeat
o o o o o Take action a and receive reward r Sample new state s Choose action a using the current policy Update Set s < s, a < a
For each step of the current episode Until there are no more episodes
Sarsa Vs Q-Learning
Consider a problem where start and goal are provided in figure: - Reward for every move is -1 - Reward for the move that end in cliff is -100 and agent gets put back into start
Sarsa Vs Q-Learning
Sarsa algorithm will converge to a much safer route that keeps it well away from the cliff, even though it takes longer
Sarsa Vs Q-Learning
Q uses greedy policy and finds the most optimum path, which is shortest one
Uses of Reinforcement Learning

Robotics: robot can be left to attempt to solve the task without intervention Games: chess, checkers, backgammon
Trading: learning to trade by reinforcement

Vehicle routing

CH13ML

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

CH13ML

Cargado por

Copyright:

Formatos disponibles

Reinforcement Learning

This problem is called reinforcement learning:

Learning from Interaction

Learning agent performs actions at in state st

Selection of reward function is crucial

Reward Function (contd..)

Reward Function (contd..)

It is hard to reward because there is no terminal state

Prediction of the total future reward is:

Discounting in Child Walking Problem

Action Selection Methods

A process with this property is called a Markov process.

Markov Decision Process

Markov Decision processes

The Q-Learning Algorithm

For each step of the current episode

Until there are no more episodes

The Sarsa Algorithm

Uses of Reinforcement Learning

Trading: learning to trade by reinforcement

También podría gustarte