Está en la página 1de 24

Reinforcement Learning

Chapter 13

Introduction
Supervised (inductive) learning is the simplest and most studied type of learning How can an agent learn behaviors when it doesnt have a teacher to tell it how to perform?
The agent has a task to perform It takes some actions in the world At some later point, it gets feedback telling it how well it did on performing the task The agent performs the same task over and over again

This problem is called reinforcement learning:


The agent gets positive reinforcement for tasks done well The agent gets negative reinforcement for tasks done poorly

Introduction (contd..)
The goal is to get the agent to act in the world so as to maximize its rewards The agent has to figure out what it did that made it get the reward/punishment Examples
backgammon and chess playing job shop scheduling controlling robot limbs

Learning from Interaction

Learning Cycle

Learning agent performs actions at in state st


Receives reward rt+1 from environment Reaches state st+1

Reward Function
Basic idea of the learner is to choose the action that gets maximum reward Reward function tells the learner what the goal is, not how the goal should be achieved, which would be supervised learning. sub-goals is not presented to learner

Selection of reward function is crucial

Reward Function (contd..)


Example 1: consider following two reward functions for a mazetraversing robot
1. Receive a reward of 50 when it finds the center of maze 2. Receive a reward of -1 for each move and a reward of +50 when it finds the center of the maze

First function does not cares about the path length where as second one rewards maximum to shortest path This example is episodic (i.e. learning is split into episodes that have a definite endpoint when robot reaches the center) The reward can be given at the terminal state and propagated back through all the actions that were performed to update the learner.

Reward Function (contd..)


Example 2: consider a problem where child learns walking A child can walk successfully when it doesnt fall at all, not when it doesnt fall over for 10 minutes this problem is continual and there is no terminal state or accepting state

It is hard to reward because there is no terminal state

Discounting
Solution of the rewarding problem of continual problem is discounting

We take account how certain we can be about things that happen in the future
Since there is lots of uncertainty in learning, we should discount our predictions of rewards in the future according to how much chance there is that they are wrong.

Discounting (contd..)
Rewards that we expect to get very soon are probably going to be more accurate predictions that those a long time in the future, because lots of things might change is added as an additional parameter where 01

discounts future rewards by multiplying them by t where t is the number of the time steps in the future this reward is from

Prediction of the total future reward is:

Discounting in Child Walking Problem


When child falls we can reward -1, and otherwise there is no reward -1 reward is discounted in future, so that a reward k steps into the future has reward -k the learner will therefore try to make k as large as possible, resulting in proper walking

Action Selection
at every step of learning process, the algorithm looks at the actions that can be performed in the current state and computes the value (average reward) of each action, average reward is computed that has been received each time in the past average reward is represented as Qs,t(a) where s is the state a is action and t is the number of times that the actin has been taken before in this state

Value will eventually converge to the true prediction of the reward for that action.

Action Selection Methods


1. Greedy: pick the action that has highest value of Qs,t(a), so always choose to exploit your current knowledge 2. -greedy: modification made to greedy to make some exploration. Greedy algorithm is used with small probability where we pick some other action at random. 3. Soft-max: refinement of -greedy algorithm. Provides alternative actions to select when the exploration happens.

Policy
Once the algorithm has decided on the reward, it needs to choose the action that should be performed in the current state. combination of exploration and exploitation in order to deciding whether to take the action that gave the highest reward last time we were in this state, or trying out different action in hope of finding something even better

Markov Property
refers to the memory less property of a stochastic process Current state provides enough information for the reward to be computed without looking at previous states Equation below represents computation of Probability that the next reward is r and next state is s

A process with this property is called a Markov process.

Markov Decision Process


Agent State Reward Environment a0 r0 a1 r1 a2 r2 Action

s0

s1

s2

s3

Markov Decision processes


a stochastic process that satisfies the Markov property. The changes of state of the process are called transitions, and the probabilities associated with various state changes are called transition probabilities. Diagram can be extended into something called a transition diagram, which shows the dynamics of a finite Markov Decision Process and usually includes information about the

Values
Expected reward is known as the value two ways to compute value
1. self-value funcion V(s): We can consider the current state, and average across all of the actions that can be taken, leaving the policy to sort this out for itself 2. the action-value function Q(s,a): We can consider the current state and each possible action that can be taken separately,

The Q-Learning Algorithm


Initialization Set Q(s,a) to small random values for all s and a Repeat: Initialize s repeat
o o o o o Select actin a using -greedy or another policy Take action a and receive reward r Sample new state s Update Set s < s

For each step of the current episode

Until there are no more episodes

The Sarsa Algorithm


Initialization Set Q(s,a) to small random values for all s and a Repeat:

Initialize s
Choose action a using the current policy repeat
o o o o o Take action a and receive reward r Sample new state s Choose action a using the current policy Update Set s < s, a < a

For each step of the current episode Until there are no more episodes

Sarsa Vs Q-Learning
Consider a problem where start and goal are provided in figure: - Reward for every move is -1 - Reward for the move that end in cliff is -100 and agent gets put back into start

Sarsa Vs Q-Learning
Sarsa algorithm will converge to a much safer route that keeps it well away from the cliff, even though it takes longer

Sarsa Vs Q-Learning
Q uses greedy policy and finds the most optimum path, which is shortest one

Uses of Reinforcement Learning


Robotics: robot can be left to attempt to solve the task without intervention Games: chess, checkers, backgammon

Trading: learning to trade by reinforcement


Vehicle routing

También podría gustarte