Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Chapter 13
Introduction
Supervised (inductive) learning is the simplest and most studied type of learning How can an agent learn behaviors when it doesnt have a teacher to tell it how to perform?
The agent has a task to perform It takes some actions in the world At some later point, it gets feedback telling it how well it did on performing the task The agent performs the same task over and over again
Introduction (contd..)
The goal is to get the agent to act in the world so as to maximize its rewards The agent has to figure out what it did that made it get the reward/punishment Examples
backgammon and chess playing job shop scheduling controlling robot limbs
Learning Cycle
Reward Function
Basic idea of the learner is to choose the action that gets maximum reward Reward function tells the learner what the goal is, not how the goal should be achieved, which would be supervised learning. sub-goals is not presented to learner
First function does not cares about the path length where as second one rewards maximum to shortest path This example is episodic (i.e. learning is split into episodes that have a definite endpoint when robot reaches the center) The reward can be given at the terminal state and propagated back through all the actions that were performed to update the learner.
Discounting
Solution of the rewarding problem of continual problem is discounting
We take account how certain we can be about things that happen in the future
Since there is lots of uncertainty in learning, we should discount our predictions of rewards in the future according to how much chance there is that they are wrong.
Discounting (contd..)
Rewards that we expect to get very soon are probably going to be more accurate predictions that those a long time in the future, because lots of things might change is added as an additional parameter where 01
discounts future rewards by multiplying them by t where t is the number of the time steps in the future this reward is from
Action Selection
at every step of learning process, the algorithm looks at the actions that can be performed in the current state and computes the value (average reward) of each action, average reward is computed that has been received each time in the past average reward is represented as Qs,t(a) where s is the state a is action and t is the number of times that the actin has been taken before in this state
Value will eventually converge to the true prediction of the reward for that action.
Policy
Once the algorithm has decided on the reward, it needs to choose the action that should be performed in the current state. combination of exploration and exploitation in order to deciding whether to take the action that gave the highest reward last time we were in this state, or trying out different action in hope of finding something even better
Markov Property
refers to the memory less property of a stochastic process Current state provides enough information for the reward to be computed without looking at previous states Equation below represents computation of Probability that the next reward is r and next state is s
s0
s1
s2
s3
Values
Expected reward is known as the value two ways to compute value
1. self-value funcion V(s): We can consider the current state, and average across all of the actions that can be taken, leaving the policy to sort this out for itself 2. the action-value function Q(s,a): We can consider the current state and each possible action that can be taken separately,
Initialize s
Choose action a using the current policy repeat
o o o o o Take action a and receive reward r Sample new state s Choose action a using the current policy Update Set s < s, a < a
For each step of the current episode Until there are no more episodes
Sarsa Vs Q-Learning
Consider a problem where start and goal are provided in figure: - Reward for every move is -1 - Reward for the move that end in cliff is -100 and agent gets put back into start
Sarsa Vs Q-Learning
Sarsa algorithm will converge to a much safer route that keeps it well away from the cliff, even though it takes longer
Sarsa Vs Q-Learning
Q uses greedy policy and finds the most optimum path, which is shortest one