What is the immediate reward in value iteration?
-
31-10-2019 - |
Frage
Suppose you're given an MDP where rewards are attributed for reaching a state, independently of the action. Then when doing value iteration:
$$ V_{i+1} = \max_a \sum_{s'} P_a(s,s') (R_a(s,s') + \gamma V_i(s'))$$
what is $R_a(s,s')$ ?
The problem I'm having is that terminal states have, by default, $V(s_T) = R(s_T)$ (some terminal reward). Then when I'm trying to implement value iteration, if I set $R_a(s,s')$ to be $R(s')$ (which is wha I thought), I get that states neighboring a terminal state have a higher value than the terminal state itself, since
$$ P_a(s,s_T) ( R_a(s,s_T) + \gamma V_i(s_T) ) $$
can easily be greater than $V_i(s_T)$, which in practice makes no sense. So the only conclusion I seem to be able to get is that in my case, $R_a(s,s') = R(s)$.. is this correct?
Keine korrekte Lösung