Do you want BuboFlash to help you learning these things? Or do you want to add or correct something? Click here to log in or create user.



#computer-science #machine-learning #reinforcement-learning
A simple complete example of divergence is Baird’s counterexample. Consider the episodic seven-state , two-action MDP shown in Figure 11.1. The dashed action takes the system to one of the six upper states with equal probability, whereas the solid action takes the system to the seventh state. The behavior policy b selects the dashed and solid actions with probabilities 6 7 and 1 7 , so that the next-state distribution under it is uniform (the same for all nonterminal states), which is also the starting distribution for each episode. The target policy ⇡ always takes the solid action, and so the on-policy distribution (for ⇡ ) is concentrated in the seventh state. The reward is ze ro on all t r ans i ti on s. The discount rate is =0.99. Consider estimating the state-value under the linear parameterization indicated by the expression shown in each state circle. For example, the est im at ed value of the leftmost state is 2 w 1 + w 8 , where the s ub sc ri p t corresponds to the component of the 2w 2 +w 8 2w 1 +w 8 2w 3 +w 8 2w 4 +w 8 2w 5 +w 8 2w 6 +w 8 w 7 +2w 8 b(dashed|·)=6/7 b(solid|·)=1/7 ⇡(solid|·)=1 =0.99
If you want to change selection, open document below and click on "Move attachment"

pdf

cannot see any pdfs


Summary

statusnot read reprioritisations
last reprioritisation on suggested re-reading day
started reading on finished reading on

Details



Discussion

Do you want to join discussion? Click here to log in or create user.