#computer-science #machine-learning #reinforcement-learning
In o↵-policy learning, we reweight the state transitions using importance sampling so that they become appropriate for learni n g about the target policy, but the state distribution is still that of the behavior policy. There is a mismatch. A natural idea is to somehow reweight the states, emphasizing some an d de-emphasizing others, so as to return the distribution of updates to th e on-policy distribution. There would then be a match, and stability and convergence would follow from ex i st i ng result s.
If you want to change selection, open document below and click on "Move attachment"
pdf
cannot see any pdfsSummary
status | not read | | reprioritisations | |
---|
last reprioritisation on | | | suggested re-reading day | |
---|
started reading on | | | finished reading on | |
---|
Details