#computer-science #machine-learning #reinforcement-learning
A slightly better algorithm can be derived by doing a few more analytic steps before substituting in v t . Continuing from (11.29): w t+1 = w t + ↵E ⇥ ⇢ t (x t x t+1 )x > t ⇤ E ⇥ x t x > t ⇤ 1 E[⇢ t t x t ] = w t + ↵ E ⇥ ⇢ t x t x > t ⇤ E ⇥ ⇢ t x t+1 x > t ⇤ E ⇥ x t x > t ⇤ 1 E[⇢ t t x t ] = w t + ↵ E ⇥ x t x > t ⇤ E ⇥ ⇢ t x t+1 x > t ⇤ E ⇥ x t x > t ⇤ 1 E[⇢ t t x t ] = w t + ↵ ⇣ E[x t ⇢ t t ] E ⇥ ⇢ t x t+1 x > t ⇤ E ⇥ x t x > t ⇤ 1 E[⇢ t t x t ] ⌘ ⇡ w t + ↵ E[x t ⇢ t t ] E ⇥ ⇢ t x t+1 x > t ⇤ v t (based on (11.28)) ⇡ w t + ↵⇢ t t x t x t+1 x > t v t , (sampling) which again is O ( d ) if the final product ( x > t v t ) is done first. This algorithm is known as either TD(0) with gradient correction (TDC) or, alternatively, as GTD(0)
If you want to change selection, open document below and click on "Move attachment"
pdf
cannot see any pdfsSummary
status | not read | | reprioritisations | |
---|
last reprioritisation on | | | suggested re-reading day | |
---|
started reading on | | | finished reading on | |
---|
Details