Algorithm 1 Learn USFA with ε-greedy Q-learning
Require: ε, training tasks M, distribution Dz over Rd, number of policies nz
1: select initial state s ∈ S
2: forn s steps do
3: sample w uniformly at random from M
4: {sample policies, possibly based on current task}
5: for i ← 1,2,...,nz do zi ∼ Dz(·|w)
6: if Bernoulli(ε)=1 then a ← Uniform(A)
7: else a ← [GPI]
8: Execute action a and observe φ and s′
9: for i ← 1,2,...,nz do {Update ψ ̃}
10: a′ ← [\(a' \equiv \pi_i(s')\)]
11: θ←− φ+γψ(s′,a′,zi)−ψ(s,a,zi) ∇_θψ
12: s←s′
13: returnθ
\(\operatorname{argmax}_{b} \max _{i} \tilde{\boldsymbol{\psi}}\left(s, b, \mathbf{z}_{i}\right)^{\top} \mathbf{w}\)
\(\operatorname{argmax}_{b}\tilde{\boldsymbol{\psi}}\left(s, b, \mathbf{z}_{i}\right)^{\top} \mathbf{z_i}\)
Algorithm 1 Learn USFA with ε-greedy Q-learning
Require: ε, training tasks M, distribution Dz over Rd, number of policies nz
1: select initial state s ∈ S
2: forn s steps do
3: sample w uniformly at random from M
4: {sample policies, possibly based on current task}
5: for i ← 1,2,...,nz do zi ∼ Dz(·|w)
6: if Bernoulli(ε)=1 then a ← Uniform(A)
7: else a ← [GPI]
8: Execute action a and observe φ and s′
9: for i ← 1,2,...,nz do {Update ψ ̃}
10: a′ ← [\(a' \equiv \pi_i(s')\)]
11: θ←− φ+γψ(s′,a′,zi)−ψ(s,a,zi) ∇_θψ
12: s←s′
13: returnθ
Algorithm 1 Learn USFA with ε-greedy Q-learning
Require: ε, training tasks M, distribution Dz over Rd, number of policies nz
1: select initial state s ∈ S
2: forn s steps do
3: sample w uniformly at random from M
4: {sample policies, possibly based on current task}
5: for i ← 1,2,...,nz do zi ∼ Dz(·|w)
6: if Bernoulli(ε)=1 then a ← Uniform(A)
7: else a ← [GPI]
8: Execute action a and observe φ and s′
9: for i ← 1,2,...,nz do {Update ψ ̃}
10: a′ ← [\(a' \equiv \pi_i(s')\)]
11: θ←− φ+γψ(s′,a′,zi)−ψ(s,a,zi) ∇_θψ
12: s←s′
13: returnθ
\(\operatorname{argmax}_{b} \max _{i} \tilde{\boldsymbol{\psi}}\left(s, b, \mathbf{z}_{i}\right)^{\top} \mathbf{w}\)
\(\operatorname{argmax}_{b}\tilde{\boldsymbol{\psi}}\left(s, b, \mathbf{z}_{i}\right)^{\top} \mathbf{z_i}\)
status | not learned | measured difficulty | 37% [default] | last interval [days] | |||
---|---|---|---|---|---|---|---|
repetition number in this series | 0 | memorised on | scheduled repetition | ||||
scheduled repetition interval | last repetition or drill |