# on 19-Sep-2019 (Thu)

#### Flashcard 4395227024652

Tags
#regression #statistics #supervised-learning
Question
How does locally weighted regression work? Think of the locally weighted linear regression cost function and setup.

In contrast, the locally weighted linear regression algorithm does the following:

1. Fit θ to minimize $$\sum_{i} w^{(i)}\left(y^{(i)}-\theta^{T} x^{(i)}\right)^{2}$$ . Where here $$w^{(i)}$$ is a weight function of a training example $$x^{(i)}\text{'s}$$ distance to the query point $$x$$ , e.g. $$w^{(i)}=\exp \left(-\frac{\left(x^{(i)}-x\right)^{2}}{2 \tau^{2}}\right)$$

2. Output $$\theta^Tx$$

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Annotation 4395229383948

 #anki #learning #spaced_repitition I began with the AlphaGo paper itself. I began reading it quickly, almost skimming. I wasn't looking for a comprehensive understanding. Rather, I was doing two things. One, I was trying to simply identify the most important ideas in the paper. What were the names of the key techniques I'd need to learn about? Second, there was a kind of hoovering process, looking for basic facts that I could understand easily, and that would obviously benefit me. Things like basic terminology, the rules of Go, and so on.

Augmenting Long-term Memory
ing. I was going to need to learn this material from scratch, and to write a good article I was going to need to really understand the underlying technical material. Here's how I went about it. <span>I began with the AlphaGo paper itself. I began reading it quickly, almost skimming. I wasn't looking for a comprehensive understanding. Rather, I was doing two things. One, I was trying to simply identify the most important ideas in the paper. What were the names of the key techniques I'd need to learn about? Second, there was a kind of hoovering process, looking for basic facts that I could understand easily, and that would obviously benefit me. Things like basic terminology, the rules of Go, and so on. Here's a few examples of the kind of question I entered into Anki at this stage: “What's the size of a Go board?”; “Who plays first in Go?”; “How many human game positions did AlphaGo l

#### Annotation 4395231481100

 #anki #learning #spaced_repitition By contrast, had I used conventional note-taking in my original reading of the AlphaGo paper, my understanding would have more rapidly evaporated, and it would have taken longer to read the later papers. And so using Anki in this way gives confidence you will retain understanding over the long term. This confidence, in turn, makes the initial act of understanding more pleasurable, since you believe you're learning something for the long haul, not something you'll forget in a day or a week.

Augmenting Long-term Memory
stand those papers as thoroughly as the initial AlphaGo paper, I found I could get a pretty good understanding of the papers in less than an hour. I'd retained much of my earlier understanding! <span>By contrast, had I used conventional note-taking in my original reading of the AlphaGo paper, my understanding would have more rapidly evaporated, and it would have taken longer to read the later papers. And so using Anki in this way gives confidence you will retain understanding over the long term. This confidence, in turn, makes the initial act of understanding more pleasurable, since you believe you're learning something for the long haul, not something you'll forget in a day or a week. OK, but what does one do with it? … [N]ow that I have all this power – a mechanical golem that will never forget and never let me forget whatever I chose to – what do I choose to rememb

#### Annotation 4395233053964

 #anki #learning #spaced_repitition This doesn't mean reading every word in the paper. Rather, I'll add to Anki questions about the core claims, core questions, and core ideas of the paper. It's particularly helpful to extract Anki questions from the abstract, introduction, conclusion, figures, and figure captions. Typically I will extract anywhere from 5 to 20 Anki questions from the paper. It's usually a bad idea to extract fewer than 5 questions – doing so tends to leave the paper as a kind of isolated orphan in my memory. Later I find it difficult to feel much connection to those questions. Put another way: if a paper is so uninteresting that it's not possible to add 5 good questions about it, it's usually better to add no questions at all.

Augmenting Long-term Memory

#### Annotation 4395234626828

 #anki #learning #spaced_repitition I said above that I typically spend 10 to 60 minutes Ankifying a paper, with the duration depending on my judgment of the value I'm getting from the paper. However, if I'm learning a great deal, and finding it interesting, I keep reading and Ankifying. Really good resources are worth investing time in. But most papers don't fit this pattern, and you quickly saturate. If you feel you could easily find something more rewarding to read, switch over. It's worth deliberately practicing such switches, to avoid building a counter-productive habit of completionism in your reading. It's nearly always possible to read deeper into a paper, but that doesn't mean you can't easily be getting more value elsewhere. It's a failure mode to spend too long reading unimportant papers.

Augmenting Long-term Memory

#### Annotation 4395236199692

 #anki #learning #spaced_repitition You might suppose the foundation would be a shallow read of a large number of papers. In fact, to really grok an unfamiliar field, you need to engage deeply with key papers – papers like the AlphaGo paper. What you get from deep engagement with important papers is more significant than any single fact or technique: you get a sense for what a powerful result in the field looks like. It helps you imbibe the healthiest norms and standards of the field. It helps you internalize how to ask good questions in the field, and how to put techniques together. You begin to understand what made something like AlphaGo a breakthrough – and also its limitations, and the sense in which it was really a natural evolution of the field. Such things aren't captured individually by any single Anki question. But they begin to be captured collectively by the questions one asks when engaged deeply enough with key papers.

Augmenting Long-term Memory
er reads of papers. There's also a sense in which it's possible to use Anki not just to read papers, but to “read” the entire research literature of some field or subfield. Here's how to do it. <span>You might suppose the foundation would be a shallow read of a large number of papers. In fact, to really grok an unfamiliar field, you need to engage deeply with key papers – papers like the AlphaGo paper. What you get from deep engagement with important papers is more significant than any single fact or technique: you get a sense for what a powerful result in the field looks like. It helps you imbibe the healthiest norms and standards of the field. It helps you internalize how to ask good questions in the field, and how to put techniques together. You begin to understand what made something like AlphaGo a breakthrough – and also its limitations, and the sense in which it was really a natural evolution of the field. Such things aren't captured individually by any single Anki question. But they begin to be captured collectively by the questions one asks when engaged deeply enough with key papers. So, to get a picture of an entire field, I usually begin with a truly important paper, ideally a paper establishing a result that got me interested in the field in the first place. I do

#### Annotation 4395237772556

Augmenting Long-term Memory

#### Flashcard 4395239345420

Tags
#reinforcement-learning
Question
How does IMPALA achieve stable distributed learning at high throughput?
by combining decoupled acting and learning with a novel off-policy correction method called V-trace.

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Flashcard 4395241704716

Tags
#reinforcement-learning
Question
What does the IMPALA architecture allow?
Distributed acting and learning that uses resources more efficiently in single-machine training but also scales to thousands of machines without sacrificing data efficiency or resource utilisation.

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Flashcard 4395244064012

Tags
#reinforcement-learning
Question
In its paper, what was the IMPALA used for?
Distributing training in the multi-task RL setting.

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Flashcard 4395246423308

Question
Why does a distributed training architecture such as IMPALA need an off-policy correction?
Because the policy used to generate a trajectory (a worker's policy) can lag behind the policy on the learner by several updates at the time of gradient calculation.

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Flashcard 4395248782604

Tags
#reinforcement-learning
Question
Aside from achieving higher data throughputs, what are some other ways in which IMPALA training architecture outperforms A3C?
Crucially, IMPALA is also more data efficient than A3C based agents and more robust to hyperparameter values and network architectures, allowing it to make better use of deeper neural networks.

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Flashcard 4395251141900

Tags
#has-images #reinforcement-learning
Question
Reproduce (by visualising mentally or by drawing) the layout of learners and trainers in the IMPALA architecture.

Figure 1. Left: Single Learner. Each actor generates trajectories and sends them via a queue to the learner. Before starting the next trajectory, actor retrieves the latest policy parameters from learner. Right: Multiple Synchronous Learners. Policy parameters are distributed across multiple learners that work synchronously

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Flashcard 4395258219788

Tags
#machine-learning
Question
Describe 2 ways in which stale gradients arise when using distributed training of models.
Firstly, the read operation (Algo 1 Line 1 ) on a worker may be interleaved with updates by other workers to different parameter servers, so the resultant θ^k may not be consistent with any parameter incarnation θ( t ) . Secondly, model updates may have occurred while a worker is computing its stochastic gradient; hence, the resultant gradients are typically computed with respect to outdated parameters

status measured difficulty not learned 37% [default] 0
Revisiting Distributed Synchronous SGD – arXiv Vanity
[j]. 3 end for Algorithm 2 Async-SGD Parameter Server j In practice, the updates of Async-Opt are different than those of serially running the stochastic optimization algorithm for two reasons. <span>Firstly, the read operation (Algo 1 Line 1) on a worker may be interleaved with updates by other workers to different parameter servers, so the resultant ˆθk may not be consistent with any parameter incarnation θ(t). Secondly, model updates may have occurred while a worker is computing its stochastic gradient; hence, the resultant gradients are typically computed with respect to outdated parameters. We refer to these as stale gradients, and its staleness as the number of updates that have occurred between its corresponding read and update operations. Understanding the theoretical

#### Flashcard 4395260579084

Question
What is the setup of a synchronous version of distributed stochastic gradient descent?
Here to reconsider a synchronous version of distributed stochastic gradient descent (Sync-SGD), or more generally, Synchronous Stochastic Optimization (Sync-Opt), where the parameter servers wait for all workers to send their gradients, aggregate them, and send the updated parameters to all workers afterward. This ensures that the actual algorithm is a true mini-batch stochastic gradient descent, with an effective batch size equal to the sum of all the mini-batch sizes of the workers.

status measured difficulty not learned 37% [default] 0
Revisiting Distributed Synchronous SGD – arXiv Vanity
14 use versions of Async-SGD where the main potential problem is that each worker computes gradients over a potentially old version of the model. In order to remove this discrepancy, we propose <span>here to reconsider a synchronous version of distributed stochastic gradient descent (Sync-SGD), or more generally, Synchronous Stochastic Optimization (Sync-Opt), where the parameter servers wait for all workers to send their gradients, aggregate them, and send the updated parameters to all workers afterward. This ensures that the actual algorithm is a true mini-batch stochastic gradient descent, with an effective batch size equal to the sum of all the mini-batch sizes of the workers. While this approach solves the staleness problem, it also introduces the potential problem that the actual update time now depends on the slowest worker. Although workers have equivalen

#### Flashcard 4395262938380

Tags
#machine-learning
Question
What is the straggler problem of the distributed training algorithm Sync-SGD?
While this synchronous approach solves the staleness problem, it also introduces the potential problem that the actual update time now depends on the slowest worker.

status measured difficulty not learned 37% [default] 0
Revisiting Distributed Synchronous SGD – arXiv Vanity
ers afterward. This ensures that the actual algorithm is a true mini-batch stochastic gradient descent, with an effective batch size equal to the sum of all the mini-batch sizes of the workers. <span>While this approach solves the staleness problem, it also introduces the potential problem that the actual update time now depends on the slowest worker. Although workers have equivalent computation and network communication workload, slow stragglers may result from failing hardware, or contention on shared underlying hardware resources

#### Flashcard 4395265297676

Tags
#machine-learning
Question
In Sync-SGD from the paper 'Revisiting Synchronous SGD', how is the straggler problem overcome?
To alleviate the straggler problem, we introduce backup workers ( tail-at-scale ) as follows: instead of having only N workers, we add b extra workers, but as soon as the parameter servers receive gradients from any N workers, they stop waiting and update their parameters using the N gradients. The slowest b workers’ gradients will be dropped when they arrive

status measured difficulty not learned 37% [default] 0
Revisiting Distributed Synchronous SGD – arXiv Vanity
network communication workload, slow stragglers may result from failing hardware, or contention on shared underlying hardware resources in data centers, or even due to preemption by other jobs. <span>To alleviate the straggler problem, we introduce backup workers (tail-at-scale) as follows: instead of having only N workers, we add b extra workers, but as soon as the parameter servers receive gradients from any N workers, they stop waiting and update their parameters using the N gradients. The slowest b workers’ gradients will be dropped when they arrive. Our method is presented in Algorithms 3, 4. Input : Dataset X Input : B mini-batch size 1 for t=0,1,… do 2 Wait to read θ(t)=(θ(t)[0],…,θ(t)[M]) from parameter servers. G(t)k:=0 for i=

#### Flashcard 4395269491980

Tags
#reinforcement-learning
Question
How does the IMPALA paper propose to learn off-policy?
They introduce a novel off-policy actor-critic algorithm for the learner, called V-trace.

status measured difficulty not learned 37% [default] 0

#### pdf

cannot see any pdfs

#### Flashcard 4395271851276

Tags
#reinforcement-learning
Question
How does the V-trace target used in the IMPALA distributed training architecture optimise for a fixed point (in value function space) that can be interpolated between the behaviour and target value functions?