Linear Learning With Allreduce: John Langford (With Help From Many)
Linear Learning With Allreduce: John Langford (With Help From Many)
Linear Learning With Allreduce: John Langford (With Help From Many)
17B Examples
16M parameters
1K nodes
How long does it take?
Terascale Linear Learning ACDL11
17B Examples
16M parameters
1K nodes
How long does it take?
5 7 6
1 2 3 4
MPI-style AllReduce
28 28 28
28 28 28 28
MPI-style AllReduce
1 2 3 4
MPI-style AllReduce
Reducing, step 1
7
8 13
1 2 3 4
MPI-style AllReduce
Reducing, step 2
28
8 13
1 2 3 4
MPI-style AllReduce
Broadcast, step 1
28
28 28
1 2 3 4
MPI-style AllReduce
28 28
28 28 28 28
AllReduce = Reduce+Broadcast
MPI-style AllReduce
28 28
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth 6n.
3 No need to rewrite code!
An Example Algorithm: Weight averaging
n = AllReduce(1)
While (pass number < max)
1 While (examples left)
1 Do online update.
2 AllReduce(weights)
3 For each weight w w /n
An Example Algorithm: Weight averaging
n = AllReduce(1)
While (pass number < max)
1 While (examples left)
1 Do online update.
2 AllReduce(weights)
3 For each weight w w /n
Program
Data
Program
Data
Program
Data
Program
Data
6
5
4
3
2
1
0
10 20 30 40 50 60 70 80 90 100
Nodes
Splice Site Recognition
0.55
0.5
0.45
auPRC
0.4
0.35
0.3
Online
LBFGS w/ 5 online passes
0.25 LBFGS w/ 1 online pass
LBFGS
0.2
0 10 20 30 40 50
Iteration
Splice Site Recognition
0.6
0.5
0.4
auPRC
0.3
0.2