Linear Learning With Allreduce: John Langford (With Help From Many)

Linear Learning with AllReduce
John Langford (with help from many)
NYU Large Scale Learning Class, February 19, 2013

Applying for a fellowship in 1997
Interviewer: So, what do you want to do?

John: Id like to solve AI.
I: How?
J: I want to use parallel learning algorithms to create fantastic
learning machines!

I: How?
learning machines!
I: You fool! The only thing parallel machines are good for is
computational windtunnels!

I: How?
learning machines!
I: You fool! The only thing parallel machines are good for is
computational windtunnels!
The worst part: he had a point.
Terascale Linear Learning ACDL11
Given 2.1 Terafeatures

P of data, how can you learn a good linear
predictor fw (x) = i wi xi ?

17B Examples
16M parameters
1K nodes
How long does it take?

17B Examples
16M parameters
1K nodes
How long does it take?
70 minutes = 500M features/second: faster than the IO

bandwidth of a single machine faster than all possible single
machine linear learning algorithms.
MPI-style AllReduce
Allreduce initial state
5 7 6
1 2 3 4
MPI-style AllReduce
Allreduce final state
28 28 28
28 28 28 28
MPI-style AllReduce
Create Binary Tree

7
5 6
1 2 3 4
MPI-style AllReduce
Reducing, step 1
7
8 13
1 2 3 4
MPI-style AllReduce
Reducing, step 2
28
8 13
1 2 3 4
MPI-style AllReduce
Broadcast, step 1
28
28 28
1 2 3 4
MPI-style AllReduce

28
28 28
28 28 28 28
AllReduce = Reduce+Broadcast
MPI-style AllReduce

28
28 28
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth 6n.
3 No need to rewrite code!
An Example Algorithm: Weight averaging
n = AllReduce(1)
While (pass number < max)
1 While (examples left)
1 Do online update.
2 AllReduce(weights)
3 For each weight w w /n
An Example Algorithm: Weight averaging
n = AllReduce(1)
While (pass number < max)
1 While (examples left)
1 Do online update.
2 AllReduce(weights)
3 For each weight w w /n
Other algorithms implemented:

1 Nonuniform averaging for online learning
2 Conjugate Gradient
3 LBFGS
What is Hadoop AllReduce?
Program
Data
Map job moves program to data.

Program
Data

2 Delayed initialization: Most failures are disk failures. First
read (and cache) all data, before initializing allreduce. Failures
autorestart on different node with identical data.
Program
Data

3 Speculative execution: In a busy cluster, one node is often
slow. Hadoop can speculatively start additional mappers. We
use the first to finish reading all data once.
Program
Data

3 Speculative execution: In a busy cluster, one node is often
slow. Hadoop can speculatively start additional mappers. We
use the first to finish reading all data once.
The net effect: Reliable execution out to perhaps 10K node-hours.
Approach Used
1 Optimize hard so few data passes required.

1 Normalized, adaptive, safe, online gradient descent.
Approach Used

2 L-BFGS = batch algorithm that approximates inverse hessian.
3 Use (1) to warmstart (2).
Approach Used

2 Use map-only Hadoop for process control and error recovery.
Approach Used

3 Use AllReduce to sync state.
Approach Used

4 Always save input examples in a cachefile to speed later
passes.
Approach Used

passes.
5 Use hashing trick to reduce input complexity.
Approach Used

passes.
5 Use hashing trick to reduce input complexity.
In Vowpal Wabbit. Allreduce is a separate easily linked library.
Robustness & Speedup
Speed per method

10
Average_10
9 Min_10
8 Max_10
linear
7
Speedup
6
5
4
3
2
1
0
10 20 30 40 50 60 70 80 90 100
Nodes
Splice Site Recognition
0.55
0.5
0.45
auPRC
0.4
0.35
0.3
Online
LBFGS w/ 5 online passes
0.25 LBFGS w/ 1 online pass
LBFGS
0.2
0 10 20 30 40 50
Iteration
Splice Site Recognition
0.6
0.5
0.4
auPRC
0.3
0.2
LBFGS w/ one online pass

0.1 Zinkevich et al.
Dekel et al.
0
0 5 10 15 20
Effective number of passes over data
Bibliography: VW & Algs
Caching L. Bottou. Stochastic Gradient Descent Examples on Toy

Problems, http://leon.bottou.org/projects/sgd, 2007.
Release Vowpal Wabbit open source project,
http://github.com/JohnLangford/vowpal_wabbit/wiki,
2007.
L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited
Storage, Mathematics of Computation 35:773782, 1980.
Adaptive H. B. McMahan and M. Streeter, Adaptive Bound
Optimization for Online Convex Optimization, COLT 2010.
Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient
Methods for Online Learning and Stochastic Optimization,
COLT 2010.
Safe N. Karampatziakis, and J. Langford, Online Importance
Weight Aware Updates, UAI 2011.
Bibliography: Parallel
grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable

Modular Convex Solver for Regularized Risk Minimization,
KDD 2007.
avg. G. Mann et al. Efficient large-scale distributed training of
conditional maximum entropy models, NIPS 2009.
ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li, Parallelized
Stochastic Gradient Descent, NIPS 2010.
P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,
Parallel Online Learning, in SUML 2010.
D. Mini O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao,
Optimal Distributed Online Predictions Using Minibatch,
http://arxiv.org/abs/1012.1367
Tera Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John
Langford A Reliable Effective Terascale Linear Learning
System, http://arxiv.org/abs/1110.4198

Linear Learning With Allreduce: John Langford (With Help From Many)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Linear Learning With Allreduce: John Langford (With Help From Many)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Learning With Allreduce: John Langford (With Help From Many)

Uploaded by

Copyright:

Available Formats

Linear Learning with AllReduce

John Langford (with help from many)

NYU Large Scale Learning Class, February 19, 2013

Interviewer: So, what do you want to do?

Interviewer: So, what do you want to do?

Interviewer: So, what do you want to do?

Given 2.1 Terafeatures

Given 2.1 Terafeatures

Given 2.1 Terafeatures

70 minutes = 500M features/second: faster than the IO

Allreduce initial state

Allreduce final state

Create Binary Tree

Allreduce final state

Allreduce final state

Other algorithms implemented:

Map job moves program to data.

Map job moves program to data.

Map job moves program to data.

Map job moves program to data.

1 Optimize hard so few data passes required.

1 Optimize hard so few data passes required.

1 Optimize hard so few data passes required.

1 Optimize hard so few data passes required.

1 Optimize hard so few data passes required.

1 Optimize hard so few data passes required.

1 Optimize hard so few data passes required.

Speed per method

LBFGS w/ one online pass

Caching L. Bottou. Stochastic Gradient Descent Examples on Toy

grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.