0% found this document useful (0 votes)
33 views33 pages

Linear Learning With Allreduce: John Langford (With Help From Many)

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 33

Linear Learning with AllReduce

John Langford (with help from many)

NYU Large Scale Learning Class, February 19, 2013


Applying for a fellowship in 1997

Interviewer: So, what do you want to do?


John: Id like to solve AI.
I: How?
J: I want to use parallel learning algorithms to create fantastic
learning machines!
Applying for a fellowship in 1997

Interviewer: So, what do you want to do?


John: Id like to solve AI.
I: How?
J: I want to use parallel learning algorithms to create fantastic
learning machines!
I: You fool! The only thing parallel machines are good for is
computational windtunnels!
Applying for a fellowship in 1997

Interviewer: So, what do you want to do?


John: Id like to solve AI.
I: How?
J: I want to use parallel learning algorithms to create fantastic
learning machines!
I: You fool! The only thing parallel machines are good for is
computational windtunnels!
The worst part: he had a point.
Terascale Linear Learning ACDL11

Given 2.1 Terafeatures


P of data, how can you learn a good linear
predictor fw (x) = i wi xi ?
Terascale Linear Learning ACDL11

Given 2.1 Terafeatures


P of data, how can you learn a good linear
predictor fw (x) = i wi xi ?

17B Examples
16M parameters
1K nodes
How long does it take?
Terascale Linear Learning ACDL11

Given 2.1 Terafeatures


P of data, how can you learn a good linear
predictor fw (x) = i wi xi ?

17B Examples
16M parameters
1K nodes
How long does it take?

70 minutes = 500M features/second: faster than the IO


bandwidth of a single machine faster than all possible single
machine linear learning algorithms.
MPI-style AllReduce

Allreduce initial state

5 7 6

1 2 3 4
MPI-style AllReduce

Allreduce final state

28 28 28

28 28 28 28
MPI-style AllReduce

Create Binary Tree


7
5 6

1 2 3 4
MPI-style AllReduce

Reducing, step 1
7

8 13

1 2 3 4
MPI-style AllReduce

Reducing, step 2
28

8 13

1 2 3 4
MPI-style AllReduce

Broadcast, step 1
28

28 28

1 2 3 4
MPI-style AllReduce

Allreduce final state


28

28 28
28 28 28 28
AllReduce = Reduce+Broadcast
MPI-style AllReduce

Allreduce final state


28

28 28
28 28 28 28
AllReduce = Reduce+Broadcast
Properties:
1 Easily pipelined so no latency concerns.
2 Bandwidth 6n.
3 No need to rewrite code!
An Example Algorithm: Weight averaging

n = AllReduce(1)
While (pass number < max)
1 While (examples left)

1 Do online update.
2 AllReduce(weights)
3 For each weight w w /n
An Example Algorithm: Weight averaging

n = AllReduce(1)
While (pass number < max)
1 While (examples left)

1 Do online update.
2 AllReduce(weights)
3 For each weight w w /n

Other algorithms implemented:


1 Nonuniform averaging for online learning
2 Conjugate Gradient
3 LBFGS
What is Hadoop AllReduce?

Program
Data

Map job moves program to data.


What is Hadoop AllReduce?

Program
Data

Map job moves program to data.


2 Delayed initialization: Most failures are disk failures. First
read (and cache) all data, before initializing allreduce. Failures
autorestart on different node with identical data.
What is Hadoop AllReduce?

Program
Data

Map job moves program to data.


2 Delayed initialization: Most failures are disk failures. First
read (and cache) all data, before initializing allreduce. Failures
autorestart on different node with identical data.
3 Speculative execution: In a busy cluster, one node is often
slow. Hadoop can speculatively start additional mappers. We
use the first to finish reading all data once.
What is Hadoop AllReduce?

Program
Data

Map job moves program to data.


2 Delayed initialization: Most failures are disk failures. First
read (and cache) all data, before initializing allreduce. Failures
autorestart on different node with identical data.
3 Speculative execution: In a busy cluster, one node is often
slow. Hadoop can speculatively start additional mappers. We
use the first to finish reading all data once.
The net effect: Reliable execution out to perhaps 10K node-hours.
Approach Used

1 Optimize hard so few data passes required.


1 Normalized, adaptive, safe, online gradient descent.
Approach Used

1 Optimize hard so few data passes required.


1 Normalized, adaptive, safe, online gradient descent.
2 L-BFGS = batch algorithm that approximates inverse hessian.
3 Use (1) to warmstart (2).
Approach Used

1 Optimize hard so few data passes required.


1 Normalized, adaptive, safe, online gradient descent.
2 L-BFGS = batch algorithm that approximates inverse hessian.
3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and error recovery.
Approach Used

1 Optimize hard so few data passes required.


1 Normalized, adaptive, safe, online gradient descent.
2 L-BFGS = batch algorithm that approximates inverse hessian.
3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and error recovery.
3 Use AllReduce to sync state.
Approach Used

1 Optimize hard so few data passes required.


1 Normalized, adaptive, safe, online gradient descent.
2 L-BFGS = batch algorithm that approximates inverse hessian.
3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and error recovery.
3 Use AllReduce to sync state.
4 Always save input examples in a cachefile to speed later
passes.
Approach Used

1 Optimize hard so few data passes required.


1 Normalized, adaptive, safe, online gradient descent.
2 L-BFGS = batch algorithm that approximates inverse hessian.
3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and error recovery.
3 Use AllReduce to sync state.
4 Always save input examples in a cachefile to speed later
passes.
5 Use hashing trick to reduce input complexity.
Approach Used

1 Optimize hard so few data passes required.


1 Normalized, adaptive, safe, online gradient descent.
2 L-BFGS = batch algorithm that approximates inverse hessian.
3 Use (1) to warmstart (2).
2 Use map-only Hadoop for process control and error recovery.
3 Use AllReduce to sync state.
4 Always save input examples in a cachefile to speed later
passes.
5 Use hashing trick to reduce input complexity.
In Vowpal Wabbit. Allreduce is a separate easily linked library.
Robustness & Speedup

Speed per method


10
Average_10
9 Min_10
8 Max_10
linear
7
Speedup

6
5
4
3
2
1
0
10 20 30 40 50 60 70 80 90 100
Nodes
Splice Site Recognition

0.55

0.5

0.45
auPRC

0.4

0.35

0.3
Online
LBFGS w/ 5 online passes
0.25 LBFGS w/ 1 online pass
LBFGS
0.2
0 10 20 30 40 50
Iteration
Splice Site Recognition

0.6

0.5

0.4
auPRC

0.3

0.2

LBFGS w/ one online pass


0.1 Zinkevich et al.
Dekel et al.
0
0 5 10 15 20
Effective number of passes over data
Bibliography: VW & Algs

Caching L. Bottou. Stochastic Gradient Descent Examples on Toy


Problems, http://leon.bottou.org/projects/sgd, 2007.
Release Vowpal Wabbit open source project,
http://github.com/JohnLangford/vowpal_wabbit/wiki,
2007.
L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with Limited
Storage, Mathematics of Computation 35:773782, 1980.
Adaptive H. B. McMahan and M. Streeter, Adaptive Bound
Optimization for Online Convex Optimization, COLT 2010.
Adaptive J. Duchi, E. Hazan, and Y. Singer, Adaptive Subgradient
Methods for Online Learning and Stochastic Optimization,
COLT 2010.
Safe N. Karampatziakis, and J. Langford, Online Importance
Weight Aware Updates, UAI 2011.
Bibliography: Parallel

grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A Scalable


Modular Convex Solver for Regularized Risk Minimization,
KDD 2007.
avg. G. Mann et al. Efficient large-scale distributed training of
conditional maximum entropy models, NIPS 2009.
ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li, Parallelized
Stochastic Gradient Descent, NIPS 2010.
P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,
Parallel Online Learning, in SUML 2010.
D. Mini O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao,
Optimal Distributed Online Predictions Using Minibatch,
http://arxiv.org/abs/1012.1367
Tera Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, John
Langford A Reliable Effective Terascale Linear Learning
System, http://arxiv.org/abs/1110.4198

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy