Scalable Neural Network

Big Data
Scalable Neural networks
Prafullata Kiran Auradkar

Department of Computer Science and Engineering
prafullatak@pes.edu
Acknowledgements:
Significant information in the slide deck presented through the Unit 4 of the course have been created by Prof. Srinivas Katharguppe’s. I would
like to acknowledge and thank him for the same. I may have supplemented the same with contents from books and other sources from Internet
and would like to sincerely thank, acknowledge and reiterate that the credit/rights for the same remain with the original authors/publishers only.
These are intended for classroom presentation only.
BIG DATA
Artificial Neural Networks - A Reminiscence of MI
Artificial Neural Networks (ANN) are multi-layer fully-connected neural nets that look like the figure
below. They consist of an input layer, multiple hidden layers, and an output layer. Every node in one
layer is connected to every other node in the next layer. We make the network deeper by increasing
the number of hidden layers.
BIG DATA
Perceptron Training Rule
A perceptron can be thought of as representing a hyperplane decision surface in n-dimensional

space. The perceptron outputs a 1 for instances lying on one side of the hyperplane and outputs a -1
for instances lying on the other side. The equation for this decision hyperplane is w.x = 0.
Perceptrons are capable of separating linearly separable points in the decision plane.
Weights of the perceptron are updated as follows -

BIG DATA
Standard vs Stochastic Gradient Descent
● In standard gradient descent,

the error is summed over all
examples before updating
weights, whereas in stochastic
gradient descent weights are
updated upon examining each
training example.
● In cases where there are
multiple local minima with
respect to E(w), stochastic
gradient descent can
sometimes avoid falling into
these local minima because it
uses the various ∇ E(w) rather
than E(w) to guide its search.
BIG DATA
Multilayer Networks and the Backpropagation Algorithm
The neural network first proceeds in the forward phase where the current weights are
used to calculate the target values. If there are n output nodes, there are n target
values to be estimated and n corresponding estimates.
The overall error is then calculated and is backpropagated through the network to
modify each weight according to its contribution to the overall error.
This entire process is iterated n number of times until either a fixed number of epochs
are reached or the error falls below a predefined threshold value.
Weight update rule for output layer nodes
Weight update rule for hidden layer nodes

BIG DATA
ANN using Map Reduce - Implementation 1 (Zhang)
Let us assume that there are N datanodes in the Map Reduce

architecture and P samples to be trained.
The parallel algorithm is reflected in the training method -
● The entire dataset of P samples is split into N batches
● Each batch of samples undergoes complete training on a
separate mapper until a certain criteria is reached
● On a certain datanode D, the P/N samples are trained in a
batch manner – normal forward and backward propagation
and weight updates
● This is done until the error falls below 0.01
● The mappers then send their weights data to reducer which
performs an average across all mappers to calculate final
weights.
● The reducer then decides whether or not to perform another
iteration of this whole Map Reduce process.
BIG DATA
ANN using Map Reduce - Implementation 1
Mapper
1. Reads weights from HDFS to initialise the network
2. Reads samples pertaining to its batch
3. Iteratively trains on its samples until error reaches < 0.01
4. Instantiates WeightWritable object with current weights
5. Outputs key-value pairs of <Long, Writable>
Reducer
1. Reads weights from each mapper and cumulate values
2. For each weight wi,j, calculates average of this weight value
across all mapper outputs
3. Updates new weight values
4. Compares weight differences from previous iteration
5. If difference > some predefined threshold, outputs 1 (start a
new iteration)
6. Else, outputs 0 (finish training)
BIG DATA
BIG DATA
ANN using Map Reduce - Implementation 2 (Liu)
Implementation of neural networks n extremely large datasets is

a computationally difficult task.
● If training data size is large, algorithm performance drops
● If classification instance size is large, algorithm may
perform worse
This implementation explores a solution to these problems
parallelising these tasks using Map Reduce on HDFS which
considers the efficiency improvement for both training and
classification phases.
This method uses bootstrapping and ensemble model

techniques in order to utilise multiple weak learners on each
mapper to form a strong learner.
Classification efficiency is further improved by using a Cascading

Model.
BIG DATA
Parallelisation of Training Phase

Let there be a mappers that can parallely run tasks.
● The implementation hence trains a different
BPNNs - one on each mapper.
● Bootstrapping of the training data is used to
avoid a drop in training accuracy due to too
few training instances in each BPNN.
● Each mapper constructs one BPNN with 3
layers.
● Initialize w, θ ∈ [−1,1] for each neuron
randomly.
● Each BPNN trains its data in a stochastic
instance-by-instance manner.
● Training completes after all instances in its
training set are processed.
BIG DATA
Parallelisation of Classification Phase

● Each instance to be classified is inputted
into each BPNN.
● Each BPNN is a weak learner and predicts
the output for that instance which its
corresponding mapper outputs as a key-
value pair of the form <instance, output>
● When all mappers finish their outputs, one
reducer collects all the mapper’s outputs
and performs majority voting to output
the final predicted class.
● In other words, a map reduce architecture
is used to perform ensemble classification
where multiple weak learners are
combined to form one strong learner.
BIG DATA
Cascading Model
The implementation also uses a cascading model in
order to improve classification accuracy -
● Let there by cn class and a mappers that
grouped into g groups.
● In each iteration, training data of a certain
class is inputted into BPNNs of each group
and trained on that class.
● Then the entire classification data is inputted
to each of the g groups.
● Any instances that belong to classes that the
BPNNs already trained on will be predicted
correctly.
● Any instances from the remaining classes will
be predicted incorrectly and be called the
errorset and be used as input to the next
iteration until all cn classes are trained on.
BIG DATA
ANN using Map Reduce - Implementation 3 (Liu)
In the realm of artificial neural networks (ANNs), backpropagation neural networks (BPNNs) are the
most popular and are known to be capable of approximating complex nonlinear functions with
arbitrary precision with an enough number of neurons.
A commonly discussed problem however, is the complexity concomitant with the backpropagation
algorithm which may be assuaged by the use of parallel algorithms.
This paper presents three different Map Reduce based parallel implementations of ANN to deal with
different data intensive scenarios -
● MRBPNN 1 - Scenario where test data to be classified is very large
● MRBPNN 2 - Scenario where training data is very large
● MRBPNN 3 - Scenario where number of neurons in BPNN is very large
In all the three scenarios, data is inputted to the BPNNs in the form -
<⟨instancek, targetk, type⟩> where,
● instancek represents the current instance
● targetk, represents the desired target class for current instance
● type represents train or test instance (when type=test, target field is empty)
BIG DATA
MRBPNN - 1
This Map Reduce based model is applicable in scenarios where
test data is very large.
● Let there be n mappers, each mapper initialises a BPNN
● Each mapper receives entire training data as training input
but only a subset of test data as testing input
● Each mapper stochastically trains on the training instances
one by one until all training instances are processed
● Each mapper than processes the test instances available to
it, and output a key-value pair of the form < instancek,ojm>
where ojm is the output of the mth mapper
● Reducer collects all the output key-value pairs from all the
mappers and performs majority voting for each key
(instance) and outputs final classification for each test
instance.
BIG DATA
MRBPNN - 2
This Map Reduce based model is applicable in scenarios where training
data is very large.
● This model uses balanced bootstrapping to create n bootstrapped
sets for n mappers - one for each mapper
● This is done because splitting the entire samples T into n equal
parts - one for each mapper - will leave too few training instances
per mapper (classification accuracy 📉)
● Each mapper reads its corresponding bootstrapped set from its
HDFS file and performs stochastic training on the instances marked
as type=train.
● Each mapper run the feedforward phase on instances marked as
type=test and produced output in the form of a key-value pair
<instancek, ojm>
● Reducer collects outputs from all mappers and performs majority
voting to finally classify each test instance.
BIG DATA
MRBPNN - 3
This Map Reduce based model is applicable in scenarios where large
number of neurons are present in the BPNN.
● In this implementation, there are a number of iterations to the
Map Reduce Jobs - for a network of l layers, there are l - 1 MR
jobs that run
● Feedforward phase runs in all l - 1 iterations with
backpropagation running only in the final iteration
● In each iteration, mappers input one record from file and
generate their outputs directed to some reducer k
● These k reducers in turn specify mapper k’ to read output for
next iteration
● The above steps keep looping until last round in which only one
reducer computes new weights and biases for each layer based
on current instance
● This entire process repeats for each instance in the dataset
BIG DATA
ANN using Map Reduce - Implementation 4 (Chen)
ANNs working on large datasets are computationally inefficient and parallelisation of these
algorithms may help in improving efficiency and accuracy.
The current solutions implementing BPNN in a parallelised manner have unsolved challenges such as
difficulty in generating convergent global BPNN and training process getting trapped in the local
minimum.
This paper presents a novel approach that introduces a genetic algorithm based Evolution Algorithm
that views local BPNNs as candidates in a population and efficiently generates the ideal global BPNN
candidate.
Gradient Descent, an algorithm that is known to fall into local optimums, is combined with Evolution
Algorithm, an algorithm that can more efficiently land at the global optimum and is much less
efficient to initial conditions.
Finally, Random Project is introduced to improve the training efficiency further. Experiments show
that the algorithm can improve the training efficiency and accuracy remarkably on a high-
dimensional big dataset.
BIG DATA
Throughout the entire process a BPNN candidate is also expressed in terms of its weight matrix LM as a
trained ANN can be defined as the collection of all its weights
The entire algorithm is split into three main stages -
Local Training Stage
● Each map task reads a split of the entire dataset and an initial global BPNN candidate (randomly
weighted initially)
● There will then be m splits of the data - S1, S2, …, Sm and m local BPNNs - LM1, LM2, …, LMm
● Before a reduce task pulls local BPNNs to form the global BPNN, all the local BPNNs on a certain
node are merged with the average of their connection weights in order to reduce I/O.
● Eventually n pairs of { <Ki, LMi >|1 <= i <= n} will be written into files, where Ki corresponding to the
ID of the current node is the key of the local BPNN LMi, and n is the number of nodes of the current
cluster.
Global Evolution Stage

● All the local BPNNs, identified as LMs are sent to the reduce task where they make up the initial
population of candidates to the Evolver.
● The Evolver then uses its operators Selection, Mutation and Crossover to produce new individuals
whose fitness satisfying the threshold will be put into the population of the next generation.
BIG DATA
Test Stage
● The fitness level of the candidates of the population
is actually evaluated based on their performance
against the testBPNN function.
● For each LM, the error ei between the supposed
output and the actual output on the testBPNN
function is calculated.
● The LM that has the smallest ei and ei <= 𝛿 (𝛿 being a
predefined threshold) will be chosen as the GM or
the global BPNN.
● If no LM has an error ei <= 𝛿, then the LM with the
smallest error is chosen as the GM.
● In this way, in the next iteration, the mappers will
read a global BPNN of higher quality than the
previous iteration.
BIG DATA
BIG DATA
Cascade SVM
● SVMs are really powerful classification algorithms, however their storage and
compute requirements increase rapidly with the number of training vectors.
● The crux of an SVM is the Quadratic Programming Problem which scales with
the cube of the number of training vectors O(k3).
● To parallelize this computation, we make use of a Cascade SVM. The idea

behind this it to split the problem into independent optimizations which are
later combined in a hierarchical fashion.
● Support vectors from two SVMs are combined and they are adjusted to
optimize the combined subset. This step goes on iteratively until satisfactory
accuracy is achieved.
● For a new iteration, the SVMs in the first layer receive all the support vectors
of the last layer as input.
BIG DATA
Cascade SVM - Filtering
● How are we sure that cascading will give us a global optima?
● If we take a subset S of the whole data Q, it will most likely not

contain all support vectors of Q and it’s support vectors may
not be that of Q either.
● However, if there is no serious bias/skew in the subset, its

support vectors are likely to be part of Q as well.
● Another way of saying this is, the interior points in the subset is
likely to be the interior points of the whole set. Therefore, non- Subset 1 Subset 2
support vectors of a subset are likely to be non-support vectors
of the whole set and we can eliminate them from further
analysis.
Subset 1 + Subset 2
BIG DATA
SVMs using Spark - Cascade SVM
● We saw how cascade SVM allows parallel computation of support vectors.

Do we know anything that’s great at parallel in-memory computations?
Spark!
● The data is randomly divided but ensures that the ratio of positive and
negative classes in each subset is equal. This is done so that there isn’t any
extreme update to the global support vector.
● Subsets are stored as RDDs and data corresponding to each partition is

trained parallely using the foreachPartition transformation.
● The support vectors and nonsupport vectors of each layer may be stored
back to HDFS to be used as input for the next layers.
● Nonsupport vectors (NoSV) are those that violate the training rule/results
of the other subset.
BIG DATA
Decision Trees (C4.5) using Map reduce Intro
● Decision Trees are classifiers that work on recursive partitioning over an instance space.
● C4.5 is an extension of the ID3 algorithm, to take care of continuous values and handle
incomplete data with missing values.
● Each internal node is a decision node which represents an attribute/subset of attributes.
● Each Edge Represents a specific value or range
● Leaf Nodes represent the class label.
Intuitive Logic for all Algorithms that work on Decision Trees:

1. On training data, apply a measurement function onto all attributes to find a best splitting
attribute.
2. After getting a splitting attribute, partition the instance space into several parts.
3. Algorithm terminates if each partition split from the previous step belongs to the same
class
4. Otherwise, recursively perform the splitting process until a whole partition is
generated.(go back to 1.)
BIG DATA
Decision Trees - C4.5 Algorithm
BIG DATA
Decision Trees - Data Structures for Map Reduce
Attribute Table:
● Consider an attribute ‘a’
● Most Basic Data Structure, stores data pertaining to the attribute, in the form
of <row_id, attribute_value[a], class_value>
Count Table:
● This computes the count of instances with specific class labels, if split by
attribute a.
● Stores data for each attribute like: <class_label, count>
Hash Table:
● This is the important one, it stores the link information between tree nodes
and row_id, as well as the node and its branches.
BIG DATA
Decision Trees - Data Preparation in Map reduce
● Before Executing the algorithm, the first thing to do is to convert a given

table into the data structures we talked about earlier.
● So here, we use map reduce as follows:
○ A map_attribute function transforms the instance record into the
attribute table with the attribute name (a) as key, row_id and class
label c as the values.
○ Following this, a reduce_attribute computes the number of instances
with a specific class, if split by attribute a, which forms the count table.
○ The Hash Table is set to null in the beginning of the process.
BIG DATA
Decision Trees - Selection of Splitting Attribute
● Now we come to another MR job that works on selecting the

splitting attribute:
● After we have the attribute table and count table ,we have to select
the first best attribute as our root.
● Heres how we can do it in Map Reduce:
○ An identity mapper is run with a reduce_population function,
it takes an input of the number of instances for each
attribute, value pair to aggregate the total size of records
given an attribute a.
○ Next, another Map Reduce job is launched

○ The map_computation step here computes the information
gain and split information, given an attribute a.
○ The reduce_computation step computes the information

gain ratio as described by the main algorithm for C4.5
○ At last, the attribute wit the maximum Gain Ratio is chosen as

the splitting attribute.
BIG DATA
Decision Trees - Updating Count table and hash table
Based of the split made in the earlier step, we now have to update the
count table and the hash table
Map only Jobs are now started as follows:
● The map_update_count function takes in the record from the
attribute table pertaining to the splitting attribute a_best, and emits
the count of the class labels.
● The map_hash function assigns a node_id to best attribute found in
the previous step to make sure that records with the same values
are split into the same partition
BIG DATA
Decision Trees - Growing the Tree🌲
● From the update step, we have generated nodes

for our tree.
● But now, we have to grow the tree by building
connections between the nodes
Here, we start a Map-only job to:
● check if the node_id of the best attribute given the
split:
○ If it remains same as the previous iteration,
we’ve reached a leaf node.
○ Otherwise, a new sub node is attached to
the node generated from the previous step,
and the hash table is updated with
information <row_ids in split, node_id,
subnode_id>
THANK YOU
Prafullata Kiran Auradkar

Department of Computer Science and Engineering
prafullatak@pes.edu

Scalable Neural Network

Uploaded by

Copyright:

Available Formats

Scalable Neural Network

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Scalable Neural Network

Uploaded by

Copyright:

Available Formats

Big Data

Scalable Neural networks

Prafullata Kiran Auradkar

A perceptron can be thought of as representing a hyperplane decision surface in n-dimensional

Weights of the perceptron are updated as follows -

● In standard gradient descent,

Weight update rule for output layer nodes

Weight update rule for hidden layer nodes

Let us assume that there are N datanodes in the Map Reduce

Implementation of neural networks n extremely large datasets is

This method uses bootstrapping and ensemble model

Classification efficiency is further improved by using a Cascading

Parallelisation of Training Phase

Parallelisation of Classification Phase

Global Evolution Stage

● To parallelize this computation, we make use of a Cascade SVM. The idea

● How are we sure that cascading will give us a global optima?

● If we take a subset S of the whole data Q, it will most likely not

● However, if there is no serious bias/skew in the subset, its

● We saw how cascade SVM allows parallel computation of support vectors.

● Subsets are stored as RDDs and data corresponding to each partition is

Intuitive Logic for all Algorithms that work on Decision Trees:

● Before Executing the algorithm, the first thing to do is to convert a given

● Now we come to another MR job that works on selecting the

○ Next, another Map Reduce job is launched

○ The reduce_computation step computes the information

○ At last, the attribute wit the maximum Gain Ratio is chosen as

● From the update step, we have generated nodes

Prafullata Kiran Auradkar

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.