0% found this document useful (0 votes)
86 views13 pages

EE292A Lecture 2.ML - Hardware - 2 - April9

Uploaded by

wuxiangjin08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views13 pages

EE292A Lecture 2.ML - Hardware - 2 - April9

Uploaded by

wuxiangjin08
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

EE292A Lecture 2

Machine Learning Network


to Custom Hardware

Raúl Camposano Antun Domic Patrick Groeneveld


Silvaco and Silicon Catalyst Kepler AMD
camposan@stanford.edu domic@stanford.edu prgr@stanford.edu
Copyright ©2024 by Raúl Camposano, Antun Domic and Patrick Groeneveld
Remainder Machine
Learning Network to
Custom Hardware
Mapping Machine learning to efficient
hardware

Stanford EE292A Lecture 2 2


Engineering Tech: Main Driving Forces
Moore’s Law Transistor Supply: Computational Demand from AI/ Machine Learning
transistors doubling every ~24 months doubling every ~4 month

LLMs

Today:
100,000,000,000

Unleashes

Mid-2010’s:
Unleashes Machine Learning / AI
Which is based on brute force
Floating-point multiply-add
Squeezing Maximum Performance from
Semiconductor Hardware
• Find faster, more effective ML algorithms (duh!)
• Make hardware run faster to get more Operations per second (Giga->Tera->Peta FLOPS)

Minimize distance
Minimize
Good Circuit between
computational Run in parallel
Synthesis computation and
complexity
memory

• Optimal resource • Smaller is better • Floating Point data • More computation


allocation and • Distribute memory formats matter, per clock cycle
scheduling, logic next to a lot! • Pipelining
minimization, clock computation. • FP32-> FP16 ->
skew minimization, BF16 -> FP8
clever transistor
sizing, etc etc,
Compute Architectures for ML
Custom ASIC TPU/NPU Mesh GPU CPU
datapath
p w p w p w p w p w

m m m m m m m m m
m m m m m m m m m
p w p w p w P
40G
40G p w p w p w p w p w 40G 40G
Weights
Weights Weights Weights
RAM
Data m m m m m m m m m
m m m m m m m m m
RAM
p w p w p w
p w
p w RAM RAM
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m p w
p w
p w
P
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
p w p w p w
p w
p w

Data Data Data Data


Data

• Any structure, hard- • 10,000 Multipliers • 1,000,000 • Up to 1000 • 2-16 processors


coded algorithm surrounded by processors, processors
• Fastest and lowest memory. • Each with local • 10-100X faster than • Most universal and
power. But least • 1000X-10000X memory CPU versatile. Suitable
flexible and highest faster than CPU, if for algorithms that
NRA. done right. contain ‘if’-
statements
Note that Tesla’s FSD Chip and Smartphone SoCs
deploy a hybrid of all options
Tesla FSD chip A14: iPhone SoC Cerebras Wafer scale

16X GPU Mesh of


12X CPU

LP CPU
X4 1,000,000

Cache
GPU
X4
processors
2X TPU NPU HP CPU
X2
March 28, 2022

TPU

Stanford EE292A Lecture 2 6


Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE ) Source: Tesla Autonomy April 22 nd, 2019
Mapping a ML Network on AI Hardware

16X GPU

12X CPU
Automatic synthesis flow
maps MLIR onto Custom

2X TPU
ML Hardware

March 28, 2022

TPU
Custom Hybrid hardware
Mix of FP32, FP18 and others
Co
(Py mpile
Tor to Compare results for Verification
ch/ PC
ML Network MLIR Ten ha
sor rdw
And debug
Flo are
Intermediate w) CP
U
Representation
with basic
operations PC: 1000X slower
All FP32
(LLVM-based)
Training a Simplified MNIST TensorFlow Model
model.py:

Labels Features (images) def mnist_model_fn(


features, labels, mode=tf.estimator.ModeKeys.TRAIN

FCN (Dense) 256

FCN (Dense) 10
2
Input 28x28x1

):
Flatten 784x1
784

256
dtype = tf.keras.mixed_precision.experimental.Policy(

Softmax
7

10
'mixed_float16', loss_scale=None)

RELU

loss
tf.keras.mixed_precision.experimental.set_policy(dtype)
5
3 network = tf.keras.layers.Dense(
256,
1 activation=tf.nn.relu,
2
)(features)

i 1 2 3 4 3 logits = tf.keras.layers.Dense(
Logits =
unnormalized
10, predictions
MNIST_training_script.py: )(network)
data.py:
import tensorflow as tf loss_op = tf.reduce_mean(
def mnist_input_fn( from data.py import mnist_input_fn 4 input_tensor=tf.nn.softmax_cross_entropy_with_logits(
mode=tf.estimator.ModeKeys.TRAIN from model.py import mnist_model_fn labels=tf.stop_gradient(labels),
): logits=logits,
mnist_loader = MnistLoader( # declare the network )
img_dtype=tf.float32, estimator = tf.Estimator(
label_dtype=tf.float32, )
model_fn,
flatten=True,
) ) train_op = tf.compat.v1.train.GradientDescentOptimizer(
i dset = mnist_loader.get_dataset(
‘train’, # run!
learning_rate=0.01).minimize(loss_op)
batch_size=1, num_batches=20
estimator.train( return tf.estimator.EstimatorSpec(
).repeat()
input_fn=input_fn, mode=mode, loss=loss_op, train_op=train_op
return dset ) )
8
From TensorFlow to Training Hardware:
Synthesis Flow
from data.py import mnist_input_fn
from model.py import mnist model_fn

estimator = CerebrasEstimator(

)
model_fn, TensorFlow model
estimator.train(
input_fn=input_fn,
)

Intermediate Representation
model.py MLIR

Match to Computation
Kernels

Shape, Place and Route

Program Hardware
data.py
Framework

def mnist_input_fn(
mode=tf.estimator.ModeKeys.TRAIN
):
mnist_loader = MnistLoader(
img_dtype=tf.float32,
label_dtype=tf.float32,

Stream data Weights, losses


flatten=True,
)
dset = mnist_loader.get_dataset(
‘train’,
batch_size=1, num_batches=20
).repeat()
9
return dset
Making sense of the
i
Intermediate Representation

FCN (Dense) 256

FCN (Dense) 10
Input 28x28x1

Flatten 784x1
784

256

Softmax
10
RELU

loss
i

1 2 3 4

Dense Layer 1: Dense Layer 2:


784x256= 1
256x10= 2560
200,704 trainable
trainable weights
weights + 2
256 biases

3
10
4
l i
From the Intermediate Representation
to a Kernel Graph 1 2

3 f g
FCN (Dense) 256

FCN (Dense) 10
Input 28x28x1

Flatten 784x1

i f
Softmax o i
f
RELU

loss
g g
3
1 2 3 4 1
l 3 2

• Library of ‘kernels’, e.g. (FCN+RELU)


• Performs operation that matches a subgraph f g
• Subgraphs of kernels are matched to the IR
graph l
• Result is a ‘kernel graph’ 4
4
11
model o IR Kernel Graph o
Summary

• Computation for machine learning


• Dense (Fully Connected) and convolution layers
• Speeding up Computation: Floating point formats
• ML hardware classes:
• CPU, GPU, TPU, custom
• GPU structure
• Layer-Pipelined execution of ML
• TensorFlow to hardware flow:
• IR, Kernel Graph
• Other approaches:
• FPGA ML flows, Google TPU

Stanford EE292A Lecture 2 12


References
• H. T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays; in: C. Mead, L. Conway (eds.): Introduction to
VLSI Systems; Addison-Wesley, 1979
1. Schwarz, E.M.; Schmookler, M.; Son Dao Trong (July 2005). ”Hardware Implementations of Denormalized Numbers". IEEE
Transactions on Computers. 54 (7): 825–836.
http://www.acsel-lab.com/arithmetic/arith16/papers/ARITH16_Schwarz.pdf
2. IEEE754 float converter: https://www.h-schmidt.net/FloatConverter/IEEE754.html
3. https://www.tensorflow.org
4. https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
5. ACAP architecture: https://www.xilinx.com/products/silicon-devices/acap/versal.html
6. http://www.cerebras.net

Stanford EE292A Lecture 2 13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy