EE292A Lecture 2.ML - Hardware - 2 - April9
EE292A Lecture 2.ML - Hardware - 2 - April9
LLMs
Today:
100,000,000,000
Unleashes
Mid-2010’s:
Unleashes Machine Learning / AI
Which is based on brute force
Floating-point multiply-add
Squeezing Maximum Performance from
Semiconductor Hardware
• Find faster, more effective ML algorithms (duh!)
• Make hardware run faster to get more Operations per second (Giga->Tera->Peta FLOPS)
Minimize distance
Minimize
Good Circuit between
computational Run in parallel
Synthesis computation and
complexity
memory
m m m m m m m m m
m m m m m m m m m
p w p w p w P
40G
40G p w p w p w p w p w 40G 40G
Weights
Weights Weights Weights
RAM
Data m m m m m m m m m
m m m m m m m m m
RAM
p w p w p w
p w
p w RAM RAM
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m p w
p w
p w
P
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
p w p w p w
p w
p w
LP CPU
X4 1,000,000
Cache
GPU
X4
processors
2X TPU NPU HP CPU
X2
March 28, 2022
TPU
16X GPU
12X CPU
Automatic synthesis flow
maps MLIR onto Custom
2X TPU
ML Hardware
TPU
Custom Hybrid hardware
Mix of FP32, FP18 and others
Co
(Py mpile
Tor to Compare results for Verification
ch/ PC
ML Network MLIR Ten ha
sor rdw
And debug
Flo are
Intermediate w) CP
U
Representation
with basic
operations PC: 1000X slower
All FP32
(LLVM-based)
Training a Simplified MNIST TensorFlow Model
model.py:
FCN (Dense) 10
2
Input 28x28x1
):
Flatten 784x1
784
256
dtype = tf.keras.mixed_precision.experimental.Policy(
Softmax
7
10
'mixed_float16', loss_scale=None)
RELU
loss
tf.keras.mixed_precision.experimental.set_policy(dtype)
5
3 network = tf.keras.layers.Dense(
256,
1 activation=tf.nn.relu,
2
)(features)
i 1 2 3 4 3 logits = tf.keras.layers.Dense(
Logits =
unnormalized
10, predictions
MNIST_training_script.py: )(network)
data.py:
import tensorflow as tf loss_op = tf.reduce_mean(
def mnist_input_fn( from data.py import mnist_input_fn 4 input_tensor=tf.nn.softmax_cross_entropy_with_logits(
mode=tf.estimator.ModeKeys.TRAIN from model.py import mnist_model_fn labels=tf.stop_gradient(labels),
): logits=logits,
mnist_loader = MnistLoader( # declare the network )
img_dtype=tf.float32, estimator = tf.Estimator(
label_dtype=tf.float32, )
model_fn,
flatten=True,
) ) train_op = tf.compat.v1.train.GradientDescentOptimizer(
i dset = mnist_loader.get_dataset(
‘train’, # run!
learning_rate=0.01).minimize(loss_op)
batch_size=1, num_batches=20
estimator.train( return tf.estimator.EstimatorSpec(
).repeat()
input_fn=input_fn, mode=mode, loss=loss_op, train_op=train_op
return dset ) )
8
From TensorFlow to Training Hardware:
Synthesis Flow
from data.py import mnist_input_fn
from model.py import mnist model_fn
estimator = CerebrasEstimator(
)
model_fn, TensorFlow model
estimator.train(
input_fn=input_fn,
)
Intermediate Representation
model.py MLIR
Match to Computation
Kernels
Program Hardware
data.py
Framework
def mnist_input_fn(
mode=tf.estimator.ModeKeys.TRAIN
):
mnist_loader = MnistLoader(
img_dtype=tf.float32,
label_dtype=tf.float32,
FCN (Dense) 10
Input 28x28x1
Flatten 784x1
784
256
Softmax
10
RELU
loss
i
1 2 3 4
3
10
4
l i
From the Intermediate Representation
to a Kernel Graph 1 2
3 f g
FCN (Dense) 256
FCN (Dense) 10
Input 28x28x1
Flatten 784x1
i f
Softmax o i
f
RELU
loss
g g
3
1 2 3 4 1
l 3 2