0% found this document useful (0 votes)

86 views13 pages

EE292A Lecture 2.ML - Hardware - 2 - April9

Uploaded by

wuxiangjin08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

86 views13 pages

EE292A Lecture 2.ML - Hardware - 2 - April9

Uploaded by

wuxiangjin08

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

EE292A Lecture 2

Machine Learning Network

to Custom Hardware

Raúl Camposano Antun Domic Patrick Groeneveld

Silvaco and Silicon Catalyst Kepler AMD
camposan@stanford.edu domic@stanford.edu prgr@stanford.edu
Copyright ©2024 by Raúl Camposano, Antun Domic and Patrick Groeneveld
Remainder Machine
Learning Network to
Custom Hardware
Mapping Machine learning to efficient
hardware

Stanford EE292A Lecture 2 2

Engineering Tech: Main Driving Forces
Moore’s Law Transistor Supply: Computational Demand from AI/ Machine Learning
transistors doubling every ~24 months doubling every ~4 month

LLMs

Today:
100,000,000,000

Unleashes

Mid-2010’s:
Unleashes Machine Learning / AI
Which is based on brute force
Floating-point multiply-add
Squeezing Maximum Performance from
Semiconductor Hardware
• Find faster, more effective ML algorithms (duh!)
• Make hardware run faster to get more Operations per second (Giga->Tera->Peta FLOPS)

Minimize distance
Minimize
Good Circuit between
computational Run in parallel
Synthesis computation and
complexity
memory

• Optimal resource • Smaller is better • Floating Point data • More computation

allocation and • Distribute memory formats matter, per clock cycle
scheduling, logic next to a lot! • Pipelining
minimization, clock computation. • FP32-> FP16 ->
skew minimization, BF16 -> FP8
clever transistor
sizing, etc etc,
Compute Architectures for ML
Custom ASIC TPU/NPU Mesh GPU CPU
datapath
p w p w p w p w p w

m m m m m m m m m
m m m m m m m m m
p w p w p w P
40G
40G p w p w p w p w p w 40G 40G
Weights
Weights Weights Weights
RAM
Data m m m m m m m m m
m m m m m m m m m
RAM
p w p w p w
p w
p w RAM RAM
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m p w
p w
p w
P
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
m m m m m m m m m
p w p w p w
p w
p w

Data Data Data Data

Data

• Any structure, hard- • 10,000 Multipliers • 1,000,000 • Up to 1000 • 2-16 processors

coded algorithm surrounded by processors, processors
• Fastest and lowest memory. • Each with local • 10-100X faster than • Most universal and
power. But least • 1000X-10000X memory CPU versatile. Suitable
flexible and highest faster than CPU, if for algorithms that
NRA. done right. contain ‘if’-
statements
Note that Tesla’s FSD Chip and Smartphone SoCs
deploy a hybrid of all options
Tesla FSD chip A14: iPhone SoC Cerebras Wafer scale

16X GPU Mesh of

12X CPU

LP CPU
X4 1,000,000

Cache
GPU
X4
processors
2X TPU NPU HP CPU
X2
March 28, 2022

TPU

Stanford EE292A Lecture 2 6

Source: Tesla (https://www.youtube.com/watch?v=Ucp0TTmvqOE ) Source: Tesla Autonomy April 22 nd, 2019
Mapping a ML Network on AI Hardware

16X GPU

12X CPU
Automatic synthesis flow
maps MLIR onto Custom

2X TPU
ML Hardware

March 28, 2022

TPU
Custom Hybrid hardware
Mix of FP32, FP18 and others
Co
(Py mpile
Tor to Compare results for Verification
ch/ PC
ML Network MLIR Ten ha
sor rdw
And debug
Flo are
Intermediate w) CP
U
Representation
with basic
operations PC: 1000X slower
All FP32
(LLVM-based)
Training a Simplified MNIST TensorFlow Model
model.py:

Labels Features (images) def mnist_model_fn(

features, labels, mode=tf.estimator.ModeKeys.TRAIN

FCN (Dense) 256

FCN (Dense) 10
2
Input 28x28x1

):
Flatten 784x1
784

256
dtype = tf.keras.mixed_precision.experimental.Policy(

Softmax
7

10
'mixed_float16', loss_scale=None)

RELU

loss
tf.keras.mixed_precision.experimental.set_policy(dtype)
5
3 network = tf.keras.layers.Dense(
256,
1 activation=tf.nn.relu,
2
)(features)

i 1 2 3 4 3 logits = tf.keras.layers.Dense(
Logits =
unnormalized
10, predictions
MNIST_training_script.py: )(network)
data.py:
import tensorflow as tf loss_op = tf.reduce_mean(
def mnist_input_fn( from data.py import mnist_input_fn 4 input_tensor=tf.nn.softmax_cross_entropy_with_logits(
mode=tf.estimator.ModeKeys.TRAIN from model.py import mnist_model_fn labels=tf.stop_gradient(labels),
): logits=logits,
mnist_loader = MnistLoader( # declare the network )
img_dtype=tf.float32, estimator = tf.Estimator(
label_dtype=tf.float32, )
model_fn,
flatten=True,
) ) train_op = tf.compat.v1.train.GradientDescentOptimizer(
i dset = mnist_loader.get_dataset(
‘train’, # run!
learning_rate=0.01).minimize(loss_op)
batch_size=1, num_batches=20
estimator.train( return tf.estimator.EstimatorSpec(
).repeat()
input_fn=input_fn, mode=mode, loss=loss_op, train_op=train_op
return dset ) )
8
From TensorFlow to Training Hardware:
Synthesis Flow
from data.py import mnist_input_fn
from model.py import mnist model_fn

estimator = CerebrasEstimator(

)
model_fn, TensorFlow model
estimator.train(
input_fn=input_fn,
)

Intermediate Representation
model.py MLIR

Match to Computation
Kernels

Shape, Place and Route

Program Hardware
data.py
Framework

def mnist_input_fn(
mode=tf.estimator.ModeKeys.TRAIN
):
mnist_loader = MnistLoader(
img_dtype=tf.float32,
label_dtype=tf.float32,

Stream data Weights, losses

flatten=True,
)
dset = mnist_loader.get_dataset(
‘train’,
batch_size=1, num_batches=20
).repeat()
9
return dset
Making sense of the
i
Intermediate Representation

FCN (Dense) 256

FCN (Dense) 10
Input 28x28x1

Flatten 784x1
784

256

Softmax
10
RELU

loss
i

1 2 3 4

Dense Layer 1: Dense Layer 2:

784x256= 1
256x10= 2560
200,704 trainable
trainable weights
weights + 2
256 biases

3
10
4
l i
From the Intermediate Representation
to a Kernel Graph 1 2

3 f g
FCN (Dense) 256

FCN (Dense) 10
Input 28x28x1

Flatten 784x1

i f
Softmax o i
f
RELU

loss
g g
3
1 2 3 4 1
l 3 2

• Library of ‘kernels’, e.g. (FCN+RELU)

• Performs operation that matches a subgraph f g
• Subgraphs of kernels are matched to the IR
graph l
• Result is a ‘kernel graph’ 4
4
11
model o IR Kernel Graph o
Summary

• Computation for machine learning

• Dense (Fully Connected) and convolution layers
• Speeding up Computation: Floating point formats
• ML hardware classes:
• CPU, GPU, TPU, custom
• GPU structure
• Layer-Pipelined execution of ML
• TensorFlow to hardware flow:
• IR, Kernel Graph
• Other approaches:
• FPGA ML flows, Google TPU

Stanford EE292A Lecture 2 12

References
• H. T. Kung, C. E. Leiserson: Algorithms for VLSI processor arrays; in: C. Mead, L. Conway (eds.): Introduction to
VLSI Systems; Addison-Wesley, 1979
1. Schwarz, E.M.; Schmookler, M.; Son Dao Trong (July 2005). ”Hardware Implementations of Denormalized Numbers". IEEE
Transactions on Computers. 54 (7): 825–836.
http://www.acsel-lab.com/arithmetic/arith16/papers/ARITH16_Schwarz.pdf
2. IEEE754 float converter: https://www.h-schmidt.net/FloatConverter/IEEE754.html
3. https://www.tensorflow.org
4. https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
5. ACAP architecture: https://www.xilinx.com/products/silicon-devices/acap/versal.html
6. http://www.cerebras.net

Stanford EE292A Lecture 2 13

Phy Interface Pci Express Sata Usb3 1 Architectures 4 2
No ratings yet
Phy Interface Pci Express Sata Usb3 1 Architectures 4 2
90 pages
UCIe Physical Layer
No ratings yet
UCIe Physical Layer
16 pages
EN-PCA-3-0-LATEST-INSTALL
No ratings yet
EN-PCA-3-0-LATEST-INSTALL
106 pages
Bios User Guide
No ratings yet
Bios User Guide
263 pages
AI Neocloud Playbook and Anatomy
No ratings yet
AI Neocloud Playbook and Anatomy
19 pages
System Busses / Networks-on-Chip: EECE 579 - Advanced Topics in VLSI Design Spring 2009 Brad Quinton
No ratings yet
System Busses / Networks-on-Chip: EECE 579 - Advanced Topics in VLSI Design Spring 2009 Brad Quinton
102 pages
RISC-V - Control Unit
100% (1)
RISC-V - Control Unit
25 pages
IEEE 610-5-1990 - w2000 Glossary of Data Management Terminology
No ratings yet
IEEE 610-5-1990 - w2000 Glossary of Data Management Terminology
76 pages
ESSCIRC2019 Tutorial Ali Sheikholeslami
No ratings yet
ESSCIRC2019 Tutorial Ali Sheikholeslami
66 pages
System On Chip SoC Report
100% (1)
System On Chip SoC Report
14 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
02 - 05 PCIe 6.0 PHY Logical
No ratings yet
02 - 05 PCIe 6.0 PHY Logical
25 pages
Soc Final Project
100% (1)
Soc Final Project
24 pages
EE292A Lecture 1.intro
No ratings yet
EE292A Lecture 1.intro
61 pages
Verilog Tutorial: Chin-Lung Su
No ratings yet
Verilog Tutorial: Chin-Lung Su
42 pages
Aditya Joshi 23252595 Assign 5
No ratings yet
Aditya Joshi 23252595 Assign 5
7 pages
Interconnection Networks
No ratings yet
Interconnection Networks
31 pages
Ethernet Fundamentals
No ratings yet
Ethernet Fundamentals
26 pages
Astro 2004 7
No ratings yet
Astro 2004 7
322 pages
An Integrated Tutorial On InfiniBand, Verbs, and MPI PDF
No ratings yet
An Integrated Tutorial On InfiniBand, Verbs, and MPI PDF
33 pages
Riscv Rocket Chip Tutorial Bootcamp Jan2015
No ratings yet
Riscv Rocket Chip Tutorial Bootcamp Jan2015
30 pages
Sifive Vcu118 Fpga Getting Started Guide 20G1.05.00
No ratings yet
Sifive Vcu118 Fpga Getting Started Guide 20G1.05.00
34 pages
Rvfpga-Soc: Getting Started Guide
No ratings yet
Rvfpga-Soc: Getting Started Guide
5 pages
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
No ratings yet
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
14 pages
Mmwave - SDK - User - Guide 1.0.0
No ratings yet
Mmwave - SDK - User - Guide 1.0.0
64 pages
Debugging SystemVerilog
No ratings yet
Debugging SystemVerilog
175 pages
pg138 Axi Ethernet
No ratings yet
pg138 Axi Ethernet
176 pages
All Notes Hcia Ai Huawei Mock Exam Written
50% (4)
All Notes Hcia Ai Huawei Mock Exam Written
28 pages
Advanced VLSI Design: Dr. Premananda B.S
No ratings yet
Advanced VLSI Design: Dr. Premananda B.S
339 pages
The Veloce Emulator: Laurent VUILLEMIN Platform Compile Software Manager Emulation Division
No ratings yet
The Veloce Emulator: Laurent VUILLEMIN Platform Compile Software Manager Emulation Division
36 pages
SCT 4 Platform Design Spec Rev1 4 PDF
No ratings yet
SCT 4 Platform Design Spec Rev1 4 PDF
62 pages
LPDDR4 Moves Mobile
100% (1)
LPDDR4 Moves Mobile
14 pages
Riscv Isa
No ratings yet
Riscv Isa
1 page
7 Series Memory Controllers
100% (1)
7 Series Memory Controllers
36 pages
MLG Tensor
No ratings yet
MLG Tensor
34 pages
Wed 1142 RISCV Tim Edwards
No ratings yet
Wed 1142 RISCV Tim Edwards
14 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages
2-3 SSUSB DevCon LinkLayer Vining
No ratings yet
2-3 SSUSB DevCon LinkLayer Vining
54 pages
Riscv Presentation PDF
0% (1)
Riscv Presentation PDF
71 pages
HLS Introduction Gajski Design and Test
No ratings yet
HLS Introduction Gajski Design and Test
10 pages
Manual HSPICE
0% (1)
Manual HSPICE
458 pages
Low Power Design of Digital Systems
No ratings yet
Low Power Design of Digital Systems
28 pages
Lec07 Memory sp17
No ratings yet
Lec07 Memory sp17
99 pages
How To Create A Microblaze AXI4 DDR3 Embedded System and Stay Alive
No ratings yet
How To Create A Microblaze AXI4 DDR3 Embedded System and Stay Alive
12 pages
Vil HDL (Ii) Verilog HDL (II)
No ratings yet
Vil HDL (Ii) Verilog HDL (II)
23 pages
Xilinx Answer 73361 PCIe Link Training Debug Guide For US and US Plus
No ratings yet
Xilinx Answer 73361 PCIe Link Training Debug Guide For US and US Plus
54 pages
Embedded System: 1 History
No ratings yet
Embedded System: 1 History
11 pages
Chisel Manual
No ratings yet
Chisel Manual
13 pages
System Modelling (ESL)
No ratings yet
System Modelling (ESL)
45 pages
Flow DW
No ratings yet
Flow DW
200 pages
File: /home/binod/documents/allfmca p/rocket-chip-master/README - MD Page 1 of 7
No ratings yet
File: /home/binod/documents/allfmca p/rocket-chip-master/README - MD Page 1 of 7
7 pages
12.10 16.10b Open Source Verification Platform For RISC V Processors
No ratings yet
12.10 16.10b Open Source Verification Platform For RISC V Processors
27 pages
Pulpissimo: Datasheet: The Pulp Team
No ratings yet
Pulpissimo: Datasheet: The Pulp Team
101 pages
SPI I2C Interface An
100% (2)
SPI I2C Interface An
13 pages
Risc VCheatsheet
No ratings yet
Risc VCheatsheet
2 pages
High-Speed 8B/10B Encoder Design Using A Simplified Coding Table
100% (1)
High-Speed 8B/10B Encoder Design Using A Simplified Coding Table
5 pages
17acb03418 Fyp
No ratings yet
17acb03418 Fyp
80 pages
Artificial_Intelligence_in_the_IoT_Era_A_Review_of_Edge_AI_Hardware_and_Software
No ratings yet
Artificial_Intelligence_in_the_IoT_Era_A_Review_of_Edge_AI_Hardware_and_Software
12 pages
EE292A Lecture 2.ML - Hardware
No ratings yet
EE292A Lecture 2.ML - Hardware
61 pages
VIVIA
No ratings yet
VIVIA
30 pages
Final Report
No ratings yet
Final Report
20 pages
AI_ML_Engineer
No ratings yet
AI_ML_Engineer
2 pages
Synopsis 3 PPT
No ratings yet
Synopsis 3 PPT
30 pages
Machine Learning Libraries
No ratings yet
Machine Learning Libraries
38 pages
Dijsktra Thesis
No ratings yet
Dijsktra Thesis
65 pages
Implementation of AMBA AHB Protocol Using Verilog HDL
No ratings yet
Implementation of AMBA AHB Protocol Using Verilog HDL
4 pages
resast tahaelhariri
No ratings yet
resast tahaelhariri
3 pages
Rohit's Resume
No ratings yet
Rohit's Resume
2 pages
By Vinayashree: Serial Communication Concepts
No ratings yet
By Vinayashree: Serial Communication Concepts
18 pages
A Machine Learning Model For Average Fuel Consumption in Heavy Vehicles
No ratings yet
A Machine Learning Model For Average Fuel Consumption in Heavy Vehicles
59 pages
AcademAI - AI-Based PHD Student Tracking Platform
No ratings yet
AcademAI - AI-Based PHD Student Tracking Platform
13 pages
Introduction To Keras
No ratings yet
Introduction To Keras
14 pages
Skill Badges List AJ - CY - Arcade Facilitator 2024
No ratings yet
Skill Badges List AJ - CY - Arcade Facilitator 2024
7 pages
Fin Irjmets1685011030
No ratings yet
Fin Irjmets1685011030
6 pages
KO WBNR Whitepaper MCW0011262MachineLearnining
No ratings yet
KO WBNR Whitepaper MCW0011262MachineLearnining
62 pages
Object Detection and Identification A Project Report: November 2019
No ratings yet
Object Detection and Identification A Project Report: November 2019
45 pages
Research Article: A 1.25-12.5 Gbps Adaptive CTLE With Asynchronous Statistic Eye-Opening Monitor
No ratings yet
Research Article: A 1.25-12.5 Gbps Adaptive CTLE With Asynchronous Statistic Eye-Opening Monitor
10 pages
Keras
No ratings yet
Keras
2 pages
Format of PPT For 7 CSE AND AI&DS
No ratings yet
Format of PPT For 7 CSE AND AI&DS
7 pages
ats-friendly-technical-resume
No ratings yet
ats-friendly-technical-resume
2 pages
Keras
No ratings yet
Keras
2 pages
Alexander Chiou Resume - FINAL for SHARING.docx
No ratings yet
Alexander Chiou Resume - FINAL for SHARING.docx
1 page
Quantum Mechanics Notes
No ratings yet
Quantum Mechanics Notes
2 pages
Resume Rishi
No ratings yet
Resume Rishi
1 page
Anant Resume
No ratings yet
Anant Resume
1 page
Shreyash Resume
No ratings yet
Shreyash Resume
2 pages
Keras Definition
No ratings yet
Keras Definition
2 pages
Venugopal Ranganath CV
No ratings yet
Venugopal Ranganath CV
1 page
Arnav Verma Updated Resume
No ratings yet
Arnav Verma Updated Resume
2 pages
These Are The Top 10 Machine Learning Languages On GitHub
No ratings yet
These Are The Top 10 Machine Learning Languages On GitHub
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

EE292A Lecture 2.ML - Hardware - 2 - April9

Uploaded by

EE292A Lecture 2.ML - Hardware - 2 - April9

Uploaded by

EE292A Lecture 2

Machine Learning Network

Raúl Camposano Antun Domic Patrick Groeneveld

Stanford EE292A Lecture 2 2

• Optimal resource • Smaller is better • Floating Point data • More computation

Data Data Data Data

• Any structure, hard- • 10,000 Multipliers • 1,000,000 • Up to 1000 • 2-16 processors

16X GPU Mesh of

Stanford EE292A Lecture 2 6

March 28, 2022

Labels Features (images) def mnist_model_fn(

FCN (Dense) 256

Shape, Place and Route

Stream data Weights, losses

FCN (Dense) 256

Dense Layer 1: Dense Layer 2:

• Library of ‘kernels’, e.g. (FCN+RELU)

• Computation for machine learning

Stanford EE292A Lecture 2 12

Stanford EE292A Lecture 2 13

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.