0% found this document useful (0 votes)

12 views20 pages

Vector Processors

Uploaded by

xohaj47692

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views20 pages

Vector Processors

Uploaded by

xohaj47692

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Data-level parallelism

Vector, SIMD and GPU architectures

● Data-level parallelism (DLP) arises because there are many data items that
can be operated on at the same time.
● Single instruction stream, multiple data streams (SIMD)—The same
instruction is executed by multiple processors using different data streams.
SIMD computers exploit data-level parallelism by applying the same
operations to multiple items of data in parallel. (Flynn, 1966)
● vector architectures, multimedia SIMD instruction set extensions, and
graphics processing units (GPUs)
Vector architecture
grab sets of data elements scattered about
memory, place them into large sequential
register files, operate on data in those
register files, and then
disperse the results back into memory. A
single instruction works on vectors of data,
which results in dozens of register-register
operations on independent data elements.
Y=a X+Y: RISC-V
Y=a X+Y: RV64V
SIMD Instruction Set Extensions for Multimedia
● many media applications operate on narrower data types than the 32-bit
processors were optimized for
● Like vector instructions, a SIMD instruction specifies the same operation on
vectors of data.
● Unlike vector SIMD instructions tend to specify fewer operands and thus use
much smaller register files.
● SIMD extensions have three major omissions: no vector length register, no
strided or gather/scatter data transfer instructions, and no mask registers.
x86 architectures
● MMX (Multi-media extension): 64-bit floating-point registers, eight 8-bit
operations or four 16-bit operations simultaneously.
● SSE (Streaming SIMD extension): (XMM registers) 128 bits wide, sixteen
8-bit operations, eight 16-bit operations, or four 32-bit operations.
● AVX (Advanced Vector extension): 256 bits (YMM registers), thirty-two 8-bit
operations, sixteen 16-bit operations, or eight 32-bit operations.
GPU
● CPU is designed to excel at
executing a sequence of
operations, called a thread, as
fast as possible and can execute
a few tens of these threads in
parallel.
● GPU is designed to excel at
executing thousands of them in
parallel (amortizing the slower
single-thread performance to
achieve greater throughput).
CUDA
General purpose parallel computing
platform and programming model
that leverages the parallel compute
engine in NVIDIA GPUs to solve
many complex computational
problems in a more efficient way than
on a CPU.
CUDA Functions

● Allocate Memory: cudaMalloc((void **) &x, size)

● Transfer memory: cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice)
kernels
● A kernel is defined using the __global__ declaration

● the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>> execution
configuration syntax
● Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through built-in variables.
● threads grouped into blocks
● Specify number of blocks and number of threads per block.
Why two levels of threads?
● A grid of thread blocks is easier to manage than one big block of threads.
● GPU has 1000’s of cores, grouped into 10’s of streaming multiprocessors (SMs).
○ Each SM has its own memory, scheduling.
○ Each SM has e.g. 64 cores (P100 architecture).
● GPU can start millions of threads, but they don’t all run simultaneously.
● Scheduler (Gigathread Engine) packs up to ~1000 threads into one block and
assigns the block to an SM.
○ The threads have consecutive IDs.
○ Several thread blocks can be assigned to an SM at same time.
○ Threads in a block don’t execute simultaneously either.
■ They run in warps of 32 threads; more later.
● A thread block assigned to an SM uses resources (registers, shared memory) on the SM.
○ All assigned threads are pre-allocated resources.
■ Since we know the block size when we invoke the kernel, the SM knows how much resources to assign.
○ This makes switching between threads very fast.
■ No dynamic resource allocation.
■ SM has huge number (e.g. 64K) of registers, so no register flush when switching threads.
● Each SM has its own (warp) scheduler to manage threads assigned to it.
● When all threads in a block finishes, the resources are freed.
● Then Gigathread Engine schedules a new block to the SM, using the freed resources.
● At any time, SM only needs to manage a block of a few thousand threads, instead of entire
grid of millions of threads.
GPU Memory organization

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Presentation1 (1) hpc mod 3
No ratings yet
Presentation1 (1) hpc mod 3
51 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
chapter-8
No ratings yet
chapter-8
58 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
17 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CatanzaroIntroToGPUs
No ratings yet
CatanzaroIntroToGPUs
76 pages
04_DLP
No ratings yet
04_DLP
19 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
CUDA
No ratings yet
CUDA
33 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
CSED405 Lec2-CUDA Overview_240916_131108
No ratings yet
CSED405 Lec2-CUDA Overview_240916_131108
52 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Paralelismo_2024
No ratings yet
Paralelismo_2024
30 pages
Hardware
No ratings yet
Hardware
54 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Lec 18-VectorSIMDGPUArchitectures
No ratings yet
Lec 18-VectorSIMDGPUArchitectures
29 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
Lec 1
No ratings yet
Lec 1
27 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Lec 14
No ratings yet
Lec 14
52 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Lec 6
No ratings yet
Lec 6
16 pages
COE4590_15_GPU1
No ratings yet
COE4590_15_GPU1
14 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
Lecture2 GPU Architecture_2025
No ratings yet
Lecture2 GPU Architecture_2025
46 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lec 3
No ratings yet
Lec 3
48 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
00_CourseIntroduction
No ratings yet
00_CourseIntroduction
33 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
GPU_Programming_slides_2
No ratings yet
GPU_Programming_slides_2
37 pages
GPU Architecture
No ratings yet
GPU Architecture
70 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
From Everand
Master System Architecture: Architecture of Consoles: A Practical Analysis, #15
Rodrigo Copetti
2/5 (1)
Verilog HDL Syllabus
No ratings yet
Verilog HDL Syllabus
28 pages
Java Coding Assessments
No ratings yet
Java Coding Assessments
7 pages
CSM NOTES
No ratings yet
CSM NOTES
276 pages
CURA 15.04 Version User's Instruction
No ratings yet
CURA 15.04 Version User's Instruction
7 pages
Hitory & Structure of C
No ratings yet
Hitory & Structure of C
8 pages
Need Help Sending Email Attachment in CSV Format U... - SAP Community
No ratings yet
Need Help Sending Email Attachment in CSV Format U... - SAP Community
6 pages
Nelder Mead 2D
No ratings yet
Nelder Mead 2D
5 pages
Relnote
No ratings yet
Relnote
12 pages
EHacking
No ratings yet
EHacking
7 pages
Introduction To Cyber Security (DAY 1)
No ratings yet
Introduction To Cyber Security (DAY 1)
12 pages
System Architecture
No ratings yet
System Architecture
34 pages
Amin Project
No ratings yet
Amin Project
13 pages
W3V/A Schematic V2.1: System Page Ref. Power Page Ref
No ratings yet
W3V/A Schematic V2.1: System Page Ref. Power Page Ref
63 pages
3.6. E32 (400T20S) / E32 (868T20S) / E32 (915T20S) : Pin No. Pin Pin Direction Application
No ratings yet
3.6. E32 (400T20S) / E32 (868T20S) / E32 (915T20S) : Pin No. Pin Pin Direction Application
1 page
NB 06 Cat9200 Ser Data Sheet Cte en
No ratings yet
NB 06 Cat9200 Ser Data Sheet Cte en
52 pages
07-Huawei EDesigner & SCT Tools Pre-Sales Training V1.7-Qian Wei
No ratings yet
07-Huawei EDesigner & SCT Tools Pre-Sales Training V1.7-Qian Wei
42 pages
Chapter 6 - Introducing - Classes
No ratings yet
Chapter 6 - Introducing - Classes
32 pages
Room Rent Management System Report
No ratings yet
Room Rent Management System Report
10 pages
Crash 2023 08 27 06 51 28
No ratings yet
Crash 2023 08 27 06 51 28
4 pages
Catia V5 Project File
No ratings yet
Catia V5 Project File
21 pages
ARM Cortex M0
No ratings yet
ARM Cortex M0
5 pages
Java 15
No ratings yet
Java 15
27 pages
ABAP Class05
No ratings yet
ABAP Class05
14 pages
C++ Projects With Source Code - Itsourcecode.com
No ratings yet
C++ Projects With Source Code - Itsourcecode.com
11 pages
Oem Gps Tracker User Manual
No ratings yet
Oem Gps Tracker User Manual
13 pages
A+ Emerging Final Exam AAU (4) - 240110 - 231739
100% (1)
A+ Emerging Final Exam AAU (4) - 240110 - 231739
15 pages
FileAgo Presentation
No ratings yet
FileAgo Presentation
14 pages
Aditi Garg: Software Engineer
No ratings yet
Aditi Garg: Software Engineer
1 page
Lab 01 - GX Works Ladder
No ratings yet
Lab 01 - GX Works Ladder
10 pages
Grid Computing2
No ratings yet
Grid Computing2
33 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Vector Processors

Uploaded by

Vector Processors

Uploaded by

Data-level parallelism

Vector, SIMD and GPU architectures

● Allocate Memory: cudaMalloc((void **) &x, size)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.