0% found this document useful (0 votes)
12 views20 pages

Vector Processors

Uploaded by

xohaj47692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views20 pages

Vector Processors

Uploaded by

xohaj47692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data-level parallelism

Vector, SIMD and GPU architectures


● Data-level parallelism (DLP) arises because there are many data items that
can be operated on at the same time.
● Single instruction stream, multiple data streams (SIMD)—The same
instruction is executed by multiple processors using different data streams.
SIMD computers exploit data-level parallelism by applying the same
operations to multiple items of data in parallel. (Flynn, 1966)
● vector architectures, multimedia SIMD instruction set extensions, and
graphics processing units (GPUs)
Vector architecture
grab sets of data elements scattered about
memory, place them into large sequential
register files, operate on data in those
register files, and then
disperse the results back into memory. A
single instruction works on vectors of data,
which results in dozens of register-register
operations on independent data elements.
Y=a X+Y: RISC-V
Y=a X+Y: RV64V
SIMD Instruction Set Extensions for Multimedia
● many media applications operate on narrower data types than the 32-bit
processors were optimized for
● Like vector instructions, a SIMD instruction specifies the same operation on
vectors of data.
● Unlike vector SIMD instructions tend to specify fewer operands and thus use
much smaller register files.
● SIMD extensions have three major omissions: no vector length register, no
strided or gather/scatter data transfer instructions, and no mask registers.
x86 architectures
● MMX (Multi-media extension): 64-bit floating-point registers, eight 8-bit
operations or four 16-bit operations simultaneously.
● SSE (Streaming SIMD extension): (XMM registers) 128 bits wide, sixteen
8-bit operations, eight 16-bit operations, or four 32-bit operations.
● AVX (Advanced Vector extension): 256 bits (YMM registers), thirty-two 8-bit
operations, sixteen 16-bit operations, or eight 32-bit operations.
GPU
● CPU is designed to excel at
executing a sequence of
operations, called a thread, as
fast as possible and can execute
a few tens of these threads in
parallel.
● GPU is designed to excel at
executing thousands of them in
parallel (amortizing the slower
single-thread performance to
achieve greater throughput).
CUDA
General purpose parallel computing
platform and programming model
that leverages the parallel compute
engine in NVIDIA GPUs to solve
many complex computational
problems in a more efficient way than
on a CPU.
CUDA Functions

● Allocate Memory: cudaMalloc((void **) &x, size)


● Transfer memory: cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice)
kernels
● A kernel is defined using the __global__ declaration

● the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>> execution
configuration syntax
● Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through built-in variables.
● threads grouped into blocks
● Specify number of blocks and number of threads per block.
Why two levels of threads?
● A grid of thread blocks is easier to manage than one big block of threads.
● GPU has 1000’s of cores, grouped into 10’s of streaming multiprocessors (SMs).
○ Each SM has its own memory, scheduling.
○ Each SM has e.g. 64 cores (P100 architecture).
● GPU can start millions of threads, but they don’t all run simultaneously.
● Scheduler (Gigathread Engine) packs up to ~1000 threads into one block and
assigns the block to an SM.
○ The threads have consecutive IDs.
○ Several thread blocks can be assigned to an SM at same time.
○ Threads in a block don’t execute simultaneously either.
■ They run in warps of 32 threads; more later.
● A thread block assigned to an SM uses resources (registers, shared memory) on the SM.
○ All assigned threads are pre-allocated resources.
■ Since we know the block size when we invoke the kernel, the SM knows how much resources to assign.
○ This makes switching between threads very fast.
■ No dynamic resource allocation.
■ SM has huge number (e.g. 64K) of registers, so no register flush when switching threads.
● Each SM has its own (warp) scheduler to manage threads assigned to it.
● When all threads in a block finishes, the resources are freed.
● Then Gigathread Engine schedules a new block to the SM, using the freed resources.
● At any time, SM only needs to manage a block of a few thousand threads, instead of entire
grid of millions of threads.
GPU Memory organization

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy