Gan Fpga
Gan Fpga
Gan Fpga
Yifan Yu
Author(s) Yifan Yu
Title Implementing the Generator of DCGAN on FPGA
Professional Major
The project was carried out by implementing a generative model on the Nexys 4 trainer
board with an Artix-7 FPGA from Xilinx. The pre-trained model is part of the popular
Generative Adversarial Networks (GANs) which can create realistic images that resemble
the training data. The core was written in Verilog, but several Xilinx IPs were also used
to facilitate the design. Xilinx Vivado 2017.4 was used as the development platform.
Both fixed-point and floating-point arithmetics were used to achieve a balance between
efficiency and accuracy.
With simplicity as the main goal of the design, some optimizations were deliberately
avoided. This paper serves as a detailed documentation of the design and implemen-
tation process. Transposed convolution is described which is the core operation of the
generative model. A method to map network weights and biases from high precision
floating-point representation to low precision integral representation, known as quanti-
zation, is derived. The quantization scheme leads to an efficient implementation of the
General Matrix Multiplication (GEMM) operation, which is at the heart of neural network
computations. As a conclusion, possible optimization methods are discussed as future
work.
1 Introduction 1
4 Quantization 17
5 Hardware Architecture 21
6 Implementation Details 24
7 Conclusions 29
References 31
List of Abbreviations
IP Intellectual Property.
MAC Multiply-Accumulate.
MLP Multilayer Perceptron.
1 Introduction
Specialized hardware for running deep learning algorithms seems to be a natural step in
the evolution of Artificial Intelligence. Google, for example, developed its own Application-
Specific Integrated Circuit (ASIC) named Tensor Processing Unit (TPU) to accelerate ten-
sor computations. The formidable cost of such endeavors limits ASIC development to the
big players in the industry. For tech startups and hobbyists, the Field-Programmable Gate
Array (FPGA) comes to rescue by filling the gap between high-cost customized ICs and
the need to make specialized hardware for certain applications. The programmable logic
blocks contained in the FPGA can be reconfigured, making it ideal for situations where
“in the field” functionality update is required. It is also a valuable tool for fast prototyping
and verification of ASIC design with low cost.
In machine learning, a discriminative model takes data that can be observed from a phe-
nomenon and outputs data that can only be inferred. For instance, let the phenomenon
be a group of people speaking different languages, a discriminative model can take the
speech data and infer the language being spoken. In other words, the model classifies
the speeches into different language types or labels. This can be done, e.g., by analyzing
the linguistic models of each speech and observing the differences. By contrast, a ge-
nerative model outputs both data that can be directly observed as well as data that can
only be inferred. Therefore, in the previous example, a generative model would have to
actually learn each language and be able to generate speeches of them. Probabilistical-
ly speaking, a discriminative model learns the conditional probability distribution P (Y |X)
(the probability of language Y given speech X), while a generative model learns the joint
probability distribution P (X, Y ) = P (X|Y )P (Y ), which explicitly models the speech ge-
neration process of each language class.
Generative models are interesting since they are capable of creating new data that re-
sembles the real world data. These models enable the machine to paint new paintings,
compose new music, or write new poetries. Many types of generative model exist [1],
including deep belief networks, variational autoencoder, Boltzmann machine, GANs, etc.
A GAN model consists of two different neural networks trained to compete against each
2
other in order to learn about the probability distribution of a particular dataset. The training
process pits the two players in a minimax game so that the performance of both networks
improves over time. Introduced in 2014 by Ian Goodfellow et al. [2], it soon gained po-
pularity in the machine learning community, kindled a wave of research on improving the
training properties and generation quality.
The marriage of FPGA and GAN seems to be an interesting topic in its own right. This
project explores such possibilities by implementing a pre-trained generator model of Deep
Convolutional Generative Adversarial Network (DCGAN) proposed by Alec Radford et
al. [3] on FPGA to generate realistic pictures. Figure 1 shows a group of generated images
of bedrooms trained from the LSUN dataset [4].
Nowadays, High Level Synthesis (HLS) is a popular option for implementing algorithms
on FPGA. HLS takes a behavioral description written in a high-level programming langua-
ge such as C and translates the description into Register-Transfer Level (RTL) Hardwa-
re Description Language (HDL) such as Verilog or VHDL. This approach is particularly
favored by engineers with a software background who wish to quickly convert an algo-
rithmic description into hardware implementation. In this project, however, low-level HDL
was chosen to implement the generator model in order to gain finer control of the imple-
mentation details. The architecture was designed with simplicity in mind and abstained
from premature optimizations. This paper serves as a rather detailed documentation of
the design and implementation process. The source code of this project is published on
GitHub [5] under Apache License 2.0.
3
In this chapter, a brief description of GAN [2] is presented first, followed by the structure of
the generator in the DCGAN model used in this project. There are three types of layers in
the model: a transposed convolutional layer that performs upscaling to its input, a batch
normalization layer that improves model stability and accuracy, an activation layer that
introduces nonlinearity to the model. The main focus is on the transposed convolutional
layer, which is the core of the generator model.
There are two networks in a GAN model: D, the discriminator, and G, the generator. D is
a discriminative model which computes a function D : x → p where x is the input example
and p ∈ R is the probability that x came from real training data rather than data generated
by G. In a sense, this probability value identifies the input example as “authentic” or not, so
the higher the probability, the better D does at discriminating authentic data against data
“faked” by G. On the other hand, G tries to fool D by generating output that resembles the
real data, and learns the data distribution during the training process. The input to G is a
vector z of random noises which could be drawn from a normal distribution, so G : z → x
is a mapping from the z space to the data space x.
During the training, two types of examples are fed to D: existing training examples and
examples generated by G. The system can be trained with regular Stochastic Gradient
Descent (SGD) and backpropagation. The training process improves the ability of both D
and G, until eventually the output of G is indistinguishable from real examples to D, that
is, the output of D approaches 1/2. Once trained, D can be discarded and G can be used
in different applications.
In the original paper, both networks are Multilayer Perceptrons (MLPs). However, many
different network types have been proposed since then. In this project, D and G are both
deep Convolutional Neural Networks (CNNs) [3] which are suitable for image processing.
4
The training is done on GPU with floating-point numbers. Since D is discarded after trai-
ning, we will be on only concerned with G from this point.
Figure 2 shows the network structure of G, with five transposed convolutional (TC) layers
and their output dimensions. Except for the last TC layer, each of the previous four TC
layers is followed by a layer of batch normalization (BN), and a layer of Rectified Linear
Units (ReLU) for activation. The last TC layer is followed by a Tanh layer for activation.
The function of BN and activation layers will be explained later. The parameters for each
layer is detailed in table 1. The network structure is rather simple, compared with other
much larger networks. ResNet, for example, contains a deep cascade of 152 layers.
Table 1: DCGAN Layer Details
In CNNs, convolutional layer extracts various features from the input, essentially perfor-
ming a downsampling operation. Transposed convolutional layer [6], also known as frac-
tionally strided convolutional layer, or sometime erroneously as deconvolutional layer, on
the other hand performs upsampling on the input. Upsampling is needed in the generator
model to successively map the 1 × 100 random noise input z to a much larger 3 × 64 × 64
output. Upsampling of data is often done with interpolation, but transposed convolution
as a novel approach can map the input to a richer space than what can be achieved with
interpolation. It also offers trainability which makes it useful for neural networks. Concep-
tually, if a convolution layer of stride s is run backwards, it can be seen as convolution
layer with stride 1/s, hence the name fractionally strided convolution.
2.3.1 Convolution
⎛ ⎞
1 2 3 4
⎜ ⎟
⎜ ⎟
⎜4 3 2 1 ⎟
A=⎜
⎜
⎟
⎟
⎜1 2 3 4 ⎟
⎝ ⎠
4 3 2 1
⎛ ⎞
1 2
K=⎝ ⎠
3 4
In addition to the kernel size k (2 in this case), a convolution can have some extra pa-
rameters. Padding size p pads the input with p rows or columns of zeros at the borders.
Stride size s is the step size to slide the kernel. Assume the convolution operates with
p = 1 and s = 2, that is, K is slid across the zero-padded matrix B with a step of 2, from
left to right, top to bottom. B is shown below:
⎛ ⎞
0 0 0 0 0 0
⎜ ⎟
⎜ ⎟
⎜0 1 2 3 4 0⎟
⎜ ⎟
⎜ ⎟
⎜0 4 3 2 1 0⎟
B=⎜
⎜
⎟
⎟
⎜0 1 2 3 4 0⎟
⎜ ⎟
⎜ ⎟
⎜0 4 3 2 1 0⎟
⎝ ⎠
0 0 0 0 0 0
When K is slid across B, the overlapping entries in B during each step is called a patch.
The first two patches are shown in red and blue respectively. At each step, an element-
wise inner product (Frobenius inner product) ⟨K, P ⟩F is computed between K and the
corresponding patch P and stored as an element in the result matrix C:
⎛ ⎞
4 8 12
⎜ ⎟
⎜ ⎟
C = ⎜12 25 13⎟
⎝ ⎠
8 7 1
' (
Krow = 1 2 3 4
For each patch in B that K convolves with, the entries in that patch are unrolled into a
matrix Bcol made of column vectors:
⎛ ⎞
0 0 0 0 3 1 0 3 1
⎜ ⎟
⎜ ⎟
⎜0 0 0 4 2 0 4 2 0 ⎟
Bcol ⎜
=⎜ ⎟
⎟
⎜0 2 4 0 2 4 0 0 0 ⎟
⎝ ⎠
1 3 0 1 3 0 0 0 0
This operation is called im2col, namely, image to columns. Now, compute the product
Krow ∗ Bcol , a 1 × 9 matrix is obtained:
' (
4 18 12 12 25 13 8 7 1
Finally this matrix is “reshaped” to the desired 3 × 3 output C using the operation col2im.
The procedure just described in the previous section is how regular convolution would be
implemented in practice. On the other hand, there is an alternative view of the convolution,
also performed with a single matrix multiplication. This alternative view is impractical for
implementation, but from which the input and output can be easily reversed. To see this,
first unroll the zero-padded input B into a 36 × 1 matrix:
' (
B̃ ! = 0 0 0 0 0 0 0 1 2 3 4 0 0 4 . . .
' (
C̃ ! = 4 18 12 12 25 13 8 7 1
Then the convolution can be represented as a sparse matrix M of 9 × 36 with entries from
kernel K, one patch per row:
⎛ ⎞
1 2 0 0 0 0 3 4 0 0 0 0 ...
⎜ ⎟
⎜ ⎟
M = ⎜0 0 1 2 0 0 0 0 3 4 0 0 . . . ⎟
⎝ ⎠
..
.
It needs to be pointed out that the term backward pass instead of inverse operation was
used, since M ! does not recover the numerical values of B from C, it only recovers the
shape of B, therefore it is misleading to call this operation deconvolution. Also, the same
kernel K defines both the forward pass and the backward pass.
The kernel size k, padding p and stride s parameters of the convolution affect the corres-
ponding transposed convolution as well. The detailed relationship is well documented by
V. Dumoulin and F. Visin in [7]. For instance, the previous convolution example has an
input size i = 4, which satisfies the condition (i + 2p − k mod s) = 0, then the associated
transposed convolution (input size i′ = 3) can be described by a regular convolution on
9
input size ĩ′ and parameters k ′ = k, s′ = 1 and p′ = k − p − 1, where ĩ′ is the size when
the input is dilated by s − 1 zeros between each element. The output size is
Here, ĩ′ = 5, k ′ = 2, s′ = 1 and p′ = 0. The output size o′ = 4, which equals the original
input size i of the associated convolution. To illustrate this with a numerical example, let
the input 3 × 3 matrix be
⎛ ⎞
1 2 3
⎜ ⎟
⎜ ⎟
A′ = ⎜ 3 2 1 ⎟
⎝ ⎠
1 2 3
⎛ ⎞
1 0 2 0 3
⎜ ⎟
⎜ ⎟
⎜0 0 0 0 0⎟
⎜ ⎟
′′ ⎜ ⎟
A = ⎜3 0 2 0 1⎟
⎜ ⎟
⎜ ⎟
⎜0 0 0 0 0⎟
⎝ ⎠
1 0 2 0 3
Convolve A′′ (without padding since p′ = 0) with the same kernel K using stride s′ = 1,
the result is
⎛ ⎞
1 4 2 6
⎜ ⎟
⎜ ⎟
⎜ 9 8 6 4 ⎟
′ ⎜
C =⎜ ⎟
⎟
⎜3 4 2 2 ⎟
⎝ ⎠
3 8 6 12
In practice, the transposed convolution can also be implemented with a single matrix mul-
tiplication followed by a col2im operation. The details are explained in chapter 3.
10
x − x̄
y = γ) +β (2)
2
σ (x) + eps
Here eps is a small constant that adds to numerical stability. γ and β can be regarded as
trained weight and bias. The output y of the normalization would have a mean of 0 and
standard deviation of 1. Batch normalization optimizes the training of the network, making
it converge faster with higher learning rates. It can also improve overall accuracy of the
model. For deep networks like DCGAN, it is a key ingredient.
1
However, this operation involves an inverse square root operation √ , which is
σ 2 (x)+eps
rather difficult to compute with fixed-point numbers. Luckily Xilinx Vivado includes an IP
which can both convert between fixed-point and floating-point representations, as well as
compute the inverse square root.
Activation function, also known as transfer function, adds non-linearity to the network. It is
often attached to the output of a layer, typically mapping the results to the range (0, 1) or
(−1, 1), albeit other possibilities exist. Assume the function maps the input to range (0, 1),
a value close to 0 would be seen as “off” or “no”, a value close to 1 would be seen as
“on” or “yes”. This indicates whether the following connection should see this output as
activated or not, hence the name activation function.
Many different types of activation functions exist with two main categories: linear and
1
nonlinear. A commonly used nonlinear sigmoid (S-shaped) function is y = 1+e−x
, shown
in figure 3.
y = max(0, x) (3)
As shown in figure 4, it is worth noting that ReLU is in fact nonlinear. When compared
with the sigmoid function, the gradient of ReLU does not saturate when x gets large,
which makes SGD converge faster. In addition, since all negative values are converted
to zero, it adds the desirable feature of sparsity to the network, leading to more efficient
computation. The implementation of ReLU is of course straightforward.
The hyperbolic tangent function tanh is another widely used activation function. It works
similarly to the sigmoid function with the additional property of being symmetrical with
respect to the origin. In the DCGAN model, the Tanh layer is used as the last layer to
output the final generated image.
12
ex − e−x
y=
ex + e−x
e2x − 1
= 2x (4)
e +1
1 − e−2x
=
1 − e−2x
The previous chapter discussed different ways to view the convolution operation and its
cousin, the transposed convolution, both conceptually and implementationally. When it
comes down to implementation, we have seen that both operations can be implemen-
ted with a single matrix multiplication. In other words, matrix multiplication is the main
computational burden of these operations. Meanwhile, real world applications often invol-
ve very large matrices. Therefore, as a low-level operation, the implementation of matrix
multiplication is often heavily optimized.
GEMM is defined as
C ← αop(A)op(B) + βC,
A subtle detail in the implementation of GEMM is the order of storage of matrix entries.
There are two different orders to store the same matrix: row-major order or column-major
order. In row-major order, entries of rows are stored contiguously in memory, while in
column-major order, entries of columns are consecutive to each other in memory.
2
3
4
5
6
7
8
This implementation is straightforward and not very efficient, but it is a good starting point
to guide us through the rest of the design.
Another operation used by transposed convolution is col2im. It cherry picks the elements
computed by GEMM and places them in the destination image. To describe col2im, it is
necessary to introduce the variables used for transposed convolution.
• and are the height and width of each input channel (so-
metimes also called input plane)
• is the number of input channels in total, thus the input is a 3D tensor
with dimensions
• With and , the input tensor is
stored as a matrix A ∈ Rm×k
• and are the height and width of each kernel channel
• Each kernel is also a 3D tensor with dimensions
7
8
9
10
11
12
13
14
15
16
17
18
10
11
12
13
14
15
16
16
17
18
19
20
21
22
23
24
25
The key to understand this code of merge is to notice that the for-loop indexed by in
can be split into three nested for-loops indexed by , and .
Similarly, the for-loop indexed by in is equivalent to the two nested for-loops indexed
by and , same as in .
This last C code in listing 3 serves as the blueprint for implementing transposed convolu-
tion on FPGA with Verilog. It can be seen from the code listing that the output channels
can be computed in groups or even by each channel individually. This is a convenient fact
when the weights of a layer cannot be fit into the Block RAM (BRAM) on the FPGA chip
all at once.
17
4 Quantization
Quantization brings additional benefits with the compression of model size. When 32-
bit floating-point weights are quantized to 8-bit, the model size can be reduced up to
75%. This cuts down memory bandwidth requirement and improves memory efficiency.
Consequently, the power consumption can be greatly reduced.
One could imagine that quantization would cause a tremendous loss to the precision of
a model, for example, when 32-bit floating-point numbers are converted to 8-bit unsig-
ned integers, the range drops from (1.175494351 × 10−38 , 3.402823466 × 1038 ) to (0, 255).
However, one of the peculiar properties of neural networks is that they are very resilient
to noise. Of course this is also one of the reasons why they are so successful in real world
applications. We could view the internal precision loss of quantization as a form of internal
noise. It turned out that the precision loss incurred is rather small [8].
There exist several different quantization methods. This project adopted the quantiza-
tion scheme implemented in Google’s gemmlowp library [9] as well as described in [10].
gemmlowp is a low precision GEMM implementation for fixed-point numbers. The origi-
nal 32-bit single-precision weights represented in are mapped to . GEMM
is then carried out on the FPGA board with these 8-bit integers. Intermediate Multiply-
18
Accumulate (MAC) results are 32-bit integers to accommodate the accumulation. Even-
tually the final output of the model is converted back to .
The quantization works by first identifying the range of the data. In neural networks, the
values are usually distributed within a relatively small range. For instance, for 8-bit quan-
tization, if the minimum value of the weights is −10.0 and maximum value is 20.0, then
−10.0 would be mapped to 0 while 20.0 would be mapped to 255. However, there is an
additional requirement that can bring great benefits to subsequent computations: the real
value 0 should be exactly representable. As will be shown, this can be achieved with a
small modification to the mapping equation.
Th quantization scheme can be derived with basic algebra. Define a real number f as the
result of an affine mapping f = Sq + B, where q is the corresponding quantized integer,
S ∈ R is a constant scaling factor, and B ∈ R is a constant shift. Next, modify the mapping
a bit so that the real value 0 is exactly representable:
f
q= +Z (6)
S
*
For an entry fC in C, fC = i Ai B i , where i is the index of entries in a particular row in
19
+
fC = Ai B i
i
+
= SA (pi − ZA )SB (qi − ZA ) (7)
i
+
= S A SB (pi − ZA )(qi − ZB )
i
*
Clearly the term i (pi − ZA )(qi − ZB ) is the computational core here, let it be K. Then,
fC
rC = + ZC
SC
(8)
S A SB
= K + ZC
SC
+
K= (pi − ZA )(qi − ZB )
i
+
= (pi qi − ZB pi − ZA qi + ZA ZB )
i
+ + + + (9)
= pi q i − Z B pi − Z A qi + ZA ZB
i i i i
+ + +
= pi q i − Z B pi − Z A qi + kZA ZB
i i i
20
*
Note that is k, which is the number of columns of A and the number of rows of B.
i1
*
Four terms are obtained here. The first term i pi qi is the main term. The second term,
*
−ZB i pi is the product of −ZB and the sum of the current row, which only needs to be
*
calculated once and added to each result of the first term. The third term −ZA i qi and
the fourth term kZA ZB are similar. The computation can be carried out by computing the
first term as the core, and add the rest three terms one by one.
*
Biases are added to the 32-bit accumulator result. Recall that fC = i Ai B i = SA SB K, it
is clear that biases should be quantized with S = SA SB and Z = 0. When zero offset is
0, negative numbers will remain negative, consequently the type of bias values should be
signed 32-bit integers.
21
5 Hardware Architecture
The development board for this project is Nexys 4 from Digilent with an Artix-7 FPGA chip
from Xilinx. Table 2 summarizes the hardware resources available on the Nexys 4 board.
Table 2: Nexys 4 Resource Specifications
BRAM will be the main active memory for input, output, weight, and bias data. The FPGA
chip in use contains 4860 Kbits of BRAM or 607.5 KB. The original model size is around
13.7 MB, which is considered very compact when compared with other deep learning
models but still very demanding for our system. Once quantized, the model size is roughly
3.5 MB. Therefore, it is not possible to fit the entire model into the BRAM and it is inevitable
to interact with external storage devices, e.g., the 16 MB Quad-SPI Flash or the 16 MB
Pseudo SRAM (PSRAM) on the Nexys 4 board.
In order to allocate BRAM properly, it is necessary to list the detailed memory requirements
of each layer.
In table 3, the numbers in parentheses are the bit width of the values. The table indicates
the maximum input/output size is 65536 in 32-bit, the maximum weight size is 2097152 in
8-bit, the maximum bias size is 512 in 32-bit. However, notice that when input is 65536 in
32-bit, the corresponding layers are batch normalization and ReLU. This means that the
same buffer can be used to store both the input and output for these layers because the
input can be updated in-place. Consequently the input buffer only needs to store 65536
22
values in 8-bit.
To conclude, three main BRAM buffers are needed: input buffer (65536 bytes), output buf-
fer (262144 bytes), weight buffer (2097152 bytes). Some auxiliary buffers are needed to
store other data such as biases. Because the resources on the board is quite limited, it is
easiest to compute the model in a layer-by-layer fashion without exploiting the layer-wise
parallelism. First, the weights of a transposed convolutional layer are read from the ex-
ternal storage device into the weight buffer, the layer is then computed, followed by one
or more auxiliary layers such as batch normalization or activation layer. A batch norma-
lization layer reads and writes the same output buffer. A ReLU layer reads this buffer and
writes its output to the input buffer, forming a circular structure. Then, the weights of the
next transposed convolutional layer are read and the process continues until it reaches
the final Tanh layer.
5.3 Architecture
The system stores the weights of transposed convolutional layers in the 16MB Quad-
SPI Flash. The input is forwarded into a computation unit which contains a transposed
convolution module TC, a batch normalization module BN, a ReLU module ReLU, and a
tanh module Tanh. The unit can be configured as TC-BN-ReLU or TC-Tanh. The network
23
computes one layer at a time, writing the output to the output buffer (except for the ReLU
module). Before reaching the final layer, the output of the computation unit (the output of
the ReLU module) is written to the input buffer, overwriting the previous input. The final
output of the Tanh module is sent to PC via Universal Asynchronous Receiver-Transmitter
(UART) (serial port). Figure 6 shows the structure of the computation unit.
The input of the TC module are quantized 8-bit integers. The output of the TC module are
32-bit integers, which are the input to the BN module. The BN module first dequantizes
the 32-bit input to 32-bit single-precision floating-point numbers according to IEEE 754
standard, then the computation is carried out in floating-point and the results are fed to the
ReLU module, which simply sets negative values to 0.0 and then quantizes the output to 8-
bit integers again and overwrites the input buffer with the results. This quantization process
is done by calculating the quantization parameters on the fly. When the flow reaches the
final Tanh module, the 32-bit input integers are dequantized just as the BN module, tanh
is then performed on the floating-point numbers, and the final floating-point result is sent
to the computer through UART and converted to PNG format with a simple conversion
script.
This combination of floating-point and fixed-point layers achieves a balance between ef-
ficiency and accuracy. The overhead introduced by on-chip quantization and dequantiza-
tion is acceptable with the use of DSP48 slices.
24
6 Implementation Details
Xilinx Vivado version 2017.04 was used as the development platform for FPGA. A typical
workflow involves the following steps:
The first step in implementing the transposed convolutional layer is to simulate the nested
C loops described in listing 3. For a for-loop shown in listing 4, the corresponding Verilog
code is shown in listing 5
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
The code is separated into a sequential block and a combinational next-state logic block
for clarity [11]. The merged C code of and shown in listing 3 can be then
translated into Verilog using this structure as a template.
The module is connected to the input, weight, and output BRAMs. A single 8-
26
bit or 32-bit value is read or written in every cycle, which is quite inefficient. On the other
hand, although the data bus width can be configured for the BRAM, a maximum of two
read ports are available. It is possible to increase the data bus width to read more bytes
in each cycle, but this is left as a future work for optimization. Indeed, I/O is the main
bottleneck of the design, as in many other systems.
The DSP48 Macro IP is utilized to accomplish computations with multiplication and accu-
mulation. The first block implements the operations shown in table 4. The second block
is the core MAC unit, which only contains P ← A ∗ B + P .
Table 4: DSP48 0 Operations
0 P ←A∗B−C
1 P ← A ∗ B + P CIN
2 P ←A∗B+C
The DSP48 slice can be used without pipelining. However, that is essentially a combina-
tional circuit and the propagation delay will significantly reduce Fmax . To run the FPGA at a
27
higher frequency, it is necessary to use pipeline registers. In the first version of the design,
only the register after the multiplier is used and the result is delayed by one clock cycle.
This caused a negative Worst Negative Slack (WNS) at 100M Hz clock frequency, which
indicates that timing requirements are not met in certain paths. By using more pipeline
registers, the results has a larger clock latency but Fmax can be improved.
The batch normalization layer, ReLU layer, and Tanh layer are implemented with floating-
point numbers. Xilinx Vivado is shipped with a Floating-point IP which supports many
common operations, including conversion between fixed-point and floating-point formats,
multiplication, accumulation, exponential function, inverse square root, and many others.
The input precision can be configured as half, single or double-precision, or even custo-
mized width of exponent and fraction. It is also possible to choose DSP slice usage, from
logic only to maximum utilization of DSP48 slices. Finally, the user is able to decide the
goal of optimization, using minimum resources or maximum clock frequency Fmax . Choo-
sing maximum clock frequency helps with the eventual timing closure, just as the case
with DSP48 pipelining. An example floating-point multiplier is shown in figure 10.
6.4 Testbench
Writing testbenches and performing simulations are a key step in FPGA development.
In this project, each module was developed as a separate project with a corresponding
testbench module and a synthesizable top module. The contents of these two modules
28
are largely similar, but the testbench module usually operates on relatively simple data
to make simulation easier. Once a module has passed simulation and on-board test, it is
imported and integrated into the main project.
7 Conclusions
This section concludes the project by briefly discussing possible optimization methods. As
mentioned before, the main goal of the design was simplicity, so many optimizations were
not applied. The main bottleneck in the current design, as in many computing systems,
is the I/O bandwidth. During each clock cycle, only one unit of data is read or written,
which is rather inefficient. Therefore, most of the optimization methods naturally focus on
improving the I/O efficiency.
The first obvious optimization is to transfer all weight data from the Dual-SPI Flash to the
PSRAM once the FPGA is configured. Weights will be loaded from PSRAM subsequent-
ly, which has a parallel interface and will result in much faster loading of weights. Burst
transfer can be used to read data from PSRAM and further reduces I/O latency. This is
done by placing the starting address on the address bus, a fixed amount of data is then
read repeatedly in a single “burst”.
Another optimization mentioned is to widen the data buses connected to the BRAM buf-
fers. Currently, the data address of input buffer is only 8-bit. This can be increased to
32-bit or even more. Multiple data can be loaded at the same time and calculation can
be performed in parallel on these data. The current design is highly sequential and only
utilizes around 15% of the DSP slices.
Yet another optimization is to further introduce several small caches using distributed
RAM. These caches are implemented using Lookup Tables (LUTs) and are faster than
BRAMs, i.e., they can be read asynchronously. Once inputs and weights are loaded into
these caches, more parallelism can be achieved, utilizing more DSP slices.
Finally, on an FPGA chip with more BRAM capacity, layer-level parallelism can be exploi-
ted, i.e., several layers can be calculated at the same time. This involves modifying the
current ring structure and forwarding data to subsequent layers in a single pass. Each
layer will work on its input concurrently, but some global coordination and scheduling is
needed in order to avoid data overrun.
30
More discussion on efficient processing of deep neural networks can be found in [12],
which includes a section on hardware acceleration. Due to the limitations of the develop-
ment board used in this project, the design runs in a way which appears as a temporal
architecture, however, it can be easily modified to become a spatial architecture. Memory
hierarchy and an optimized dataflow can be utilized to greatly improve overall efficiency
and reduce power.
31
References
1 Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning. MIT
Press. 2016. http://www.deeplearningbook.org.
4 Yu, Fisher and Zhang, Yinda and Song, et al. LSUN: Construction of a
Large-scale Image Dataset using Deep Learning with Humans in the Loop.
arXiv preprint arXiv:150603365. 2015;.
6 Jonathan Long and Evan Shelhamer and Trevor Darrell. Fully Convolutional
Networks for Semantic Segmentation. CoRR. 2014;abs/1411.4038.
<http://arxiv.org/abs/1411.4038>.
8 Vincent Vanhoucke and Andrew Senior and Mark Z. Mao. Improving the speed
of neural networks on CPUs. Deep Learning and Unsupervised Feature
Learning Workshop, NIPS 2011. 2011;.
12 Vivienne Sze and Yu-Hsin Chen and Tien-Ju Yang and Joel S. Emer. Efficient
Processing of Deep Neural Networks: A Tutorial and Survey. CoRR.
2017;abs/1703.09039. <http://arxiv.org/abs/1703.09039>.