Você está na página 1de 18

electronics

Article
An FPGA-Based CNN Accelerator Integrating
Depthwise Separable Convolution
Bing Liu * , Danyin Zou, Lei Feng, Shou Feng, Ping Fu and Junbao Li
School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin 150001, China;
17S101133@stu.hit.edu.cn (D.Z.); hitfenglei@hit.edu.cn (L.F.); fengshou@hit.edu.cn (S.F.);
fuping@hit.edu.cn (P.F.); lijunbao@hit.edu.cn (J.L.)
* Correspondence: liubing66@hit.edu.cn; Tel.: +86-0451-86413532

Received: 29 December 2018; Accepted: 23 February 2019; Published: 3 March 2019 

Abstract: The Convolutional Neural Network (CNN) has been used in many fields and has achieved
remarkable results, such as image classification, face detection, and speech recognition. Compared
to GPU (graphics processing unit) and ASIC, a FPGA (field programmable gate array)-based
CNN accelerator has great advantages due to its low power consumption and reconfigurable
property. However, FPGA’s extremely limited resources and CNN’s huge amount of parameters and
computational complexity pose great challenges to the design. Based on the ZYNQ heterogeneous
platform and the coordination of resource and bandwidth issues with the roofline model, the CNN
accelerator we designed can accelerate both standard convolution and depthwise separable
convolution with a high hardware resource rate. The accelerator can handle network layers of different
scales through parameter configuration and maximizes bandwidth and achieves full pipelined by
using a data stream interface and ping-pong on-chip cache. The experimental results show that
the accelerator designed in this paper can achieve 17.11GOPS for 32bit floating point when it can
also accelerate depthwise separable convolution, which has obvious advantages compared with
other designs.

Keywords: convolutional neural network (CNN); field programmable gate array (FPGA); depthwise
separable convolution; accelerator

1. Introduction
Inspired by biological vision systems, Convolutional Neural Network (CNN) is a well-known deep
learning algorithm extended from Artificial Neural Network (ANN) that has become one of the research
hotspots in many scientific fields [1,2]. It has achieved great success in image classification [3], object
detection [4], and speech recognition [5]. This technique has also been widely used in the industry,
such as monitoring and surveillance, autonomous robot vision, and smart camera technologies [6–9].
Due to the development of consumer electronics and the development of the Internet of
Things (IoT), embedded devices have occupied an important position. However, most of the current
image processing devices are still based on the PC architecture, which is inconvenient for some specific
occasions. Or, the use of embedded devices is only the image acquisition and display work and the
background is still through the PC for data processing. Consumer-grade IoT devices often rely on
high-quality Internet connections, are only available in some areas, and cost more. Therefore, the high
performance of CNN directly on embedded devices has great application requirements.
The implementation of high performance relies on the computing platform. Because CNN is
computationally intensive, it is not suitable for general-purpose processors, such as traditional CPUs.
Many researchers have proposed CNN accelerators for implementation in the Field-programmable gate
array (FPGA) [10,11], graphics processing unit (GPU) [3], and application-specific integrated circuit

Electronics 2019, 8, 281; doi:10.3390/electronics8030281 www.mdpi.com/journal/electronics


Electronics 2019, 8, 281 2 of 18

(ASIC) [12]. These accelerators provide an order of magnitude performance improvement and energy
advantage over general purpose processors [13]. Although the GPU has superior performance in the
computational efficiency of deep learning, it is expensive and has large power consumption. There are
many problems in the large-scale deployment and operation platform. For the same given functional
design, the power consumption of a single GPU is often several tens of times or even hundreds of
times the power consumption of the FPGA. Compared to ASICs, FPGAs have a short design cycle
and can be reconfigured. In recent years, due to the reconfigurable, customizable, and energy-efficient
features of FPGAs [14] and the rapid development of high-performance products and more flexible
architecture design, more and more researchers are focusing on FPGA-based CNN hardware Accelerate
implementation. On the other hand, many efficient network structures have been proposed which
effectively reduces the computational complexity and parameter quantities of the model. Among them,
depthwise separable convolution is very typical and widely used. This has been applied in Mobile Net
V1 [15] and later in Mobile Net V2 [16].
In general, deploying CNN on an FPGA-based hardware platform has become a research
boom through the adoption of reliable and efficient hardware acceleration solutions to achieve
high performance. The literature [7,17,18] implements a complete CNN application on the FPGA
with high performance by exploiting different parallelism opportunities. Work [7,17] mainly uses
the parallelism within feature maps and convolution kernel. Work [18] uses “inter-output” and
“intra-output” parallelism. However, these three improve performance with high bandwidth and
dynamic reconfiguration instead of using the on-chip buffer for data reuse. Reference [19] aims
to design efficient accelerator problems with limited external storage bandwidth by maximizing
data reuse, however it does not consider computational performance. Further, it is necessary to
reprogram the FPGA when computing the next layer, which greatly increases the whole running
time. Literature [20] studies the data parallelism of deep learning algorithms using six FPGAs to
calculate cloud acceleration calculations, however it requires a well-coordinated control program and
a large system. None of the above studies have taken into account the deployment requirements of
mobile devices, such as storage bandwidth and resource constraints, and flexible portability. Work [21]
presents an FPGA implementation of CNN designed for addressing portability and power efficiency.
The implementation is as efficient as a general purpose 16-core CPU and is almost 15 times faster than
a So C GPU for mobile application. The Squeeze Net DCNN is accelerated using a So C FPGA in order
for the offered object recognition resource to be employed in a robotic application [22]. In [23], under
the roofline model, considering resources and bandwidth, a CNN accelerator was implemented on
the VC707 FPGA board. Literature [24] proposes many optimization methods and uses the Xilinx
SDAccel tool to accelerate a convolution layer under the OpenCL framework with a performance
improvement of 14.4 times. In [25], the authors present a systematic methodology for maximizing the
throughput of an FPGA-based accelerator. In this work, an entire CNN model is proposed consisting
of all CNN layers: convolution, normalization, pooling, and classification layers. Work [26] proposes a
FPGA accelerator with a scalable architecture of deeply pipelined Open CL kernels. However, none of
the above work [21–26] implements depthwise separable convolution, therefore they cannot apply to
series networks such as MobileNet. This paper makes the following major contributions:

1. A configurable system architecture is proposed based on the ZYNQ heterogeneous platform.


Under this architecture, the optimal design of the accelerator is completed with the Roofline
model, and the accelerator is scalable.
2. Based on the single-computation engine model, the CNN hardware accelerator we designed
efficiently integrates standard convolution and depthwise separable convolution.
3. Ping-pong on-chip buffer maximizes the bandwidth and the CNN accelerator we designed is
full pipelined.

The rest of this article is organized as follows: Section 2 introduces the basic principles of CNN
and depthwise separable convolution. Section 3 describes the architecture of this implementation
Electronics 2019, 8, 281 3 of 18

and elaborates on the design details of the accelerator. Section 4 describes the experimental results
of implementing the accelerator on the ZYNQ platform, completing design verification and analysis.
Section 5 summarizes the content of this article.

2. Background

2.1. Convolutional Neural Network


A typical CNN contains multiple computation layers which are concatenated together. The main
common network layers are the convolutional layer, pooled layer, and fully connected layer. The details
are as follows.

2.1.1. Convolution Layer


The convolutional layer is the most important layer in a CNN. It is used to extract the
characteristics of the input image or the output feature map data of the upper layer. The operation is a
two-dimensional convolution calculation by input data and a plurality of different convolution kernels,
and a new two-dimensional output process is obtained by the activation function. The calculation
formula for a single two-dimensional convolution is given by Equation (1).

k −1 k −1
Oxy = f ( ∑ ∑ px+i,y+ j wij + b) 0 ≤ x ≤ W, 0 ≤ y ≤ H (1)
j =0 i =0

where p x+i,y+ j is the pixel value of the input feature map at the point of ( x + i, y + j), k is the size
of the convolution kernel, W and H are the width and height of the input feature map, wij is the
corresponding weight in the convolution kernel, b is the bias and f is the activation function (e.g.,
ReLU, Sigmoid, Tanh, Etc.), and Oxy is a convolution output value of a two-dimensional convolution
with a convolution window size of k × k centered on the point of ( x, y).
The calculation of the convolutional layer is composed of many two-dimensional convolution
operations, and its calculation is as Equation (2).

X jn = f ( ∑ ( Xin−1 ∗ knj,i ) + bnj ) (2)


i∈N
j∈M

where X jn is the jth feature map output by the nth layer convolution layer, N is the number of input
feature map channels, knj,i indicates the corresponding convolution kernel and M is the number of
convolution kernels, bnj is the offset term, ∗ is a convolution operation, and f is the activation function.

2.1.2. Pool Layer


The pool layer, also called the down sample layer, reduces feature map redundancy and network
computation complexity by reducing the feature map dimensions and effectively prevents over fitting.
The formula for calculating the pooling layer is shown in Equation (3).

X jn = f (down( X jn−1 )) (3)

where X jn is the jth feature map output by the nth layer convolution layer, down is the pooling method,
commonly used is the average pooling and maximum pooling, and f is the activation function.
When the pooling step size is 2, the process of 2 × 2 maximum pooling is shown in the Figure 1.
Electronics 2019, 8, x FOR PEER REVIEW 4 of 19

Electronics 2019,
Electronics 8, 281
2019, 8, x FOR PEER REVIEW 4 of 18 19
4 of

15 27 17 96 105 96

58 105 46 36 201 98
15 27 17 96 105 96
48 156 98 37
58 105 46 36 105=max(15,27,58,105)
201 98
147 201 44 54
48 156 98 37
105=max(15,27,58,105)
Figure 1. Maxpool.
147 201 44 54
2.1.3. Fully-Connected Layer
Figure Maxpool.
1. 1.
Figure Maxpool.
The full connection is generally placed at the end of the convolutional neural network and the
high-level two-dimensional
2.1.3. Fully-Connected Layer feature map extracted by the previous convolutional layer is converted
2.1.3. Fully-Connected Layer
into a one-dimensional feature map output. In the fully connected layer, each of its neurons is
The full connection is generally placed at the end of the convolutional neural network and the
connected The tofull connection
all neurons is generally
of the previous placed
layer and at the end
there is of
nothe convolutional
weight sharing. neural network and the
high-level two-dimensional feature map extracted by the previous convolutional layer is converted into
high-level two-dimensional feature map extracted by the previous convolutional layer is converted
a one-dimensional feature map output. In the fully connected layer, each of its neurons is connected to
2.2.into
Depthwise Separable Convolution
a one-dimensional feature map output. In the fully connected layer, each of its neurons is
all neurons of the previous layer and there is no weight sharing.
connected to all neurons of the previous layer and there is no weight sharing.
In recent years, in order to run high-quality CNN models on mobile terminals with strict
2.2. Depthwise
memory Separable Convolution
and computing budgets, many innovative network models have been proposed, such as
2.2. Depthwise Separable Convolution
MobileNet
In recentandyears,
ShuffleNet.
in order toThese models include
run high-quality CNN models depthwise separable
on mobile terminalsconvolution which
with strict memory
In reduces
effectively recent years,
the in order
amount of to run high-quality
parameters and CNN models
calculations of the on mobile
network underterminals
limited with
loss strict
of
and computing budgets, many innovative network models have been proposed, such as MobileNet
memory and computing budgets, many innovative network models have been proposed, such as
precision.
and ShuffleNet. These models include depthwise separable convolution which effectively reduces the
MobileNet
The standard andconvolution
ShuffleNet.uses These models include
a convolution kernel depthwise
with the same separable convolution
of input datawhich
amount of parameters and calculations of the network under limited loss channels
of precision. to
sumeffectively
aThe
result reduces
after the amount of parameters
channel-by-channel and calculations
convolution. As Figure of2 the network
shows, under limited
depthwise
standard convolution uses a convolution kernel with the same channels of input data to sum
loss of
separable
precision.is divided into depthwise convolution and pointwise convolution. The former refers to
convolution
a result after channel-by-channel convolution. As Figure 2 shows, depthwise separable convolution is
the use The
divided a standard
ofinto set
depthwise
convolution uses
of two-dimensional
convolution and
a convolution
(channel number
pointwise
kernel with the
is 1) kernels
convolution. The to
same channels
perform
former refersthe
of input data
to convolution forofto
the use of a set
eachsum a result
channel afterinput
between channel-by-channel
feature maps convolution.
and kernels As Figure The
individually. 2 shows,
latter isdepthwise
equivalent separable
to the
two-dimensional (channel number is 1) kernels to perform the convolution for each channel between
convolution
standard is divided
convolution intokernel
of kernels
1×1 depthwise
size. convolution
In the and pointwise
text, it isconvolution. The former refers to
input feature maps and individually. Thefollowing
latter is equivalent implemented
to the standardas a standard
convolution of
the use of a set of two-dimensional (channel number is 1) kernels to perform the convolution for
convolution.
1 × 1 kernel size. In the following text, it is implemented as a standard convolution.
each channel between input feature maps and kernels individually. The latter is equivalent to the
standard convolution of 1×1 kernel size. In the N following text, it is implemented as a standard
convolution. N
K
……

F N
……

M M
N

F
 K
……

F
……

M M R

F

R
(a) Standard convolution.

Figure 2. Cont.

(a) Standard convolution.

Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics

Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics


Electronics 2019, 8, x FOR PEER REVIEW 5 of 19
Electronics 2019, 8, 281 5 of 18

N
N
F

……
N

F

R

(b) Depthwise convolution.

N N
1

……
1
F

……
M M

F

(c) Pointwise convolution.

Standardconvolution
Figure2.2.Standard
Figure convolutionand
anddepthwise
depthwiseseparable
separableconvolution.
convolution.

Assumingthat
Assuming thatthe
thesize
sizeofofthe
theinput
inputfeature mapisisFF*∗FF*∗NN,
featuremap thesize
, the sizeofofthe
theconvolution
convolutionkernel
kernelisis
K ∗ K ∗ M ∗ N and the stride is 1. The parameter quantities of the standard convolution layer
K * K * M * N and the stride is 1. The parameter quantities of the standard convolution layer are: are:

Wsc =KK*∗KK*∗MM* N
Wsc ∗N (4)(4)
The amount of calculation is:
The amount of calculation is:
Osc  F * F * K * K * N * M (5)
Osc = F ∗ F ∗ K ∗ K ∗ N ∗ M (5)
The parameter quantities of depthwise separable convolution are:
Wsdc separable
The parameter quantities of depthwise K * K * N convolution
M *N are: (6)
The amount of calculation is: Wsdc = K ∗ K ∗ N + M ∗ N (6)
Osdc  F * F * K * K * N  F * F * N * M (7)
The amount of calculation is:
Thus, the reduction factors on weights and operation are calculated in Equation (8):
Osdc = F ∗ F ∗ K ∗ K ∗ N + F ∗ F ∗ N ∗ M (7)
Wsdc 1 1
Fw    2
Wsc M K
Thus, the reduction factors on weights and operation are calculated in Equation (8): (8)
Osdc 1 1
Fo   
Wsdc M 1 K 2 1
Osc
Fw = Wsc = M + 2 K
Osdc 1 1
(8)
Fo = Osc = M + K2
3. Architecture and Accelerator Design
3. Architecture and Accelerator Design
In the AI application scenario, the CPU is highly flexible, however not computationally
In the
efficient, and AI
theapplication
accelerator scenario, the CPU efficient,
is computationally is highlyhowever
flexible,not
however
flexiblenot computationally
enough. Therefore,
the architecture that is currently widely used for deep learning usually combines a CPUTherefore,
efficient, and the accelerator is computationally efficient, however not flexible enough. with an
the architecture
accelerator, calledthat is currently widely
a heterogeneous system.used for deep
We choose thelearning usually
Xilinx ZYNQ combines
7100 a CPU chip
heterogeneous with as
an
the hardware platform to complete the design of the system architecture and accelerator.

Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics


Electronics 2019, 8, 281 6 of 18

accelerator, called a heterogeneous system. We choose the Xilinx ZYNQ 7100 heterogeneous chip as
the hardware
Electronics 2019, 8,platform to REVIEW
x FOR PEER complete the design of the system architecture and accelerator. 6 of 19

3.1. Design Overview


There
There are
arecurrently
currently two different
two implementation
different implementation modes for CNN
modes for due
CNN to its
duehierarchical structure;
to its hierarchical
one
structure; one is Streaming Architectures and the other is Single Computation Engine. Theallocates
is Streaming Architectures and the other is Single Computation Engine. The former which former
corresponding
which allocateshardware resources
corresponding to each network
hardware resourceslayer has the
to each following
network layerthree
has characteristics: (1) it
the following three
can realize inter-layer
characteristics: parallelism
(1) it can realize and flexiblyparallelism
inter-layer control the andparallelism
flexiblywithin
controleach
thelayer. (2) It is within
parallelism highly
customized
each layer. (2)andIt inflexible. (3) The demand
is highly customized for resources
and inflexible. is high
(3) The and only
demand applies toissmall
for resources highnetworks.
and only
The latter
applies meansnetworks.
to small that different network
The latter means layers
that share the network
different same accelerator through
layers share resource
the same reuse,
accelerator
which is a non-highly customized architecture, is more flexible, and is easier
through resource reuse, which is a non-highly customized architecture, is more flexible, and is easier to migrate between
platforms. Therefore,
to migrate between considering
platforms. the limited
Therefore, resourcesthe
considering of limited
the hardware platform
resources of the and the parallelism
hardware platform
of
andfully
thedeveloping
parallelismtheof single-layer network
fully developing thestructure, we design
single-layer networkthestructure,
system architecture
we designinthe thesystem
single
computation engine mode, as shown in Figure 3.
architecture in the single computation engine mode, as shown in Figure 3.

PL
PS
AXI AXI_lite
AXI-Lite
Interconnect

image_buffer
Weight_buffer
Processing AXI_MM2S Streaming
System
AXI
DDR Interconnect AXI DMA Compute Engine
Controller AXI_S2MM

output_buffer
Streaming
DDR

Figure 3. FPGA (field programmable gate array) architecture of system implementation.


implementation.

The
The system
system architecture
architecture mainly
mainly includes
includes external
external memory
memory Double
Double Data
Data Rate
Rate (DDR),
(DDR), processing
processing
system
system (PS), on-chip buffer, accelerator in programmable logic (PL), and on-chip and
(PS), on-chip buffer, accelerator in programmable logic (PL), and on-chip and off-chip
off-chip busbus
interconnection. The initial image data and weights are pre-stored in the external
interconnection. The initial image data and weights are pre-stored in the external memory DDR. PS memory DDR. PS and
PL
andare
PLinterconnected
are interconnected through
throughthe AXI4 bus.bus.
the AXI4 The The
accelerator receives
accelerator configuration
receives configurationsignals fromfrom
signals the
CPU
the CPUthrough the AXI4_Lite
through bus (e.g.,
the AXI4_Lite convolution
bus kernel size,
(e.g., convolution stride,size,
kernel performs
stride,standard
performs convolution
standard
or
convolution or depthwise convolution, etc.). Under the action of the DDR controller in the PS,and
depthwise convolution, etc.). Under the action of the DDR controller in the PS, the weight the
input
weightdataandof the current
input data of layer required
the current byrequired
layer the accelerator are read from
by the accelerator aretheread DDR
fromand
theare
DDR converted
and are
from the AXI4_memory
converted map format
from the AXI4_memory map toformat
the AXI4_streaming format into
to the AXI4_streaming format theinto
on-chip buffer buffer
the on-chip of the
accelerator under the action of the Direct Memory Access (DMA). The
of the accelerator under the action of the Direct Memory Access (DMA). The buffer of the IP corebuffer of the IP core uses the
AXI4_Streaming interface and the ping-pong mode. One buffer acts
uses the AXI4_Streaming interface and the ping-pong mode. One buffer acts as a producer for as a producer for receiving data
from the DMA
receiving and the
data from the DMA
other andis used
the to participate
other is used to inparticipate
the currentincalculation
the currentofcalculation
the IP core, of called
the IP
acore,
consumer.
called a consumer. In the next stage, the producer and the consumer swap. After by
In the next stage, the producer and the consumer swap. After being processed the
being
accelerator,
processed bythe theoutput is sent
accelerator, theback to the
output DDR
is sent through
back to the the
DDR AXI4 bus and
through the above
the AXI4 bus and operation
the above is
repeated until the calculation of the entire network model is completed.
operation is repeated until the calculation of the entire network model is completed.
It can be
It can be seen
seenthat
thatthe
theaccelerator
acceleratorhas hastwotwo data
data exchanges
exchanges with
with thethe external
external memory
memory under under
the
the architecture, including receiving the weights and input feature map
architecture, including receiving the weights and input feature map and sending output feature mapand sending output feature
map
back back
to thetooff-chip.
the off-chip.
FrequentFrequent data exchange
data exchange imposes imposes high requirements
high requirements on theon the bandwidth
bandwidth of the
of
platform. Therefore, taking the MobileNet + SSD as an example, use the roofline model model
the platform. Therefore, taking the MobileNet + SSD as an example, use the roofline to jointlyto
jointly
consider consider the computing
the computing platform platform resources
resources and storage
and storage bandwidth
bandwidth to seek to seek optimal
optimal design
design for
for the
the accelerator.
accelerator.

3.2. Accelerator Design under the Roofline Model

3.2.1. Accelerator overview


Electronics 2019, 8, 281 7 of 18

Electronics 2019, 8, x FOR PEER REVIEW 7 of 19


3.2. Accelerator
Electronics 2019, 8, xDesign under
FOR PEER the Roofline
REVIEW Model 7 of 19
The structure of the accelerator is shown in Figure 4. The on-chip buffer is divided into three
3.2.1.The
Accelerator
structure Overview
parts (1) Input of the accelerator
buffer for storingisinputshown in Figure
feature map,4. (2)
TheWeights
on-chip buffer
buffer for
is divided
storing into three (3)
weights,
parts (1)
Output Input
The structure buffer
buffer for for
ofstoring storing
the accelerator input feature
is shown
intermediate map,
in and
results Figure (2)
the 4. Weights
The on-chip
output buffer for storing
bufferInisorder
feature map. weights,
divided into three
to maximize (3)the
Output
parts (1)buffer
external Input for storing
buffer
storage intermediate
for storing
bandwidth, input
the results
feature
three andAXI_Streaming
map,
all use the
(2) output
Weightsfeature
buffer map. Inand
order
for storing
interfaces thetoping
maximize
weights, (3) the
Output
pong mode.
external
buffer forstorage
storing bandwidth,
intermediate the three
results all
and use
the AXI_Streaming
output feature interfaces
map. In orderand
to
The three processes of inputting input feature map data and weights, calculating convolution, and the ping
maximize pong
the mode.
external
The three
storage
outputting processes
bandwidth, of inputting
the threeresult
the calculation input
all use feature mapflowed.
AXI_Streaming
are completely data and
Theweights,
interfaces and thecalculating
compute ping
engine canconvolution,
pong mode. toand
The three
be selected work
outputting
processes the
of calculation
inputting inputresult are
feature completely
map data flowed.
and The
weights, compute
calculating engine can
convolution,
in standard convolution or depthwise convolution modes under the control of the CPU through the be selected
and to work
outputting
in
thestandard
calculation
AXI_Lite convolution
bus.result are or depthwise
completely convolution
flowed. modesengine
The compute under canthe control of the
be selected to CPU
workthrough
in standardthe
AXI_Lite bus.
convolution or depthwise convolution modes under the control of the CPU through the AXI_Lite bus.

AXI_Streaming
Output_buf set1 Output_buf set2 AXI_Streaming
Output_buf set1 Output_buf set2
… …
… …

Crossbar
Crossbar
+ +
++ + … + + + AXI_Lite

+  … + +  … + … AXI_Lite
… 


Compute  …     … 
Engine
Compute
Engine Crossbar
Crossbar
Weight_buf set1 Weight_buf set2 Input_buf set1 Input_buf set2
Weight_buf set1 Weight_buf set2 Input_buf set1 Input_buf set2
… … … …
… … … …
AXI_Streaming
AXI_Streaming

Figure 4. Structure of FPGA accelerator.


Structure of FPGA accelerator.
Figure 4. Structure
3.2.2.
3.2.2. TheThe roofline
Roofline model
Model of of ZYNQ
ZYNQ 7100
7100
3.2.2. The roofline model of ZYNQ 7100
In order
In order to solve
to solve thethe performance
performance prediction
prediction problem
problem of the
of the deep
deep learning
learning model
model on on a specific
a specific
In order platform,
hardware to solve the in performance
[27], the prediction
roofline model problem
proposedof athe deep learning
method of model on
quantitative a specific
analysis using
hardware platform, in [27], the roofline model proposed a method of quantitative analysis using
hardware platform,
operational in
intensity, [27],
whichthe roofline
calculates model
how proposed a method of quantitative analysis using
operational intensity, which calculates how fastfast
thethe floating
floating point
point calculation
calculation speed
speed cancan be achieved
be achieved
operational
under intensity,
the which
limitation of calculates
external how fastbandwidth
storage the floatingand
point calculationresources
computing speed canonbe aachieved
hardware
under the limitation of external storage bandwidth and computing resources on a hardware platform.
under the limitation
platform. This is of external
shown in Figurestorage
5. bandwidth and computing resources on a hardware
This is shown in Figure 5.
platform. This is shown in Figure 5.
Attainable performance ( GFLOPS)
Attainable performance ( GFLOPS)

Computational roof
Computational roof
off
rrooo
itthohoff
wwrrioo
nniidtdthh
BBwawa
nndd
BBaa

Computation to communication ratio (FLOP/Byte)


Computation to communication ratio (FLOP/Byte)
Figure 5. Roofline model.
Figure 5. Roofline model.
Figure 5. Roofline model.
Equation (9) formulates the attainable throughput of an application on a specific hardware
Equation (9) formulates
platform. Giga the attainable
floating-point operationsthroughput of an
per second application
(GFLOPS) is aonmeasure
a specificofhardware
computing
platform. Giga floating-point operations per second (GFLOPS) is a measure
performance. The roofline is divided into two regions: compute bound and memory of computing
bound. The
performance. The roofline is divided into two regions: compute bound and memory bound. The
Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics
Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics
Electronics 2019, 8, 281 8 of 18
Electronics 2019, 8, x FOR PEER REVIEW 8 of 19

computational
Equation (9) performance
formulates achievable
the attainableon the platform of
throughput by antheapplication
network model cannot exceed
on a specific hardwarethe
minimumGiga
platform. of the two regional
floating-point bottlenecks.
operations In the
per second case of compute
(GFLOPS) is a measure bound, the bottleneck
of computing is the
performance.
computing
The rooflineroof (i.e., a into
is divided computing platform
two regions: exhausts
compute all and
bound the floating-point
memory bound. operations that can be
The computational
completed per
performance second). When
achievable on thein the memory
platform by thebound, themodel
network bottleneck
cannotis multiplied
exceed theby the computing
minimum of the
to communication
two (CTC) ratio
regional bottlenecks. In the(i.e.,
caseoperations
of compute per DRAM
bound, thetraffic) and the
bottleneck I/Ocomputing
is the memory bandwidth
roof (i.e.,
a(BW),
computing platform exhausts all the floating-point operations that can be completed per second).
When in the memory bound, the bottleneck is multiplied by the computing to communication (CTC)
Attainable performance  min{Computational roof , CTC  BW } (9)
ratio (i.e., operations per DRAM traffic) and the I/O memory bandwidth (BW),
In our work, we calculate the computational roof and the I/O memory maximum bandwidth
roof of the Xilinx Attainable
ZYNQ 7100per = min{Computational
f ormance platform
computing × BW }
roo f , CTC(10).
according to Equation (9)

N DSPand the I/O memory maximum bandwidth roof


In our work, we calculate the computational
Computation roof  2 roof  f  80.8  GPLOPS 
of the Xilinx ZYNQ 7100 computing platform according 5 to Equation (10). (10)
64
Bandwidth roof   N HP  f  3.2  GByte / s 
Computation roo f = 28 × NDSP
5 × f = 80.8( GPLOPS ) (10)
Bandwidth roo f = 64 × N HP × f = 3.2( GByte/s)
where NDsp is the number of hardware platform8 DSP divided by 5 because it requires five DSPs to
complete
where NDspa is
multiplication
the number of and additionplatform
hardware operationDSP of divided
32-bit floating-point multiply.
by 5 because it NHP is
requires five DSPsthe
to complete
number a multiplication
of High Performanceand addition
(HP) ports andoperation
f is theofsystem
32-bit floating-point
clock frequencymultiply.
(assumedNHP is the
to be 100
number of High
MHz). The Performance
constructed (HP)model
skeleton ports is
and f is the
shown in system
Figure 6.clock frequency (assumed to be 100 MHz).
The constructed skeleton model is shown in Figure 6.

Figure6.6. The
Figure The roofline
rooflinemodel
modelof
ofZYNQ
ZYNQ7100.
7100.

3.2.3. Data Partition and Exploring the Design Space


3.2.3. Data partition and exploring the design space
Since the on-chip cache resources are often extremely limited, this is usually unsatisfactory for all
Since the on-chip cache resources are often extremely limited, this is usually unsatisfactory for
input feature maps and weights to be cached on the chip. The data must be partitioned. As shown
all input feature maps and weights to be cached on the chip. The data must be partitioned. As shown
in Figure 7, since the convolution kernel size K itself is small, it is not divided in this dimension.
in Figure 7, since the convolution kernel size K itself is small, it is not divided in this dimension.
R, C, M, N, and Ni are the width, height, and number of convolution kernels of the output feature
R, C, M , N , and N are the width, height, and number of convolution kernels of the output
map (also the numberi of channels of the output feature map), the channels of convolution kernel
feature map (also the number of channels of the output feature map), the channels of convolution
channels, and the channels of the input feature map, respectively. Tr , Tc , Tm , and Tn are the block factors
kernel channels, and the channels of the input feature map, respectively. Tr , Tc , Tm , and Tn are
of width, height of output feature map, the number and channels of convolution kernels, respectively.
Tthe
ri , Tblock
ci , andfactors
T ni areof width,
the block height of
factors output
width, feature
height, andmap, the number
channels of the andfeature
input channels of respectively.
map, convolution
kernels,
The respectively. Tblock
above-mentioned ri
, Tcicoefficient
, and Tni are the block
setting takesfactors width, both
into account height, and channels
standard of the input
convolution and
feature map,
depthwise respectively. The above-mentioned block coefficient setting takes into account both
convolution.
standard convolution and depthwise convolution.

Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics


Electronics
Electronics2019,
2019,8,8,x 281
FOR PEER REVIEW 9 9ofof1918
Electronics 2019, 8, x FOR PEER REVIEW 9 of 19
Ni
Ni Tni
Tri
Tni
Tri
Tci
M
Tci
N M
Tn
K
N
Tn
K …… ……

Tm …… ……
R
Tm
M R
Tm
M
Tr
Tm
Tr
Tc
C Tc
C

Figure 7. Data block diagram.


Figure 7.
Figure Data block
7. Data block diagram.
diagram.
We use the example in Figure 8 to illustrate how block convolution works. In this example, the
We use the example in Figure 8 to illustrate how block convolution works. In this example,
input We use consists
tensor the example in Figure
of three separated 8 to illustrate
channelshow block
of size 8 ×convolution works. In
8 with additional thispadding,
zero example,the the
the input tensor consists of three separated channels of size 8 × 8 with additional zero padding,
input size
kernel tensoris 3consists
× 3, and of three
each inputseparated
feature map channels
is dividedof size 8 ×four
into 8 with additional
independent zero
tiles. padding,
Since the
inter-tile
the kernel size is 3 × 3, and each input feature map is divided into four independent tiles. Since
kernel size is are
dependencies 3 × 3, and each in
eliminated input
block feature map is divided
convolution, into fourtoindependent
it is not possible obtain an outputtiles. Since
tile ofinter-tile
size 4 ×
inter-tile dependencies are eliminated in block convolution, it is not possible to obtain an output tile
4dependencies
directly fromare eliminated
three in block
input tiles at theconvolution,
corresponding it is not possible
position. As to obtaininanFigure
shown output8a,tilewhen
of sizethe

of size 4 × 4 directly from three input tiles at the corresponding position. As shown in Figure 8a,
4 directly
stride from
is 1, an three
input input
tile of size tiles
6 × 6 at the corresponding
is required to get an output position. Assize
tile of shown
4 × 4.inInFigure
Figure 8a,
8b, when
an inputthe
when the stride is 1, an input tile of size 6 × 6 is required to get an output tile of size 4 × 4. In Figure 8b,
stride
tile is 1,5an
of size × 5input tile of size
is required to get6 ×an6 output
is required
tile ofto size
get an 2 ×output
2 whentile
theofstride
size 4is×2.4.In
Inblock
Figure 8b, an input
convolution,
an input tile of size 5 × 5 is required to get an output tile of size 2 × 2 when the stride is 2. In block
the
tilerelationship
of size 5 × 5 between
is required and
Tr to getTanri
can
outputbe determined
tile of size 2 as
× 2Equation
when the (11):
stride is 2. In block convolution,
convolution, the relationship between T and Tri can be determined as Equation (11):
the relationship between Tr and Tri canr be determined as Equation (11):
Tri  S *Tr  K  S (11)
TTriri= SS ∗*TTrr +KK−S S (11)
(11)

Tr
Tri
Tr Tr
Tri Tc
Tc Tr
Tci Tc
Tc
Tci

Tm

Tm

(a) (b)
(a) (b)
Figure 8. An example of block convolution: (a) The stride of convolution is one; (b) The stride of
Figure 8. An example of block convolution: (a) The stride of convolution is one; (b) The stride of
convolution is two.
convolution is two.
Figure 8. An example of block convolution: (a) The stride of convolution is one; (b) The stride of
convolution is two. affects the external memory access of the model, which affects the CTC Ratio.
Block convolution
Block convolution affects the external memory access of the model, which affects the
See Equation (12), which establishes a mathematical connection between the block factors and
CTC Ratio
Block. See
CTC Ratio.
Equation affects
convolution (12), which establishes
the external a mathematical
memory access of connection
the model,between the block
which affects the
CTC Ratio
factors and CTC
. SeeRatio
Equation
. (12), which establishes a mathematical connection between the block
factors and CTC Ratio .

Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics


Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics
Electronics 2019, 8,2019,
Electronics 281 8, x FOR PEER REVIEW 10 of10
19 of 18

number of operations
CTC Ratio 
number
number of o f external
operationsaccess bytes
CTC Ratio = number f
o(2 R  C access
external M  Nbytes
 K  K)
 (2 × R × C × M × N × K × K )
= (4  ( in  Bin   w  Bw   out  Bout ))
(4 × (αin × Bin +αw × Bw +αout × Bout ))
BinT TTniTriTTci =
Where =  TT
ni ( STr  rK+ K
S )(− K
STSc )( STcS+
) K − S)
Where Bin ni ri ci ni ( ST
Bw  TmTn K2 2
(12) (12)
Bw = Tm Tn K
Bout  TmTr Tc
Bout = Tm Tr Tc
R C M N
 inα = R  C  M N
αin = w w TrTr× T
Tcc ×TmTm T×
n Tn
R RCC M
= T × Tc ×Tm
αout M
 out r
Tr Tc Tm
In particular, for standard convolution, N = Ni and Tni = Tn ; however, for depthwise convolution,
N = 1, M = In Nparticular, for standard convolution, N  Ni and Tni  Tn ; however, for depthwise
i and Tni = Tm .
The  1 , M  Ni and
hardwareNacceleration
convolution, Tni  Tof
effect .
m CNN depends largely on the degree of development of

The hardwareCNN
algorithm parallelism. acceleration
belongseffect of CNN depends
to a feedforward largely network
multi-layer on the degree of interlayer
and its development of
structure,
algorithm parallelism. CNN belongs to a feedforward multi-layer network
intra-layer operation, and data stream drive all have certain similarities. Therefore, the convolutional and its interlayer
neuralstructure,
networkintra-layer
topology operation,
CNN itselfand hasdata
many stream drive all have
parallelisms. Thiscertain
mainlysimilarities.
includes (1) Therefore, the
multi-channel
convolutional neural network topology CNN itself has many parallelisms. This mainly includes (1)
operation of the input feature map and convolution kernel. (2) The same convolution window
multi-channel operation of the input feature map and convolution kernel. (2) The same convolution
and different convolution kernels can simultaneously perform convolution operations. (3) Multiple
window and different convolution kernels can simultaneously perform convolution operations. (3)
convolution
Multiplewindows andwindows
convolution the sameand convolution kernel can simultaneously
the same convolution perform convolution
kernel can simultaneously perform
operations. (4) In operations.
convolution a convolution (4) window, the parameters
In a convolution window,corresponding
the parametersto corresponding
all convolution to kernels
all
of all neuron nodeskernels
convolution and corresponding
of all neuronparameters
nodes andcan be operated parameters
corresponding simultaneously.can be Theoperated
above four
parallelisms correspond
simultaneously. Theto the dimensions
above four parallelismsof Tn ,correspond
Tm , Tr , andtoK,the
respectively.
dimensions Computational
of Tn , Tm , Tr , andparallel
K,
development not only
respectively. has certainparallel
Computational requirements
development for computing
not only hasresources, yet also requires
certain requirements an on-chip
for computing
cache resources,
structure yet
to provide the data
also requires neededcache
an on-chip for parallel
structure computing.
to provide However, it alsofor
the data needed increases
parallel the
on-chipcomputing. However, The
cache bandwidth. it also increases
Vivado HLS thedevelopment
on-chip cachetool bandwidth.
makes itThe Vivado
very easyHLS development
to partition an array
tool makes it very easy to partition an array in a particular dimension. However,
in a particular dimension. However, if the parallelism of (3) and (4) is developed, the cache structure if the parallelism of
(3) and (4) is developed, the cache structure of the data is shown in Figure 9.
of the data is shown in Figure 9.

1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20
2 3 4
10 11 12 Conv window 2
18 19 20

Buf 1 Buf 2 Buf 3 Buf 4 Buf 5 Buf 6 Buf 7 Buf 8 Buf 9

1 2 3
Conv window 1 9 10 11
17 18 19

Figure 9. Array partitioning in Tr and K dimensions.


Figure 9. Array partitioning in Tr and K dimensions.
As can be seen from the figure, if the data in a convolution window is to be distributed and
stored in KAs×can be seen from the figure, if the data in a convolution window is to be distributed and
K buffers, the data is not continuous in the Tr dimension. Vivado HLS is difficult to
stored in K  K buffers, the data is not continuous in the Tr dimension. Vivado HLS is difficult to
implement this with array partitioning. Moreover, it repeatedly stores the overlapping data between the
Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics
convolution windows which greatly increases the consumption of on-chip cache resources. Therefore,
we will develop the parallelism of the calculations on the dimensions Tm and Tn .
The calculation engine diagram is shown in Figure 10. When calculating the standard convolution,
the Tn channels of the input feature map are simultaneously multiplied by the weights of the
implement this with array partitioning. Moreover, it repeatedly stores the overlapping data between
the convolution windows which greatly increases the consumption of on-chip cache resources.
Therefore, we will develop the parallelism of the calculations on the dimensions Tm and Tn .
The calculation engine diagram is shown in Figure 10. When calculating the standard
Electronics 2019, 8, 281 11 of 18
convolution, the Tn channels of the input feature map are simultaneously multiplied by the weights
of the corresponding channels and then the intermediate results are continuously accumulated
which willchannels
corresponding greatly reduce
and thenthe delay by pipeline. results
the intermediate At the same time, the sameaccumulated
are continuously operation of which
the Tm will
greatlygroup
reduceamong different
the delay convolution
by pipeline. kernels
At the is performed.
same time, the same When dealing
operation with
of the depthwise
Tm group among
convolution, channels of the convolution kernel are filled
different convolution kernels is performed. When dealing with depthwisewith zero to Tm
in order to efficiently
convolution, channels
of the integrate the two
convolution kindsare
kernel of filled
convolution and to
with zero to not
Tm destroy
in orderthe
tocomputational parallelism,
efficiently integrate as shown
the two kinds of
in Figure
convolution and10b.
to not destroy the computational parallelism, as shown in Figure 10b.

Tm
Tn

.


Tn

.

Tn

.


Tr


Tc
…….

Buf_w_Tm Tn
Buf_w_Tm 1

Buf_w_Tm 2
Buf_w_1 Tn
Buf_px Tn
Buf_px 1

Buf_px 2

Buf_w_1 1

Buf_w_1 2
……. ……. ……. …….

  …    … 
+ +


…….
+ +
Tm

(a) Standard convolution.

Tm
Tm
.

0000 0000

Tn
.

……

Tn ……
.

Tr

0000 0000
0000 0000
…… ……
Tc 0000 0000
…….
Buf_w_Tm 1
Buf_px Tm
Buf_px 1

Buf_w_1 1

……. …….
0 0 0 0

  …    … 
+ +

…….
+ Tm
+

(b) Depthwise separable convolution.

Figure
Figure 10.10. Computeengine
Compute engine diagram.
diagram.

It can be seen from the above analysis that under the calculation engine we designed,
Electronics 2019, 8, x; doi: FOR PEER REVIEW
physical computation roo f can be calculated by the Equation (13) forwww.mdpi.com/journal/electronics
a given block factors of Tr ,
Tc , Tm , and Tn .
total number o f operations × system clock f requency
physical computation roo f = number o f execution cycles
(2 × R × C × M × N × K × K ) × 0.1
=
[ TMm ] × [ TNn ] × [ TRr ] × [ TCc ] × (Tr × Tc × K × K+ P)
(13)
(2 × M × N ) × 0.1

[ TMm ] × [ TNn ]
where P = pipeline depth − 1
It can be seen from the above analysis that under the calculation engine we designed,
physical computation roof can be calculated by the Equation (13) for a given block factors of Tr , Tc ,
Tm , and Tn .
Electronics 2019, 8, 281 12 of 18
total number of operations  system clock frequency
physical computation roof 
number of execution cycles
Due to array partition and ping-pong (2  R  C the
buffers,  M consumption
 N  K  K )  0.1
of on-chip cache resources is

increased. To associate the on-chip bufferM withNthe block R Cfactors, we need to satisfy Equation (14).
[ ]  [ ]  [ ]  [ ]  (Tr  Tc  K  K  P)
Tm Tn Tr Tc (13)
Weight Bu f f er + (2
Input
 M + Output Bu f f er
buNf)fer0.1

  
× Tci
= T2m × T2n × 2 + [T2M n
×] T[riN512 Tn
] × 2 + 2 × 512 × 2
Tr × Tc
(14)
T T
= Tm × Tn
+ Tn × 512
Tri × Tmci
+ Tnm ×512 Tr × Tc
< number o f BRAM
where P  pipeline
2
depth  1
Combining the above-mentioned CTC Ratio, pysical computation roo f , and the analyzed on-chip
Due to array partition and ping-pong buffers, the consumption of on-chip cache resources is
cache relationship with the roofline model of the ZYNQ 7100 under the block factors of Tr , Tc , Tm ,
increased. To associate the on-chip buffer with the block factors, we need to satisfy Equation (14).
and Tn , we seek the best design and find the optimal block factor, as shown in point A of the Figure 11,
Weight Buffer
under some certain constraints,  Input
as shown buffer  Output
in Equation (15). Buffer
T T T T T T T T
( m  n ) 2  n  ri ci  2  n  (2r × Mc ×2N ) × 0.1
max2 physical
2 512 roo f 2≈ 512M
computation
2 (14)
[ Tm ] × [ TNn ]
Tm  Tn Tn  Tri  Tci Tm  Tr  Tc
 computation
  min{Computational
 number of
 
 physical roo f ≤ f , CTC × BW } 
rooBRAM
 Tm × Tn + T2n × Tri × T512 Tm × Tr ×512
 
ci Tc
+ < number o f BRAM
 

2 512 512

 


0≤T
 (15)
Combining the r ≤ R
above-mentioned CTC Ratio , pysical computation roof , and  the analyzed

s.t.

 0 ≤ Tc ≤ with
on-chip cache relationship C the roofline model of the ZYNQ 7100 under the block factors of Tr ,
 
0 ≤ T ≤ N
 
Tc , Tm , and Tn n
, we seek the best design and find the optimal block factor, as shown in  point A of the

 

 
0 ≤ Tm ≤ M
Figure 11, under some certain constraints, as shown in Equation (15).

Figure 11. Design exploration.


Figure 11. Design exploration.
According to the above ideas, the optimal block factors Tr_opt , Tc_opt , Tm_opt , and Tn_opt of
the current network layer can be obtained by programming with Matlab. However, the optimal
block coefficients obtained from the different layers are different. In particular, Tm and Tn affect the
computational parallelism. If Tm and Tn are allowed to be variable, complex hardware architectures
need to be designed to support reconfiguration of computational engines and interconnects. So, we will
solve the global optimal Tm and Tn under the whole network model, as shown in Formula (16).

Min global number o f execution cycles


N o f operations×system clock f requency
= min ∑ total number
pyscial computation o f the i th layer
i =1
Electronics 2019, 8, x; doi: FOR PEER(REVIEW ) (16)
www.mdpi.com/journal/electronics
Tm_min ≤ Tm ≤ Tm_max
s.t.
Tn_min ≤ Tn ≤ Tn_max
Electronics 2019, 8, 281 13 of 18

where N is the number of network layers. Tm_min and Tm_max are the minimum and maximum values
of Tm_opt sought by all network layers. Tn_min and Tn_max are the minimum and maximum values of
Tn_opt sought by all network layers.
The final global optimal solution is obtained:

[ Tm Tn ] = [64 6]
(17)
global number o f execution cycles = 5602706

Since Tn and Tm have been determined, the configurable parameters of the accelerator are shown
in Table 1.

Table 1. Configurable parameters.

Parameter Description
width The width of the input feature map
height The height of the input feature map
channels_in Number of channels of convolution kernels
channels_out Number of channels of the output feature map
Tr Block factor of the width of the output feature map
Tc Block factor of the height of the output feature map
Tri Block factor of the width of the input feature map
Tci Block factor of the height of the input feature map
Tni Block factor of the channels of the intput feature map
kernel The size of convolution kernels
stride The stride of convolution
pad Whether to pad or not
depthwise Whether it is a depthwise convolution or not
relu Whether to relu or not
split Whether it is a split layer (detection layer) or not

4. Experimental Evaluation and Results


The accelerator is implemented with Vivado HLS (v2016.4). Vivado HLS (v2016.4) can implement
the accelerator in C++ and convert it to the RTL as a Vivado’s IP core which greatly shortens
the development cycle. The C code design of the accelerator can be implemented by adding the
HLS-defined pragma of Vivado HLS to achieve the parallelism described previously, such as pipeline,
array partition, dataflow, and so on. After that, the IP core is imported into the Vivado (v2016.4) project
to complete the synthesis and verification on FPGA.

4.1. Resource Utilization


The hardware resources consumed by the accelerator are shown in Table 2.

Table 2. Resource utilization of the accelerator.

Power Power
Name Bram_18k DSP FF LUT
(pipelined) (unpipelined)
Total 708 1926 187,146 142,291 4.083 W 3.993 W
Available 1510 2020 554,800 277,400 - -
Utilization (%) 46 95 38 51 - -

It can be seen from the table that the implemented accelerator has a high hardware resource rate
and also verifies the analysis results of the previous design exploration and computational parallelism.

4.2. Comparisons of Pipelined and no Pipelined


MoblieNet+SSD has a total of 47 network layers which contains both standard convolution and
depthwise separable convolution. We use the first network layer that is the standard convolution
and the second network layer that is depthwise convolution as the example to compare the running
Electronics 2019, 8, 281 14 of 18

time of full pipelined and no pipelined combined with layer parameters for more details. Table 3
is the parameter of the two network layers. A more detailed comparison is as shown in Tables 4
and 5, respectively.

Table 3. Parameter of the two network layers.

Name First Layer First Layer (Block) Second Layer Second Layer (Block)
Output_fm row (R) 150 150 150 150
Output_fm col (C) 150 150 150 150
Output_fm channel (M) 32 64 32 64
Input_fm channel (Ni) 3 6 32 64
Kernel channel (N) 3 6 1 6
Kernel (K) 3 3 3 3
Stride (S) 2 2 1 1

Table 4. Comparisons of the first layer.

Name No Pipelined Full Pipelined


Latency (clock cycles) 1,921,492 934,612
Clock frequency (MHz) 100 100
Time (ms) 19.21 9.35
Calculated amount (GLOPs) 0.16 0.16
Performance (GLOPS) 8.32 17.11

Table 5. Comparisons of the second layer.

Name No Pipelined Full Pipelined


Latency (clock cycles) 2,816,722 1,904,020
Clock frequency (MHz) 100 100
Time (ms) 28.2 19
Calculated amount (GLOPs) 0.16 0.16
Performance (GLOPS) 5.67 8.42

The parameters and structure of the network layer are different and the calculation throughput
is also different. The performance bottleneck of MobileNet + SSD is the bandwidth roof rather than
the computation roof. Therefore, the latency of the full-flow state achieved by the stream data and
the ping-pong technique is greatly reduced. Combining the resource report, the pipelined version has
higher energy efficiency than the version of not pipelined.

4.3. Comparisons with CPU Implementation


The CPU version is completed by the Cortex-A9 core of ZYNQ 7100. The complier is
“arm_xilinx_eabigcc” in Xilinx Software Development Kit. And the software version of -O3 compilation
flags is used to compare with an accelerator. The calculation amount of each layer is shown as Figure 12.
The running time results of using the CPU and accelerator to complete the MobileNet + SSD network
are shown as Figures 13 and 14, respectively.
The running time of each layer in Figures 13 and 14 includes the time fetching the data from
the external memory, computing, and sending the results to the external memory with each layer.
Combining Figures 13 and 14, it indicates that the time consumption of the accelerator is considerably
smaller, more than 150 times faster compared with the software process.
“arm_xilinx_eabigcc”
The CPU version in Xilinx Software by
is completed Development Kit. And
the Cortex-A9 the of
core software
ZYNQ version
7100.of The
-O3 complier
compilation
is
flags is used to compareinwith
“arm_xilinx_eabigcc” an accelerator.
Xilinx The calculation
Software Development Kit. amount
And the of each layer
software is shown
version of -O3ascompilation
Figure 12.
The is
flags running
used to time results
compare withofanusing the CPU
accelerator. Theand accelerator
calculation to complete
amount of each the MobileNet+SSD
layer network
is shown as Figure 12.
are shown as Figure 13 and Figure 14, respectively.
The running time results of using the CPU and accelerator to complete the MobileNet+SSD network
Electronics
are shown2019, 8, 281 13 and Figure 14, respectively.
as Figure 15 of 18

Figure 12. The calculation amount of MobileNet + SSD.


Figure 12. The calculation amount of MobileNet + SSD.
Figure 12. The calculation amount of MobileNet + SSD.

Electronics 2019, 8, x FOR PEER REVIEW 16 of 19


Figure 13.
Figure CPU running
13. CPU running time
time results.
results.
Figure 13. CPU running time results.

Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics


Electronics 2019, 8, x; doi: FOR PEER REVIEW www.mdpi.com/journal/electronics
Figure
Figure 14. Accelerator running
14. Accelerator running time
time results.
results.
4.4. Comparisons with Others
The running time of each layer in Figure 13 and Figure 14 includes the time fetching the data
fromCompare our memory,
the external implementation of the and
computing, accelerator
sending with
theother FPGA-based
results accelerators,
to the external memoryaswith
shown in
each
Table 6.
layer. Combining Figure 13 and Figure 14, it indicates that the time consumption of the accelerator
is considerably smaller, more than 150 times faster compared with the software process.
Table 6. Comparison to previous implementations.

4.4. Comparisons with others [7] [17] [18] [19] Ours


Precision
Compare 16 bits fixed
our implementation fixed point with48other
of the accelerator bits fixed fixed point
FPGA-based 32 bits
accelerators, asfloat
shown
frequency (MHz) 115 125 200 150 100
in Table FPGA
6. chip Virtex5 LX330T Virtex5 SX240T Virtex5 SX240T Virtex6 SX240T ZYNQ 7100
Performance (GMACS) 3.37 3.50 8.00 8.50 8.55
Performance (GOPS) Table6.746. Comparison7.00 16.00
to previous implementations. 17.00 17.11

[7] [17] [18] [19] ours


Since one Multiply-Accumulate (MACC) contains two operations, we convert fixed the performance
Precision
indicators 16 bits fixedper second)
into (Giga operations fixed point
GOPS and 48 bits fixed
compare 32 bits float
them.pointOther accelerator
implementations
frequency (MHz) do not include
115depthwise separable
125 convolution,
200 so the performance
150 of our standard
100
convolution is involved in
FPGA chip the comparison.
Virtex5 If using
LX330T Virtex5 a fixed-point
SX240T calculation
Virtex5 SX240T engine,
Virtex6 our method
SX240T ZYNQ 7100can
Performance (GMACS) 3.37 3.50 8.00 8.50 8.55
Performance (GOPS) 6.74 7.00 16.00 17.00 17.11

Since one Multiply-Accumulate (MACC) contains two operations, we convert the performance
indicators into (Giga operations per second) GOPS and compare them. Other accelerator
implementations do not include depthwise separable convolution, so the performance of our
Electronics 2019, 8, 281 16 of 18

better perform because the fixed-point processing unit uses fewer resources. It can be seen that the
accelerator we have achieved has certain performance advantages.

5. Conclusions
In this article, we implemented a CNN accelerator on the Xilinx ZYNQ 7100 hardware platform
that accelerates both standard convolution and depthwise separable convolution. Thanks to the
heterogeneous mode of ZYNQ, the accelerator based on the single-computing engine mode can
realize network layer acceleration of different scales under the configurable architecture we designed.
Taking the MobileNet + SSD network design as an example, the accelerator modeled the global
optimal computational parallelism parameter of the entire network under the roofline model of
ZYNQ 7100. In order to maximize bandwidth and reduce the delay caused by on-chip off-chip data
exchange, the three stream buffers on the chip use the data stream interface and set the ping-pong
buffer mode. Even when dealing with standard convolution or depthwise separable convolution,
the above-mentioned technology achieves a full pipelined state with a much slower delay than the
no pipelined state. In the end, the accelerator achieved a computing performance of 17.11GFLOPS at
a clock frequency of 100 MHz and high resource utilization, which is superior to previous designs.
Our current system clock frequency is only 100 MHZ, which is lower than other designs. If we can
increase the system clock, the performance of the accelerator will be significantly improved.

Author Contributions: Funding: J.L.; Investigation, D.Z.; Methodology, B.L.; Resources, P.F.; Experiment, D.Z.
and S.F.; Formal analysis, L.F.; Writing-original draft, D.Z.; Writing-review and editing, B.L. and S.F.
Funding: This research was funded by the National Natural Science Foundation of China under Grant
No. 61671170, the Open Projects Program of National Laboratory of Pattern Recognition under Grant
No.201700019.
Acknowledgments: The authors would like to thank the Editor and the anonymous reviewers for their valuable
comments and suggestions.
Conflicts of Interest: The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

CNN Convolutional Neural Network


FPGA Field Programmable Gate Array
IoT Internet of Things
GPU Graphics Processing Unit
ASIC Application-Specific Integrated Circuit
Open CL Open Computing Language
DDR Double Data Rate
AXI Advanced eXtensible Interface
DRAM Dynamic Random Access Memory
PS Processing System
PL Programmable Logic
DMA Direct Memory Access
HP High Performance
GFLOPS Giga FLoating-point Operations Per Second
CTC Computing To Communication
BW BandWidth
DSP Digital Signal Processing
FF Flip Flop
LUT Look-Up-Table
MACC Multiply-Accumulate
RTL Register Transfer Level
Electronics 2019, 8, 281 17 of 18

References
1. Sivaramakrishnan, R.; Sema, C.; Incheol, K.; George, T.; Sameer, A. Visualization and Interpretation
of Convolutional Neural Network Predictions in Detecting Pneumonia in Pediatric Chest Radiographs.
Appl. Sci. 2018, 8, 1715. [CrossRef]
2. Yinghua, L.; Bin, S.; Xu, K.; Xiaojiang, D.; Mohsen, G. Vehicle-Type Detection Based on Compressed Sensing
and Deep Learning in Vehicular Networks. Sensors 2018, 18, 4500. [CrossRef]
3. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural network.
Commun. ACM 2017, 60, 84–90. [CrossRef]
4. Ren, S.; He, K.M.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-time object Detection with Region
Proposal Network. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
5. Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Penn, G. Applying convolutional neural networks concepts to
hybrid NN-HMM model for speech recognition. In Proceedings of the 2012 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4277–4280.
6. Farabet, C.; Poulet, C.; Han, J.Y.; Le, C.Y. CNP: An FPGA-based processor for convolutional networks.
In Proceedings of the International Conference on Field Programmable Logic and Applications, Prague,
Czech Republic, 31 August–2 September 2009; pp. 32–37.
7. Sankaradas, M.; Jakkula, V.; Cadambi, S.; Chakradhar, S.; Durdanovic, I.; Cosatto, E.; Graf, H.P. A massively
parallel coprocessor for convolutional neural networks. In Proceedings of the IEEE International Conference
on Application-specific Systems, Architectures and Processors, New York, NY, USA, 6–7 July 2009; pp. 53–60.
8. Hadsell, R.; Sermanet, P.; Ben, J.; Erkan, A.; Scoffier, M.; Kavukcuoglu, K.; Muller, U.; Le, C.Y. Learning
long-range vision for autonomous off-road driving. J. Field Robot. 2009, 26, 120–144. [CrossRef]
9. Maria, J.; Amaro, J.; Falcao, G.; Alexandre, L.A. Stacked autoencoders using low-power accelerated
architectures for object recognition in autonomous systems. Neural Process Lett. 2016, 43, 445–458. [CrossRef]
10. Wei, Z.; Zuchen, J.; Xiaosong, W.; Hai, W. An FPGA Implementation of a Convolutional Auto-Encoder.
Appl. Sci. 2018, 8, 504. [CrossRef]
11. Zhiling, T.; Siming, L.; Lijuan, Y. Implementation of Deep learning-Based Automatic Modulation Classifier
on FPGA SDR Platform. Elecronics 2018, 7, 122. [CrossRef]
12. Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient inference engine
on compressed deep neural network. In Proceedings of the 2016 International Symposium on Computer
Architecture, Seoul, Korea, 18–22 June 2016; pp. 243–254.
13. Chen, T.S.; Du, Z.D.; Sun, N.H.; Wang, J.; Wu, C.Y.; Chen, Y.J.; Temam, O. DianNao: A small-footprint
high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices 2014, 49, 269–284.
14. Song, L.; Wang, Y.; Han, Y.H.; Zhao, X.; Liu, B.S.; Li, X.W. C-brain: A deep learning accelerator that tames the
diversity of CNNs through adaptive data-level parallelization. In Proceedings of the 53rd Annual Design
Automation Conference, Austin, TX, USA, 5–9 June 2016.
15. Andrew, G.H.; Menglong, Z.; Bo, C.; Dmitry, K.; Weijun, W.; Tobias, W.; Marco, A.; Hartwing, A. Mobile
Nets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017; arXiv:1704.04861.
16. Mark, S.; Andrew, G.H.; Menglong, Z.; Andrey, Z.; Liangchied, C. Mobile Net V2: Inverted residuals and
linear bottlenecks. arXiv 2018; arXiv:1801.04381v3.
17. Cadambi, S.; Majumdar, A.; Becchi, M.; Chakradhar, S.; Graf, H.P. A programmable parallel accelerator for
learning and classification. In Proceedings of the 19th international conference on Parallel architectures and
compilation techniques, Vienna, Austria, 11–15 September 2010; pp. 273–284.
18. Chakradhar, S.; Sankaradas, M.; Jakkula, V.; Cadambi, S. A dynamically configurable coprocessor for
convolutional neural networks. In Proceedings of the 37th International Symposiumon Computer
Architecture, St Mal, France, 19–23 June 2010; pp. 247–257.
19. Peemen, M.; Setio, A.A.; Mesman, B.; Corporaal, H. Memory-centric accelerator design for convolutional
neural networks. In Proceedings of the 2013 IEEE 31st International Conference (ICCD), Asheville, NC, USA,
6–9 October 2013; pp. 13–19.
20. Alhamali, A.; Salha, N.; Morcel, R. FPGA-Accelerated Hadoop Cluster for Deep Learning Computations.
In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic
City, NJ, USA, 14–17 November 2015; pp. 565–574.
Electronics 2019, 8, 281 18 of 18

21. Bettoni, M.; Urgese, G.; Kobayashi, Y.; Macii, E.; Acquaviva, A. A Convolutional Neural Network Fully
Implemented on FPGA for Embedded Platforms. In Proceedings of the 2017 New Generation of CAS
(NGCAS), Genoa, Italy, 6–9 September 2017.
22. Mousouliotis, P.G.; Panayiotou, K.L.; Tsardoulias, E.G.; Petrou, L.P.; Symeonidis, A.L. Expanding a robot’s life:
Low power object recognition via fpga-based dcnn deployment. In Proceedings of the 2018 7th International
Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 7–9 May 2018.
23. Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing fpgabased accelerator design for deep
convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-programmable Gate Arrays, Monterey, CA, USA, 22–24 February2015.
24. Wang, Z.R.; Qiao, F.; Liu, Z.; Shan, Y.X.; Zhou, X.Y.; Luo, L.; Yang, H.Z. Optimizing convolutional neural
network on FPGA under heterogeneous computing framework with OpenCL. In Proceedings of the IEEE
Region 10 Conference (TENCON), Singapore, 22–25 November 2016; pp. 3433–3438.
25. Naveen, S.; Vikas, C.; Ganesh, D.; Abinash, M.; Yufei, M. Throughput-optimized Open CL-based FPGA
accelerator for largescale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016;
pp. 16–25.
26. Xu, K.; Wang, X.; Fu, S.; Wang, D. A Scalable FPGA Accelerator for Convolutional Neural Networks.
Commun. Comput. Inf. Sci. 2018, 908, 3–14.
27. Williams, S.; Waterman, A.; Patterson, D. Roofline: An insightful visual performance model for floating-point
and multicore architectures. Commun. ACM 2009, 52, 65–76. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Você também pode gostar