Você está na página 1de 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1

A Real-Time Convolutional Neural Network for


Super-Resolution on FPGA with Applications to
4K UHD 60 fps Video Services
Yongwoo Kim, Jae-Seok Choi, and Munchurl Kim, Senior Member, IEEE

 especially when it comes to video up-scaling for 2K FHD to 4K


Abstract— In this paper, we present a novel hardware-friendly UHD conversions [1-2].
super-resolution (SR) method based on convolutional neural The up-scaling methods are classified into two types. Single
networks (CNN) and its dedicated hardware (HW) on Field image up-scaling methods [8-18], [19-24] exploit local spatial
Programmable Gate Array (FPGA). Although CNN-based SR
correlations within one LR image to recover lost
methods have shown very promising results for SR, their
computational complexities are prohibitive for hardware high-frequency details. On the other hand, video up-scaling
implementation. To the best of our knowledge, we are the first to methods [3-4] employ an additional data dimension (time) to
implement a real-time CNN-based SR HW that upscales 2K full improve performance but with higher computational costs. In
high-definition (FHD) video to 4K ultra high-definition (UHD) this paper, we consider a single image up-scaling method for
video at 60 frames per second (fps). In our dedicated CNN-based lower-complexity hardware. Single image up-scaling
SR HW, low resolution (LR) input frames are processed
algorithms can be divided into two folds: interpolation methods
line-by-line, and the number of convolutional filter parameters is
reduced significantly by incorporating depth-wise separable and super-resolution (SR) methods.
convolutions with a residual connection. Our CNN-based SR HW  Interpolation methods utilize simpler interpolation kernels
incorporates a cascade of 1D convolutions having large receptive such as bilinear or bicubic kernels. Many interpolation
fields along horizontal lines while keeping vertical receptive fields methods using variants of bilinear or bicubic interpolation
minimal, which allows to save required line memory space in
have been proposed for up-scaling [5-7], [43].
achieving comparable SR performance against full 2D
convolution operations. For efficient HW implementation, we use  SR methods [8-18] have shown better performance than the
a simple and effective quantization method with little peak previous interpolation-based methods. The basic idea of
signal-to-noise ratio (PSNR) degradation. Also, we propose a learning-based approach is to learn mapping functions from
compression method to efficiently store intermediate feature map LR to HR images or videos. The learning-based methods
data to reduce the number of line memories used in HW. Our HW can be grouped into two types: learning LR-to-HR
implementation on the FPGA generates 4K UHD frames of higher mappings using surrounding information from the LR
PSNR values at 60 fps and shows better visual quality, compared
to conventional CNN-based SR methods that are trained and
image itself (internal-based [8-9]), and learning from the
tested in software. external LR-HR image pairs (external-based [10-11]). A
variety of machine learning algorithms, such as sparse
Index Terms—Super-resolution, 4K UHD, deep learning, CNN, coding [12-13], anchored neighbor [14-16] and linear
real-time, FPGA. mapping kernels [17-18], have been proposed for SR.
I. INTRODUCTION However, these learning-based SR methods mostly require
a number of frame buffers for saving intermediate images,
U ltra high definition (UHD) videos are being prevailed in
UHD TV and IPTV services, and smartphone applications.
While many high-end TVs and smartphones support 4K UHD
which makes it difficult for the SR methods to be
implemented in low-complexity hardware for real-time
FHD to 4K UHD video conversion.
video, there are still many video streams with full
Recently, deep neural networks (DNNs), especially deep
high-definition (FHD) resolution (1,9201,080) due to legacy convolutional neural networks (CNNs), have demonstrated
acquisition devices and services. Therefore, a delicate superior performance in a variety of computer vision areas,
up-scaling technique, which is able to convert low-resolution including image classification, object detection, and
(LR) contents into high-resolution (HR) ones, is essential, segmentation, etc. While the machine learning-based methods
find features in a hand-crafted design and learn mappings by
This work was supported by Institute for Information & communications using these hand-crafted features, deep neural networks learn
Technology Promotion (IITP) grant funded by the Korea government (MSIT)
(No. 2017-0-00419, Intelligent High Realistic Visual Processing for Smart the best features and mappings by themselves, making the
Broadcasting Media). overall learning simpler and more effective. Elaborate
Yongwoo Kim, Jae-Seok Choi and Munchurl Kim* are with School of CNN-based SR methods [19-24] have been proposed to
Electrical Engineering in Korea Advanced Institute of Science and Technology,
KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. enhance the visual qualities of HR reconstruction. These CNN
E-mail: {yongwoo.kim, jschoi14, mkimee}@kaist.ac.kr, *: corresponding structures comprise multiple layers of convolutions and
author; Tel : +82-42-350-7419, Fax : +82-42-350-7619.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2

nonlinear functions, and are designed to perform SR and interpolation for LR input is performed as its pre-processing
produce HR images or videos of high quality. Due to their step followed by the extraction of feature maps via
excessive multiplications and other computations, it is known convolutions. Next, the nonlinear mappings between the
that these conventional CNNs are difficult to be implemented in feature maps of LR inputs to those of HR target images are
hardware (HW) of low complexity for real-time applications. In learned. Finally these HR feature maps are aggregated into the
addition, the analysis on computational complexity and form of overlapped patches to obtain a final HR image. In their
execution time for all elaborate CNN-based SR methods extended work [20], they investigated the impact of network
[19-24] was made in software (SW) levels on CPU and/or GPU depth on super-resolution, and empirically showed that the
platforms. Additionally, these CNN structures require the usage difficulty of training deeper models impedes performance
of multiple frame buffers to store intermediate feature maps improvement of CNN-based SR methods. In addition, a faster
when they are implemented in SW and HW, which hinders version of SRCNN, called FSRCNN [23], improves upon
them from being implemented in real-time. SRCNN in terms of computation speed. In order to decrease
In this paper, we propose a hardware-friendly CNN-based overall computational complexity, FSRCNN replaces
SR method and its dedicated hardware, which can convert 2K pre-processing bicubic interpolation step of SRCNN with a
FHD to 4K UHD at 60fps on FPGA. Overall, the contributions post-processing step using deconvolutions. In doing so, its
of this paper are summarized as follows: pipeline has 4 convolutional layers which work as feature
extraction, shrinking, mapping and expanding, respectively.
i) We propose a novel network structure that performs SR
However, the use of deconvolution layer is known to cause
effectively in a hardware with limited computations and
problems such as checkerboard artifacts [33].
memory space, where LR input data are processed
Kim et al. [21] proposed a very deep convolutional network,
line-by-line and the parameter values of convolutional
called VDSR, inspired by VGG-net [25] used for the ImageNet
filters are maintained in small numbers.
classification contest. Using 20 convolution layers in a very
ii) A cascade structure of 1D convolutions is proposed, deep network, VDSR is able to reconstruct HR images superior
which can maintain large receptive fields along to that of SRCNN. This is due to its very large receptive field
horizontal lines while keeping vertical receptive fields in from using many layers, and the network was able to exploit
small sizes, thus saving required line memories. context information over large image regions. Slow training
iii) By incorporating depth-wise separable convolutions with convergence, which is a common problem in very deep
a residual connection, the number of filter parameters of networks, was mitigated by learning residuals between bicubic
our network is reduced, while preserving good SR interpolated images and HR target images, and by using high
performance with much reduced computations. learning rates with adaptive gradient clipping.
iv) A simple and effective quantization method is presented, Shi et al. [24] proposed an SR method using an efficient
which can convert 32-bit floating-point data into sub-pixel convolutional network (ESPCN), which can directly
fixed-point data with little PSNR degradation. generate HR images from LR-sized feature maps by using a
v) We propose a compression method to compress sub-pixel convolutional layer. They showed that the overall
intermediate feature maps in order to reduce line computational complexity can be reduced, compared to that of
memories required to store the feature map data. other SR networks such as SRCNN [19, 20] and VDSR [21].
When it comes to hardware implementation, however, there
The remainder of the paper is organized as follows: Section are some obstacles to implementing the aforementioned SR
II examines various existing SR methods using CNN in terms methods in low-complexity hardware for real-time applications.
of their weaknesses and strengths for hardware perspective. In this paper, we consider the following three points: 1)
Section III details our proposed CNN architecture for SR with reduction of CNN filter parameters; 2) quantization of filter
the quantization of parameters and activations and the parameters and activations; and 3) hardware architectures for
compression of feature maps. Section IV describes its dedicated CNN implementation.
hardware architecture for the proposed hardware-friendly SR
method. In Section V, various experiment results are presented A. Reduction of CNN filter parameters
for the conventional SR methods and our proposed SR method To begin with, the aforementioned CNN-based SR methods
with our HW implementation on FPGA. Finally, we conclude [19-24] have a large number of convolutional filter parameters
our work in Section VI. in order to maximize PSNR performance. SRCNN [19] and
SRCNN-Ex [20] have about 8K and 57K parameters,
II. RELATED WORKS respectively. ESPCN [24] used about 27K parameters. VDSR
Many SR methods have been proposed based on various [21] has about 680K parameters, which is too large to be
CNN structures [19-24]. Dong et al. [19] proposed a simple implemented on HW with limited resources. SR networks with
network structure consisting of three convolutional layers, a smaller number of parameters also have been proposed.
called SRCNN, and showed better performance than the FSRCNN [23] has about 12K parameters. A simpler version of
existing machine learning-based SR methods [8-18]. SRCNN FSRCNN, called FSRCNN-s, have used 4,016 parameters. We
[19] directly learns an end-to-end mapping between LR input argue that the current SR network structures still have too many
images and the corresponding HR output images where bicubic parameters to be implemented in low-complexity hardware,

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3

and the number of parameters should be further reduced. CNN [26-31, 62-64] in various areas of research. These HW
Especially, the deconvolution layers used in FSRCNN and architectures were designed as accelerators that can increase
FSRCNN-s [23] are known to have some difficulty for stream computation speed for CNN operations. Generally, it is
processing in hardware. reported that large frame memory is essentially needed to store
An efficient network structure for classification, called parameters and intermediate feature maps of convolution layers
MobileNets [33], has been proposed, which has a small number and pooling layers, and dynamic random-access memory
of parameters but exhibits good performance. In MobileNets, a (DRAM) is frequently used for frame memory. According to
depth-wise separable convolution is proposed, where a the work in [32], when a CNN hardware operates, the DRAM
conventional convolution is decomposed into a depth-wise as a frame memory consumes the most power. However, low
convolution and pointwise convolution. The MobileNets with power consumption is important in the display field (e.g. timing
depth-wise separable convolutions can achieve similar controller (T-Con)), where high performance CPU or GPU, and
performance only with one-ninth of the total number of DRAM cannot be used in many cases. Therefore, it is necessary
parameters, compared to those using standard (non-separable) to design a novel CNN structure that is apt to perform
convolutions. line-based stream processing instead of using frame buffer.
In this paper, we propose a novel hardware-friendly SR
B. Quantization of filter parameters and activations
method that is suitable for real-time hardware realization
Recent works [35–38] have shown that more without using frame memory, and therefore can increase data
computationally efficient CNN networks can be constructed by throughput with much reduced power consumption. As a result,
quantizing filter parameters. Their quantization methods can be we present a fully pipelined dedicated CNN-based SR
applied to both training and testing phases. More specifically, hardware, which performs real-time FHD-to-4K UHD
quantization can be applied to weights and biases, and also to conversion at 60 fps on FPGA.
feature maps after nonlinear activations from which zeros are
dominant. III. PROPOSED HARDWARE-FRIENDLY CNN-BASED SR
It is well known that 32-bit floating-point (single precision) METHOD
data is used in most of the deep learning platforms (e.g. Pytorch
We propose a novel network structure that performs SR
[52], Caffe [53] and Tensorflow [54]). Specifically, to handle
effectively in a hardware with limited resources, where LR data
32-bit floating-point data in hardware, dedicated computation
should be processed in a line-by-line manner and the number of
hardware with floating processing unit (FPU) is often used to
convolutional filter parameters must be significantly reduced
perform adding or multiplying operations for convolution
compared to those of conventional CNNs. We now describe
layers. Because of this, converting floating-point data to
details of our network architecture along with quantization for
fixed-point data is essential for real-time hardware
filter parameters and activations, and a proposed compression
implementation with limited resources. Therefore, it is
method for intermediate feature maps.
important to find the optimal bit depth to meet a trade-off
between visual quality performance and computation A. Details of Our Proposed CNN Architecture
complexity in quantization process. Fig. 1 shows a block diagram of our hardware-friendly
Over the past few years a considerable number of studies CNN-based SR network. As commonly used in [19-24], RGB
have been made on computer vision tasks such as classification channels of LR input images are converted to YCbCr channels,
and object detection to reduce computational complexity by and only Y-channel is used as an input to our CNN network.
quantizing. Gysel et al. [35] proposed representing both CNN The color channels (Cb and Cr) are up-scaled by simple nearest
weights and activations using minifloat (i.e., floating-point neighbor interpolation for hardware efficiency. It is worthwhile
number with shorter bit-width), since fixed-point arithmetic is to mention that the PSNR performance of networks trained with
more hardware-efficient than the floating-point arithmetic. only Y channels is similar to those of the networks trained with
Gupta et al. [36] reported accuracy performance of using RGB channels [20]. We also incorporated the residual learning
different fixed-point rounding schemes. Judd et al. [37] technique from VDSR [21]. In order to further reduce
demonstrated that the minimum-required precision of data complexity, bicubic interpolation is replaced with nearest
varies not only across different networks, but also across neighbor interpolation in our network structure. Here, final HR
different layers of the same network structure. Lin et al. [38] images, YF, are calculated by adding interpolated LR images,
presented a fixed-point quantization methodology that can be YN, with the network output, YC, as
used to identify the optimal data precision for all layers in a
network. The aforementioned studies have been used for YF  YN  YC . (1)
classification tasks. What seems to be lacking, however, is the In order to achieve the small usage of convolutional filter
research for quantization on regression problems like SR. parameters and line memories, our network incorporates (i)
Inspired by a quantization method in [36], we employ a simple depth-wise separable convolutions, (ii) 1D horizontal
yet effective quantization method for our proposed SR network. convolutions and (iii) residual connections, each of which is
C. Hardware architectures for CNN implementation described in details in the following subsections. As a result,
Many hardware architectures have been proposed to support the number of parameters used in our proposed network is
about 21 times smaller than that of SRCNN-Ex [20], about 4.5

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4

LR HR
RGB RGB Y 3× 3 Residual 3× 3 1× 1 3× 3 1× 1 Pixel YC YF YCbCr RGB
to
YCbCr
Standard
Conv
Block
ReLU DW
Conv
PW
Conv
ReLU DW
Conv
PW
Conv
Shuffle + to
YN RGB

Y Nearest Neighbor (2×)


CbCr
Nearest Neighbor (2×)

Residual Block
In Out
1× 5 1× 1 1× 5 1× 1 Channels
Channels
ReLU DW
Conv
PW
Conv
ReLU DW
Conv
PW
Conv
+ Legend:
'2×' indicates upscaling by a factor
of 2 in the width and height.
Fig. 1. A block diagram of our hardware-friendly CNN-based SR network.

times smaller than that of FSRCNN [23], and 1.56 times Table I. PSNR Comparison of Our Two Networks with and without
smaller than that of FSRCNN-s [23], while maintaining similar ReLU between DSC for Set-5 Dataset.
PSNR and SSIM performance compared to that of SRCNN-Ex Average PSNR Average SSIM
[20]. with ReLU 33.54 dB 0.9544
Depth-wise separable convolutions: MobileNets [33], which without ReLU 33.66 dB 0.9548
is a network using depth-wise separable convolutions (DSC), Difference + 0.12 dB + 0.0004
has shown to achieve similar classification performance only display applications such as T-Con, the excessive usage of line
with one-ninth of the number of parameters, compared to the memories is not encouraged, thus limiting the use of 3×3-sized
cases with conventional non-separable convolutions. DSC filters in networks. On the contrary, it is well known that a large
comprises a depth-wise (DW) convolution, a rectified linear receptive field by using 3×3 or larger filters is necessary to
unit (ReLU) and pointwise convolution (PW) in a cascade. obtain high performance in deep learning [20]. Therefore, as a
However, DSC is known to yield low performance when used compromise, we utilize 1D horizontal convolutions for some
in regression problems such as SR. Since the batch convolutional layers, which makes our network more compact
normalization (BN) may degrade performance in regression and suitable for hardware where LR input data comes through
[22] and requires relatively high computations for computing line-by-line streaming. As a result, our network has a
means and variances, we first remove the BN from DSC. rectangular receptive field with a longer length along the
Secondly, we also remove the ReLU from DSC. We found horizontal direction and with a shorter length in a vertical
from our experiments that when a small number of convolution direction. As shown in Fig. 1, our proposed network has 3×3
filters is used with ReLU in DSC, the resulting feature maps convolutions in the first and the last two layers, and two 1D
tended to become too sparse after ReLU, which may hinder horizontal convolution layers with 1×5-sized convolutions,
from training and leads to lower PSNR performance. Fig. 2-(a) resulting in the total receptive field size of 7×15 for our
shows a conventional DSC [33] and Fig. 2-(b) illustrates our proposed network shown in Fig. 1. In doing so, our network can
modified DSC. Additionally, Table I shows PSNR comparison effectively reduce the required line memories as small as
of our two networks for Set-5 dataset, with and without ReLU possible for storing intermediate feature maps.
between DW and PW convolutions in DSC.
Residual connections: For an efficient HW implementation, a
1D horizontal convolutions: 3×3-sized filters are often used network should be kept with its convolutional filter parameters
for DW convolutions in conventional DSC. However, in some as small as possible. With a small number of filter parameters,
however, we found that our network with DSC and 1D
3× 3 Depthwise Conv ReLU ReLU horizontal convolutions suffers from poor training. This is due
Batch Normalization 3× 3 Depthwise Conv 1× 5 Depthwise Conv
to the fact that the lack of filter parameters leads to very sparse
inter-layer connections in a network, resulting in poor training
ReLU 1× 1 Pointwise Conv 1× 1 Pointwise Conv for image restoration. By inserting a residual connection into
ReLU our network, we were able to reduce the number of filters
1× 1 Pointwise Conv
significantly with good SR performance. It is noted that from a
Batch Normalization 1× 5 Depthwise Conv HW perspective, implementing residual connections with
3×3-sized convolutions requires additional line memories for
ReLU 1× 1 Pointwise Conv
+ saving the input of the residual connection, which is needed
again at the end of the connection. Therefore, we only use 1D
(a) (b) (c)
horizontal convolutions in the residual connection, which can
Fig. 2. Comparison of different separable convolution layers: (a)
be easily implemented in HW by using delay buffers. Fig. 2-(c)
Depth-wise separable convolution of MobileNets [33]; (b) Proposed
depth-wise separable convolution; (c) Proposed residual layer of illustrates our final DSC structure with 1D horizontal
depth-wise separable convolution. convolutions and a residual connection.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5

B. A Quantization Method for Converting Floating-point to problems for implementing chip design such as (i) it increases a
Fixed-point chip size due to the increased number of power rings used in the
In our network, we use fixed-point number representation for line memories, (ii) it causes routing congestion in place and
low-complexity. There are some quantization methods [36, 39] route (P&R), and (iii) voltage drop is occurred when memory
that can be used to convert floating-point to fixed-point data. block boundaries lack the power rings. To solve these problems,
Fixed-point data can be defined as [IL, FL] to represent its a new approach that reduces the required line memories and
number, where IL and FL indicate integer length and fractional also compresses the feature maps is required.
length, respectively. The number of integer bits (IL) plus the In this paper, we also propose a novel compression method
number of fractional bits (FL) yields the total number of bits for compressing intermediate feature maps of our network,
that is used to represent the number. Their sum, IL + FL, is based on some guidelines from hardware’s viewpoint. First,
referred to as the word length, WL. The fixed-point format in any compression algorithm must be very simple to be
[IL, FL] limits the precision of data to FL bits, and sets the implemented. For efficient compression, the logic size used for
range to [-2IL-1, 2IL-1-2-FL]. To convert the floating-point to the a compression method must be smaller than the size of line
fixed-point, the rounding scheme used here is round-to-nearest. memories required to store the intermediate feature maps
The round-to-nearest method [36] is defined as before compression. Secondly, because the usage of residual
  learning and ReLU leads to many zero and close-to-zero values
x  if x   x  x  
2
in the feature maps, an efficient compression algorithm for such
Round ( x,[IL, FL])   , (2) data characteristics should be considered. Finally, the data for
x    if x     x  x   
 compression should be compressed only using adjacent data in
2
 FL
the horizontal direction to effectively use the line memories.
where x  is the largest integer multiple of  (  2 ) less than We present an efficient compression method by modifying
or equal to x . If x lies outside the range of [IL, FL], we saturate the DXT5 method [40-42] in a suitable way for CNN structures.
the result to either the lower or upper limit of [IL, FL]. Finally, Note that the DXT5 method [40-42] was developed for image
the formula [36] to convert from floating-point to fixed-point is compression. Fig. 3-(a) shows details of the conventional
formulated as DXT5 algorithm. To perform compression, RGB input pixels
are configured into 4×4 blocks. Each color channel of RGB
- 2IL-1 , if x  -2IL-1
 (3) inputs is compressed independently. Maximum (MAX) and
Convert ( x, [IL, FL])  2IL-1  2 FL , if x  2IL-1 - 2- FL
 Round ( x, [IL, FL]) , otherwise. minimum (MIN) values in each color channel are first

computed. Six intermediate points are generated by applying
To minimize PSNR degradation when we apply quantization
interpolation to the MAX and MIN values. The MAX, MIN and
to float-point data (filter parameter and activation values), we
the computed 6 intermediate points are used as reference colors
conducted many experiments to find and apply the optimal WL, for compression. To encode pixel data, we assign the index
IL and FL values for our proposed SR network. Interestingly, value of the closest reference color to each pixel. Encoding is
degradation due to the quantization method is very small in our completed by storing the 4×4 block index, MAX and MIN
proposed hardware-friendly CNN-based SR network. More values. There exist the 8 neighboring index values for each
detailed results for quantization are discussed in Section V. pixel in a 4×4 block, and each index is expressed by 3 bits.
C. Compression of Intermediate Feature Maps Decoding is the reverse of the encoding process, and can be
easily performed using the MAX, MIN and index values. If the
As mentioned in Section III-A, the size of the receptive field bit per pixel (bpp) of RGB inputs is 8 bits, the DXT5 algorithm
greatly affects the overall performance. More precisely, both has a fixed compression ratio of 2:1 for each 4×4 block. The
horizontal and vertical receptive fields are important. However, compression ratio (CR) can be calculated as
in order to perform 3×3 convolutions for the next layer in
uncompressed bits
hardware, feature map data of the current layer needs to be CR 
stored in line memories, where this operation (3×3 convolution) compressed bits
(4)
requires twice of the number of line memory required to store bpp  block_size

the output feature maps of the current layer. It should be noted 3  (max  min  block_size  index)
that the usage of many line memories may cause some Fig. 3-(b) shows our proposed compression method. Table II

input - RGB Reference Colors Search Index Save input – feature maps Reference Features Search Index Save
Compute MAX, MIN, intermediate Find closest to Each component of Compute MAX, intermediate Find closest to
Generate 4x4 block Generate 1x32 block
values of each color component reference color values MAX, MIN, 4×4 LUT values of feature maps reference feature values MAX, 1×32 LUT
Index Index
B00 R
RG00 B01 R
G01 B02 R
G02 B03
G03 R7 RGmax
MAX 7 RMAX RMIN F00 FMAX =max R31 FMAX 31 29
00 01 02 03 R6 (R0+6R7)/7 6 5 3 2 0 GMAX GMIN (F00,F01, 31 FMAX
RG10
B10 R B11 R B12 R ,
R30 (30/32)×FMAX 30
G11 G12 B13
G13
R5 BMAX BMIN F01 4
10 11 12 13
(2R0+5R7)/7 5 4 2 3 1 F30, F31) R29 (29/32)×FMAX 29
RG20 RG21 RG22 RG23
B20 B21 B22 B23 R4 (3R0+4R7)/7 4 ... ...
5 3 2 0 ... 29 31 4
20 21 22 23 R3 (4R0+3R7)/7 3 5 6 2 7 4 2 3 1
RG30 RG31 RG32 RG33 R2 (2/32)×FMAX 2 ...
B30 B31 B32 B33
R2 (5R0+2R7)/7 2 5 6 2 7 F30 7
30 31 32 33 1 2 0 4 1 2 0 4 R1 (1/32)×FMAX 1 7 2 4
R1 (6R0+R7)/7 1 2
R0 RMIN 0 F31 R0 0 0 4
(a) (b)
Fig. 3. Compression methods: (a) Conventional DXT5 method; (b) Proposed compression method for intermediate feature maps.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6

TABLE II. DIFFERENCE BETWEEN CONVENTIONAL DXT5 AND Algorithm 1. Proposed Algorithm for Feature Map Compression
PROPOSED COMPRESSION METHOD. 1. Input: Feature Map Data in a line.
method Conventional DXT5 Proposed method 2. Initialization:
input RGB intermediate feature maps Generate 1×32 block from feature map data in a line
bits 24 bits 14 bits (quantized) 3. For each 1x32 block, do
Block 44 132 4. Compute max value, assign zero min value
max value compute compute 1st point = zero value
min value compute zero (fixed) 32th point = max value
bits per index 3 5 5. Compute 30 intermediate points through interpolation
divisor value 7 32 (approximate) 2nd point = (1/32)*max value
compression ratio 2:1 2.58:1 3rd point = (2/32)*max value

presents the main differences between our method and the 31th point = (30/32)*max value
original DXT5 algorithm. In our method, the MIN value is 6. For each feature map in 1x32 block, do
fixed to 0 and only the MAX value is computed, based on the 7. Find index value by closest to reference colors.
distribution of feature map data that are often zero or 8. End For
close-to-zero values. By fixing the MIN value to 0, we can 9. End For
10. Store each max value and index values of 1x32 block in a line
further reduce the bits for storing the MIN value and eliminate
the corresponding logic. Since intermediate feature map data
must be processed line by line in hardware, the block size of the illustrates our proposed pipelined HW architecture for SR. In
feature map data is set to 1×32. Also, a 5-bit index is assigned a addition, our HW is designed with two types: Type-1 is without
quantization level for each data point in an 1×32 block of a compression of intermediate feature maps and Type-2 is with
feature map. The 5-bit length for the indices was empirically compression. Table III shows the detailed specification of our
found by checking the PSNR performance with respect to the proposed SR HW architecture. Our HW consists of an
bit lengths for the data point indices, which will be discussed in RGB-to-YCbCr convertor, four input line buffers, a data
Section V with experiment results. The CR of our proposed aligner, depth-wise and pointwise convolution operators, ReLU
compression method (PCR) can be calculated as follows operators, compressor and decompressor, four output line
buffers, and weight buffers. As shown in Fig. 4, our HW has a
bits of quantized feature map  block_size
PCR  (5)
(bits of max  block_size  bits of index) TABLE III. OUR PROPOSED HW ARCHITECTURE FOR SR.
Type / Stride / Padding Filter Shape Input / Output Size Remark
If the word length (WL) of the feature map data is set to the
1920×1080×1 Input Y
14-bit depth after quantization of activations, the compression Conv / (1,1) / (1,1) 3×3×1×32 1920×1080×32
ratio becomes 2.58:1 (=14(132) / (14+5(132))). That is, the ReLU - 1920×1080×32
number of line memories for storing feature map data can be DW Conv / (1,1) / (0, 2) 1×5×32 dw 1920×1080×32
reduced to about 2.58 times. As shown in Table II, we set the PW Conv / (1,1) / (0, 0) 1×1×32×16 1920×1080×16
Residual
ReLU - 1920×1080×16
divisor value to 32 (multiple of 2) instead of 31 (25-1) to reduce DW Conv / (1,1) / (0, 2) 1×5×16 dw 1920×1080×16
Block
hardware complexity for calculating intermediate points. By PW Conv / (1,1) / (0, 0) 1×1×16×32 1920×1080×32
doing so, the computation of intermediate points can be done ReLU - 1920×1080×32
only with shift and add operators. The operation of our DW Conv / (1,1) / (1, 1) 3×3×32 dw 1920×1080×32
proposed compression method is summarized in Algorithm 1. PW Conv / (1,1) / (0, 0) 1×1×32×16 1920×1080×16
ReLU - 1920×1080×16
DW Conv / (1,1) / (1, 1) 3×3×16 dw 1920×1080×16
IV. PROPOSED CNN-BASED SR HW ARCHITECTURE PW Conv / (1,1) / (0, 0) 3×3×16×4 1920×1080×4
This section describes our SR HW architecture in which our Pixel Shuffle depth-to-space 3840×2160×1 YC
proposed HW-friendly SR network is implemented for Nearest Neighbor 2× up-sample 3840×2160×1 YN
Residual Network YN + Y C 3840×2160×1 Output YF
real-time FHD-to-4K UHD up-scaling applications. Fig. 4

Weight Buffers Type-1 SR HW Datapath


Type-2 SR HW Datapath
Line 1× 1 Even Line
Input Buffer 1 1× 5 Buffer
Decompressor
Pixels DW
PW Conv 3× 3 1× 1
Line with Odd Line 2×2
RGB Decompressor DW PW
RGB Buffer 2 3× 3 Conv
ReLU ReLU Compressor Buffer
to Data Conv Conv YF YCbCr
Line Conv to Line Buffer
YCbCr Aligner Decompressor
Buffer 3 with RGB
ReLU 1× 5 1× 1
Line
Buffer 4
DW PW + Even Line
Buffer
Decompressor
RGB

Conv Conv 3× 3 1× 1 Y C Output Pixels

Delayed ReLU Compressor


Odd Line
Buffer
Decompressor DW
Conv
PW
Conv
+
YN
Buffer Decompressor

YCbCr CbCr
FIFO

Fig. 4. Block diagram of the proposed pipelined SR hardware.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7

pipeline structure for up-scaling where the red arrows indicates V. EXPERIMENT RESULTS
the data paths of Type-1 SR HW that does not have We evaluate the performance of our method for popular
compressors and decompressors, and the red arrows shows the datasets in comparison with bicubic and conventional
data paths of Type-2 SR HW having compressors and CNN-based SR methods [19-21], [23]. In Section V-A, we
decompressors. The other black arrows are the data paths are describe datasets used for training and testing, and present our
shared by both Type-1 and Type-2 SR HWs. All the weight experiment setup. In Section V-B, we investigate the variants
parameters in our HW are quantized according to (3) from (combinations of various WL and IL values) of our methods for
32-bit floating point to 10-bit fixed-point, and the quantized quantization. Additionally, the effect of quantization bit depths
weight parameters are stored in the weight buffers. The in our network is thoroughly analyzed with respect to PSNR
processing of our proposed pipelined SR hardware in Fig. 4 is performance. We also present the experiment results for our
performed as follows: feature map compression. In Section V-C, we compare our
(i) The convolutional filter parameters in our HW are loaded proposed SW-based and HW-friendly SR networks with other
from the weight buffers. Afterwards, we extract YCbCr CNN-based SR networks which are all SW-based methods,
values from a RGB input stream. From the YCbCr LR including SRCNN [19], SRCNN-Ex [20], FSRCNN [23],
input image, we store its four rows into line buffers, which FSRCNN-s [23] and VDSR [21] methods. We also compare
will be used for nearest neighbor up-sampling to obtain an our real-time SR HW with other real-time SR HW [43, 44, 58]
interpolated image for residual connection at the end of our in terms of gate counts and operating frequency.
network.
(ii) The data aligner rearranges the data in both the four line A. Experiment Setup
buffers and the input stream, and generates 3×3-sized 1) Train/Test Data Sets, Metrics
YCbCr LR patches. Here, Y channels of the LR patches are We used publicly available benchmark datasets for our
sent to the first 3×3 convolution layer. After the first training and testing. Specifically, our SR network was trained
convolution operator, the feature map data go through using 291 images which consists of 91 images from Yang et al.
ReLU activation function. After that, the output is passed [45] and 200 images from Berkeley Segmentation Dataset [46].
to the 1D horizontal residual block. We used two sets of images for performance comparisons:
(iii) The intermediate feature maps that has passed through the TestSet-1 and TestSet-2. TestSet-1 was constructed with Set5
residual block and ReLU are compressed by the [47], Set14 [48], B100 [46] and Urban100 [49], which are often
compressor and are stored in the line memories. The data used for SR benchmark in many methods. TestSet-2 consists of
stored in the line memories is then read and decompressed 8 images of 4K UHD from [17] and is used only for testing. All
at the one-delayed line data enable (DE) timing. The experiments were performed for SR with a scale factor of 2.
decompressed data is convolved to 3×3 depth-wise and 1×1 PSNR and SSIM [50] were used as metrics for evaluation.
pointwise convolution. Since SR was performed for the luminance channel of YCbCr
(iv) After passing 1×1 pointwise convolution, the number of color space, PSNR and SSIM were calculated with the
feature map channels is reduced by half from 32 to 16. The Y-channels of reconstructed and original HR images.
operation in the step (iii) is again repeated for feature maps
2) Implementation Details
of 16 channels in the same manner with compression,
storing into line memories and decompression in sequence. For training and testing, LR input images were deliberately
It should be noted from this repeated operation that the created from original HR images by down-sampling using the
feature maps at the output (1×1 pointwise) convolution bicubic interpolation with a scale factor of 2. For training,
layer are four channels which are used to construct a 128×128-sized sub-images were randomly cropped from the
2×2-sized HR patch in a similar way as sub-pixel HR images. Furthermore, LR-HR training image pairs were
convolution [24]. augmented using rotation, mirroring and scaling. The weights
(v) Lastly, the final Y (YF) is obtained by adding the 2×2-sized were initialized using the uniform distribution, and no biases
super-resolved Y data (YC) and the 2X up-sampled data by were used to reduce the number of parameters. L1 loss was used
nearest neighbor interpolation method (YN). To as a cost function instead of L2 loss. Our proposed SR networks
synchronize the two timings of YC and YN data, YN data is were trained using Adam optimizer [51]. The learning rate was
stored in the first-in-first-out (FIFO) block and is read at initially set to 0.0001 and was decreased by a factor of 10 after
the same timing as YC. A delayed CbCr data from the FIFO every 50 epochs. A mini-batch size of 2 was set during the
is also up-sampled by a factor of 2 based on nearest training. A NVIDIA Titan X GPU and Intel Core i7-6700 CPU
neighbor interpolation and is transferred to the with 3.4GHz were used for training and testing.
YCbCr-to-RGB converter to obtain RGB pixels. The two It should be noted that while we used floating-point
output line buffers store the generated 2×2 RGB HR computation during training phase, the weight parameters of
patches which are then delivered to display devices per our SR network were quantized according to (3) from
output clock cycle on output timing. Here, we use four floating-point to fixed-point in the testing phase. Additionally,
output line buffers to avoid read/write conflict for 2×2
we quantized activations of all convolution layers and
RGB HR output patches using a double-buffering scheme
compressed the feature maps of only 3×3-sized depth-wise
for stream processing.
convolution layers using our compression method. The optimal

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8

SW - PyTorch
2) Test : Weights 3) Test : Activations 4) Test : Feature Map
1) Train
Quantization Quantization Compression
Train LR SR image SR image SR image SR image
image (Floating-point) Pre-trained (Floating-point) Pre-trained (Fixed-point) Pre-trained (Fixed-point)
Proposed Proposed CNN Activations Proposed CNN Activations Proposed CNN Activations
Train HR CNN SR Weights SR Network with (Floating-point) SR Network with (Fixed-point) SR Network with (Fixed-point)
image (Floating-point) Weights Compressed
Network Weights Activations
Quantization (Fixed-point) Quantization Feature Maps

Weights : Activations and Type-1, Type-2 :


HW (RTL) - Test LR
binary representation final SR image : Golden Model
image
SystemVerilog of fixed-point [IL, FL] binary representation of Type-2
Activations and fixed-point [IL, FL]
5) HW Design final HW SR image : Compare with
Weight Buffers
Type-1
binary representation of Golden Model Finish
Proposed CNN SR Network Fixed-point [IL, FL] Type-1 HW and HW
Hardware Type-2 HW
(Type-1, Type-2)

Fig. 5. Verification framework for our proposed SR hardware.

quantization bits for weights and activations were found activations. It should be noted in Fig. 6 that for the 10-bit WL,
empirically, and the quantized weight parameters were used in the PSNR performance is greatly decreased at IL = 4 or higher.
our proposed SR HW. The compressed intermediate feature
maps and final SR images are used as output data of a golden Baseline WL=10 WL=12 WL=14 WL=16
model to compare with the hardware simulation results. Fig. 5 38
shows our verification framework for proposed SR HW.
36
B. Quantization and Feature Map Compression Results 34
As explained in Section III, the weight parameters and 32
PSNR (dB)

activations are quantized for hardware implementation. Since


30
quantization for the weight parameters and activations
significantly affects the quality of the output images, it is 28
important to find an appropriate quantization bit depth. In other 26
words, we need to find appropriate values for the three
24
parameters for word length (WL), integer length (IL), and
fractional length (FL) as described in Section III-B. We 22

performed experiments by varying the values of WL, IL, and 20


FL for various datasets. 2 3 4 5 6 7

Fig. 6 shows PSNR plots with respect to WL and IL by Integer Length (IL)
quantizing the weight parameter values for the Set5 dataset. As Fig. 7. PSNR performance plot with respect to varying values of WL
shown in Fig. 6, when the bit depth of WL is 10 or higher, the and IL for quantizing activations in the Set-5 dataset. The weight
PSNR performance of our SR network is similar to the baseline parameters are fixed to WL=10, IL=2 bits before quantizing
which is the case without the quantization of weights and activations.

Quantized baseline INDEX=3 INDEX=4 INDEX=5


Baseline WL=8 WL=10 WL=12 WL=14 WL=16
37
37

36.5
36.5
36

35.5 36
PSNR (dB)

35
PSNR (dB)

35.5
34.5

34 35
33.5

33 34.5

32.5
34
32 8 16 24 32 40
1 2 3 4 5
Block Size (1xN)
Integer Length (IL)
Fig. 8. The PSNR performance plot with respect to varying values of
Fig. 6. PSNR performance plot with respect to varying values of WL block size and index for intermediate feature maps compression in the
and IL for quantizing weights in the Set-5 dataset. Set-5 dataset.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9

TABLE IV. PERFORMANCE COMPARISONS OF VARIOUS SR METHODS: USED BITS OF WEIGHT PARAMETERS AND ACTIVATIONS, AVERAGE PSNR
AND SSIM OF FOUR BENCHMARKS WITH A SCALE FACTOR OF 2 FOR TESTSET-1.
Our HW Our HW
SRCNN SRCNN-Ex FSRCNN FSRCNN-s VDSR Ours Ours
Methods Bicubic Type-1 Type-2
[19] [20] [23] [23] [21] (baseline) (W Only)
(W+A) (W+A)
# of Params - 8K 57K 12K 4K 665K 2.56K
Weights bits - 32-bit 32-bit 32-bit 32-bit 32-bit 32-bit 10-bit 10-bit 10-bit
Activation bits - 32-bit 32-bit 32-bit 32-bit 32-bit 32-bit 32-bit 14-bit 14-bit
Data Sets PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM

Set-5 33.66 0.9299 36.34 0.9521 36.66 0.9542 37.00 0.9557 36.57 0.9531 37.53 0.9587 36.66 0.9548 36.64 0.9544 36.64 0.9543 36.51 0.9520
Set-14 30.24 0.8688 32.18 0.9039 32.42 0.9063 32.63 0.9086 32.28 0.9049 33.03 0.9124 32.52 0.9073 32.52 0.9071 32.47 0.9070 32.46 0.9055
B100 29.56 0.8431 31.11 0.8835 31.36 0.8870 31.50 0.8909 31.23 0.8866 31.90 0.8960 31.32 0.8880 31.31 0.8876 31.31 0.8877 31.27 0.8864
Urban100 26.88 0.8403 29.09 0.8897 29.50 0.8946 29.85 0.9010 29.23 0.8914 30.76 0.9140 29.34 0.8943 29.33 0.8942 29.32 0.8939 29.28 0.8916

This indicates that the FL bit depth has a greater effect on performance.
PSNR performance than the IL bit depth due to the use of a In general, when implementing the SR networks with large
residual network as in (1). We set WL bit depth to 10 bits and numbers of channels in HWs with frame memory and high
IL bit depth to 2 bits for weight parameter quantization, which processing capabilities, it is worthwhile studying more
is also used for the experiments of activation quantization and elaborate techniques for feature map compression to effectively
intermediate feature map compression. store interim feature maps (activations) in limited memory
Fig. 7 shows the PSNR performance of our SR network with space.
respect to WL and IL bit depths for activation quantization. The
C. Comparison with other CNN-based SR methods
experiment results in Fig. 7 show a similar trend compared to
those of weight parameter quantization in Fig. 6. However, to To validate the performance of our proposed SR network
alleviate performance degradation, more bits should be before hardware implementation, we compared it with bicubic
allocated for activation quantization than weight parameter and other CNN-based SR methods, including SRCNN [19],
quantization. Based on the experiment results in Fig. 7, we set SRCNN-Ex [20], FSRCNN [23], FSRCNN–s [23] and VDSR
WL to 14 bits and IL to 2 bits for activation quantization. [21]. We utilized publically available open MATLABTM source
Fig. 8 shows the experiment results for our proposed codes for SRCNN, SRCNN-Ex, FSRCNN and FSRCNN-s
compression applied to the quantized feature maps to reduce while our proposed method were implemented using PyTorch
line memory usage. Various block sizes and indices [52]. For a fair comparison, the borders of HR reconstructions
(quantization levels) were inspected with respect to PSNR and original images were excluded in PSNR/SSIM calculation
performance. As expected, a smaller number of quantization as done in [19]. All methods were executed on a CPU platform.
levels for compression brings in higher compression but results Because the official code for VDSR is only executable on GPU
in lower PSNR performance. Based on the experiment results platforms, we used a third-party code [55] running on CPU
in Fig. 8, the block size of 32 and the index size (the number of platforms to measure PSNR/SSIM performance and execution
quantization levels) of 5 bits were chosen as a compromise time. The execution time for our proposed method is measured
between required line memories and resulting PSNR based on the software implementation using PyTorch.

TABLE V. COMPARISON OF FHD-TO-4K UHD SR PERFORMANCES FOR TESTSET-2 IN TERMS OF COMPUTATION TIME, PSNR AND SSIM.
Our HW Our HW
SRCNN SRCNN-Ex FSRCNN FSRCNN-s VDSR Ours Ours
Methods Bicubic Type-1 Type-2
[19] [20] [23] [23] [21] (baseline) (W Only)
(W+A) (W+A)
Realization
FPGA, ASIC N/A N/A N/A N/A N/A N/A FPGA
of HW
Average CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU CPU GPU
Computation - 0.0166 (60 fps)
Time(sec) 277.6 1.052 288.0 1.256 324.4 0.583 146.4 0.518 124.3 2.851 2.53 0.050 2.62 0.057
Images PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM

Balloon 33.79 0.9409 35.37 0.9582 35.55 0.9594 35.74 0.9607 35.47 0.9591 35.99 0.9608 35.59 0.9601 35.589 0.9599 35.587 0.9599 35.550 0.9589
Children 33.56 0.9123 34.91 0.9272 34.97 0.9280 35.09 0.9291 34.92 0.9275 35.28 0.9296 35.00 0.9288 34.997 0.9283 34.995 0.9283 34.947 0.9265
Constance 31.98 0.9271 32.84 0.9436 32.94 0.9447 33.01 0.9460 32.88 0.9439 33.22 0.9470 32.93 0.9449 32.926 0.9446 32.924 0.9445 32.911 0.9440
Lake 30.1 0.8527 31.47 0.9004 31.68 0.9033 31.74 0.9047 31.58 0.9019 31.87 0.9057 31.57 0.9019 31.565 0.9016 31.561 0.9015 31.532 0.9006
Louvre 35.63 0.9476 38.2 0.9666 38.39 0.9677 38.93 0.9699 38.35 0.9673 39.36 0.9708 38.33 0.9680 38.320 0.9677 38.312 0.9676 38.234 0.9668
Medieval 29.68 0.9128 31.49 0.9424 31.76 0.9453 31.88 0.9464 31.60 0.9431 32.17 0.9482 31.61 0.9443 31.598 0.9437 31.596 0.9437 31.561 0.9427
Skyscrapers 29.48 0.9103 32.23 0.9434 32.75 0.9470 33.04 0.9488 32.67 0.9458 33.16 0.9496 32.57 0.9462 32.562 0.9458 32.550 0.9457 32.479 0.9439
Supercars 29.63 0.9453 32.22 0.9668 32.81 0.9699 33.00 0.9703 32.55 0.9679 33.20 0.9705 32.38 0.9681 32.370 0.9677 32.362 0.9676 32.308 0.9658
Average 31.74 0.9199 33.59 0.9436 33.86 0.9457 34.05 0.9470 33.75 0.9446 34.28 0.9477 33.75 0.9453 33.740 0.9449 33.736 0.9448 33.690 0.9437

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)


Fig. 9. Comparison of HR output images (×2 on Monarch). (a) Original HR image. (b) Bicubic. (c) SRCNN [19]. (d) SRCNN-Ex [20]. (e)
FSRCNN [23]. (f) FSRCNN-s [23]. (g) VDSR [21]. (h) Our proposed SR method with only quantized weights (W only). (i) Our proposed SR HW
with quantized weights and activations (W+A). (j) Our proposed SR HW with weights, activations (W+A) and compression of intermediate
feature maps.

Table IV shows the average PSNR and SSIM values for the HW-friendly SR network can reconstruct HR images of
SR methods under comparison with the four benchmark comparable quality compared to the other SR methods.
datasets. As shown in Table IV, our proposed HW-friendly SR Regarding the execution time, SRCNN, SRCNN-Ex, FSRCNN
method outperforms FSRCNN-s, while having only 64-percent and FSRCNN-s take relatively longer execution times because
of the number of filter parameters in FSRCNN-s. Also, almost their public codes are implemented in MATLAB and may not
no performance degradation is observed for our method even be optimized in the CPU platform. For fair comparison, our
after quantizing the weight parameter values and activations. network was re-implemented in TensorFlow [54], as other
When applying the feature map compression for our codes [59]-[61] were written in TensorFlow to measure
HW-friendly SR network, there is only about 0.1dB execution time on GPU platform. Table V also shows the
performance degradation in PSNR, but the required line execution time measured by GPU for various CNN-based SR
memory space is reduced by about 2.58 times. methods including ours. The execution time of our proposed
Table V shows the performance comparison for our CNN SR network running on GPU was measured to take about
proposed SR network and other CNN-based SR methods in 50ms, which is about three times slower than our FPGA
terms of PSNR, SSIM and average computation time for implementation.
Test-Set2, which consists of 4K UHD test images. Our Fig. 9 shows HR reconstructed images and their cropped
TABLE VI. CHARACTERISTICS OF HARDWARE IMPLEMENTATIONS FOR SR.
Publication Lee [43] Yang [44] Kim [58] Ours
CNN
Edge Orientation
Methods Sharp Filter Lagrange ANR Type-1 HW: Type-2 HW:
Learn Linear Mappings
without compression with compression
FPGA Device or Altera Xilinx Xilinx
0.13 ㎛ 90 nm 0.13 ㎛
CMOS Tech. EP4SGX530 XCKU040 XCKU040
Slice LUTs : 3,395 Slice LUTs : 110K Slice LUTs : 151K
FPGA Resources
5.1 K N/A 1,985K Slice Regs : 1,952 159K Slice Regs : 102K Slice Regs : 121K
or Equivalent Gate Counts*
DSP Blocks : 108 DSP Blocks : 1920 DSP Blocks : 1920
5 (input), 4 (input), 4 (input),
2 (input)
Line Buffers 4 (input) 24 (internal) 96 (internal), 38 (internal),
4 (output)
8 (output) 4 (output) 4 (output)
Memory Size (Bytes) N/A 235K 92K 392K 194K
Max. Freq. (MHz) 431 124.4 150 220 150
Throughput (Mpixels/s) 431 124.4 600 880 600
Supported Scale 2X, 3X 2X 2X 2X
PSNR Set5 N/A 33.83 34.78 36.64 36.51
(dB) Set14 N/A 29.77 31.63 32.47 32.46
Power (W) - - - 4.791 5.686
Target 4K UHD FHD 4K UHD 4K UHD
Resolution (30 fps) (60 fps) (60 fps) (60 fps)
* A 2-input NAND gate is counted as one equivalent gate.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)


Fig. 10. Comparison of HR output images (×2 on Children). (a) Original HR image. (b) Bicubic. (c) SRCNN [19]. (d) SRCNN-Ex [20]. (e)
FSRCNN [23]. (f) FSRCNN-s [23]. (g) VDSR [21]. (h) Our proposed SR method with only quantized weights. (W only) (i) Our proposed SR HW
with quantized weights and activations (W+A). (j) Our proposed SR HW with weights, activations (W+A) and compression of intermediate
feature maps.
regions using the bicubic and the five CNN-based SR methods
including our HW-friendly SR network. Although our
proposed method utilizes the smallest number of parameters, FHD 60fps
the resulting HR images are still perceptually plausible with Input

sharp edge and little artifacts. Fig. 10 shows the cropped


HDMI 2.0
regions of reconstructed HR images for the Children image of TX, RX

4K UHD resolution. As shown in Fig. 10, the visual qualities of FPGA


4K UHD 60fps
the HR images using our HW-friendly SR network and its SR XCKU040
Output
HW are comparable to those by the other CNN-based SR Xilinx FPGA
4K UHD
Display
methods. Board

D. FPGA Prototyping
Fig. 11. Demonstration of an FPGA prototype of our SR HW.
Table VI shows details of HW implementations for Lee’s
[43], Yang’s [44], our previous Super-Interpolation (SI) HW accordance with both the 150MHz target operating frequency
[58] and our proposed SR HW. Lee’s HW [43] is implemented and the constraints that are put in the synthesis stage and the
using an interpolation-based method while our SI HW [58] and place and route (P&R) stage with Vivado Design Suite 2015.4.
Yang’s SR HW [44] are machine learning-based methods. The estimated power consumptions were derived from the
Lee et al. [43] presented an HW using a Lagrange Xilinx Vivado power estimator tool. Power dissipations of
interpolation method with a sharpening algorithm, which can Type-1 and Type-2 SR HW are 4.791W and 5.686W,
obtain 4K UHD video streams from HD and FHD streams at 30 respectively. Type-2 SR HW consumes more power than
fps. Yang’s HW architecture [44] is based on Anchor Type-1 SR HW because the logic of the compression algorithm
Neighborhood Regression (ANR), which requires an
in Type-2 SR HW consumes additional power. In addition, to
intermediate image of a target resolution to generate
verify our implemented SR HW, we used a Xilinx Kintex
high-frequency patches using dictionaries, and can obtain FHD
UltraScale FPGA KCU105 evaluation board [56] and a TED's
at 60 fps. Our previous machine learning-based SI HW
architecture [58] is based on linear mapping using edge HDMI 2.0 expansion card [57] to support FHD input and 4K
orientation analysis, which does not require such an UHD output video interfaces.
intermediate image and directly reconstructs HR images with Fig. 11 shows an FPGA prototype system of our SR HW.
high frequency restoration. Here, we show two types of the FPGA prototype system for our
Our HW-friendly SR method was implemented using proposed SR HW: Type-1 SR HW where the feature map
System Verilog on FPGA. In our SR HW, the output clock compression is not applied and Type-2 SR HW with the feature
speed is four times the input clock speed, because the operating map compression applied. Type-1 SR HW has used 110K Slice
frequency ratio of FHD over 4K UHD is often one fourth. Our LUTs and 102K Slice Registers which take 45.38% of total
SR HW processes four pixels per clock cycle to support 4K Slice LUTs and 21.08% of total Slice Registers in the
UHD video streams of 60 fps, and was implemented in XCKU040 FPGA device. Type-2 SR HW has consumed 151K

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12

Slice LUTs and 121K Slice Registers that correspond to 62.6% 1, pp. 31–35, Jan. 2013.
[8] C.-Y. Yang, M.-H. Yang, “Fast direct super-resolution by simple
of total Slice LUTs and 24.97% of total Slice Registers. functions,” in Proc. IEEE Int. Conf. Comput. Vision, 2013, pp. 561-568.
Furthermore, both Type-1 SR HW and Type-2 SR HW have [9] G. Freeman, R. Fattal, “Image and video upscaling from local
made the full use of 1,920 DSPs blocks in the XCKU040 FPGA self-examples,” ACM Transactions on Graphics, vol. 30, no. 2, p. 12,
2011.
device of the KCU105 evaluation board. By incorporating the
[10] W. T. Freeman, T. R. Jones, and E. C. Pasztor, “Example-based
feature map compression, Type-2 SR HW has the reduced super-resolution,” IEEE Computer Graphics and Applications, vol. 22,
on-chip memory usage (e.g. block ram in FPGA) which is no. 2, pp. 56-65, 2002.
about 50% of that of Type-1 SR HW, while utilizing about 38% [11] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via
sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp.
more Slice LUTs and about 18% more Slice Registers than 2861–2873, Nov. 2010.
Type-1 SR HW, in implementing the two compressors and six [12] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using
decompressors. sparse-representations,” in Curves and Surfaces, 2010, pp. 711-730.
[13] J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang, “Coupled dictionary
Compared to the non-CNN-based SR methods [43-44, 58], training for image super-resolution,” IEEE Transactions on Image
our CNN-based SR HW implementation requires a larger Processing, vol. 21, no. 8, pp. 3467–3478, Aug. 2012.
number of line memories and gates, but can reconstruct 4K [14] H. Chang, D.-Y. Yeung, and Y. Xiong, “Super-resolution through
neighbor embedding,” Proc. IEEE Conf. Comput. Vision and Pattern
UHD HR images with significantly higher quality in real-time Recog., Washington DC, USA, vol. 1, 2004, pp. 275–282.
at 60 fps speed. To the best of our knowledge, our proposed SR [15] R. Timofte, V. Smet, and L. Gool, “Anchored neighborhood regression
HW is the first CNN-based SR implementation on a for fast example-based super-resolution,” Proc. IEEE Int. Conf. Comput.
Vision, Sydney, Australia, Dec. 1-8, 2013, pp. 1920-1927.
low-complexity hardware that supports real-time 2X up-scaling [16] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored
from FHD to 4K UHD at 60 fps. neighborhood regression for fast super-resolution,” Proc. Asian Conf.
Comput. Vision, Singapore, Nov. 1-5, 2014, pp. 111–126.
[17] J.-S. Choi, M. Kim, “Super-Interpolation With Edge-Orientation-Based
VI. CONCLUSION
Mapping Kernels for Low Complex 2× Upscaling,” IEEE Trans. Image
In this paper, we presented an efficient and novel Process., vol. 25, no. 1, pp. 469–483, Jan. 2016.
CNN-based hardware-friendly SR network and its dedicated [18] J.-S. Choi, M. Kim, “Single Image Super-Resolution Using Global
Regression Based on Multiple Local Linear Mappings”, IEEE Trans.
hardware that can reconstruct 4K UHD from FHD at 60fps. Our Image Process., vol. 26, no. 3, pp. 1300–1314, Mar. 2017.
hardware-friendly SR network does not use any frame buffer [19] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional
and uses a small number of filter parameters. Additionally, we network for image super-resolution,” Proc. Eur. Conf. Comput. Vis.,
use a simple and effective quantization scheme for weight Zurich, Switzerland, Sep. 2014, pp. 184–199.
[20] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using
parameters and activations to further reduce computational deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell.,
complexity and memory usage. Also, a compression method vol. 38, no. 2, pp. 295–307, Feb. 2015.
for intermediate feature maps was proposed to reduce the [21] J. Kim, J. Kwon. Lee and K. Mu Lee, "Accurate Image Super-Resolution
number of required line memories. As a result, our SR HW was Using Very Deep Convolutional Networks," Proc. of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pp. 1646 – 1654,
able to reconstruct 4K UHD from FHD at 60fps. As a future Las Vegas, USA, June 27-30, 2016.
research topic, we will extend our proposed CNN-based SR to a [22] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee, "Enhanced Deep
more generalized hardware-friendly SR network in order to Residual Networks for Single Image Super-Resolution," Proc. IEEE Conf.
support more than two-times up-scaling applications. We Comput. Vision and Pattern Recog. Workshops, Hawaii, USA, July 21-26,
2017, pp. 1132-1140.
believe that our work may be beneficial to those who try to [23] C. Dong, C. C. Loy, and X. Tang, “Accelerating the Super-Resolution
implement real-time CNN-based image restoration algorithms Convolutional Neural Network,” Part II Proc. European Conf. Computer
on low-complexity hardware. Vision (ECCV), Springer, LNCS 9906, pp. 391-407, Amsterdam,
Netherland, Oct. 8-16, 2016.
[24] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D.
REFERENCES Rueckert, and Z. Wang, “Real-time single image and video
[1] E. Perez-Pellitero, J. Salvador, J. Ruiz-Hidalgo, and B. Rosenhahn, super-resolution using an efficient sub-pixel convolutional neural
“Accelerating super-resolution for 4K upscaling,” in Proc. IEEE. Int. network,” Proc. IEEE Conf. Comput. Vision and Pattern Recog. (CVPR),
Conf. Consum. Electron., Las Vegas, NV, USA, Jan. 2015, pp. 317-320. Las Vegas, USA, June 27-30, 2016, pp. 1874–1883.
[2] M. Sakurai, Y. Sakuta, M. Watanabe, T. Goto, and S. Hirano, [25] K. Simonyan, A. Zisserman. (2015). “Very Deep Convolutional
“Super-resolution through non-linear enhancement filters,” Proc. 20th Networks for Large-Scale Image Recognition.” [Online]. Available:
IEEE Int. Conf. Image Process., Melbourne, VIC, Australia, Sept. 2013, https://arxiv.org/abs/1409.1556, Accessed: 2018-08-04.
pp. 854-858. [26] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-Layer CNN
[3] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz, Z. Wang, and W. Shi, Accelerators,” Proc. 49th Annual IEEE/ACM Int. Sympo.
“Real-Time Video Super-Resolution with Spatio-Temporal Networks Microarchitecture (MICRO), Taipei, Taiwan, 2016, Oct. 15-19.
and Motion Compensation,” Proc. IEEE Conf. Comput. Vision and [27] Y-H. Chen, T. Krishina, J-S. Emer, and V. Sze, “Eyeriss: An
Pattern Recog. (CVPR), Hawaii, USA, July 21-26, 2017, pp. 2846-2857. Energy-Efficient Reconfigurable Accelerator for Deep Convolutional
[4] A. Kappeler, S. Yoo, Q. Dai, and A. K. Katsaggelos, “Video Neural Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp.
Super-Resolution with Convolutional Neural Networks,” IEEE Trans. 127–138, Nov. 2016.
Computat. Imaging, vol. 2, no. 2, pp. 109-122, June, 2016. [28] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A Flexible
[5] C. Huang, P. Chen, and C. Ma, “A Novel Interpolation Chip for Dataflow Accelerator Architecture for Convolutional Neural Networks,”
Real-Time Multimedia Applications,” IEEE Trans. Circuits Syst. Video IEEE Int. Sympo. High Performance Comput. Architect. (HPCA), Austin,
Technol., vol. 22, no. 10, pp. 1512–1525, Oct. 2012. USA, Feb. 4-8, 2017, pp. 553-564.
[6] S.-L. Chen, “A Low-Cost High-Quality Adaptive Scalar for Real-Time [29] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre,
Multimedia Applications,” IEEE Trans. Circuits Syst. Video Technol., vol. and K. Vissers. (2016). “FINN: A Framework for Fast, Scalable
21, no. 11, pp. 1600–1611, Nov. 2011. Binarized Neural Network Inference.” [Online]. Available:
[7] S-L. Chen, “VLSI Implementation of a Low-Cost High-Quality Image https://arxiv.org/abs/1612.07119, Accessed: 2018-08-04.
Scaling Processor,” IEEE Trans. Circuits Syst. II Exp. Briefs, vol. 60, no.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 13

[30] L. Cavigelli, L. Benini, “Origami: A 803-GOp/s/W Convolutional [52] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
Network Accelerator,” IEEE Trans. Circuits and Systems for Video A. Desmaison, L. Antiga and A. Lerer, “Automatic differentiation in
Technol., vol. 27, no. 11, pp. 2461–2475, Nov. 2017. PyTorch,” Neural Information Processing Systems (NIPS), Dec. 4 - 9,
[31] H. Kim, J. Sim, Y. Choi, and L-S. Kim, “A Kernel Decomposition 2017. [Online]. Available: https://openreview.net/forum?id=BJJsrmfCZ,
Architecture for Binary-weight Convolutional Neural Networks,” Proc. Accessed: 2018-08-04.
54th Design Automation Conference (DAC), Austin, USA, 2017, June [53] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S.
18-22. Guadarrama and T. Darrell (2014). “Caffe: Convolutional Architecture
[32] M. Horowiz, “Computing’s energy problem (and what we can do about for Fast Feature Embedding.” [Online]. Available:
it),” Solid-State Circuits Conference Digest of Technical Paper (ISSCC), https://arxiv.org/abs/1408.5093, Accessed: 2018-08-04.
Feb. 2014. [54] M. Abadi et al. (2016). “TensorFlow: Large-scale machine learning on
[33] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, heterogeneous systems.” [Online]. Available:
M. Andreetto, and H. Adam (2017). “MobileNets: Efficient convolutional https://www.tensorflow.org, Accessed: 2018-08-04.
neural networks for mobile vision applications.” [Online]. Available: [55] Huangzehao, “Caffe_VDSR.”, [Online]. Available:
https://arxiv.org/abs/1704.04861, Accessed: 2018-08-04. https://github.com/huangzehao/caffe-vdsr, Accessed: 2018-08-04.
[34] A. Aitken, C. Ledig, L. Thesis, J. Caballero, Z. Wang, and W. Shi. (2017). [56] Xilinx. KCU105 Board User Guide. [Online]. Available:
“Checkerboard artifact free sub-pixel convolution: A note on sub-pixel https://www.xilinx.com/support/documentation/boards_and_kits/kcu105
convolution, resize convolution and convolution resize.” [Online]. /ug917-kcu105-eval-bd.pdf, Accessed: 2018-08-04.
Available: https://arxiv.org/abs/1707.02937, Accessed: 2018-08-04. [57] Inrevium. TB-FMCH-HDMI4K Hardware User Manual Rev.2.03.
[35] P. Gysel, M. Motamedi, and S. Ghiasi (2016). “Hardware-oriented [Online]. Available: http://solutions.inrevium.com/products/pdf/
Approximation of Convolutional Neural Networks.” [Online]. Available: TB_FMCH_HDMI4K_HWUserManual_2.03e.pdf, Accessed:
https://arxiv.org/abs/1604.03168, Accessed: 2018-08-04. 2018-08-04.
[36] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. (2015). [58] Y. Kim, J.-S, Choi and M. Kim, “2X Super-Resolution Hardware using
“Deep Learning with Limited Numerical Precision.” [Online]. Available: Edge-Orientation-based Linear Mapping for Real-Time 4K UHD 60 fps
https://arxiv.org/abs/1502.02551, Accessed: 2018-08-04. Video Applications,” IEEE Trans. Circuits Syst. II Exp. Briefs, to be
[37] P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. E. Jerger, R. published, [Online]. Available: http://
Urtasun, and A. Moshovos (2015). “Reduced-Precision Strategies for ieeexplore.ieee.org/document/8274961, Accessed: 2018-08-04.
Bounded Memory in Deep Neural Nets.” [Online]. Available: [59] Sam, SRCNN TensorFlow implementation github, [Online]. Available:
https://arxiv.org/abs/1511.05236, Accessed: 2018-08-04. https://github.com/kweisamx/TensorFlow-SRCNN, Accessed:
[38] D. D. Lin, S. S. Talathi, V. S. Annapureddy. (2015). “Fixed Point 2018-08-04.
Quantization of Deep Convolutional Networks.” [Online]. Available: [60] Yifanw90, FSRCNN TensorFlow implementation github, [Online].
https://arxiv.org/abs/1511.06393, Accessed: 2018-08-04. Available: https://github.com/yifanw90/FSRCNN-TensorFlow,
[39] M. Shabany, “Floating-point to Fixed-point conversion.” [Online]. Accessed: 2018-08-04.
Available: http://ee.sharif.edu/~digitalvlsi/Docs/Fixed-Point.pdf, [61] Sam, VDSR TensorFlow implementation github, [Online]. Available:
Accessed: 2018-08-04. https://github.com/kweisamx/TensorFlow-VDSR, Accessed:
[40] J.M.P. van Waveren, “Real-Time DXT Compression.” [Online]. 2018-08-04.
Available: [62] S. Han et al., "EIE: Efficient Inference Engine on Compressed Deep
http://www.gamedev.no/projects/MegatextureCompression/324337_324 Neural Network," 2016 ACM/IEEE 43rd Annual International
337.pdf, Accessed: 2018-08-04. Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 243-254.
[41] J.M.P. van Waveren, and Ignacio Castano, “Real-Time YCoCg-DXT [63] Y. Wang, Huawei Li and Xiaowei Li, "Re-architecting the on-chip
Compression.” [Online]. Available: memory sub-system of machine-learning accelerator for embedded
http://www.nvidia.com/object/real-time-ycocg-dxt-compression.html, devices," 2016 IEEE/ACM International Conference on Computer-Aided
Accessed: 2018-08-04. Design (ICCAD), Austin, TX, 2016, pp. 1-6.
[42] P. Holub, M. Srom, M. Pulec, J. Matela and M. Jirman, “GPU-accelerated [64] S. Han et al., "ESE: Efficient Speech Recognition Engine with Sparse
DXT and JPEG compression schemes for low-latency network LSTM on FPGA," Proceedings of the 2017 ACM/SIGDA International
transmissions of HD, 2K, and 4K video,” Future Gener. Comput. Syst., Symposium on Field-Programmable Gate Arrays, Monterey, California,
vol. 29, Oct. 2013, pp. 1991–2006. USA, 2017, pp. 75-84.
[43] J. Lee and I. C. Park, “High-performance low-area video up-scaling
architecture for 4 K-UHD video,” IEEE Trans. Circuits Syst. II Exp. Yongwoo Kim received the B.S., M.S.
Briefs, vol. 64, no. 4, pp. 437–441, May. 2016.
[44] M.-C. Yang, K.-L. Liu, and S.-Y. Chien, “A Real-Time FHD
degree from the Inha University, Incheon,
Learning-Based Super-Resolution System without a Frame Buffer,” Republic of Korea, in 2007, 2009,
IEEE Trans. Circuits Syst. II Exp. Briefs, vol. 64, no. 12, pp. 1407–1411, respectively. From 2009 to 2017, he was a
Dec. 2017. senior engineer at Silicon Works Co., Ltd.,
[45] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via Daejeon, Korea. He is currently pursuing the
sparse representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp.
Ph.D. degree from the Korea Advanced
2861–2873, Nov. 2010.
[46] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human Institute of Science and Technology
segmented natural images and its application to evaluating segmentation (KAIST), Daejeon, South Korea. His
algorithms and measuring ecological statistics,” Proc. 8th IEEE Int. Conf. current research interests include
Comput. Vis., vol. 2. Jul. 2001, pp. 416–423. image/video processing algorithm, super resolution, deep learning
[47] M. Bevilacqua et al., “Low-complexity single-image super-resolution hardware architecture for low-level vision processing.
based on nonnegative neighbor embedding,” Proc. 23rd Brit. Mach. Vis.
Conf. (BMVC), Guildford, U.K., Sep. 2012, pp. 135.1–135.10. Jae-Seok Choi received the B.S. degree in
[48] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using electrical engineering from Hanyang
sparse-representations,” Proc. 7th Int. Conf. Curves Surf., Jun. 2010, pp. University, Seoul, South Korea, in 2014,
711–730.
[49] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution and the M.S. degree in electrical
from transformed self-exemplars,” Proc. IEEE Conf. Comput. Vis. engineering from Korea Advanced Institute
Pattern Recognit. (CVPR), Boston, MA, USA, Jun. 2015, pp. 5197–5206. of Science and Technology (KAIST),
[50] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality Daejeon, South Korea, in 2016. He is
assessment: From error visibility to structural similarity,” IEEE Trans. currently pursuing the Ph.D. degree in
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
electrical engineering at KAIST. His
[51] D-. P. Kingma, J. Ba (2014). “Adam: A Method for Stochastic
Optimization.” [Online]. Available: https://arxiv.org/abs/1412.6980, research interests include super-resolution,
Accessed: 2018-08-04. deep learning and image processing.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCSVT.2018.2864321, IEEE
Transactions on Circuits and Systems for Video Technology
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 14

Munchurl Kim (M’99, SM’13) received the


B.E. degree in electronics from Kyungpook
National University, Daegu, Korea, in 1989,
and the M.E. and Ph.D. degrees in Electrical
and Computer Engineering from the
University of Florida, Gainesville, in 1992
and 1996, respectively. After his graduation,
he joined the Electronics and
Telecommunications Research Institute,
Daejeon, Korea, as a Senior Research Staff Member, where he led the
Realistic Broadcasting Media Research Team. In 2001, he was an
Assistant Professor with the School of Engineering, Information and
Communications University (ICU), Daejeon. Since 2009, he has been
with the School of Electrical Engineering, Korea Advanced Institute of
Science and Technology (KAIST), Daejeon where he is now a Full
Professor. He had been involved with scalable video coding and high
efficiency video coding (HEVC) in JCT-VC standardization activities
of ITU-T VCEG and ISO/IEC MPEG. His current research interests
include deep learning for image restoration and visual quality
enhancement, deep video compression, perceptual video coding,
visual quality assessments, computational photography, machine
learning and pattern recognition.

1051-8215 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Você também pode gostar