Progressive Deep Neural Networks Acceleration Via Soft Filter Pruning

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 1
Progressive Deep Neural Networks Acceleration via

Soft Filter Pruning
Yang He, Xuanyi Dong, Guoliang Kang, Yanwei Fu, and Yi Yang
Abstract—This paper proposed a Progressive Soft Filter Prun- Input Filters Output Input Filters Output
ing method (PSFP) to prune the filters of deep Neural Networks
which can thus be accelerated in the inference. Specifically, the
× ×
arXiv:1808.07471v1 [cs.CV] 22 Aug 2018
proposed PSFP method prunes the network progressively and

enables the pruned filters to be updated when training the model
after pruning. PSFP has three advantages over previous works:
1) Larger model capacity. Updating previously pruned filters Pruning Pruning
provides our approach with larger optimization space than fixing
the filters to zero. Therefore, the network trained by our method Hard pruned Soft pruned
has a larger model capacity to learn from the training data. never update allow update
2) Less dependence on the pre-trained model. Large capacity
enables our method to train from scratch and prune the model
simultaneously. In contrast, previous filter pruning methods × ×
should be conducted on the basis of the pre-trained model to
guarantee their performance. Empirically, PSFP from scratch
outperforms the previous filter pruning methods. 3) Pruning the Training Training
neural network progressively makes the selection of low-norm
filters much more stable, which has a potential to get a better Still zero Decrease
performance. Moreover, our approach has been demonstrated Non-zero Stable
Dimension Dimension
effective for many advanced CNN architectures. Notably, on
ILSCRC-2012, our method reduces more than 42% FLOPs on
ResNet-101 with even 0.2% top-5 accuracy improvement, which
× ×
has advanced the state-of-the-art. On ResNet-50, our progressive
pruning method have 1.08% top-1 accuracy improvement over
the pruning method without progressive pruning. Hard Filter Pruning Soft Filter Pruning
Index Terms—Neural Networks, Deep Learning, Computer Fig. 1: Hard filter pruning v.s. soft filter pruning. We mark the pruned
Vision. filter as the orange dashed box. For the hard filter pruning, the
pruned filters are always fixed during the whole training procedure.
Therefore, the model capacity is reduced and thus harms the perfor-
I. I NTRODUCTION mance because the dashed blue box is useless during training. On
ONVOLUTIONAL Neural Networks (CNNs) is increas- the contrary, our soft pruning method allows the pruned filters to be
C ingly used to structural prediction [1], visual presen-
tation [2], detection [3], eye fixations prediction [4], image
updated during the training procedure. In this way, the model capacity
is recovered from the pruned model, and thus leads a better accuracy.
saliency computing [5], [6], face recognition [7] and so on. The
superior performance of deep CNNs usually comes from the
deeper and wider architectures, which cause the prohibitively relatively low computational cost but high accuracy in real-
expensive computation cost. Even if we use more efficient world applications.
architectures, such as residual connection [8] or inception Various acceleration methods for the neural network has
module [9], it is still difficult in deploying the state-of-the-art been proposed [10], [11]. Pruning deep CNNs [12], [13] is
CNN models on mobile devices. For example, ResNet-152 has an important direction. Recent efforts have been made either
60.2 million parameters with 231MB storage spaces; besides, on directly deleting some weight values of filters [14] (i.e.,
it also needs more than 380MB memory footprint and six weight pruning) or totally discarding some filters (i.e., filter
seconds (11.3 billion float point operations, FLOPs) to process pruning) [15]–[17]. However, the weight pruning may result
a single image on CPU. The storage, memory, and computation in the unstructured sparsity of filters, which may still be
of this cumbersome model significantly exceed the computing less efficient in saving the memory usage and computational
limitation of current mobile devices. Therefore, it is essential cost, since the unstructured model cannot leverage the existing
to maintain the small size of the deep CNN models which has high-efficiency BLAS (Basic Linear Algebra Subprograms)
Y. He, X. Dong, G. Kang and Y. Yang are with the Center for libraries. In contrast, the filter pruning enables the model
AI, University of Technology Sydney, Sydney, NSW 2007, Australia (e- with structured sparsity and more efficient memory usage
mail: yang.he-1@student.uts.edu.au; xuanyi.dong@student.uts.edu.au; guo- than weight pruning, and thus takes full advantage of BLAS
liang.kang@student.uts.edu.au; yi.yang@uts.edu.au.)
Y. Fu is with The School of Data Science, Fudan University, Shanghai libraries to achieve a more realistic acceleration. Therefore, the
200433, China (e-mail: yanweifu@fudan.edu.cn.) filter pruning is more advocated in accelerating the networks.
Nevertheless, most of the previous works on filter pruning pruning rate at early epochs and large pruning rate at later
still suffer from the problems of 1) the model capacity reduc- epochs. In this way, the filters with less information (in
tion and 2) the dependence on pre-trained model. Specifically, other words, less norm) become distinct after pruning and
as shown in Figure 1, most previous works conduct the “hard retraining at early epochs and it is less possible to prune the
filter pruning”, which directly delete the pruned filters. The wrong filters at later epochs if we use a large pruning rate.
discarded filters will reduce the model capacity of original (4) The extensive experiment on three benchmark datasets
models, and thus inevitably harm the performance. Moreover, demonstrates the effectiveness and efficiency of our PSFP. We
to maintain a reasonable performance with respect to the full accelerate ResNet-110 by two times with about 4% relative
models, previous works [15]–[17] always fine-tuned the hard accuracy improvement on CIFAR-10, and also achieve
pruned model after pruning the filters of a pre-trained model, state-of-the-art results on ILSVRC-2012.
which however has low training efficiency and often requires
much more training time than the traditional training schema.
To address the above mentioned two problems, we propose II. R ELATED W ORKS
a novel Soft Filter Pruning (SFP) approach. The SFP dynam- Most previous works on accelerating CNNs can be roughly
ically prunes the filters in a soft manner. Particularly, before divided into three categories, namely, matrix decomposition,
first training epoch, the filters of almost all layers with small low-precision weights, and pruning. In particular, the matrix
`2 -norm are selected and set to zero. Then the training data decomposition of deep CNN tensors is approximated by the
is used to update the pruned model. Before the next training product of two low-rank matrices [18]–[22]. This can save
epoch, our SFP will prune a new set of filters of small `2 - the computational cost. Some works [23], [24] focus on com-
norm. These training processes are continued until converged. pressing the CNNs by using low-precision weights. Pruning-
Finally, some filters will be selected and pruned without further based approaches aim to remove the unnecessary connections
updating. The SFP algorithm enables the compressed network of the neural network [14], [15]. Essentially, the work of
to have a larger model capacity, and thus achieve a higher this paper is based on the idea of pruning techniques; and
accuracy than others. the approaches of matrix decomposition and low-precision
Intuitively, for the pre-trained network, all the filter of the weights are orthogonal but potentially useful here – it may be
model have information of the training set, some of them are still worth simplifying the weight matrix after pruning filters,
important, and others are not that important thus could be which would be taken as future work.
pruned. If we directly prune a large number of filters at the
early epoch, some informative filters might not be recovered A. Matrix Decomposition.
by retraining even if we use the soft pruning approach. This
is because there exist some filters with similar `2 -norm in To reduce the computation costs of the convolutional layers,
the neural network, which makes it difficult to choose the past works have proposed to representing the weight matrix
right filters to be pruned. Therefore, we propose to prune the of the convolutional network as a low-rank product of two
network progressively, i.e., small pruning rate at the early smaller matrices [18]–[21], [25]. Then the calculation of
epochs, and large pruning rate at later epochs, which is production of one large matrix turns to the production of
called progressive soft filter pruning (PSFP). In this way, after two smaller matrices. However, the computation cost of tensor
pruning a small number of filters at the early epochs and decomposition operation is large, which is not friendly to the
retraining the model, the distribution of `2 -norm of the filters training of neural network. In addition, there exists increasing
would become different, and the information of training set usage of 1 × 1 convolution kernel in some recent neural
would concentrate on some important filters. In the meantime, network, such as the bottleneck building block structure of
less informative filters are much easy to be recognized. In ResNet [8], which is difficult for matrix decomposition.
this situation, if we use a larger pruning rate at a later epoch,
it is easy to select the unimportant filters because they are B. Low Precision.
becoming discriminative and it is less possible to choose the Some other researchers focus on low-precision implemen-
wrong filters. tation to compress and accelerate CNN models [23], [24],
Contributions. We highlight four contributions: [26]–[28]. Han [26] proposed to prune the connections of
(1) We propose a soft pruning manner to allow the pruned the neural network, and then fine-tuned networks to recover
filters to be updated during the training procedure. This soft the accuracy drop, after which he quantify the weights to
manner can dramatically maintain the model capacity and reduce the number of bits of the parameters of a single layer
thus achieves the superior performance. to 5. Zhou [23] proposes trained ternary quantization (TTQ),
(2) Our acceleration approach can train a model from a method that can reduce the precision of weights in neural
scratch and achieve better performance compared to the networks to ternary values. In [24], the authors present incre-
state-of-the-art. In this way, the fine-tuning procedure and the mental network quantization (INQ), a novel method targeting
overall training time is saved. Moreover, using the pre-trained to efficiently convert pre-trained full-precision CNN model
model can further enhance the performance of our approach into a low-precision version whose weights are constrained
to advance the state-of-the-art in model acceleration. to be either power of two or zero. Under this situation, only
(3) To reduce the possibility of pruning the wrong filters, low-precision weights are stored and used to inference, thus
we propose to prune the network progressively, i.e., small the storage and computation cost are dramatically reduced.
C. Weight Pruning. a matrix of connection weights in the i-th layer. Ni denotes

Many recent works [14], [26], [29] pruning weights of the number of input channels for the i-th convolution layer. L
neural network resulting in small models. For example, [14] denotes the number of layers. The shapes of input tensor U
proposed an iterative weight pruning method by discarding and output tensor V are Ni ×Hi ×Wi and Ni+1 ×Hi+1 ×Wi+1 ,
the small weights whose values are below the threshold. [29] respectively. The convolutional operation of the i-th layer can
proposed the dynamic network surgery to reduce the training be written as:
iteration while maintaining a good prediction accuracy. [30], Vi,j = Fi,j ∗ U for 1 ≤ j ≤ Ni+1 , (1)
[31] leveraged the sparsity property of feature maps or weight
parameters to accelerate the CNN models. However, pruning where Fi,j ∈ RNi ×K×K represents the j-th filter of the i-
weights always leads to unstructured models, so the model th layer. W(i) consists of {Fi,j , 1 ≤ j ≤ Ni+1 }. The Vi,j
cannot leverage the existing efficient BLAS libraries in prac- represents the j-th output feature map of the i-th layer.
tice. Therefore, it is difficult for weight pruning to achieve Pruning filters can remove the output feature maps. In this
realistic speedup. way, the computational cost of the neural network will reduce
remarkably. Let us assume the pruning rate is Pi for the i-th
D. Filter Pruning. layer. The number of filters of this layer will be reduced from
Ni+1 to Ni+1 (1 − Pi ), thereby the size of the output tensor
Concurrently with our work, some filter pruning strategies
Vi,j can be reduced to Ni+1 (1 − Pi ) × Hi+1 × Wi+1 . As the
[15]–[17], [32] have been explored. Pruning the filters leads
output tensor of i-th layer is the input tensor of i + 1-th layer,
to the removal of the corresponding feature maps. This not
we can reduce the input size of i-th layer to achieve a higher
only reduces the storage usage on devices but also decreases
acceleration ratio.
the memory footprint consumption to accelerate the inference.
[15] uses `1 -norm to select unimportant filters and explores
the sensitivity of layers for filter pruning. [32] introduces `1 B. Pruning with hard manner
regularization on the scaling factors in batch normalization Most of previous filter pruning works [15]–[17], [32] com-
(BN) layers as a penalty term, and prune channel with small pressed the deep CNNs in a hard manner. We call them as the
scaling factors in BN layers. [33] proposes a Taylor expansion hard filter pruning (HFP). Typically, these algorithms firstly
based pruning criterion to approximate the change in the prune filters of a single layer of a pre-trained model and
cost function induced by pruning. [17] adopts the statistics fine-tune the pruned model to complement the degrade of the
information from next layer to guide the importance evaluation performance. Then they prune the next layer and fine-tune the
of filters. [16] proposes a LASSO-based channel selection model again until the last layer of the model is pruned.
strategy, and a least square reconstruction algorithm to prune However, once filters are pruned, these approaches will not
filers. However, for all these filter pruning methods, the repre- update these filters again. Therefore, the model capacity is
sentative capacity of neural network after pruning is seriously drastically reduced due to the removed filters; and such a
affected by smaller optimization space. hard pruning manner negatively affects the performance of
the compressed models.
E. Discussion. The process of hard filter pruning is shown in the first row
of Figure 2. First, some of the filters with small `p -norm
To the best of our knowledge, there is only one approach
(marked in blue) are select and pruned. After retraining, the
that uses the soft manner to prune weights [29]. We would
pruned filters are not able to be updated again thus the `p -
like to highlight our advantages compared to this approach as
norm of these pruned filters will be zero during all the training
below: 1) Our pruning algorithm focuses on the filter pruning,
epochs. Besides, the remaining filters (marked in green) might
but they focus on the weight pruning. As discussed above,
be updated to another value after retraining to make up for the
weight pruning approaches lack the practical implementations
performance degradation due to pruning, which is represented
to achieve the realistic acceleration. 2) [29] paid more attention
as the first and third filter in HFP-b and HFP-c of Figure 2.
to the model compression, whereas our approach can achieve
After several epochs of retraining to converge the model, a
both compression and acceleration of the model. 3) Extensive
compact model is obtained to accelerate the inference.
experiments have been conducted to validate the effectiveness
of our proposed approach both on large-scale datasets and
the state-of-the-art CNN models. In contrast, [29] only had C. Pruning with soft manner
the experiments on Alexnet which is more redundant than the As summarized in Algorithm 1, SFP can dynamically re-
advanced models, such as ResNet. move the filters in a soft manner. Specifically, the key is to
keep updating the pruned filters in the training stage. Such
III. M ETHODOLOGY an updating manner brings several benefits. It not only keeps
A. Preliminary the model capacity of the compressed deep CNN models as
We will formally introduce the symbol and annotations in the original models, but also avoid the greedy layer by layer
this section. The deep CNN network can be parameterized pruning procedure and enable pruning almost all layers at the
by {W(i) ∈ RNi+1 ×Ni ×K×K , 1 ≤ i ≤ L} 1 . W(i) denotes same time.
More specifically, SFP can prune a model either in the
1 Fully-connected layers can be viewed as convolutional layers with k = 1 process of training from scratch, or a pre-trained model. In
k-th training epoch Pruned model (k+1)-th training epoch

filters ‖•‖p importance filters ‖•‖p importance filters ‖•‖p importance
1.231 1.231 2.357
0.331 0 0
2.056 Filter Pruning 2.056 1.634
Retraining
0.275 0 0
1.572 1.572 0.256
HFP-a HFP-b HFP-c

1.231 1.231 2.512
0.331 0 1.324
Reconstruction
0.275 0 0.897
1.572 1.572 3.742
SFP-a SFP-b SFP-c

1.231 1.231 2.512
0.331 0.734 1.324
Reconstruction
0.575 0.575 0.897
1.572 1.572 3.742
PSFP-a PSFP-b PSFP-c

Fig. 2: Overview of HFP (hard filter pruning, first row), SFP (hard filter pruning, second row) and PSFP (progressive filter pruning, third
row) (best viewed in color). At the end of each training epoch, we prune the filters based on their importance evaluations. The filters are
ranked by their `p -norms (purple rectangles) and the small ones (blue circles) are selected to be pruned. For HFP, after filter pruning, the
pruned filter would have no chance to come back during retraining process. This is represented by that HFP-b and HFP-c have the same blue
filters with zero `p -norm. For SFP, after filter pruning, the model undergoes a reconstruction process where the pruned filters are capable
of being reconstructed (i.e. updated from zeros) by the forward-backward process. The number of pruned filters is same during all training
epochs, which is showed by that the number of blue filters in SFP-a is the same as the number of blue filters in SFP-c. While for PSFP, the
number of pruned filters is increasing progressively during training epochs, which is represented by that PSFP-c has a larger number of blue
filters than PSFP-a. (a): filter instantiations before pruning. (b): filter instantiations after pruning. (c): filter instantiations after reconstruction.
each training epoch, the full model is optimized and trained several operations of pruning and reconstruction to converge
on the training data. After each epoch, the `2 -norm of all the model, a compact model is obtained to accelerate the
filters are computed for each weighted layer and used as the inference.
criterion of our filter selection strategy. Then we will prune
the selected filters by setting the corresponding filter weights D. Progressive Soft Filter Pruning (PSFP)
as zero, which is followed by next training epoch. Finally, the
original deep CNNs are pruned into a compact and efficient As all the filters of the pre-trained model have the infor-
model. mation of the training set, if we prune a large number of
these informative filters, the information of the neural network
The second row of Figure 2 explained the process of SFP. obtained from training set would dramatically lose. Even if we
First, some of the filters with small `p -norm (marked in pruning the filter with the soft manner, it is still difficult for
blue) are select and set to zero. During retraining, we allow some important filter to recover by retraining. Therefore, we
the pruned filters to be updated. In this way, the pruned propose to pruning the neural network progressively.
filter whose `p -norm equals to zero might become nonzero Different from the setting in Algorithm 1 that the pruning
again due to retraining, which is represented as the second rate during all retraining epochs equals to the goal pruning
and fourth filter in SFP-b and SFP-c of Figure 2. After rate Pi , we change the pruning rate at each retraining epoch.
Algorithm 1 Algorithm Description of SFP Algorithm 2 Algorithm Description of PSFP

Input: training data: X, pruning rate: Pi Input: training data: X
Input: the model with parameters W = {W(i) , 0 ≤ i ≤ L} Input: pruning rate: Pi , pruning rate decay D
Initialize the model parameter W Input: the model with parameters W = {W(i) , 0 ≤ i ≤ L}
for epoch = 1; epoch ≤ epochmax ; epoch + + do Initialize the model parameter W
Update the model parameter W based on X for epoch = 1; epoch ≤ epochmax ; epoch + + do
for i = 1; i ≤ L; i + + do Update the model parameter W based on X
Calculate the `2 -norm for each filter kFi,j k2 , 1 ≤ for i = 1; i ≤ L; i + + do
j ≤ Ni+1 Calculate the `2 -norm for each filter kFi,j k2 , 1 ≤
Zeroize Ni+1 Pi filters by `2 -norm filter selection j ≤ Ni+1
0
end for Calculate the pruning rate at this epoch Pi =
end for H(Pi , D, epoch) by function H
0
Obtain the compact model with parameters W∗ from W Zeroize Ni+1 Pi filters by `2 -norm filter selection
Output: The compact model and its parameters W∗ end for
end for
Obtain the compact model with parameters W∗ from W
Specifically, we use a small pruning rate at early epochs, Output: The compact model and its parameters W∗
and gradually increase the pruning rate as the epoch number
increasing until finally, we reach the goal pruning rate Pi at the
last epoch of retraining. The details of PSFP is illustratively (epochmax × D, Pi /4), to solve the equation. This means that
explained in Algorithm 2. when the epoch number is epochmax × D, the pruning rate
1) Progressive Filter selection: We use the `p -norm to increase to 1/4 of the goal pruning rate Pi .
0
evaluate the importance of each filter as Eq. (2). In general, In this situation, the lowest Ni+1 Pi filters are selected, e.g.,
the convolutional results of the filter with the smaller `p -norm the blue filters in Figure 2. In practice, we use the `2 -norm
would lead to relatively lower activation values, and thus have based on the empirical analysis.
0
a less numerical impact on the final prediction of deep CNN 2) Filter Pruning: We set the value of selected Ni+1 Pi
models. In term of this understanding, such filters of small `p - filters to zero (see the filter pruning step in Figure 2). This
norm will be given high priority of being pruned than those can temporarily eliminate their contribution to the network
of higher `p -norm: output. Nevertheless, in the following training stage, we still
v allow these selected filters to be updated, in order to keep the
u Ni K K
uX X X representative capacity and the high performance of the model.
p
kFi,j kp = t p
|Fi,j (n, k1 , k2 )| . (2) In the filter pruning step, we simply prune all the weighted
n=1 k1 =1 k2 =1 layers at the same time. In this way, we can prune each
Different form [34] that uses the same pruning rate Pi to filter in parallel, which would cost negligible computation
select Ni+1 Pi unimportant filters during the whole retraining time. In contrast, the previous filter pruning methods al-
0
process, we use different pruning rate Pi at every epoch to ways conduct layer by layer greedy pruning. After pruning
0
select Ni+1 Pi unimportant filters for the i-th weighted layer. filters of one single layer, existing methods always require
0
The definition of Pi is list as follows: training to converge the network [16], [17]. This procedure
cost much extra computation time, especially when the depth
0
Pi = H(Pi , D, epoch), (3) increases. Moreover, we use the same pruning rate for all
weighted layers. Therefore, we need only one hyper-parameter
where Pi represents the goal pruning rate for the i-th weighted Pi = P to balance the acceleration and accuracy. For
layer, Parameter D represents the ratio of the total epoch different epochs, we could calculate the pruning rate with
number when pruning rate equals Pi /4, which will be ex- 0
Pi = H(Pi , D, epoch) = H(P, D, epoch). This can avoid
plicitly explained later. Exponential parameter decay is widely the inconvenient hyper-parameter search or the complicated
used in optimization [35] to achieve a stable result. Similarly, sensitivity analysis [15]. As we allow the pruned filters to be
we change the pruning rate exponentially, the equation is as updated, the model has a large model capacity and becomes
follows: more flexible and thus can well balance the contribution of
0
Pi = a × e−k×epoch + b, (4) each filter to the final prediction.
3) Reconstruction: After the pruning step, we train the
network for one epoch to reconstruct the pruned filters. As
In order to solve three parameters a, k, b of the above shown in Figure 2, the pruned filters are updated to non-zero
0
exponential equation, three points consists of (epoch, Pi ) pair by back-propagation. In this way, the pruned model have the
are needed to solve the equation. Certainly, the first point same capacity as the original model during the training. In
is (0, 0), which means the pruning rate is zero before the contrast, hard filter pruning leads to the decrease of feature
first training epoch. In addition, to achieve the goal pruning maps. The reduction of the feature map would dramatically
rate Pi at the final retraining epoch, the point (epochmax , reduce the model capacity, and further harm the performance.
Pi ) is essential. Now we have to define the third point Previous pruning methods usually require a pre-trained model
and then fine-tune it. However, as we integrate the pruning 30

step into the normal training schema, our approach can train
the model from scratch. Therefore, the fine-tuning stage is no 25
longer necessary for PSFP. As we will show in experiments, 20
Pruning rate (%)
the network trained from scratch by PSFP can obtain the
competitive results with the one trained from a well-trained 15
model by others. By leveraging the pre-trained model, our 10
PSFP can obtain a much higher performance, and advance the
state-of-the-art. 5
Fit. func: f(x) = 30.000e 0.055x + 30.000
4) Obtaining Compact Model: PSFP iterates over the filter 0 three points to get the function
selection, filter pruning and reconstruction steps. After the 0 50 100 150 200
model gets converged, we can obtain a sparse model contain- Epoch
ing many “zero filters”. One “zero filter” corresponds to one
Fig. 3: Progressively changed pruning rate when the goal pruning rate
feature map. The features maps, corresponding to those “zero is 30%. Three blue points are the three pairs to generate the function
filters”, will always be zero during the inference procedure. of pruning rate and the solid curve is the exponential pruning rate
There will be no influence to remove these filters as well as function.
the corresponding feature maps. Specifically, for the pruning
rate Pi in the i-th layer, only Ni+1 (1 − Pi ) filters are non- method.
zero and have an effect on the final prediction. Consider
pruning the previous layer, the input channel of i-th layer
is changed from Ni to Ni (1 − Pi−1 ). We can thus re-build A. Benchmark Datasets and Experimental Setting
the i-th layer into a smaller one. Finally, a compact model Our method is evaluated on three benchmarks: MNIST [37],
{W∗ (i) ∈ RNi+1 (1−Pi )×Ni (1−Pi−1 )×K×K } is obtained. CIFAR-10 [38] and ILSVRC-2012 [39]. The MNIST has a
training set of 60,000 examples, and a test set of 10,000
E. Computation Complexity Analysis examples. The CIFAR-10 dataset contains 50,000 training
1) Theoretical speedup analysis: Suppose the filter pruning images and 10,000 testing images, which are categorized into
rate of the ith layer is Pi , which means the Ni+1 ×Pi filters are 10 classes. ILSVRC-2012 is a large-scale dataset containing
set to zero and pruned from the layer, and the other Ni+1 × 1.28 million training images and 50k validation images of
(1 − Pi ) filters remain unchanged, and suppose the size of 1,000 classes. As discussed in [16], [17], [36], ResNet is less
the input and output feature map of ith layer is Hi × Wi redundancy, and thus more difficult to get accelerated than
and Hi+1 × Wi+1 . Then after filter pruning, the dimension VGGNet [40]. Therefore, we focus on pruning the challenging
of useful output feature map of the ith layer decreases from ResNet model.
Ni+1 × Hi+1 × Wi+1 to Ni+1 (1 − Pi ) × Hi+1 × Wi+1 . Note In the MNIST experiments, we use the default parameter
that the output of ith layer is the input of (i + 1) th layer. setting as [37], and use the PyTorch implementation [41] of
And we further prunes the (i + 1)th layer with a filter pruning the network and training schedule. We train the network for
rate Pi+1 , then the calculation of (i + 1)th layer is decrease 20 epochs to get a more stable result.
from Ni+2 × Ni+1 × k 2 × Hi+2 × Wi+2 to Ni+2 (1 − Pi+1 ) × In the CIFAR-10 experiments, we use the default parameter
Ni+1 (1−Pi )×k 2 ×Hi+2 ×Wi+2 . In other words, a proportion setting as [42] and follow the training schedule in [43]. For
of 1 − (1 − Pi+1 ) × (1 − Pi ) of the original calculation is CIFAR-10 dataset, we test our SFP on ResNet-20, 32, 56 and
reduced, which will make the neural network inference much 110. We use several different pruning rates, and also analyze
faster. the difference between using the pre-trained model and from
2) Realistic speedup analysis: In theoretical speedup anal- scratch. In addition, we test our PSFP on ResNet-56 and 110
ysis, other operations such as batch normalization and pool- with different pruning rate.
ing are negligible comparing to the convolution operations. On ILSVRC-2012, we follow the same parameter settings
Therefore, we consider the FLOPs of convolution operations as [8], [42]. We use the same data argumentation strategies
for computation complexity comparison, which is commonly with PyTorch implementation [41]. We test our SFP and PSFP
used in previous work [15], [17]. However, reduced FLOPs on ResNet-18, 34, 50 and 101; and we use the same pruning
cannot bring the same level of realistic speedup because non- rate 30% for all the models. All the convolutional layer of
tensor layers (e.g., batch normalization and pooling layers) ResNet are pruned with the same pruning rate at the same time.
also need the inference time on GPU [17]. In addition, the (We do not prune the projection shortcuts for simplification,
limitation of IO delay, buffer switch and efficiency of BLAS which only need negligible time and do not affect the overall
libraries also lead to the wide gap between theoretical and cost.)
realistic speedup ratio. We compare the theoretical and realistic For progressive pruning, we set the D in equation 3 as 1/8.
speedup in Section IV-D. For example, if we use the goal pruning rate as 30%, then the
pruning rate increases exponentially as shown in figure 3,
IV. E XPERIMENT We conduct our SFP and PSFP operation at the end of every
In this section, we conduct experiments on several bench- training epoch. For pruning a scratch model, we use the normal
mark datasets to validate the effectiveness of our acceleration training schedule. For pruning a pre-trained model, we reduce
TABLE I: Comparison of ResNet on Cifar-10. In “Fine-tune?” column, “Y” and “N” indicate whether to use the pre-trained model as
initialization or not. The “Accu. Drop” is the accuracy of the pruned model minus that of the baseline model, so negative number means the
accelerated model has a higher accuracy than the baseline model. A smaller number of ”Accu. Drop” is better. We run every experiment for
three times to get the mean and standard deviation of the accuracy.
Depth Method Fine-tune? Baseline Accu. (%) Accelerated Accu. (%) Accu. Drop (%) FLOPs Pruned FLOPs(%)
Dong et al. [36] N 91.53 91.43 0.10 3.20E7 20.3
SFP(10%) N 92.20 ± 0.18 92.24 ± 0.33 -0.04 3.44E7 15.2
20
SFP(20%) N 92.20 ± 0.18 91.20 ± 0.30 1.00 2.87E7 29.3
SFP(30%) N 92.20 ± 0.18 90.83 ± 0.31 1.37 2.43E7 42.2
Dong et al. [36] N 92.33 90.74 1.59 4.70E7 31.2
SFP(10%) N 92.63 ± 0.70 93.22 ± 0.09 -0.59 5.86E7 14.9
32
SFP(20%) N 92.63 ± 0.70 90.63 ± 0.37 0.00 4.90E7 28.8
SFP(30%) N 92.63 ± 0.70 90.08 ± 0.08 0.55 4.03E7 41.5
Li et al. [15] N 93.04 91.31 1.75 9.09E7 27.6
Li et al. [15] Y 93.04 93.06 -0.02 9.09E7 27.6
He et al. [16] N 92.80 90.90 1.90 - 50.0
He et al. [16] Y 92.80 91.80 1.00 - 50.0
SFP(10%) N 93.59 ± 0.58 93.89 ± 0.19 -0.30 1.070E8 14.7
56
SFP(20%) N 93.59 ± 0.58 93.47 ± 0.24 0.12 8.98E7 28.4
SFP(30%) N 93.59 ± 0.58 93.10 ± 0.20 0.49 7.40E7 41.1
SFP(40%) N 93.59 ± 0.58 92.26 ± 0.31 1.33 5.94E7 52.6
SFP(30%) Y 93.59 ± 0.58 93.78 ± 0.22 -0.19 7.40E7 41.1
SFP(40%) Y 93.59 ± 0.58 93.35 ± 0.31 0.24 5.94E7 52.6
PSFP(30%) N 93.59 ± 0.58 93.05 ± 0.19 0.54 7.40E7 41.1
PSFP(40%) N 93.59 ± 0.58 92.44 ± 0.07 1.15 5.94E7 52.6
PSFP(30%) Y 93.59 ± 0.58 93.25 ± 0.23 0.34 7.40E7 41.1
PSFP(40%) Y 93.59 ± 0.58 93.12 ± 0.20 0.47 5.94E7 52.6
Li et al. [15] N 93.53 92.94 0.61 1.55E8 38.6
Li et al. [15] Y 93.53 93.30 0.20 1.55E8 38.6
Dong et al. [36] N 93.63 93.44 0.19 - 34.2
SFP(10%) N 93.68 ± 0.32 93.83 ± 0.19 -0.15 2.16E8 14.6
SFP(20%) N 93.68 ± 0.32 93.93 ± 0.41 -0.25 1.82E8 28.2
SFP(30%) N 93.68 ± 0.32 93.38 ± 0.30 0.30 1.50E8 40.8
SFP(40%) N 93.68 ± 0.32 92.62 ± 0.60 1.04 1.21E8 52.3
SFP(30%) Y 93.68 ± 0.32 93.86 ± 0.21 -0.18 1.50E8 40.8
110
SFP(40%) Y 93.68 ± 0.32 92.90 ± 0.18 0.78 1.21E8 52.3
PSFP(10%) N 93.68 ± 0.32 93.70 ± 0.55 -0.02 2.16E8 14.6
PSFP(20%) N 93.68 ± 0.32 93.94 ± 0.56 -0.24 1.82E8 28.2
PSFP(30%) N 93.68 ± 0.32 93.29 ± 0.45 0.39 1.50E8 40.8
PSFP(40%) N 93.68 ± 0.32 93.20 ± 0.10 0.48 1.21E8 52.3
PSFP(10%) Y 93.68 ± 0.32 93.46 ± 0.12 -0.22 2.16E8 14.6
PSFP(20%) Y 93.68 ± 0.32 93.38 ± 0.06 -0.30 1.82E8 28.2
PSFP(30%) Y 93.68 ± 0.32 93.37 ± 0.12 0.31 1.50E8 40.8
PSFP(40%) Y 93.68 ± 0.32 93.10 ± 0.06 0.58 1.21E8 52.3
the learning rate by 10 compared to the schedule for the scratch that the SFP and PSFP algorithm have a similar result on
model. We run each experiment three times and report the MNIST. As the previous works on pruning MNIST always
“mean ± std”. We compare the performance with other state- pruning ’weight’ of the network instead of the ’filter’ of the
of-the-art acceleration algorithms [15]–[17], [36]. network, the comparison results are not listed.
C. ResNet on CIFAR-10
B. LeNet-5 on MNIST
Table I shows the results on CIFAR-10. Our SFP could
As stated in section III-A, the fully-connected layer can be achieve a better performance than the other state-of-the-art
viewed as the convolutional layer with kernel size k = 1. hard filter pruning methods. For example, [15] use the hard
Suppose the pruning rate for i-th layer is Pi , filter prun- pruning method to accelerate ResNet-110 by 38.6% speedup
ing leads to the size of fully-connected layer changes from ratio with 0.61% accuracy drop when without fine-tuning.
{W(i) ∈ RNi+1 ×Ni } to {W(i) ∈ RNi+1 (1−Pi )×Ni (1−Pi−1 ) }. When using pre-trained model and fine-tuning, the accuracy
In order to test the pruning effect of the algorithm on drop becomes 0.20%. However, we can accelerate the in-
the fully-connected layer, we pruning the first fully-connected ference of ResNet-110 to 40.8% speed-up with only 0.30%
layer of LeNet-5. The baseline accuracy on MNIST is 98.63± accuracy drop without fine-tuning. When using the pre-trained
0.04, and the pruning result is shown in Table III. After model, we can even outperform the original model by 0.18%
pruning 30% of the neurons of the fully-connected layer, with about more than 40% FLOPs reduced.
the accuracy drop is only 0.07%, which means the pruning PSFP could achieve better performance than SFP when
algorithm works well on the fully-connected layer. We find pruning rate is large. For example, when the pruning rate is
TABLE II: Overall performance of pruning ResNet on ImageNet. In “Fine-tune?” column, “Y” and “N” indicate whether to use the pre-
trained model as initialization or not. The “Accu. Drop” is the accuracy of the pruned model minus that of the baseline model, so negative
number means the accelerated model has a higher accuracy than the baseline model. A smaller number of ”Accu. Drop” is better. We run
some ResNet-18 for three times to get the mean and standard deviation of the accuracy. For the other depths of ResNet, we just list the
one-view accuracy.
Fine- Top-1 Accu. Top-1 Accu. Top-5 Accu. Top-5 Accu. Top-1 Accu. Top-5 Accu. Pruned
Depth Method tune? FLOPs(%)
Baseline(%) Accelerated(%) Baseline(%) Accelerated(%) Drop(%) Drop(%)
Dong et al. [36] N 69.98 66.33 89.24 86.94 3.65 2.30 34.6
SFP(30%) N 70.23±0.06 67.25±0.13 89.51±0.10 87.76±0.06 2.98 1.75 41.8
18 SFP(30%) Y 70.23±0.06 60.79 89.51±0.10 83.11 9.44 6.40 41.8
PSFP(30%) N 70.23±0.06 67.41 89.51±0.10 87.89 2.82 1.62 41.8
PSFP(30%) Y 70.23±0.06 68.02 89.51±0.10 88.19 2.21 1.32 41.8
Li et al. [15] Y 73.23 72.17 - - 1.06 - 24.2
SFP(30%) N 73.92 71.83 91.62 90.33 2.09 1.29 41.1
34 SFP(30%) Y 73.92 72.29 91.62 90.90 1.63 0.72 41.1
PSFP(30%) N 73.92 71.72 91.62 90.65 2.20 0.97 41.1
PSFP(30%) Y 73.92 72.53 91.62 91.04 1.39 0.58 41.1
He et al. [16] Y - - 92.20 90.80 - 1.40 50.0
Luo et al. [17] Y 72.88 72.04 91.14 90.67 0.84 0.47 36.7
SFP(30%) N 76.15 74.61 92.87 92.06 1.54 0.81 41.8
50
SFP(30%) Y 76.15 62.14 92.87 84.60 14.01 8.27 41.8
PSFP(30%) N 76.15 74.88 92.87 92.39 1.27 0.48 41.8
PSFP(30%) Y 76.15 75.53 92.87 92.73 0.62 0.14 41.8
SFP(30%) N 77.37 77.03 93.56 93.46 0.34 0.10 42.2
SFP(30%) Y 77.37 77.51 93.56 93.71 -0.14 -0.15 42.2
101
PSFP(30%) N 77.37 77.01 93.56 93.43 0.36 0.13 42.2
PSFP(30%) Y 77.37 77.28 93.56 93.69 0.09 -0.13 42.2
TABLE III: Accuracy of MNIST on LeNet-5 under different pruning

rate.
2.57%. Moreover, for pruning a pre-trained ResNet-101, SFP
reduces more than 40% FLOPs of the model with even 0.2%
Pruning rate(%) 10 20 30 top-5 accuracy increase, which is the state-of-the-art result. In
Accuracy 98.65 ± 0.03 98.61 ± 0.05 98.55 ± 0.07 contrast, the performance degradation is inevitable for hard
filter pruning method. Maintained model capacity of SFP is
the main reason for the superior performance. In addition,
40%, PSFP could achieve 93.10% while SFP only achieves the non-greedy all-layer pruning method may have a better
92.90% if we prune the pre-trained model. If we train the performance than the locally optimal solution obtained from
model from scratch, PSFP could achieve 93.20% while SFP previous greedy pruning method, which seems to be another
only achieves 92.62%. When pruning rate is small, we find reason.
the performance of SFP and PSFP is competitive. We find
However, pruning the pre-trained model with SFP algorithm
that SFP works better for the shallower network, such as
would lead to dramatic accuracy degradation occasionally,
ResNet-56, while PSFP works better for the deeper network.
which is stated in [34]. For example, when we prune the
These results validate the effectiveness of our SFP and PSFP
pre-trained ResNet-18 using SFP, the top-1 accuracy drop
algorithm which can produce a more compressed model with
is 9.44%. Moreover, if we prune the pre-trained ResNet-50
comparable performance to the original model.
using SFP, the top-1 accuracy drop is even 14.01%. However,
if we use PSFP to prune the model progressively, the top-
D. ResNet on ILSVRC-2012 1 accuracy drop of ResNet-18 and ResNet-50 are 2.21%
and 0.62%, respectively. On ResNet-50, PSFP could achieve
TABLE IV: Comparison on the theoretical and realistic speedup. We 75.53% top-1 accuracy, which has 1.08% top-1 accuracy
only count the time consumption of the forward procedure.
improvement over the best result of SFP. On ResNet-50, PSFP
Model
Baseline Pruned Realistic Theoretical could achieve 68.02% top-1 accuracy, which has 0.77% top-1
time (ms) time (ms) Speed-up(%) Speed-up(%)
accuracy improvement over the best result of SFP. On ResNet-
ResNet-18 37.10 26.97 27.4 41.8
ResNet-34 63.97 45.14 29.4 41.1
34 and ResNet-101, the result of PSFP is comparable with
ResNet-50 135.01 94.66 29.8 41.8 SFP algorithm. Therefore, we could achieve a much stable
ResNet-101 219.71 148.64 32.3 42.2 performance if we use PSFP to prune the neural network.
The accuracy of the model during training for soft filter
Table II shows that SFP outperforms other state-of-the-art pruning method and progressive filter pruning method are
methods. For ResNet-34, SFP without fine-tuning achieves shown in the Figure 4. We run this comparison experiment
more inference speedup to the hard pruning method [17], but on ResNet-50, with pruning rate 30%. The blue solid line and
the accuracy of our pruned model exceeds their model by red dashed line indicate the accuracy of the model before and
2) Varying pruning rates: To comprehensively understand

150 before pruning 150 before pruning
after pruning after pruning SFP, we test the accuracy of different pruning rates for ResNet-
100 100
Accuracy (%)
Accuracy (%)
Performance gap Performance gap 110, shown in Figure 5(a). As the pruning rate increases, the
50 50 accuracy of the pruned model first rises above the baseline
model and then drops approximately linearly. For the pruning
0 0
rate between 0% and about 23%, the accuracy of the accel-
50 50 erated model is higher than the baseline model. This shows
0 50 100 0 50 100
Epoch Epoch that our SFP has a regularization effect on the neural network
(a) Soft Filter Pruning (b) Progressive Filter Pruning because SFP reduces the over-fitting of the model.
3) Sensitivity of SFP interval: By default, we conduct our
Fig. 4: The influence of pruning of ResNet-50 on ImageNet regarding SFP operation at the end of every training epoch. However,
different pruning method. The performance gap is calculated by the different SFP intervals may lead to different performance; so
performance after pruning subtracting that before pruning.
we explore the sensitivity of SFP interval. We use the ResNet-
TABLE V: Accuracy of CIFAR-10 on ResNet-110 under different 110 under pruning rate 30% as a baseline, and change the SFP
pruning rate with different filter selection criteria. interval from one epoch to ten epochs, as shown in Figure 5(b).
Pruning rate(%) 10 20 30 It is shown that the model accuracy has no large fluctuation
`1 -norm 93.68 ± 0.60 93.68 ± 0.76 93.34 ± 0.12 along with the different SFP intervals. Moreover, the model
`2 -norm 93.89 ± 0.19 93.93 ± 0.41 93.38 ± 0.30 accuracy of most (80%) intervals surpasses the accuracy of
one epoch interval. Therefore, we can even achieve a better
performance if we fine-tune this parameter.
after pruning, respectively. The black line is the performance 4) Selection of pruned layers: Previous works always prune
gap due to pruning, which is calculated by the accuracy a portion of the layers of the network. Besides, different layers
after pruning subtracting that before pruning. We find that always have different pruning rates. For example, [15] only
the performance gap of SFP is not stable for during all prunes insensitive layers, [17] skips the last layer of every
the retraining epochs. On the contrary, the performance gap block of the ResNet, and [17] prunes more aggressive for
of PSFP is not stable at the early 30 epochs, and become shallower layers and prune less for deep layers.
stable afterwards. This is become SFP directly prune a large Similarly, we compare the performance of pruning first and
number of filters, which leads to the loss of a large quantity second layer of all basic blocks of ResNet-110. We set the
of pre-trained information of the network, thus the influence pruning rate as 30%. The model with all the first layers of
of pruning is large and hard to be recovered by retraining. blocks pruned has an accuracy of 93.96 ± 0.13%, while the
In contrast, PSFP pruned a small number of filters at early model with all the second layers of blocks pruned has an
epochs, then increase the number of pruned filters at later accuracy of 93.38 ± 0.44%. Therefore, different layers have
epochs, this progressive pruning makes the network has time to different sensitivity for SFP. Consequently, careful selection of
recover from the information of by retraining, thus stabilizing pruned layers would potentially lead to performance improve-
the pruning process. ment, although more hyper-parameters are needed.
In order to test the realistic speedup ratio, we measure the 5) Sensitivity of parameter D of PSFP algorithm: We
forward time of the pruned models on one GTX1080 GPU change the parameter D in the Equation 3 to comprehensively
with a batch size of 64. The results are shown in Table IV. understand PSFP, and the results are shown in the Figure 5(c).
The gap between theoretical and realistic model may come We use ResNet-56 on CIFAR-10 and set the pruning rate as
from non-tensor layers and the limitation of IO delay, buffer 40%. When changing the parameter D from 7 to 16, we find
switch and efficiency of BLAS libraries. the model accuracy has no large fluctuation along with the
different parameter D, which shows that the parameter D has
little effect on the final result of pruning.
E. Ablation Study
Extensive ablation study is also conducted to further analyze
each component of our model. V. C ONCLUSION & F UTURE W ORK
1) Filter Selection Criteria: The magnitude based criteria In this paper, we propose the PSFP, a progressive soft
such as `p -norm are widely used to filter selection because filter pruning approach, to accelerate the deep CNNs. During
computational resources cost is small [15]. We compare the the training procedure, we prune the filters progressively and
`2 -norm and `1 -norm, and the results are shown in Table V. allow the pruned filters to be updated. This progressively
We find that the performance of `2 -norm criteria is slightly pruning could make the pruning process more stable and
better than that of `1 -norm criteria. The result of `2 -norm is the soft manner could maintain the model capacity and thus
dominated by the largest element, while the result of `1 -norm achieve the superior performance. Remarkably, without using
is also largely affected by other small elements. Therefore, the pre-trained model, our PSFP can achieve the competi-
filters with some large weights would be preserved by the `2 - tive performance compared to the state-of-the-art approaches.
norm criteria. Consequently, the corresponding discriminative Moreover, by leveraging the pre-trained model, our PSFP
features are kept so the performance of the pruned model is achieves a better result and advances the state-of-the-art in
better. the field of model acceleration. The result shows our PSFP
94.0 93.6
94.0
93.5 93.4
93.5 93.2
Accuracy (%)
Accuracy (%)
Accuracy (%)
93.0
92.5 93.0
93.0
92.0 92.8
91.5 Baseline Model 92.5 Other epochs 92.6 PSFP from pre-trained model
Accelerated Model One epoch PSFP from scratch model
91.0 92.0 92.4
0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 8 10 12 14 16
Pruning rate (%) Epoch Parameter D
(a) Accuracy of ResNet-110 on CIFAR-10 re- (b) Accuracy of ResNet-110 on CIFAR-10 regard- (c) Accuracy of ResNet-56 on CIFAR-10 when
garding different pruning rates. ing different pruning intervals. pruning rate is 40% regarding different parameter
D in the Equation 3.
Fig. 5: Ablation study of SFP and PSFP. (Solid line and shadow denotes the mean and standard deviation of three experiment, respectively.)
have a better performance than the pruning method without [14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
progressive pruning. Furthermore, PSFP can be combined with connections for efficient neural network,” in Proc. NIPS, 2015, pp. 1135–
1143.
other acceleration algorithms, e.g., matrix decomposition and [15] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
low-precision weights, to further improve the performance. We filters for efficient convnets,” in ICLR, 2017.
will explore this in our future work. [16] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
deep neural networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 1398–1406.
[17] J.-H. Luo, J. Wu, and W. Lin, “ThiNet: A filter level pruning method for
R EFERENCES deep neural network compression,” in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Oct. 2017, pp. 5068–5076.
[1] J. Li, X. Mei, D. Prokhorov, and D. Tao, “Deep neural network [18] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional
for structural prediction and lane detection in traffic scene,” IEEE neural networks with low rank expansions,” in Proc. BMVC, 2014.
Transactions on Neural Networks and Learning Systems, vol. 28, no. 3,
[19] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolu-
pp. 690–703, 2017.
tional networks for classification and detection,” IEEE Transactions on
[2] H. Cecotti, M. P. Eckstein, and B. Giesbrecht, “Single-trial classification
Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–
of event-related potentials in rapid serial visual presentation tasks using
1955, 2016.
supervised spatial filtering,” IEEE Transactions on Neural Networks and
[20] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurate
Learning Systems, vol. 25, no. 11, pp. 2030–2042, 2014.
approximations of nonlinear convolutional networks,” in Proc. IEEE
[3] J. Liu, M. Gong, K. Qin, and P. Zhang, “A deep convolutional coupling
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1984–1992.
network for change detection based on heterogeneous optical and radar
images,” IEEE Transactions on Neural Networks and Learning Systems, [21] C. Tai, T. Xiao, Y. Zhang, X. Wang et al., “Convolutional neural
2016. networks with low-rank regularization,” in ICLR, 2016.
[4] N. Liu, J. Han, T. Liu, and X. Li, “Learning to predict eye fixations via [22] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey,
multiresolution convolutional neural networks,” IEEE Transactions on “Faster cnns with direct sparse convolutions and guided pruning,” in
Neural Networks and Learning Systems, 2016. ICLR, 2017.
[5] T. Chen, L. Lin, L. Liu, X. Luo, and X. Li, “Disc: Deep image saliency [23] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
computing via progressive representation learning,” IEEE Transactions ICLR, 2017.
on Neural Networks and Learning Systems, vol. 27, no. 6, pp. 1135– [24] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
1149, 2016. quantization: Towards lossless cnns with low-precision weights,” ICLR,
[6] C. Xia, F. Qi, and G. Shi, “Bottom–up visual saliency estimation with 2017.
deep autoencoder-based sparse reconstruction,” IEEE Transactions on [25] J.-T. Chien and Y.-T. Bao, “Tensor-factorized neural networks,” IEEE
Neural Networks and Learning Systems, vol. 27, no. 6, pp. 1227–1240, Transactions on Neural Networks and Learning Systems, vol. 29, no. 5,
2016. pp. 1998–2011, 2018.
[7] M. Shao, Y. Zhang, and Y. Fu, “Collaborative random faces-guided [26] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
encoders for pose-invariant face representation learning,” IEEE Trans- deep neural networks with pruning, trained quantization and huffman
actions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. coding,” in ICLR, 2015.
1019–1032, 2018. [27] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Bi-
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for narized neural networks,” in Advances in neural information processing
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. systems, 2016, pp. 4107–4115.
(CVPR), Jun. 2016, pp. 770–778. [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Imagenet classification using binary convolutional neural networks,” in
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” European Conference on Computer Vision. Springer, 2016, pp. 525–
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, 542.
pp. 1–9. [29] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
[10] J. Wang, C. Xu, X. Yang, and J. M. Zurada, “A novel pruning DNNs,” in Proc. NIPS, 2016, pp. 1379–1387.
algorithm for smoothing feedforward neural networks based on group [30] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
lasso method,” IEEE Transactions on Neural Networks and Learning sparsity in deep neural networks,” in Proc. NIPS, 2016, pp. 2074–2082.
Systems, vol. 29, no. 5, pp. 2012–2024, 2018. [31] V. Lebedev and V. Lempitsky, “Fast convnets using group-wise brain
[11] L.-W. Kim, “Deepx: Deep learning accelerator for restricted boltzmann damage,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
machine artificial neural networks,” IEEE Transactions on Neural Net- 2016, pp. 2554–2564.
works and Learning Systems, vol. 29, no. 5, pp. 1441–1453, 2018. [32] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
[12] B. Hassibi and D. G. Stork, “Second order derivatives for network efficient convolutional networks through network slimming,” in Proc.
pruning: Optimal brain surgeon,” in Advances in neural information IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2755–2763.
processing systems, 1993, pp. 164–171. [33] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning
[13] R. Reed, “Pruning algorithms-a survey,” IEEE Transactions on Neural convolutional neural networks for resource efficient transfer learning,”
Networks, vol. 4, no. 5, pp. 740–747, 1993. in ICLR, 2017.
[34] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for Yanwei Fu received the PhD degree from Queen
accelerating deep convolutional neural networks,” in International Joint Mary University of London in 2014, and the MEng
Conference on Artificial Intelligence, 2018. degree in the Department of Computer Science &
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Technology at Nanjing University in 2011, China.
in ICLR, 2015. He worked as a Post-doc in Disney Research at
[36] X. Dong, J. Huang, Y. Yang, and S. Yan, “More is less: A more Pittsburgh from 2015-2016. He is currently an As-
complicated network with less inference complexity,” in Proc. IEEE sistant Professor at Fudan University. His research
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1895–1903. interest is image and video understanding, and life-
[37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning long learning.
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[38] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
tiny images,” 2009.
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in ICLR, 2015.
[41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” 2017.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
networks,” in Proc. ECCV, 2016, pp. 630–645.
[43] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proc.
BMVC, 2016, pp. 87.1–87.12.
Yang He received the BS degree and MSc from

University of Science and Technology of China,
Hefei, China, in 2014 and 2017, respectively. He is
currently working toward the PhD degree with Cen-
ter of AI, Faculty of Engineering and Information
Technology, University of Technology Sydney. His
research interests include deep learning, computer
vision, and machine learning.
Yi Yang received the Ph.D. degree in computer
science from Zhejiang University, Hangzhou, China,
in 2010. He is currently a professor with University
of Technology Sydney, Australia. He was a Post-
Doctoral Research with the School of Computer
Science, Carnegie Mellon University, Pittsburgh, PA,
USA. His current research interest include machine
learning and its applications to multimedia content
Xuanyi Dong received the B.E. degree in Computer analysis and computer vision, such as multimedia
Science and Technology from Beihang University, indexing and retrieval, surveillance video analysis
Beijing, China, in 2016. He is currently a Ph.D. and video semantics understanding.
student in the Center of Artificial Intelligence, Uni-
versity of Technology Sydney, Australia, under the
supervision of Prof. Yi Yang.
Guoliang Kang received the BS degree in automa-

tion from Chongqing University, Chongqing, China,
in 2011 and the MSc degree in pattern recogni-
tion and intelligent system from Beihang University,
Beijing, China, in 2014. He is currently working
toward the PhD degree with Center of AI, Faculty of
Engineering and Information Technology, University
of Technology Sydney. His research interests include
deep learning, computer vision, and statistical ma-
chine learning.

Progressive Deep Neural Networks Acceleration Via Soft Filter Pruning

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Progressive Deep Neural Networks Acceleration Via Soft Filter Pruning

Enviado por

Direitos autorais:

Formatos disponíveis

SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 1

Progressive Deep Neural Networks Acceleration via

proposed PSFP method prunes the network progressively and

C. Weight Pruning. a matrix of connection weights in the i-th layer. Ni denotes

k-th training epoch Pruned model (k+1)-th training epoch

1.572 1.572 0.256

HFP-a HFP-b HFP-c

1.572 1.572 3.742

SFP-a SFP-b SFP-c

1.572 1.572 3.742

PSFP-a PSFP-b PSFP-c

Algorithm 1 Algorithm Description of SFP Algorithm 2 Algorithm Description of PSFP

and then fine-tune it. However, as we integrate the pruning 30

TABLE III: Accuracy of MNIST on LeNet-5 under different pruning

2) Varying pruning rates: To comprehensively understand

Yang He received the BS degree and MSc from

Guoliang Kang received the BS degree in automa-

Você também pode gostar