Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract—This paper proposed a Progressive Soft Filter Prun- Input Filters Output Input Filters Output
ing method (PSFP) to prune the filters of deep Neural Networks
which can thus be accelerated in the inference. Specifically, the
× ×
arXiv:1808.07471v1 [cs.CV] 22 Aug 2018
Nevertheless, most of the previous works on filter pruning pruning rate at early epochs and large pruning rate at later
still suffer from the problems of 1) the model capacity reduc- epochs. In this way, the filters with less information (in
tion and 2) the dependence on pre-trained model. Specifically, other words, less norm) become distinct after pruning and
as shown in Figure 1, most previous works conduct the “hard retraining at early epochs and it is less possible to prune the
filter pruning”, which directly delete the pruned filters. The wrong filters at later epochs if we use a large pruning rate.
discarded filters will reduce the model capacity of original (4) The extensive experiment on three benchmark datasets
models, and thus inevitably harm the performance. Moreover, demonstrates the effectiveness and efficiency of our PSFP. We
to maintain a reasonable performance with respect to the full accelerate ResNet-110 by two times with about 4% relative
models, previous works [15]–[17] always fine-tuned the hard accuracy improvement on CIFAR-10, and also achieve
pruned model after pruning the filters of a pre-trained model, state-of-the-art results on ILSVRC-2012.
which however has low training efficiency and often requires
much more training time than the traditional training schema.
To address the above mentioned two problems, we propose II. R ELATED W ORKS
a novel Soft Filter Pruning (SFP) approach. The SFP dynam- Most previous works on accelerating CNNs can be roughly
ically prunes the filters in a soft manner. Particularly, before divided into three categories, namely, matrix decomposition,
first training epoch, the filters of almost all layers with small low-precision weights, and pruning. In particular, the matrix
`2 -norm are selected and set to zero. Then the training data decomposition of deep CNN tensors is approximated by the
is used to update the pruned model. Before the next training product of two low-rank matrices [18]–[22]. This can save
epoch, our SFP will prune a new set of filters of small `2 - the computational cost. Some works [23], [24] focus on com-
norm. These training processes are continued until converged. pressing the CNNs by using low-precision weights. Pruning-
Finally, some filters will be selected and pruned without further based approaches aim to remove the unnecessary connections
updating. The SFP algorithm enables the compressed network of the neural network [14], [15]. Essentially, the work of
to have a larger model capacity, and thus achieve a higher this paper is based on the idea of pruning techniques; and
accuracy than others. the approaches of matrix decomposition and low-precision
Intuitively, for the pre-trained network, all the filter of the weights are orthogonal but potentially useful here – it may be
model have information of the training set, some of them are still worth simplifying the weight matrix after pruning filters,
important, and others are not that important thus could be which would be taken as future work.
pruned. If we directly prune a large number of filters at the
early epoch, some informative filters might not be recovered A. Matrix Decomposition.
by retraining even if we use the soft pruning approach. This
is because there exist some filters with similar `2 -norm in To reduce the computation costs of the convolutional layers,
the neural network, which makes it difficult to choose the past works have proposed to representing the weight matrix
right filters to be pruned. Therefore, we propose to prune the of the convolutional network as a low-rank product of two
network progressively, i.e., small pruning rate at the early smaller matrices [18]–[21], [25]. Then the calculation of
epochs, and large pruning rate at later epochs, which is production of one large matrix turns to the production of
called progressive soft filter pruning (PSFP). In this way, after two smaller matrices. However, the computation cost of tensor
pruning a small number of filters at the early epochs and decomposition operation is large, which is not friendly to the
retraining the model, the distribution of `2 -norm of the filters training of neural network. In addition, there exists increasing
would become different, and the information of training set usage of 1 × 1 convolution kernel in some recent neural
would concentrate on some important filters. In the meantime, network, such as the bottleneck building block structure of
less informative filters are much easy to be recognized. In ResNet [8], which is difficult for matrix decomposition.
this situation, if we use a larger pruning rate at a later epoch,
it is easy to select the unimportant filters because they are B. Low Precision.
becoming discriminative and it is less possible to choose the Some other researchers focus on low-precision implemen-
wrong filters. tation to compress and accelerate CNN models [23], [24],
Contributions. We highlight four contributions: [26]–[28]. Han [26] proposed to prune the connections of
(1) We propose a soft pruning manner to allow the pruned the neural network, and then fine-tuned networks to recover
filters to be updated during the training procedure. This soft the accuracy drop, after which he quantify the weights to
manner can dramatically maintain the model capacity and reduce the number of bits of the parameters of a single layer
thus achieves the superior performance. to 5. Zhou [23] proposes trained ternary quantization (TTQ),
(2) Our acceleration approach can train a model from a method that can reduce the precision of weights in neural
scratch and achieve better performance compared to the networks to ternary values. In [24], the authors present incre-
state-of-the-art. In this way, the fine-tuning procedure and the mental network quantization (INQ), a novel method targeting
overall training time is saved. Moreover, using the pre-trained to efficiently convert pre-trained full-precision CNN model
model can further enhance the performance of our approach into a low-precision version whose weights are constrained
to advance the state-of-the-art in model acceleration. to be either power of two or zero. Under this situation, only
(3) To reduce the possibility of pruning the wrong filters, low-precision weights are stored and used to inference, thus
we propose to prune the network progressively, i.e., small the storage and computation cost are dramatically reduced.
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 3
each training epoch, the full model is optimized and trained several operations of pruning and reconstruction to converge
on the training data. After each epoch, the `2 -norm of all the model, a compact model is obtained to accelerate the
filters are computed for each weighted layer and used as the inference.
criterion of our filter selection strategy. Then we will prune
the selected filters by setting the corresponding filter weights D. Progressive Soft Filter Pruning (PSFP)
as zero, which is followed by next training epoch. Finally, the
original deep CNNs are pruned into a compact and efficient As all the filters of the pre-trained model have the infor-
model. mation of the training set, if we prune a large number of
these informative filters, the information of the neural network
The second row of Figure 2 explained the process of SFP. obtained from training set would dramatically lose. Even if we
First, some of the filters with small `p -norm (marked in pruning the filter with the soft manner, it is still difficult for
blue) are select and set to zero. During retraining, we allow some important filter to recover by retraining. Therefore, we
the pruned filters to be updated. In this way, the pruned propose to pruning the neural network progressively.
filter whose `p -norm equals to zero might become nonzero Different from the setting in Algorithm 1 that the pruning
again due to retraining, which is represented as the second rate during all retraining epochs equals to the goal pruning
and fourth filter in SFP-b and SFP-c of Figure 2. After rate Pi , we change the pruning rate at each retraining epoch.
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 5
Pruning rate (%)
the network trained from scratch by PSFP can obtain the
competitive results with the one trained from a well-trained 15
model by others. By leveraging the pre-trained model, our 10
PSFP can obtain a much higher performance, and advance the
state-of-the-art. 5
Fit. func: f(x) = 30.000e 0.055x + 30.000
4) Obtaining Compact Model: PSFP iterates over the filter 0 three points to get the function
selection, filter pruning and reconstruction steps. After the 0 50 100 150 200
model gets converged, we can obtain a sparse model contain- Epoch
ing many “zero filters”. One “zero filter” corresponds to one
Fig. 3: Progressively changed pruning rate when the goal pruning rate
feature map. The features maps, corresponding to those “zero is 30%. Three blue points are the three pairs to generate the function
filters”, will always be zero during the inference procedure. of pruning rate and the solid curve is the exponential pruning rate
There will be no influence to remove these filters as well as function.
the corresponding feature maps. Specifically, for the pruning
rate Pi in the i-th layer, only Ni+1 (1 − Pi ) filters are non- method.
zero and have an effect on the final prediction. Consider
pruning the previous layer, the input channel of i-th layer
is changed from Ni to Ni (1 − Pi−1 ). We can thus re-build A. Benchmark Datasets and Experimental Setting
the i-th layer into a smaller one. Finally, a compact model Our method is evaluated on three benchmarks: MNIST [37],
{W∗ (i) ∈ RNi+1 (1−Pi )×Ni (1−Pi−1 )×K×K } is obtained. CIFAR-10 [38] and ILSVRC-2012 [39]. The MNIST has a
training set of 60,000 examples, and a test set of 10,000
E. Computation Complexity Analysis examples. The CIFAR-10 dataset contains 50,000 training
1) Theoretical speedup analysis: Suppose the filter pruning images and 10,000 testing images, which are categorized into
rate of the ith layer is Pi , which means the Ni+1 ×Pi filters are 10 classes. ILSVRC-2012 is a large-scale dataset containing
set to zero and pruned from the layer, and the other Ni+1 × 1.28 million training images and 50k validation images of
(1 − Pi ) filters remain unchanged, and suppose the size of 1,000 classes. As discussed in [16], [17], [36], ResNet is less
the input and output feature map of ith layer is Hi × Wi redundancy, and thus more difficult to get accelerated than
and Hi+1 × Wi+1 . Then after filter pruning, the dimension VGGNet [40]. Therefore, we focus on pruning the challenging
of useful output feature map of the ith layer decreases from ResNet model.
Ni+1 × Hi+1 × Wi+1 to Ni+1 (1 − Pi ) × Hi+1 × Wi+1 . Note In the MNIST experiments, we use the default parameter
that the output of ith layer is the input of (i + 1) th layer. setting as [37], and use the PyTorch implementation [41] of
And we further prunes the (i + 1)th layer with a filter pruning the network and training schedule. We train the network for
rate Pi+1 , then the calculation of (i + 1)th layer is decrease 20 epochs to get a more stable result.
from Ni+2 × Ni+1 × k 2 × Hi+2 × Wi+2 to Ni+2 (1 − Pi+1 ) × In the CIFAR-10 experiments, we use the default parameter
Ni+1 (1−Pi )×k 2 ×Hi+2 ×Wi+2 . In other words, a proportion setting as [42] and follow the training schedule in [43]. For
of 1 − (1 − Pi+1 ) × (1 − Pi ) of the original calculation is CIFAR-10 dataset, we test our SFP on ResNet-20, 32, 56 and
reduced, which will make the neural network inference much 110. We use several different pruning rates, and also analyze
faster. the difference between using the pre-trained model and from
2) Realistic speedup analysis: In theoretical speedup anal- scratch. In addition, we test our PSFP on ResNet-56 and 110
ysis, other operations such as batch normalization and pool- with different pruning rate.
ing are negligible comparing to the convolution operations. On ILSVRC-2012, we follow the same parameter settings
Therefore, we consider the FLOPs of convolution operations as [8], [42]. We use the same data argumentation strategies
for computation complexity comparison, which is commonly with PyTorch implementation [41]. We test our SFP and PSFP
used in previous work [15], [17]. However, reduced FLOPs on ResNet-18, 34, 50 and 101; and we use the same pruning
cannot bring the same level of realistic speedup because non- rate 30% for all the models. All the convolutional layer of
tensor layers (e.g., batch normalization and pooling layers) ResNet are pruned with the same pruning rate at the same time.
also need the inference time on GPU [17]. In addition, the (We do not prune the projection shortcuts for simplification,
limitation of IO delay, buffer switch and efficiency of BLAS which only need negligible time and do not affect the overall
libraries also lead to the wide gap between theoretical and cost.)
realistic speedup ratio. We compare the theoretical and realistic For progressive pruning, we set the D in equation 3 as 1/8.
speedup in Section IV-D. For example, if we use the goal pruning rate as 30%, then the
pruning rate increases exponentially as shown in figure 3,
IV. E XPERIMENT We conduct our SFP and PSFP operation at the end of every
In this section, we conduct experiments on several bench- training epoch. For pruning a scratch model, we use the normal
mark datasets to validate the effectiveness of our acceleration training schedule. For pruning a pre-trained model, we reduce
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 7
TABLE I: Comparison of ResNet on Cifar-10. In “Fine-tune?” column, “Y” and “N” indicate whether to use the pre-trained model as
initialization or not. The “Accu. Drop” is the accuracy of the pruned model minus that of the baseline model, so negative number means the
accelerated model has a higher accuracy than the baseline model. A smaller number of ”Accu. Drop” is better. We run every experiment for
three times to get the mean and standard deviation of the accuracy.
Depth Method Fine-tune? Baseline Accu. (%) Accelerated Accu. (%) Accu. Drop (%) FLOPs Pruned FLOPs(%)
Dong et al. [36] N 91.53 91.43 0.10 3.20E7 20.3
SFP(10%) N 92.20 ± 0.18 92.24 ± 0.33 -0.04 3.44E7 15.2
20
SFP(20%) N 92.20 ± 0.18 91.20 ± 0.30 1.00 2.87E7 29.3
SFP(30%) N 92.20 ± 0.18 90.83 ± 0.31 1.37 2.43E7 42.2
Dong et al. [36] N 92.33 90.74 1.59 4.70E7 31.2
SFP(10%) N 92.63 ± 0.70 93.22 ± 0.09 -0.59 5.86E7 14.9
32
SFP(20%) N 92.63 ± 0.70 90.63 ± 0.37 0.00 4.90E7 28.8
SFP(30%) N 92.63 ± 0.70 90.08 ± 0.08 0.55 4.03E7 41.5
Li et al. [15] N 93.04 91.31 1.75 9.09E7 27.6
Li et al. [15] Y 93.04 93.06 -0.02 9.09E7 27.6
He et al. [16] N 92.80 90.90 1.90 - 50.0
He et al. [16] Y 92.80 91.80 1.00 - 50.0
SFP(10%) N 93.59 ± 0.58 93.89 ± 0.19 -0.30 1.070E8 14.7
56
SFP(20%) N 93.59 ± 0.58 93.47 ± 0.24 0.12 8.98E7 28.4
SFP(30%) N 93.59 ± 0.58 93.10 ± 0.20 0.49 7.40E7 41.1
SFP(40%) N 93.59 ± 0.58 92.26 ± 0.31 1.33 5.94E7 52.6
SFP(30%) Y 93.59 ± 0.58 93.78 ± 0.22 -0.19 7.40E7 41.1
SFP(40%) Y 93.59 ± 0.58 93.35 ± 0.31 0.24 5.94E7 52.6
PSFP(30%) N 93.59 ± 0.58 93.05 ± 0.19 0.54 7.40E7 41.1
PSFP(40%) N 93.59 ± 0.58 92.44 ± 0.07 1.15 5.94E7 52.6
PSFP(30%) Y 93.59 ± 0.58 93.25 ± 0.23 0.34 7.40E7 41.1
PSFP(40%) Y 93.59 ± 0.58 93.12 ± 0.20 0.47 5.94E7 52.6
Li et al. [15] N 93.53 92.94 0.61 1.55E8 38.6
Li et al. [15] Y 93.53 93.30 0.20 1.55E8 38.6
Dong et al. [36] N 93.63 93.44 0.19 - 34.2
SFP(10%) N 93.68 ± 0.32 93.83 ± 0.19 -0.15 2.16E8 14.6
SFP(20%) N 93.68 ± 0.32 93.93 ± 0.41 -0.25 1.82E8 28.2
SFP(30%) N 93.68 ± 0.32 93.38 ± 0.30 0.30 1.50E8 40.8
SFP(40%) N 93.68 ± 0.32 92.62 ± 0.60 1.04 1.21E8 52.3
SFP(30%) Y 93.68 ± 0.32 93.86 ± 0.21 -0.18 1.50E8 40.8
110
SFP(40%) Y 93.68 ± 0.32 92.90 ± 0.18 0.78 1.21E8 52.3
PSFP(10%) N 93.68 ± 0.32 93.70 ± 0.55 -0.02 2.16E8 14.6
PSFP(20%) N 93.68 ± 0.32 93.94 ± 0.56 -0.24 1.82E8 28.2
PSFP(30%) N 93.68 ± 0.32 93.29 ± 0.45 0.39 1.50E8 40.8
PSFP(40%) N 93.68 ± 0.32 93.20 ± 0.10 0.48 1.21E8 52.3
PSFP(10%) Y 93.68 ± 0.32 93.46 ± 0.12 -0.22 2.16E8 14.6
PSFP(20%) Y 93.68 ± 0.32 93.38 ± 0.06 -0.30 1.82E8 28.2
PSFP(30%) Y 93.68 ± 0.32 93.37 ± 0.12 0.31 1.50E8 40.8
PSFP(40%) Y 93.68 ± 0.32 93.10 ± 0.06 0.58 1.21E8 52.3
the learning rate by 10 compared to the schedule for the scratch that the SFP and PSFP algorithm have a similar result on
model. We run each experiment three times and report the MNIST. As the previous works on pruning MNIST always
“mean ± std”. We compare the performance with other state- pruning ’weight’ of the network instead of the ’filter’ of the
of-the-art acceleration algorithms [15]–[17], [36]. network, the comparison results are not listed.
C. ResNet on CIFAR-10
B. LeNet-5 on MNIST
Table I shows the results on CIFAR-10. Our SFP could
As stated in section III-A, the fully-connected layer can be achieve a better performance than the other state-of-the-art
viewed as the convolutional layer with kernel size k = 1. hard filter pruning methods. For example, [15] use the hard
Suppose the pruning rate for i-th layer is Pi , filter prun- pruning method to accelerate ResNet-110 by 38.6% speedup
ing leads to the size of fully-connected layer changes from ratio with 0.61% accuracy drop when without fine-tuning.
{W(i) ∈ RNi+1 ×Ni } to {W(i) ∈ RNi+1 (1−Pi )×Ni (1−Pi−1 ) }. When using pre-trained model and fine-tuning, the accuracy
In order to test the pruning effect of the algorithm on drop becomes 0.20%. However, we can accelerate the in-
the fully-connected layer, we pruning the first fully-connected ference of ResNet-110 to 40.8% speed-up with only 0.30%
layer of LeNet-5. The baseline accuracy on MNIST is 98.63± accuracy drop without fine-tuning. When using the pre-trained
0.04, and the pruning result is shown in Table III. After model, we can even outperform the original model by 0.18%
pruning 30% of the neurons of the fully-connected layer, with about more than 40% FLOPs reduced.
the accuracy drop is only 0.07%, which means the pruning PSFP could achieve better performance than SFP when
algorithm works well on the fully-connected layer. We find pruning rate is large. For example, when the pruning rate is
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 8
TABLE II: Overall performance of pruning ResNet on ImageNet. In “Fine-tune?” column, “Y” and “N” indicate whether to use the pre-
trained model as initialization or not. The “Accu. Drop” is the accuracy of the pruned model minus that of the baseline model, so negative
number means the accelerated model has a higher accuracy than the baseline model. A smaller number of ”Accu. Drop” is better. We run
some ResNet-18 for three times to get the mean and standard deviation of the accuracy. For the other depths of ResNet, we just list the
one-view accuracy.
Fine- Top-1 Accu. Top-1 Accu. Top-5 Accu. Top-5 Accu. Top-1 Accu. Top-5 Accu. Pruned
Depth Method tune? FLOPs(%)
Baseline(%) Accelerated(%) Baseline(%) Accelerated(%) Drop(%) Drop(%)
Dong et al. [36] N 69.98 66.33 89.24 86.94 3.65 2.30 34.6
SFP(30%) N 70.23±0.06 67.25±0.13 89.51±0.10 87.76±0.06 2.98 1.75 41.8
18 SFP(30%) Y 70.23±0.06 60.79 89.51±0.10 83.11 9.44 6.40 41.8
PSFP(30%) N 70.23±0.06 67.41 89.51±0.10 87.89 2.82 1.62 41.8
PSFP(30%) Y 70.23±0.06 68.02 89.51±0.10 88.19 2.21 1.32 41.8
Li et al. [15] Y 73.23 72.17 - - 1.06 - 24.2
SFP(30%) N 73.92 71.83 91.62 90.33 2.09 1.29 41.1
34 SFP(30%) Y 73.92 72.29 91.62 90.90 1.63 0.72 41.1
PSFP(30%) N 73.92 71.72 91.62 90.65 2.20 0.97 41.1
PSFP(30%) Y 73.92 72.53 91.62 91.04 1.39 0.58 41.1
He et al. [16] Y - - 92.20 90.80 - 1.40 50.0
Luo et al. [17] Y 72.88 72.04 91.14 90.67 0.84 0.47 36.7
SFP(30%) N 76.15 74.61 92.87 92.06 1.54 0.81 41.8
50
SFP(30%) Y 76.15 62.14 92.87 84.60 14.01 8.27 41.8
PSFP(30%) N 76.15 74.88 92.87 92.39 1.27 0.48 41.8
PSFP(30%) Y 76.15 75.53 92.87 92.73 0.62 0.14 41.8
SFP(30%) N 77.37 77.03 93.56 93.46 0.34 0.10 42.2
SFP(30%) Y 77.37 77.51 93.56 93.71 -0.14 -0.15 42.2
101
PSFP(30%) N 77.37 77.01 93.56 93.43 0.36 0.13 42.2
PSFP(30%) Y 77.37 77.28 93.56 93.69 0.09 -0.13 42.2
Accuracy (%)
Performance gap Performance gap 110, shown in Figure 5(a). As the pruning rate increases, the
50 50 accuracy of the pruned model first rises above the baseline
model and then drops approximately linearly. For the pruning
0 0
rate between 0% and about 23%, the accuracy of the accel-
50 50 erated model is higher than the baseline model. This shows
0 50 100 0 50 100
Epoch Epoch that our SFP has a regularization effect on the neural network
(a) Soft Filter Pruning (b) Progressive Filter Pruning because SFP reduces the over-fitting of the model.
3) Sensitivity of SFP interval: By default, we conduct our
Fig. 4: The influence of pruning of ResNet-50 on ImageNet regarding SFP operation at the end of every training epoch. However,
different pruning method. The performance gap is calculated by the different SFP intervals may lead to different performance; so
performance after pruning subtracting that before pruning.
we explore the sensitivity of SFP interval. We use the ResNet-
TABLE V: Accuracy of CIFAR-10 on ResNet-110 under different 110 under pruning rate 30% as a baseline, and change the SFP
pruning rate with different filter selection criteria. interval from one epoch to ten epochs, as shown in Figure 5(b).
Pruning rate(%) 10 20 30 It is shown that the model accuracy has no large fluctuation
`1 -norm 93.68 ± 0.60 93.68 ± 0.76 93.34 ± 0.12 along with the different SFP intervals. Moreover, the model
`2 -norm 93.89 ± 0.19 93.93 ± 0.41 93.38 ± 0.30 accuracy of most (80%) intervals surpasses the accuracy of
one epoch interval. Therefore, we can even achieve a better
performance if we fine-tune this parameter.
after pruning, respectively. The black line is the performance 4) Selection of pruned layers: Previous works always prune
gap due to pruning, which is calculated by the accuracy a portion of the layers of the network. Besides, different layers
after pruning subtracting that before pruning. We find that always have different pruning rates. For example, [15] only
the performance gap of SFP is not stable for during all prunes insensitive layers, [17] skips the last layer of every
the retraining epochs. On the contrary, the performance gap block of the ResNet, and [17] prunes more aggressive for
of PSFP is not stable at the early 30 epochs, and become shallower layers and prune less for deep layers.
stable afterwards. This is become SFP directly prune a large Similarly, we compare the performance of pruning first and
number of filters, which leads to the loss of a large quantity second layer of all basic blocks of ResNet-110. We set the
of pre-trained information of the network, thus the influence pruning rate as 30%. The model with all the first layers of
of pruning is large and hard to be recovered by retraining. blocks pruned has an accuracy of 93.96 ± 0.13%, while the
In contrast, PSFP pruned a small number of filters at early model with all the second layers of blocks pruned has an
epochs, then increase the number of pruned filters at later accuracy of 93.38 ± 0.44%. Therefore, different layers have
epochs, this progressive pruning makes the network has time to different sensitivity for SFP. Consequently, careful selection of
recover from the information of by retraining, thus stabilizing pruned layers would potentially lead to performance improve-
the pruning process. ment, although more hyper-parameters are needed.
In order to test the realistic speedup ratio, we measure the 5) Sensitivity of parameter D of PSFP algorithm: We
forward time of the pruned models on one GTX1080 GPU change the parameter D in the Equation 3 to comprehensively
with a batch size of 64. The results are shown in Table IV. understand PSFP, and the results are shown in the Figure 5(c).
The gap between theoretical and realistic model may come We use ResNet-56 on CIFAR-10 and set the pruning rate as
from non-tensor layers and the limitation of IO delay, buffer 40%. When changing the parameter D from 7 to 16, we find
switch and efficiency of BLAS libraries. the model accuracy has no large fluctuation along with the
different parameter D, which shows that the parameter D has
little effect on the final result of pruning.
E. Ablation Study
Extensive ablation study is also conducted to further analyze
each component of our model. V. C ONCLUSION & F UTURE W ORK
1) Filter Selection Criteria: The magnitude based criteria In this paper, we propose the PSFP, a progressive soft
such as `p -norm are widely used to filter selection because filter pruning approach, to accelerate the deep CNNs. During
computational resources cost is small [15]. We compare the the training procedure, we prune the filters progressively and
`2 -norm and `1 -norm, and the results are shown in Table V. allow the pruned filters to be updated. This progressively
We find that the performance of `2 -norm criteria is slightly pruning could make the pruning process more stable and
better than that of `1 -norm criteria. The result of `2 -norm is the soft manner could maintain the model capacity and thus
dominated by the largest element, while the result of `1 -norm achieve the superior performance. Remarkably, without using
is also largely affected by other small elements. Therefore, the pre-trained model, our PSFP can achieve the competi-
filters with some large weights would be preserved by the `2 - tive performance compared to the state-of-the-art approaches.
norm criteria. Consequently, the corresponding discriminative Moreover, by leveraging the pre-trained model, our PSFP
features are kept so the performance of the pruned model is achieves a better result and advances the state-of-the-art in
better. the field of model acceleration. The result shows our PSFP
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 10
94.0 93.6
94.0
93.5 93.4
93.5 93.2
Accuracy (%)
Accuracy (%)
Accuracy (%)
93.0
92.5 93.0
93.0
92.0 92.8
91.5 Baseline Model 92.5 Other epochs 92.6 PSFP from pre-trained model
Accelerated Model One epoch PSFP from scratch model
91.0 92.0 92.4
0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 8 10 12 14 16
Pruning rate (%) Epoch Parameter D
(a) Accuracy of ResNet-110 on CIFAR-10 re- (b) Accuracy of ResNet-110 on CIFAR-10 regard- (c) Accuracy of ResNet-56 on CIFAR-10 when
garding different pruning rates. ing different pruning intervals. pruning rate is 40% regarding different parameter
D in the Equation 3.
Fig. 5: Ablation study of SFP and PSFP. (Solid line and shadow denotes the mean and standard deviation of three experiment, respectively.)
have a better performance than the pruning method without [14] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and
progressive pruning. Furthermore, PSFP can be combined with connections for efficient neural network,” in Proc. NIPS, 2015, pp. 1135–
1143.
other acceleration algorithms, e.g., matrix decomposition and [15] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning
low-precision weights, to further improve the performance. We filters for efficient convnets,” in ICLR, 2017.
will explore this in our future work. [16] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very
deep neural networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 1398–1406.
[17] J.-H. Luo, J. Wu, and W. Lin, “ThiNet: A filter level pruning method for
R EFERENCES deep neural network compression,” in Proc. IEEE Int. Conf. Comput.
Vis. (ICCV), Oct. 2017, pp. 5068–5076.
[1] J. Li, X. Mei, D. Prokhorov, and D. Tao, “Deep neural network [18] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional
for structural prediction and lane detection in traffic scene,” IEEE neural networks with low rank expansions,” in Proc. BMVC, 2014.
Transactions on Neural Networks and Learning Systems, vol. 28, no. 3,
[19] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolu-
pp. 690–703, 2017.
tional networks for classification and detection,” IEEE Transactions on
[2] H. Cecotti, M. P. Eckstein, and B. Giesbrecht, “Single-trial classification
Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 1943–
of event-related potentials in rapid serial visual presentation tasks using
1955, 2016.
supervised spatial filtering,” IEEE Transactions on Neural Networks and
[20] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurate
Learning Systems, vol. 25, no. 11, pp. 2030–2042, 2014.
approximations of nonlinear convolutional networks,” in Proc. IEEE
[3] J. Liu, M. Gong, K. Qin, and P. Zhang, “A deep convolutional coupling
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 1984–1992.
network for change detection based on heterogeneous optical and radar
images,” IEEE Transactions on Neural Networks and Learning Systems, [21] C. Tai, T. Xiao, Y. Zhang, X. Wang et al., “Convolutional neural
2016. networks with low-rank regularization,” in ICLR, 2016.
[4] N. Liu, J. Han, T. Liu, and X. Li, “Learning to predict eye fixations via [22] J. Park, S. Li, W. Wen, P. T. P. Tang, H. Li, Y. Chen, and P. Dubey,
multiresolution convolutional neural networks,” IEEE Transactions on “Faster cnns with direct sparse convolutions and guided pruning,” in
Neural Networks and Learning Systems, 2016. ICLR, 2017.
[5] T. Chen, L. Lin, L. Liu, X. Luo, and X. Li, “Disc: Deep image saliency [23] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,”
computing via progressive representation learning,” IEEE Transactions ICLR, 2017.
on Neural Networks and Learning Systems, vol. 27, no. 6, pp. 1135– [24] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network
1149, 2016. quantization: Towards lossless cnns with low-precision weights,” ICLR,
[6] C. Xia, F. Qi, and G. Shi, “Bottom–up visual saliency estimation with 2017.
deep autoencoder-based sparse reconstruction,” IEEE Transactions on [25] J.-T. Chien and Y.-T. Bao, “Tensor-factorized neural networks,” IEEE
Neural Networks and Learning Systems, vol. 27, no. 6, pp. 1227–1240, Transactions on Neural Networks and Learning Systems, vol. 29, no. 5,
2016. pp. 1998–2011, 2018.
[7] M. Shao, Y. Zhang, and Y. Fu, “Collaborative random faces-guided [26] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
encoders for pose-invariant face representation learning,” IEEE Trans- deep neural networks with pruning, trained quantization and huffman
actions on Neural Networks and Learning Systems, vol. 29, no. 4, pp. coding,” in ICLR, 2015.
1019–1032, 2018. [27] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Bi-
[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for narized neural networks,” in Advances in neural information processing
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. systems, 2016, pp. 4107–4115.
(CVPR), Jun. 2016, pp. 770–778. [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net:
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, Imagenet classification using binary convolutional neural networks,” in
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” European Conference on Computer Vision. Springer, 2016, pp. 525–
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, 542.
pp. 1–9. [29] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient
[10] J. Wang, C. Xu, X. Yang, and J. M. Zurada, “A novel pruning DNNs,” in Proc. NIPS, 2016, pp. 1379–1387.
algorithm for smoothing feedforward neural networks based on group [30] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
lasso method,” IEEE Transactions on Neural Networks and Learning sparsity in deep neural networks,” in Proc. NIPS, 2016, pp. 2074–2082.
Systems, vol. 29, no. 5, pp. 2012–2024, 2018. [31] V. Lebedev and V. Lempitsky, “Fast convnets using group-wise brain
[11] L.-W. Kim, “Deepx: Deep learning accelerator for restricted boltzmann damage,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
machine artificial neural networks,” IEEE Transactions on Neural Net- 2016, pp. 2554–2564.
works and Learning Systems, vol. 29, no. 5, pp. 1441–1453, 2018. [32] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning
[12] B. Hassibi and D. G. Stork, “Second order derivatives for network efficient convolutional networks through network slimming,” in Proc.
pruning: Optimal brain surgeon,” in Advances in neural information IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2755–2763.
processing systems, 1993, pp. 164–171. [33] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning
[13] R. Reed, “Pruning algorithms-a survey,” IEEE Transactions on Neural convolutional neural networks for resource efficient transfer learning,”
Networks, vol. 4, no. 5, pp. 740–747, 1993. in ICLR, 2017.
SUBMITTED TO IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018 11
[34] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for Yanwei Fu received the PhD degree from Queen
accelerating deep convolutional neural networks,” in International Joint Mary University of London in 2014, and the MEng
Conference on Artificial Intelligence, 2018. degree in the Department of Computer Science &
[35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Technology at Nanjing University in 2011, China.
in ICLR, 2015. He worked as a Post-doc in Disney Research at
[36] X. Dong, J. Huang, Y. Yang, and S. Yan, “More is less: A more Pittsburgh from 2015-2016. He is currently an As-
complicated network with less inference complexity,” in Proc. IEEE sistant Professor at Fudan University. His research
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1895–1903. interest is image and video understanding, and life-
[37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning long learning.
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[38] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
tiny images,” 2009.
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in ICLR, 2015.
[41] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,
A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in
pytorch,” 2017.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual
networks,” in Proc. ECCV, 2016, pp. 630–645.
[43] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proc.
BMVC, 2016, pp. 87.1–87.12.