Robust Estimator With The SCAD Function in Penalized Linear Regression

The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 2, No.
4, June 2014
ISSN: 2321-2381 2014 | Published by The Standard International Journals (The SIJ) 156

AbstractPenalized regression procedures have recently seen a lot of attention, because it can yield both
estimation and variable selection simultaneously. There are increasing applications in bioinformatics research
which treats large number of variables. However, their performance can be severely influenced by outliers in
either the response or the covariate space. This paper proposes a weighted regression estimator with the
Smoothly Clipped Absolute Deviation (SCAD) function, because it has two advantages, sparse system and
unbiasedness for large coefficients. It deals with robust variable selection and robust estimation. We develop a
unified algorithm for the proposed estimator including the SCAD estimate and the tuning parameter, based on
the Local Quadratic Approximation (LQA) and the Local Linear Approximation (LLA) of the non-convex
SCAD penalty function. We compare the robustness of the proposed algorithm with other penalized regression
estimators. Numerical simulation results show that the proposed estimator is effective to analyze a
contaminated data.
KeywordsLinear Regression; Local Linear Approximation; Local Quadratic Approximation; Robust;
Penalized Function; Smoothly Clipped Absolute Deviation; Tuning Parameter; Weight.
AbbreviationsAkaike Information Criterion (AIC); Bayesian Information Criterion (BIC); Least Absolute
Deviation (LAD); LAD and the
1
-type penalty (LAD-L1); LAD and the SCAD penalty (LAD-SCAD); Least
Absolute Shrinkage and Selection Operator (LASSO); Local Linear Approximation (LLA); Local Quadratic
Approximation (LQA); Least Squares Estimator (LSE); Smoothly Clipped Absolute Deviation (SCAD); the
Weighted LAD and the SCAD penalty (WLAD-SCAD).

I. INTRODUCTION
ECENT days we can obtain easily large sample high
dimensional data sets from cheap sensors. However
the growth of variables can prevent us from
constructing a parsimonious model which provides good
interpretation about the system. One important stream of
statistical research requires effective variable selection
procedures to improve both accuracy and interpretability of
the learning technique [Kitter, 1986]. Variable selection is an
important research topic in linear regression especially for
model selection in high-dimensional data situation [Jung,
2008].
Tibshirani (1996) proposed the least absolute shrinkage
and selection operator (LASSO), which can simultaneously
select valuable covariates and estimate regression parameters.
Traditional model selection criteria such as Akaike
Information Criterion (AIC) [Akaike, 1973] and Bayesian
Information Criterion (BIC) [Schwarz, 1978] have major
drawbacks that parameter estimation and model selection are
two separate processes. The LASSO is a regularisation with
1
-type penalization and it becomes extremely popular,
because it shrinks the regression coefficients toward zero
with the possibility of setting some coefficients equal to zero,
resulting in a simultaneous estimation and variable selection.
Many literatures show successful applications using the
LASSO. However, the LASSO can be biased for the
coefficients whose absolute values are large. Fan & Li (2001)
proposed a penalized regression with the Smoothly Clipped
Absolute Deviation (SCAD) penalty function and showed
that it has better theoretical properties than the LASSO with
1
-type penalty. The penalized regression with SCAD not
only selects important covariates consistently but also
produces parameter estimators as efficient as if the true
model were known, i.e., the oracle property. The LASSO
does not satisfy the oracle property. The SCAD function is a
non-convex penalty function to make up for the deficiencies
of the LASSO.
The penalized regression consists of a loss function and a
penalty function. In traditional regression setting it is well
known that the least squares method is sensitive to even
single outlier. Alternative to the least squares method is the
Least Absolute Deviation (LAD) method. Wang et al., (2007)
developed a robust algorithm with the LAD loss function and
1
-type penalty (LAD-L1). They showed that the LAD-L1 is
resistant to non-normal error distributions and outliers. Jung
(2007, 2012) proposed a robust method with the LAD loss
function and the SCAD penalty (LAD-SCAD) estimate. It
shows that the SCAD penalty function is more efficient than
the LAD-L1.
R
*Professor, Department of Statistics and Computer Science, Kunsan National University, Kunsan, Chonbuk, SOUTH KOREA.
E-Mail: kmjung@kunsan.ac.kr
Kang-Mo Jung*
Robust Estimator with the SCAD
Function in Penalized Linear Regression
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 2, No. 4, June 2014
Recently statisticians often treat data sets with a non-
normal response variable or covariates that may contain
multiple outliers or leverage points. Even though the LAD is
more robust than the least squares method, the unbounded
loss function of the LAD affects strongly the LAD estimator.
In this paper we consider a weight method for the bounded
loss function. The weight in our algorithm attenuates the
influence of outliers on the estimator and works the same
influence of non-outliers as the un-weighted LAD-SCAD
method. The weighted LAD loss function with the SCAD
penalty function (WLAD-SCAD) improves the error
performance of the LAD-SCAD. The proposed method
combines the robustness of the weighted LAD and the oracle
property of the SCAD penalty. The tuning parameter controls
the model complexity and plays an important role in the
variable selection procedure. We propose a BIC-type tuning
parameter selector using a data-driven method, the WLAD-
SCAD with the BIC tuning parameter can identify the most
parsimonious correct model.
The paper is organized as follows. Section 2 describes
the previously related works. Section 3 provides our proposed
algorithm of the weighted LAD with the SCAD penalty
function. Since the SCAD function is not convex, we use an
approximation algorithm to solve the non-differentiable and
non-convex objective function in penalized regression with
the SCAD penalty. We used approximation methods such as
the Local Quadratic Approximation (LQA) and the Local
Linear Approximation (LLA) to solve the optimization
problem for the non-convex SCAD penalty function. We
provide two results, the solver of a Newton-Raphson solution
and the LAD criterion. Section 4 illustrates simulation results.
It shows that the proposed algorithm has superior to other
methods from the view of robustness and parsimonious
model.
II. RELATED WORKS
Consider the linear regression model
= +
, = 1, , ,
where
= (
1
, ,
is the -dimensional covariate,

= (
1
, ,
, is the number of covariates and is the

number of observations. Let = (
1
, ,
and let be a
( + 1) matrix whose -th row1,
. The Least Squares

Estimator (LSE) (
)
1
which minimizes the sum of

squares of residuals can be distorted by the heady-tailed
probability distribution of errors or even single outlier
[Rousseeuw & Leroy, 1987]. The criterion can be written by
(
=1
)
2
.
One alternative to the least squares method is the LAD
method which minimises the criterion function, the sum of
absolute deviations of the errors
|
=1
+
)|.
The major advantage of the LAD method lies in its
robustness relative to the LSE. The LAD estimates are less
affected by the presence of a few outliers or influential
observations. However both the LSE and the LAD cannot be
useful for model selection when especially the number of
covariates is very large.
Tibshirani (1996) proposed a penalty based on the
1

norm |
=1
for automatically deleting unnecessary
covariates. The LASSO criterion is a simply penalized least
squares with the
1
penalty
(
=1
)
2
+
=1
, (1)
where > 0 is the tuning parameter which controls the trade-
off between model fitting and model sparsity.
When the tuning parameter is large, the criterion focuses
on model sparsity. Traditionally in model selection, cross-
validation and information criteria-including the AIC
[Akaike, 1973] and BIC [Schwarz, 1978]-are widely applied.
Shao (1997) showed that the BIC can identify the true model
consistently in linear regression with fixed dimensional
covariates, but the AIC may fail due to over-fitting. Yang
(2005) showed that cross-validation is asymptotically
equivalent to the AIC [Yang, 2005] and so they behave
similarly.
Leng et al., (2006) showed that the LASSO is not
asymptotically consistent and so the LASSO can be biased
for large coefficients. Fan & Li (2001) addressed this
problem and proposed the SCAD penalty function. They
described the conditions of a good penalty function (a)
unbiasedness: the resulting estimator is nearly unbiased when
the true unknown parameter is large; (b) sparsity: the
resulting estimator is a thresholding rule, which automatically
sets small estimated coefficients to be zero; (c) continuity: the
resulting estimator is continuous in the data. The LSE with
the SCAD penalty function minimizes the criterion function
(
=1
)
2
+
(|
=1
), (2)
where
() =
||, 0 <
2
1
2
( )
2
2( 1)
, <
1
2
( +1)
2
2
, >

and so its derivative becomes
, 0 <

1
, <
0, >

where can be chosen using cross-validation or generalized
cross-validation. However, the simulation of Fan & Li (2001)
gives us = 3.7 which is approximately optimal. In this
article we set = 3.7.
Similar to that the LAD is more robust than the LSE in
non-penalized regression model, Wang et al., (2007)
proposed the LAD with the SCAD penalty which minimises
the criterion function
|
=1
| +
(|
=1
). (3)
Even though the LAD is robust, its breakdown point is
also 1/ which is equivalent to the LSE. Jung (2012)
proposed the weighted LAD with the SCAD penalty, because
the simulation results of Giloni et al., (2006) show that in
non-penalized linear regression the performance of the
weighted LAD estimator is competitive with that of high
breakdown regression estimators, particularly in the presence
of outliers located at leverage points. Jung (2011) proposed a
weighted LAD penalized estimator which combines the
weighted LAD estimator with the
1
penalty. Jung (2012)
used the criterion function
=1
| +
(|
=1
), (4)
where the weight
depends on the space of covariates. We

call the solution of (4) the weighted LAD with the SCAD
(WLAD-SCAD). The proposed method uses the weight for
robustness which is resistant to leverage points or influential
observations, because the weight reduces the effects of the
observations having large deviations or the leverage points.
The objective function will give the robustness of weight
methods and the advantages of the SCAD penalized function.
III. METHODS
The solution of (4) can be obtained by a standard
optimization program if the criterion function is convex and
differentiable. Unfortunately the absolute function in (4) is
not differentiable at zero and the SCAD penalty function is
not convex in . Approximation of the absolute function and
the SCAD function transforms the objective function into
linear equations and so we can obtain efficiently an iterative
solution of the WLAD-SCAD estimator.
The absolute function
u
2
2u
0
+
1
2
0
for nonzero
0

near gives
)
2
2
0
+
+
1
2

0
+
0
,
(5)
for initial values
0
,
0
near the minimisation of (3). Also the
Taylor expansion at the non-zero
0
yields
0
+
02
, (6)
and we set
= 0 if
0
is near zero. Assume that the log-
likelihood function is smooth with respect to and its first
two partial derivatives are continuous. The linear quadratic
approximation (LQA) is a modification of the Newton-
Raphson algorithm [Fan & Li, 2001]. The LQA is broadly
useful for solving optimization problems with the non-
differentiable criterion function. Then the criterion function
(4) becomes
1
2
=1
(
)
2

0
+
+
1
2
=1

0
+
+ [
=1
02
].
(7)
and we obtain the criterion function with up to constants
1
+
1
2
,
(8)
where
= (,
, = diag
0
+
, =
diag(
) and
is the ( + 1) data matrix whose

first column vector is the vector of ones with length . Thus
the Newton-Raphson solution yields the iterative solution of
(8)
(+1)
= (
+
()
)
1
, (9)
for the th solution
()
, the matrix
and the th weight

matrix
. When = and
= , the solution (2)

reduces to the LSE-SCAD estimate which is the solution of
the criterion (3). For the case the th weight matrix
=
diag
1
(1)
+
(1)
and the matrix
=
diag(
(1)
(1)
), the solution of (9) becomes the LAD-

SCAD. And for the th weight matrix
= diag
(1)
+
(1)
and the matrix
=
diag(
(1)
(1)
), the solution of (9) is called the WLAD-

SCAD. In Section 4 the iteration stops when the maximum
difference among
between the previous and the current

solution is less than 10
4
.
We will provide another approach to solve the
minimization of the criterion function (4). By the Taylor
expansion of
) we obtain the relationship as follows
0
+
0
,

0
.
(10)
The constant terms in (10) can not affect the function of
(3) or (4) and so they can be eliminated. The criterion
function (4) can be written by
=1
| +
(|
0
|
=1
)
, (11)
whose regularization part is the linearized SCAD penalty
function [Zou & Li, 2008]. It is a local linear approximation
(LLA) method for the SCAD penalty function.
Computationally, it is very easy to find the solution of (11).
Specifically, we can construct the augmented data set
{
} as
for = 1, ,
0,
for = +1, , + ,

where
is the unit vector having 0 except the jth element

one [Wang et al., 2007]. Then the WLAD-SCAD estimator of
(11) can be obtained by minimising the criterion function
|
( +
+
=1
)|.
(12)
Consequently we can use a standard LAD program (the
function rq in R program) without computational effort.
To find a good tuning parameter is an important
procedure in penalized estimation methods. Fan & Li (2001)
found the values of tuning parameters by optimizing the
performance via cross-validation and generalized cross
validation. Zou (2006) in LASSO used the tuning parameters
set by the reciprocal of the absolute value of the LSE. Wang
et al., (2007) used the tuning parameter by minimising a BIC-
type criterion function. In this paper we propose the tuning
parameter which can be obtained by minimizing
=
(
)
2
/
=1
(1 df()/)
2
,
Where df = tr[ (
+ )
1
], is the data
matrix excluding the constant term, and = diag(
).
For linear models, the generalized cross-validation is
asymptotically equivalent to the Mallows
, AIC and leave-

one-out cross validation [Hastie et al., 2001].
IV. SIMULATION RESULTS
This section demonstrates simulations in various situations to
show the robustness of the method proposed in Section III.
We numerically compare the proposed method WLAD-
SCAD estimates with the LASSO estimates [Thibshirani,
1996], the LSE with the SCAD penalized function [Fan & Li,
2001] and the LAD-SCAD estimate [Jung, 2007]. All
simulations are performed by R program. We consider the
linear regression model
, = 1, , ,
Where
= (3, 1, 0, 0, 2,0,0) and
are standard normally

distributed. The covariate vector follows the multivariate
normal distribution with the zero mean vector and the
correlation matrix whose (, )th element is 0.5
| |
for
, = 1, ,7. This is a similar to Tibshirani (1996) and Fan
& Li (2001).
We set several situations. The sample sizes are given by
= 20, 40, 100. Two different values for the error standard
deviation are given by = 1, 2. The error distributions are
the standard normal distribution, the double exponential
distribution and the t distribution with degrees of 2. The last
two distributions have thick-tailed probability distribution
functions. To show the robustness of our proposed algorithm
we consider the contaminated data with leverage points about
20%.
The simulation data consist of the training data and
independent test data. The regression coefficients can be
obtained on the training data, and the performance is
evaluated on the test data of sample size 1000 generating
from the normal distribution with the above defined matrix.
We conducted 200 simulation iterations to evaluate the
performance of the fitness of the algorithm, which can be
calculated by the Average of the Mean Absolute Deviations
(AMAD) on the test data. The performance of the sparseness
of the algorithm can be evaluated by the average number of
correctly estimated zero coefficients which is the column
labelled Correct in the tables. Analogously the column
labelled Incorrect denotes the number of zero estimates
which are not zero coefficients, which means the inaccuracy
for each algorithm. The number in the parenthesis is the
sample standard deviation for each algorithm. Thus the model
is best in case the lesser of the AMAD and the Incorrect
terms. For the true model the number of non-zero coefficients
is 5, and thus the term Correct should be close to 5 if the
algorithm is compatible.
Table 1 summarized the simulation results for the data
without outliers. We considered three error models, and
summarized only the results for the double exponential error
distribution. It shows that even though LASSO is the best
model in the sparseness of the model, it has large model
errors. As expected the standard deviation of the AMAD
values become larger when becomes larger. When = 2,
the Correct values of the WLAD-SCAD is little large than
that of the LASSO. It means that the proposed estimator
yields a sparse model when the error distribution is not
normal.
Table 1: Simulation Results with Double Exponential Errors without
Outliers
DE
Errors
Methods LASSO SCAD
LAD
-SCAD
WLAD
-SCAD
=
=
AMAD
0.210
(0.094)
0.202
(0.085)
0.176
(0.077)
0.176
(0.078)
Correct
4.825
(0.430)
4.565
(0.606)
4.345
(0.761)
4.410
(0.751)
Incorrect
0.000
(0.000)
0.000
(0.000)
0.000
(0.000)
0.000
(0.000)
=
=
AMAD
0.903
(0.354)
0.870
(0.346)
0.787
(0.364)
0.837
(0.393)
Correct
4.410
(0.846)
4.185
(0.845)
4.420
(0.915)
4.475
(0.838)
Incorrect
0.520
(0.549)
0.455
(0.538)
0.435
(0.517)
0.500
(0.540)
Table 2 summarized the simulation results for the
standard normal errors and 20% leverage points. Seeing
Correct term implies that in model sparsity the WLAD-
SCAD is the best among the total estimators. The difference
between the WLAD-SCAD and other estimator is meaningful
even the incorrectness of the proposed method is somewhat
large.
Table 2: Simulation Results with the Normal Errors with 20% Outliers
Normal Methods LASSO SCAD LAD- WLAD
Errors SCAD -SCAD
=
=
AMAD
2.917
(0.215)
2.920
(0.188)
3.014
(0.239)
3.293
(0.219)
Correct
3.260
(1.408)
2.985
(0.842)
3.325
(0.966)
4.470
(0.929)
Incorrect
1.525
(0.749)
1.365
(0.611)
1.545
(0.707)
2.540
(0.742)
=
=
AMAD
2.978
(0.226)
2.973
(0.198)
3.089
(0.257)
3.293
(0.215)
Correct
3.320
(1.045)
3.085
(0.837)
3.380
(0.995)
4.510
(0.868)
Incorrect
1.585
(0.803)
1.385
(0.639)
1.590
(0.724)
2.555
(0.728)
Table 3 summarized the simulation results for the double
exponential error distribution and 20% leverage points. The
WLAD-SCAD is the best algorithm among 4 estimators in
the view of variable selection, because the Correct value of
the WLAD-SCAD is the closest to 5 for all estimators. Table
3 shows that in model error the WLAD-SCAD is the most
efficient estimation regardless of outliers and the spread of
errors and the sample size, since we considered the estimation
method which reduces the influence of leverage points.
Table 3: Simulation Results with the Double Exponential Errors with 20%
Outliers
Normal
Errors
Methods LASSO SCAD
LAD-
SCAD
WLAD
-SCAD
=
=
AMAD
2.952
(0.225)
2.955
(0.216)
3.043
(0.278)
3.329
(0.212)
Correct
3.175
(1.068)
2.950
(0.861)
3.110
(0.971)
4.550
(0.878)
Incorrect
1.480
(0.770)
1.380
(0.598)
1.500
(0.642)
2.600
(0.730)
=
=
AMAD
2.757
(0.111)
2.779
(0.098)
2.845
(0.129)
3.360
(0.129)
Correct
2.575
(1.058)
2.620
(0.824)
2.950
(0.884)
4.850
(0.582)
Incorrect
1.125
(0.657)
1.115
(0.532)
1.305
(0.603)
2.860
(0.511)
=
=
AMAD
3.099
(0.272)
3.078
(0.255)
3.202
(0.314)
3.353
(0.188)
Correct
3.720
(1.071)
3.340
(0.805)
3.655
(1.010)
4.670
(0.717)
Incorrect
1.845
(0.821)
1.585
(0.636)
1.835
(0.742)
2.695
(0.628)
V. CONCLUSION
In this paper we proposed a robust algorithm for the
penalized regression model with the LAD loss function and
the SCAD penalty function. We used a weight function for
the robust loss function and improved the effectiveness of the
proposed algorithm through numerical simulations. We
derived two approximations for objective functions to treat
the non-convex optimization problem. One is LLA and the
other is LQA. Since the former is linear, it is so easy to
implement it. Both two methods are robust to outliers or
influential observations. The numerical simulations show that
the proposed method is more robust than other methods from
the point view of finding exact non-zero coefficients. Thus
the proposed method gives a method of variable selection
from thousands of input variables appeared in the
applications of biometrical experiments. For the further study
we will consider the Huber function for the loss function with
the SCAD penalty function.
ACKNOWLEDGEMENTS
This research was supported by Basic Science Research
Program through the National Research Foundation of Korea
(NRF) funded by the Ministry of Education, Science and
Technology (NRF-2012R1A1A4A01004594).
REFERENCES
[1] H. Akaike (1973), Information Theory and an Extension of the
Maximum Likelihood Principle, Proceedings of the Second
International Symposium on Information Theory, Editors: B.N.
Petrov & F. Cski, Akademiai Kiado, Budapest.
[2] G. Schwarz (1978), Estimating the Dimension of a Model,
Annals of Statistics, Vol. 6, Pp. 461464.
[3] J. Kitter (1986), Feature Selection and Extraction, Handbook
of Pattern Recognition and Image Processing, Editors: T.Y.
Young and K.-S. Fun, Academic Press, New York.
[4] P.J. Rousseeuw & A.M. Leroy (1987), Robust Regression and
Outlier Detection, John Wiley, New York.
[5] R. Tibshirani (1996), Regression Shrinkage and Selection via
the LASSO, Journal of the Royal Statistical Society, Series B,
Vol. 58, Pp. 267288.
[6] J. Shao (1997), An Asymptotic Theory for Linear Model
Selection, Statistical Sinica, Vol. 7, Pp. 221264.
[7] T. Hastie, R. Tibshirani & J. Friedman (2001), The Elements
of Statistical Learning: Data Mining, Inference, and
Prediction, Springer, New York.
[8] J. Fan & R. Li (2001), Variable Selection via Nonconcave
Penalized Likelihood and its Oracle Properties, Journal of the
American Statistical Association, Vol. 96, Pp. 13481360.
[9] Y. Yang (2005), Can the Strengths of AIC and BIC be
Shared? A Conflict between Model Identification and
Regression Estimation, Biometrika, Vol. 92, Pp. 937950.
[10] H. Zou (2006), The Adaptive Lasso and its Oracle Properties,
Journal of the American Statistical Association, Vol. 101, Pp.
14181429.
[11] C. Leng, Y. Lin & G. Wahba (2006), A Note on the LASSO
and Related Procedures in Model Selection, Statistical Sinica,
Vol. 16, Pp. 12731284.
[12] A. Giloni, J.S. Simonoff & B. Sengupta (2006), Robust
Weighted LAD Regression, Computational Statistics and
Data Analysis, Vol. 50, Pp. 31243140.
[13] H. Wang, G. Li & G. Jiang (2007), Robust Regression
Shrinkage and Consistent Variable Selection through the LAD-
Lasso, Journal of Business & Economic Statistics, Vol. 25,
Pp. 347355.
[14] K.-M. Jung (2007), A Robust Estimator in Ridge Regression,
Journal of the Korean Data Analysis Society, Vol. 9, Pp. 535
543.
[15] K.-M. Jung (2008), Robust Statistical Methods in Variable
Selection, Journal of the Korean Data Analysis Society, Vol.
10, Pp. 30573066.
[16] H. Zou & R. Li (2008), One-step Sparse Estimates in
Nonconcave Penalized Likelihood Models, Annals of
Statistics, Vol. 36, Pp. 15091566.
[17] K.-M. Jung (2011), Weighted Least Absolute Deviation Lasso
Estimator, Communications of the Korean Statistical Society,
Vol. 18, Pp. 733739.
[18] K.-M. Jung (2012), Weighted Least Absolute Deviation
Regression Estimator with the SCAD Function, Journal of the
Korean Data Analysis Society, Vol. 14, Pp. 23052312.

Robust Estimator With The SCAD Function in Penalized Linear Regression

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Robust Estimator With The SCAD Function in Penalized Linear Regression

Enviado por

Direitos autorais:

Formatos disponíveis

The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 2, No.

is the -dimensional covariate,

, is the number of covariates and is the

. The Least Squares

which minimizes the sum of

depends on the space of covariates. We

is the ( + 1) data matrix whose

and the th weight

= , the solution (2)

and the matrix

), the solution of (9) becomes the LAD-

and the matrix

), the solution of (9) is called the WLAD-

between the previous and the current

) we obtain the relationship as follows

is the unit vector having 0 except the jth element

, AIC and leave-

= (3, 1, 0, 0, 2,0,0) and

are standard normally

Você também pode gostar