Escolar Documentos
Profissional Documentos
Cultura Documentos
4, June 2014
ISSN: 2321-2381 2014 | Published by The Standard International Journals (The SIJ) 156
AbstractPenalized regression procedures have recently seen a lot of attention, because it can yield both
estimation and variable selection simultaneously. There are increasing applications in bioinformatics research
which treats large number of variables. However, their performance can be severely influenced by outliers in
either the response or the covariate space. This paper proposes a weighted regression estimator with the
Smoothly Clipped Absolute Deviation (SCAD) function, because it has two advantages, sparse system and
unbiasedness for large coefficients. It deals with robust variable selection and robust estimation. We develop a
unified algorithm for the proposed estimator including the SCAD estimate and the tuning parameter, based on
the Local Quadratic Approximation (LQA) and the Local Linear Approximation (LLA) of the non-convex
SCAD penalty function. We compare the robustness of the proposed algorithm with other penalized regression
estimators. Numerical simulation results show that the proposed estimator is effective to analyze a
contaminated data.
KeywordsLinear Regression; Local Linear Approximation; Local Quadratic Approximation; Robust;
Penalized Function; Smoothly Clipped Absolute Deviation; Tuning Parameter; Weight.
AbbreviationsAkaike Information Criterion (AIC); Bayesian Information Criterion (BIC); Least Absolute
Deviation (LAD); LAD and the
1
-type penalty (LAD-L1); LAD and the SCAD penalty (LAD-SCAD); Least
Absolute Shrinkage and Selection Operator (LASSO); Local Linear Approximation (LLA); Local Quadratic
Approximation (LQA); Least Squares Estimator (LSE); Smoothly Clipped Absolute Deviation (SCAD); the
Weighted LAD and the SCAD penalty (WLAD-SCAD).
I. INTRODUCTION
ECENT days we can obtain easily large sample high
dimensional data sets from cheap sensors. However
the growth of variables can prevent us from
constructing a parsimonious model which provides good
interpretation about the system. One important stream of
statistical research requires effective variable selection
procedures to improve both accuracy and interpretability of
the learning technique [Kitter, 1986]. Variable selection is an
important research topic in linear regression especially for
model selection in high-dimensional data situation [Jung,
2008].
Tibshirani (1996) proposed the least absolute shrinkage
and selection operator (LASSO), which can simultaneously
select valuable covariates and estimate regression parameters.
Traditional model selection criteria such as Akaike
Information Criterion (AIC) [Akaike, 1973] and Bayesian
Information Criterion (BIC) [Schwarz, 1978] have major
drawbacks that parameter estimation and model selection are
two separate processes. The LASSO is a regularisation with
1
-type penalization and it becomes extremely popular,
because it shrinks the regression coefficients toward zero
with the possibility of setting some coefficients equal to zero,
resulting in a simultaneous estimation and variable selection.
Many literatures show successful applications using the
LASSO. However, the LASSO can be biased for the
coefficients whose absolute values are large. Fan & Li (2001)
proposed a penalized regression with the Smoothly Clipped
Absolute Deviation (SCAD) penalty function and showed
that it has better theoretical properties than the LASSO with
1
-type penalty. The penalized regression with SCAD not
only selects important covariates consistently but also
produces parameter estimators as efficient as if the true
model were known, i.e., the oracle property. The LASSO
does not satisfy the oracle property. The SCAD function is a
non-convex penalty function to make up for the deficiencies
of the LASSO.
The penalized regression consists of a loss function and a
penalty function. In traditional regression setting it is well
known that the least squares method is sensitive to even
single outlier. Alternative to the least squares method is the
Least Absolute Deviation (LAD) method. Wang et al., (2007)
developed a robust algorithm with the LAD loss function and
1
-type penalty (LAD-L1). They showed that the LAD-L1 is
resistant to non-normal error distributions and outliers. Jung
(2007, 2012) proposed a robust method with the LAD loss
function and the SCAD penalty (LAD-SCAD) estimate. It
shows that the SCAD penalty function is more efficient than
the LAD-L1.
R
*Professor, Department of Statistics and Computer Science, Kunsan National University, Kunsan, Chonbuk, SOUTH KOREA.
E-Mail: kmjung@kunsan.ac.kr
Kang-Mo Jung*
Robust Estimator with the SCAD
Function in Penalized Linear Regression
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 2, No. 4, June 2014
ISSN: 2321-2381 2014 | Published by The Standard International Journals (The SIJ) 157
Recently statisticians often treat data sets with a non-
normal response variable or covariates that may contain
multiple outliers or leverage points. Even though the LAD is
more robust than the least squares method, the unbounded
loss function of the LAD affects strongly the LAD estimator.
In this paper we consider a weight method for the bounded
loss function. The weight in our algorithm attenuates the
influence of outliers on the estimator and works the same
influence of non-outliers as the un-weighted LAD-SCAD
method. The weighted LAD loss function with the SCAD
penalty function (WLAD-SCAD) improves the error
performance of the LAD-SCAD. The proposed method
combines the robustness of the weighted LAD and the oracle
property of the SCAD penalty. The tuning parameter controls
the model complexity and plays an important role in the
variable selection procedure. We propose a BIC-type tuning
parameter selector using a data-driven method, the WLAD-
SCAD with the BIC tuning parameter can identify the most
parsimonious correct model.
The paper is organized as follows. Section 2 describes
the previously related works. Section 3 provides our proposed
algorithm of the weighted LAD with the SCAD penalty
function. Since the SCAD function is not convex, we use an
approximation algorithm to solve the non-differentiable and
non-convex objective function in penalized regression with
the SCAD penalty. We used approximation methods such as
the Local Quadratic Approximation (LQA) and the Local
Linear Approximation (LLA) to solve the optimization
problem for the non-convex SCAD penalty function. We
provide two results, the solver of a Newton-Raphson solution
and the LAD criterion. Section 4 illustrates simulation results.
It shows that the proposed algorithm has superior to other
methods from the view of robustness and parsimonious
model.
II. RELATED WORKS
Consider the linear regression model
= +
, = 1, , ,
where
= (
1
, ,
and let be a
( + 1) matrix whose -th row1,
)
1
=1
)
2
.
One alternative to the least squares method is the LAD
method which minimises the criterion function, the sum of
absolute deviations of the errors
|
=1
+
)|.
The major advantage of the LAD method lies in its
robustness relative to the LSE. The LAD estimates are less
affected by the presence of a few outliers or influential
observations. However both the LSE and the LAD cannot be
useful for model selection when especially the number of
covariates is very large.
Tibshirani (1996) proposed a penalty based on the
1
norm |
=1
for automatically deleting unnecessary
covariates. The LASSO criterion is a simply penalized least
squares with the
1
penalty
(
=1
)
2
+
=1
, (1)
where > 0 is the tuning parameter which controls the trade-
off between model fitting and model sparsity.
When the tuning parameter is large, the criterion focuses
on model sparsity. Traditionally in model selection, cross-
validation and information criteria-including the AIC
[Akaike, 1973] and BIC [Schwarz, 1978]-are widely applied.
Shao (1997) showed that the BIC can identify the true model
consistently in linear regression with fixed dimensional
covariates, but the AIC may fail due to over-fitting. Yang
(2005) showed that cross-validation is asymptotically
equivalent to the AIC [Yang, 2005] and so they behave
similarly.
Leng et al., (2006) showed that the LASSO is not
asymptotically consistent and so the LASSO can be biased
for large coefficients. Fan & Li (2001) addressed this
problem and proposed the SCAD penalty function. They
described the conditions of a good penalty function (a)
unbiasedness: the resulting estimator is nearly unbiased when
the true unknown parameter is large; (b) sparsity: the
resulting estimator is a thresholding rule, which automatically
sets small estimated coefficients to be zero; (c) continuity: the
resulting estimator is continuous in the data. The LSE with
the SCAD penalty function minimizes the criterion function
(
=1
)
2
+
(|
=1
), (2)
where
() =
||, 0 <
2
1
2
( )
2
2( 1)
, <
1
2
( +1)
2
2
, >
and so its derivative becomes
, 0 <
1
, <
0, >
where can be chosen using cross-validation or generalized
cross-validation. However, the simulation of Fan & Li (2001)
gives us = 3.7 which is approximately optimal. In this
article we set = 3.7.
Similar to that the LAD is more robust than the LSE in
non-penalized regression model, Wang et al., (2007)
proposed the LAD with the SCAD penalty which minimises
the criterion function
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 2, No. 4, June 2014
ISSN: 2321-2381 2014 | Published by The Standard International Journals (The SIJ) 158
|
=1
| +
(|
=1
). (3)
Even though the LAD is robust, its breakdown point is
also 1/ which is equivalent to the LSE. Jung (2012)
proposed the weighted LAD with the SCAD penalty, because
the simulation results of Giloni et al., (2006) show that in
non-penalized linear regression the performance of the
weighted LAD estimator is competitive with that of high
breakdown regression estimators, particularly in the presence
of outliers located at leverage points. Jung (2011) proposed a
weighted LAD penalized estimator which combines the
weighted LAD estimator with the
1
penalty. Jung (2012)
used the criterion function
=1
| +
(|
=1
), (4)
where the weight
+
1
2
0
for nonzero
0
near gives
)
2
2
0
+
+
1
2
0
+
0
,
(5)
for initial values
0
,
0
near the minimisation of (3). Also the
Taylor expansion at the non-zero
0
yields
0
+
02
, (6)
and we set
= 0 if
0
is near zero. Assume that the log-
likelihood function is smooth with respect to and its first
two partial derivatives are continuous. The linear quadratic
approximation (LQA) is a modification of the Newton-
Raphson algorithm [Fan & Li, 2001]. The LQA is broadly
useful for solving optimization problems with the non-
differentiable criterion function. Then the criterion function
(4) becomes
1
2
=1
(
)
2
0
+
+
1
2
=1
0
+
+ [
=1
02
].
(7)
and we obtain the criterion function with up to constants
1
+
1
2
,
(8)
where
= (,
, = diag
0
+
, =
diag(
) and
(+1)
= (
+
()
)
1
, (9)
for the th solution
()
, the matrix
. When = and
=
diag
1
(1)
+
(1)
=
diag(
(1)
(1)
= diag
(1)
+
(1)
=
diag(
(1)
(1)
0
+
0
,
0
.
(10)
The constant terms in (10) can not affect the function of
(3) or (4) and so they can be eliminated. The criterion
function (4) can be written by
=1
| +
(|
0
|
=1
)
, (11)
whose regularization part is the linearized SCAD penalty
function [Zou & Li, 2008]. It is a local linear approximation
(LLA) method for the SCAD penalty function.
Computationally, it is very easy to find the solution of (11).
Specifically, we can construct the augmented data set
{
} as
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 2, No. 4, June 2014
ISSN: 2321-2381 2014 | Published by The Standard International Journals (The SIJ) 159
for = 1, ,
0,
for = +1, , + ,
where
( +
+
=1
)|.
(12)
Consequently we can use a standard LAD program (the
function rq in R program) without computational effort.
To find a good tuning parameter is an important
procedure in penalized estimation methods. Fan & Li (2001)
found the values of tuning parameters by optimizing the
performance via cross-validation and generalized cross
validation. Zou (2006) in LASSO used the tuning parameters
set by the reciprocal of the absolute value of the LSE. Wang
et al., (2007) used the tuning parameter by minimising a BIC-
type criterion function. In this paper we propose the tuning
parameter which can be obtained by minimizing
=
(
)
2
/
=1
(1 df()/)
2
,
Where df = tr[ (
+ )
1
], is the data
matrix excluding the constant term, and = diag(
).
For linear models, the generalized cross-validation is
asymptotically equivalent to the Mallows
, = 1, , ,
Where