Data Analysis

ICFAI UNIVERSITY
DEHRADUN
Name: Gopal Krishan

Enrollment No. 09BS0002792
IUD No: 0901202792
Course Code: SL IT 609
Course Name: Business Intelligence
Faculty Name: Prof. Prashobhan Palakkeel
Date: November 02, 2010
Title of the assignment: Multiple regressions of
copper prices on the fundamental variables.
Sign Student Sign Faculty

Multiple regression analysis of copper prices
Fundamental: Copper prices are determined by a lot of fundamentals like dollar index, copper
consumption, housing index, industrial production, and stock of copper in the world, copper ore
production and import of copper by different countries. From past few years, China and USA have
been the largest importer of copper in the world and imports quantity also has an impact on copper
prices. This study determines the impact of different variables on copper prices using multiple
regression analysis.
In this regression analysis, copper price is the dependant variable and dollar index (DX), China
imports of copper, USA imports of copper, total stock of copper and world consumption of copper
are the independent variables. The variables selected are based on the correlation analysis, the
variables which are least correlated are taken into the analysis as dependant variables.
The process flow for the analysis is as under:
Data partition: In data partition, the entire data is divided into two partitions, 80% being the
training data, 10% testing and 10% is validating data. After partitioning data, the insight analysis of
the data is done using enterprise miner. The data is checked for the assumptions of linear
regressions like normality, detecting outliers and transformation of variables before putting it to the
regression analysis.
DX
Dollar index
DX
The box plot of dollar index
states that the data is slightly
80 90 100 110
positively skewed. The
DX
Anderson darling test and
0.04 Kolmogorov-Smirnov test
rejects the null hypothesis of
D
e data being normally
n
s
i
0.02 distributed.
t
y
0
72 78 84 90 96 102 108 114 120
The Q-Q plot is also giving
DX
the similar expression that the
data is skewed. Thus
transformation of the data is
required to make it linear and
D
X normally distributed.
-2 0 2
N_ DX_ 1
Mo m e n t s
N 102. 0 0 0 0 Su m Wg t s 102. 0 0 0 0
Me a n 94. 1 7 2 2 Su m 9605. 5 6 1 5
S t d De v 13. 6 3 2 9 Va r i ance 185. 8 5 5 0
S k e wn e s s 0. 3 1 1 9 Ku r t os i s - 1. 0 5 4 7
US S 923347. 9 3 7 CS S 18771. 3 5 1 0
CV 14. 4 7 6 5 St d Me a n 1. 3 4 9 9
The skewness is 0.31. Taking
Qu a n t i l e s
10
7
0
5
%
%
Ma
Q3
x 119
105
.
.
4
9
7
9
0
5
0
0
9
9
9
7
.
.
0
5
%
%
1
1
1
1
9
8
.
.
1
5
0
2
2
2
5
5
the log of the data will
transform it to maximum
5 0 % Me d 89 . 9 9 8 8 9 5 . 0 % 1 1 6 . 8 1 0 0
2 5 % Q1 83 . 4 3 7 5 9 0 . 0 % 1 1 5 . 4 3 5 0
0 % Mi n 72 . 1 7 1 5 1 0 . 0 % 7 7 . 3 8 3 8
Ra
Q3
Mo
nge
- Q1
de
47
22
.
.
.
2
5
9
5
8
7
5
5
5
2
1
.
.
.
0
5
0
%
%
%
7
7
7
4
2
2
.
.
.
8
6
5
7
3
5
5
5
0
3
3
0
normality transformation.
Te s t s f or No r ma l i t y
Te s t St at i s t i c Va l u e p- val ue
Sh a pi r o - Wi l k 0 . 9 4 5 2 6 3 0 . 0 0 0 4
Ko l mo g o r ov- S mi r n o v 0 . 1 2 9 1 7 8 < . 0 1 0 0
Cr a me r - v on Mi s e s 0 . 2 9 0 7 2 8 < . 0 0 5 0
An d er s o n - Da r l i ng 1 . 7 2 0 6 3 3 < . 0 0 5 0
Ch i n a I m
China Imports
By the box plot we can see ChinaIm
that the data is positively
skewed and there are outliers
in the data. Thus outlier filter 100000 200000
Ch i n a I m
300000
is necessary before the

regression analysis is done. 0
Also transformation of data

D
is needed to make it normally e
0
n
distributed. s
i
t
0
y
0
-30000 150000 330000
Ch i n a I m
C
h
i
n
a
I
The skewness is 1.71. Taking m
the square root of the data

can transform the data to -2 0 2
N_ C h i n a I m_ 1
maximum normality.
Mo m e n t s
N 102. 0 0 0 0 Su m Wg t s 102. 00 0 0
Me a n 94454. 1 8 6 3 Su m 9634327. 0 0
S t d De v 70632. 4 3 6 4 Va r i a nc e 4. 989E+ 0 9
S k e wn e s s 1. 7 1 0 9 Ku r t os i s 4. 20 1 6
US S 1. 414 E + 1 2 CS S 5. 039E+ 1 1
CV 74. 7 7 9 6 St d Me a n 6993. 65 3 3
Qu a n t i l e s
10 0 % Ma x 3 8 7 9 4 3 . 0 0 0 9 9 . 0 % 3 3 7 2 3 0 . 0 0 0
7 5 % Q3 1 9 1 5 3 6 . 0 0 0 9 7 . 5 % 3 1 7 9 4 7 . 0 0 0
5 0 % Me d 8 6 5 3 3 . 0 0 0 0 9 5 . 0 % 2 4 4 0 1 3 . 0 0 0
2 5 % Q1 5 8 2 4 6 . 0 0 0 0 9 0 . 0 % 1 4 8 6 7 9 . 0 0 0
0 % Mi n 5 5 4 . 0 0 0 0 1 0 . 0 % 1 4 2 0 4 . 0 0 0 0
Ra nge 378 3 8 9 . 0 0 0 5 . 0 % 1 7 1 1 . 0 0 0 0
Q3 - Q1 666 9 0 . 0 0 0 0 2 . 5 % 1 4 9 0 . 0 0 0 0
Mo de 17 1 1 . 0 0 0 0 1 . 0 % 1 1 2 9 . 0 0 0 0
Te s t St at i s t i c Va l u e p- va l ue
Sh a pi r o - Wi l k 0 . 8 5 2 6 7 8 0 . 0 0 0 0
Cr a me r - v on Mi s e s 0 . 5 1 6 8 2 9 < . 0 0 5 0
An d e r s o n - Da r l i ng 3 . 4 9 5 1 1 9 < . 0 0 5 0
Total Stock of Copper
The data is slightly positively

distributed and this can be
seen in the histogram and the
Q-Q plot.
Also the Koglomorov-

Smirnov test rejects the null
hypothesis of skewness 0 and
kurtosis 3.
Transformation of the data

using the fourth root made the
data to be maximum normally
distributed.
WR C O N S
World Consumption
The data is close to normal WRCONS
distribution but not
absolutely normally
distributed. 1200000 1400000
WR C O N S
1600000
Taking the log of the data

would increase the normality
0
and bring the data close to D
e
normal distribution. n
s
i
0
t
y
0
1040000 1280000 1520000
WR C O N S
W
R
C
O
N
S
Seeing the normality tests,

Cramor-von Mises test has -2 0
N _ WR C O N S _ 1
2
accepted the null hypothesis

Mo m e n t s
of data being normally N 102. 00 0 0 Su m Wg t s 10 2 . 00 0 0
Me a n 134107 1. 2 4 Su m 136 7 8 92 6 6
distributed but the K-S test S t d De v
S k e wn e s s
137480
0.
. 6
28
1
0
0
6
Va
Ku
r i
r t
a nc e
os i s
1. 8
-
9
0
0
.
E+
78
1
3
0
6
has rejected the null US S
CV
1. 854
10.
E+
25
1
1
4
6
CS
St
S
d Me a n
1. 9
1361
0
2
9
.
E+
60
1
8
2
8
hypotheses at 95% Qu a n t i l e s
confidence level. This is 10

7
0
5
%
%
Ma
Q3
x 1
1
6
4
3
3
0
3
5
6
2
0
9
3
.
.
0
0
0
0
9
9
9
7
.
.
0
5
%
%
1
1
6
6
2
0
4
9
4
5
0
9
3
1
.
.
0
0
0
0
5 0 % Me d 1 3 2 0 7 2 1 . 0 0 9 5 . 0 % 1 5 6 5 5 7 3 . 0 0
because K-S test is a non- 2 5 % Q1 1 2 3 2 7 4 5 . 0 0 9 0 . 0 % 1 5 4 6 7 4 5 . 0 0
0 % Mi n 1 0 6 3 1 4 8 . 0 0 1 0 . 0 % 1 1 7 2 4 5 1 . 0 0
parametric test and makes the Ra
Q3
nge
- Q1
5
2
6
0
7
0
3
8
8
5
1
8
.
.
0
0
0
0
0
0
5
2
.
.
0
5
%
%
1
1
1
1
4
2
9
0
2
6
9
0
6
6
.
.
0
0
0
0
Mo de . 1 . 0 % 1 0 8 9 3 4 2 . 0 0
analysis stronger. Hence
transformation is necessary Te s t
Te s t s f or
St a t i s t i c
No r ma l i t y
Va l u e p- va l ue
for the same. Sh

Ko
a
l
pi
mo
r
g
o
o
-
r
Wi l
ov-
k
S mi r n o v
0
0
.
.
9
0
7
9
1
7
1
6
1
7
9
6
0
0
.
.
0
0
2
1
4
7
6
7
Cr a me r - v on Mi s e s 0 . 1 2 4 5 0 1 0 . 0 5 2 0
An d e r s o n - Da r l i ng 0 . 8 5 9 3 0 6 0 . 0 2 6 5
Pr i c e s
Prices of Copper
Prices
The data of prices is
positively skewed. Though it
doesn’t have a time trend but
2000 4000
Pr i c e s
6000 8000
it is highly skewed. Taking a
log of the data would
0.0004
transform the data to
maximum normality in this
D
e case.
n
s
i 0.0002
t
y
0
1000 2000 3000 4000 5000 6000 7000 8000 9000
Pr i c e s
P
r The Q-Q plot clearly shows that
i
c
the data is positively skewed and
e is not normally distributed.
s
Though taking the log also is not
able to increase the normality
much.
-2 0 2
N_ P r i c e s _ 1
Mo m e n t s
N 102. 00 0 0 Su m Wg t s 102. 0 0 0 0
Me a n 3664. 24 6 2 Su m 373753 . 1 1 0
S t d De v 2413. 92 3 0 Va r i a nc e 582702 4 . 3 5
S k e wn e s s 0. 82 3 5 Ku r t os i s - 0. 9 1 8 9
US S 1. 958 E+ 0 9 CS S 58852 9 4 5 9
CV 65. 87 7 8 St d Me a n 239. 0 1 4 0
Qu a n t i l e s
10 0 % Ma x 8 4 9 2 . 2 5 0 0 9 9 . 0 % 8 3 4 1 . 2 5 0 0
7 5 % Q3 5 8 4 9 . 7 5 0 0 9 7 . 5 % 8 3 0 9 . 0 0 0 0
5 0 % Me d 2 7 1 4 . 1 2 5 0 9 5 . 0 % 7 9 2 7 . 6 2 5 0
2 5 % Q1 1 7 1 1 . 7 5 0 0 9 0 . 0 % 7 6 9 4 . 2 5 0 0
0 % Mi n 1 4 3 0 . 1 2 5 0 1 0 . 0 % 1 5 1 9 . 5 0 0 0
Ra nge 7 0 6 2 . 1 2 5 0 5 . 0 % 1 4 9 5 . 2 5 0 0
Q3 - Q1 4 1 3 8 . 0 0 0 0 2 . 5 % 1 4 6 8 . 8 7 5 0
Mo de 1 7 6 2 . 0 0 0 0 1 . 0 % 1 4 5 5 . 5 0 0 0
Te s t St a t i s t i c Va l u e p- va l ue
Sh a pi r o - Wi l k 0 . 7 9 9 9 7 7 0 . 0 0 0 0
Cr a me r - v on Mi s e s 1 . 3 6 6 6 3 7 < . 0 0 5 0
An d e r s o n - Da r l i ng 8 . 1 2 9 9 6 0 < . 0 0 5 0
The above table shows the transformation of different variables used in analysis. It can be seen that
after transformation the skewness of data has decreased to a great extent in most of the variables.
After transformation of the variable the outlier filter is run which would remove the outliers in some
of the variables as seen in the distribution analysis.
The regression analysis when run gave the following results:
The regression analysis when run gave results where Dollar Index came out to be having most
impact on the copper prices followed by imports of copper by China. Total consumption of copper
doesn’t clear the t-test here and thus cannot be classified as a variable having impact on the total
stock. Also, fundamentally Chinese imports and dollar index already discounts the impact of total
consumption as USA and China are the largest users of copper.
Time also stands out to be a variable having some impact on copper prices. Total stock of copper
has the least impact on the prices. Overall the regression model came out to explain close to 90% of
variations in the copper prices.
The SAS System 23:51 Friday, October 31, 2008 7

The DMREG Procedure
Model Information
Training Data Set _EMSPDE.SP_DGM00001.DATA

DMDB Catalog EMPROJ.SP_DGM00001
Target Variable PRIC_9VG (Prices: Maximize normality)
Target Measurement Level Interval
Error Normal
Link Function Identity
Number of Model Parameters 5
Number of Observations 102
Analysis of Variance
Sum of
Source DF Squares Mean Square F Value Pr > F
Model 4 35.966193 8.991548 218.47 <.0001

Error 97 3.992232 0.041157
Corrected Total 101 39.958425
Model Fit Statistics
R-Square 0.9001 Adj R-Sq 0.8960

AIC -320.5435 BIC -318.0333
SBC -307.4186 C(p) 5.0000
Analysis of Maximum Likelihood Estimates
Standard
Parameter DF Estimate Error t Value Pr > |t|
Intercept 1 -12.2845 6.0880 -2.02 0.0464

CHIN_SE3 1 -0.00175 0.000234 -7.48 <.0001
DX_3RMO7 1 -1.9050 0.2611 -7.30 <.0001
WRCO_904 1 2.0530 0.4243 4.84 <.0001
Time 1 0.00744 0.00143 5.21 <.0001
In order to remove the variable world consumption form the list of independent variables, the
regression analysis was run again with 4 variables and thus the final model is selected. The F-test
gives a value of 218.47 with a p-value which is significantly low, which rejects the null hypothesis
of model not being a good fit for the analysis. Hence we accept the alternate hypothesis of model
being a good fit.
The above table gives the results for the different partitions of the data. And the results are quite
close for average standard error and maximum absolute error. Thus it is concluded that china
imports and dollar index are top 2 variables having an impact on the prices of copper. These
variables are followed by time factor and world total stocks of copper.
10
Series: RESID
Sample 1 127
8 Observations 127
Mean 1.49E-14
6 Median -0.025230
Maximum 0.533643
Minimum -0.385356
4 Std. Dev. 0.202947
Skewness 0.391824
Kurtosis 2.364538
2
Jarque-Bera 5.386476
Probability 0.067662
0
-0.25 0.00 0.25 0.50
Jarque-Bera test for normality of residual: Taking the JB test for the residuals, we test the
normality assumption for residuals. The assumption of normal distribution for residuals is important
and JB test has accepted the null hypothesis of residuals being normally distributed and thus the
assumption holds good.
White heteroskedasticity test for residuals: The white test checks the heteroskedasticity or non-
constant variance in the residual terms. If there is heteroskedasticity, the forecasting becomes a
problem using the model because the error terms keep changing. Thus error terms should lie
between a range. The test takes a null hypothesis of homoskedasticity and the test should accept the
null hypothesis with high p-value and small F-statistic. In this test the assumption holds good for
the homoskedasticity of the residuals.
White Heteroskedasticity Test:

F-statistic 1.556945 Probability 0.102774
Obs*R-squared 20.68987 Probability 0.109848
Test Equation:
Dependent Variable: RESID^2
Method: Least Squares
Date: 11/04/10 Time: 10:20
Sample: 1 127
Included observations: 127
Newey-West HAC Standard Errors & Covariance (lag truncation=4)
Variable Coefficient Std. Error t-Statistic Prob.
C -83.13552 332.3502 -0.250144 0.8029
CHINA -0.031927 0.020330 -1.570442 0.1191
CHINA^2 -3.55E-07 6.41E-07 -0.554688 0.5802
CHINA*LW 0.002098 0.001367 1.534175 0.1278
CHINA*LDX 0.000573 0.000788 0.726770 0.4689
CHINA*TIME -2.93E-06 5.33E-06 -0.550778 0.5829
LW 10.22270 46.34195 0.220593 0.8258
LW^2 -0.426168 1.641817 -0.259571 0.7957
LW*LDX 0.288200 1.327384 0.217119 0.8285
LW*TIME -0.002887 0.009718 -0.297043 0.7670
LDX 6.001456 18.48112 0.324734 0.7460
LDX^2 -1.069046 0.839763 -1.273033 0.2056
LDX*TIME -0.006501 0.007509 -0.865759 0.3885
TIME 0.072793 0.137160 0.530716 0.5967
TIME^2 -6.93E-06 1.89E-05 -0.367772 0.7137
R-squared 0.162912 Mean dependent var 0.040863
Adjusted R-squared 0.058276 S.D. dependent var 0.047923
S.E. of regression 0.046506 Akaike info criterion -3.187958
Sum squared resid 0.242230 Schwarz criterion -2.852030
Log likelihood 217.4353 F-statistic 1.556945
Durbin-Watson stat 1.147569 Prob(F-statistic) 0.102774

Data Analysis

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Data Analysis

Enviado por

Direitos autorais:

Formatos disponíveis

ICFAI UNIVERSITY

Name: Gopal Krishan

Sign Student Sign Faculty

The process flow for the analysis is as under:

is necessary before the

Also transformation of data

The skewness is 1.71. Taking m

the square root of the data

The data is slightly positively

Also the Koglomorov-

Transformation of the data

Taking the log of the data

Seeing the normality tests,

accepted the null hypothesis

confidence level. This is 10

for the same. Sh

The regression analysis when run gave the following results:

The SAS System 23:51 Friday, October 31, 2008 7

Training Data Set _EMSPDE.SP_DGM00001.DATA

Model 4 35.966193 8.991548 218.47 <.0001

Model Fit Statistics

R-Square 0.9001 Adj R-Sq 0.8960

Analysis of Maximum Likelihood Estimates

Intercept 1 -12.2845 6.0880 -2.02 0.0464

White Heteroskedasticity Test:

Você também pode gostar