Você está na página 1de 9

Analytics Assignment

JTA
October 29, 2015

Data Analysis and Modelling


Preliminary Analysis
The data under consideration was obtained from Barilliant- a company which operates multiple
pubs across England and Wales, currently undertaking a project to improve selection of future sites
by determining the key drivers of cost and revenue. In particular, we will consider whether factors
such as the quality of management and staff , local income ,competition , type of establishment
and location explain the variation in revenues (R) for 47 of the existing pubs.
Table 1 shows a descriptive summary of the revenues generated by the pubs over the past year.
These figures vary between .6 and 3.53 million (), with an average of 1.89 million () and standard
deviation of .66 million (). Figure 1 shows a graphical illustration of this information by way of
a histogram, which seems to be symmetric about the mean and bell-shaped-suggesting that the
revenues are approximately normally distributed.

Revenue (mil)
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count

1.89
0.10
1.84
2.06
0.66
0.43
0.01
0.31
2.92
0.61
3.53
88.74
47.00

Table 1: My caption

Figure 1: Scatter plot showing relationships between Age(A), Height(H), Weight(W)


Table 2 below presents the correlation matrix for all the variables under consideration. From
these results we can glean that management quality and revenues have a strong positive linear
relationship with M,R = 0.87. On the other hand, linear correlation between revenue and the
other variables is relatively weak, with || < 0.4 in all cases. Linear correlation between the
explanatory variables is also very weak (|| < 0.4), which suggest that they are independent and
will add little or no redundant information to the model. Figure 2 below gives the story of behind
the numbers in the first column of table 2 graphically,using scatter plots.

R
MT
LI
M
S
C
L2
L3
L4
L5

MT

LI

L2

L3

L4

L5

1.00
0.35
-0.34
0.87
-0.38
0.12
-0.16
0.35
0.10
0.25

1.00
-0.29
-0.08
-0.04
0.40
0.11
0.11
-0.14
-0.02

1.00
-0.10
0.14
0.00
0.10
-0.23
-0.02
-0.01

1.00
-0.39
-0.04
-0.16
0.28
0.10
0.26

1.00
-0.14
0.19
-0.17
-0.07
-0.20

1.00
-0.10
0.24
0.00
-0.24

1.00
-0.30
-0.20
-0.20

1.00
-0.26
-0.26

1.00
-0.18

1.00

Table 2: My caption
Figure 2 below gives the story behind the numbers in the first column of table 2 graphically,using
scatter plots. Here again we see a clear positive linear pattern between revenue and management
quality. Which implies that when management quality improves revenues tend to increase as well.
A similar, but weaker argument can be made for the relationship between revenue and location
and, revenue and type of establishment- a positive linear trend exists but the points are more
widely scattered. The scatter plots for revenue versus local income and revenue versus staff seem
2

to suggests the existence of a weak, negative linear relationship between revenue and these two
variables. While the scatter plot for revenue versus competition, shows a horizantal line of fit,
which suggests that competition does not affect revenues. The latter observations defy our usual
expectations, since we would expect better quality staff and higher local income, to translate into
increased sales. likewise, we expect greater competition to have a negative impact on revenue.
Revenue vs Management Quality

Revenue vs Local Income


4

3.5
3.5

Revenue (mil)

Revenue (mil)

2.5

1.5

2.5

1.5

0.5

0.5

0
0

10

10

15

20

25

30

35

Local Income (mil)

Management Quality

Revenue vs Staff Quality

Revenue vs Location
4

3.5

3.5

Revenue (mil)

Revenue (mil)

3
2.5

1.5

2.5

1.5

0.5

0.5

0
0

10

Staff Quality

0
0

Location

Revenue Modern Type

Revenue vs Competition

3.5

3.5

Revenue (mil)

Revenue (mil)

2.5

1.5

2.5

1.5

1
1
0.5
0.5
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

10

12

Competition

0
1

Modern Type

Figure 2: equation...
We note, that though the correlation matrix and scatter plots give us some insights into the behavior
of revenue with respect to the individual explanatory variables, we will not rely fully on them since
they give us no information on how the individual explanatory variables impact revenue when they
are combined. For this reason, we proceed to perform a multiple regression analysis.

Model Fitting and Diagnostics


The aim of this analysis is to construct a Multiple Linear Regression Model (MLRM) which will
explain variations in the revenues. The variables under consideration are:
Dependent variable- Ri Revenue (millions)
Explanatory variables- Mi Management Quality (on 1 to 9 scale); Si Staff Quality (on 1 to 9
scale); LIi Local Income (000s); Ci Competition (# of pubs within 2 miles of the site);M Ti
(dummy variable indicting if the establishment is modern) default type is conventional; L2, L3,
L4, L5 (dummy variables for northern, eastern, southern and western locations respectively) midlands is the default/reference location.
Residual- i unexplained variation in revenue
We proceed by first fitting a MLRM with all the proposed explanatory variables. A summary of
the regression statistics and coefficients are presented in tables 4 and 5.

Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.979
0.958
0.948
0.150
47

Table 3: My caption
Variable

Coefficients

Standard Error

t -Stat

P-value

Intercept
ModernType
LocalIncome
Management
Staff
Competition
Location 2
Location 3
Location 4
Location 5

0.702
0.511
-0.022
0.239
0.006
-0.002
0.028
0.167
0.215
0.160

0.188
0.053
0.007
0.012
0.014
0.009
0.070
0.073
0.079
0.086

3.737
9.640
-3.191
19.696
0.476
-0.197
0.392
2.289
2.736
1.875

0.001
0.000
0.003
0.000
0.637
0.845
0.697
0.028
0.009
0.069

Table 4: My caption
We observe that the fitted model above explains a significant portion (94.8 %) of the variance
in revenues, but it contains a number of variables (Staff, Competition, Location 2 and Location
5) that are insignificant (|t stat| > 1.96)- in other words we cannot reject the hypothesis that
the coefficients of these variables are equal zero and as such these variables have no impact on
revenue. To obtain a more suitable model we proceed by discarding the insignificant variables, one
at a time, and refitting the previous model with the remaining variables. This process is repeated
multiple times until no variable with insignificant t-stat is left in the model (A full summary of all
the intermediate models is presented in the appendix).

After few iterations of the aforementioned process the model obtained features Management Quality, local Income, Modern Type and locations 3,4 and 5 as significant variables explaining 95.1 %
of variations in Revenue. A summary of the regression statistics and coefficients are presented in
tables 5 and 6 below.
Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.979
0.957
0.951
0.146
47.000

Table 5: My caption

Intercept
ModernType
LocalIncome
Management
Location 3
Location 4
Location 5

Coefficients

Standard Error

t Stat

P-value

0.75
0.51
-0.02
0.24
0.15
0.20
0.14

0.16
0.05
0.01
0.01
0.06
0.07
0.07

4.84
11.20
-3.37
21.49
2.42
2.88
1.99

0.00
0.00
0.00
0.00
0.02
0.01
0.05

Table 6: My caption
Overall this model provides a (marginally) better fit than the initial model containing all the proposed explanatory variables described earlier (adjusted R square increase from 94.8 % to 95.1%
,meanwhile residual standard deviation decreased from 0.15 to 0.146).

Residuals vs Predicted values


0.5

0.4

0.3

Residuals

0.2

0.1

0.5

1.5

2.5

3.5

-0.1

-0.2

-0.3

Predicted Values

Figure 3: Scatter plot showing relationships between Age(A), Height(H), Weight(W)


We now check the validity of this model by examining the residual plots- looking for any patterns
5

that might suggest the residuals are dependent or heteroscedastic (non constant variance). Figures
3 shows the residual against predicted values while Figure 4 shows the residual against the explanatory variables. The residuals in figure 3 seem to be randomly and evenly scattered around
0, with no observable pattern- implying independence and homoscedasticity (constant variance).
However, in Figure 4 we notice a parabolic pattern in residuals when they are plotted against local
income. This implies that revenues might have a second-order polynomial relationship with local
income which is not captured by the model.
Local Income Residual Plot

Management Residual Plot

0.5

0.5
0.4
0.4
0.3
0.3

Residuals

0.2

Residuals

0.2

0.1

0.1

0
0

10

15

20

25

30

10

35
-0.1

-0.1

-0.2
-0.2

-0.3

-0.3

Management

Local Income

0.5

0.4

0.4

0.3

0.3

0.2

0.2

Residuals

Residuals

Location 3 Residual Plot

ModernType Residual Plot

0.5

0.1

0.1

0.2

0.4

0.6

0.8

1.2

-0.1

-0.2

-0.2

-0.3

-0.3

ModernType

Location 3

Location 4 Residual Plot


0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

Residuals

Residuals

-0.1

0.1

Location 5 Residual Plot

0.1

-0.1

-0.1

-0.2

-0.2

-0.3

-0.3

Location 4

Location 5

Figure 4: equation...
We adjust the model by adding a second degree polynomial term for local income. Including
this term in the model caused the coefficients for location 3,4 and 5 to become insignificant (See
appendix for this result). After discarding the insignificant variables (one at time) from the model
we obtained a model with Management Quality,type of establishment and, first and second degree
polynomial terms for Local Income as significant explanatory variables- explaining 97.7 % of the
variance in revenue with a remarkably low residual standard deviation of 0.099 . The regression
statistics and coefficients along with residual plots are presented in tables 7 and 8 and figures 5
and 6 below.

Regression Statistics
Multiple R
R Square
Adjusted R Square
Standard Error
Observations

0.990
0.979
0.977
0.099
47.000

Table 7: My caption
Variable

Coefficient

Standard Error

t Stat

P-value

Intercept
Management
LocalIncome
ModernType
LocalIncome^2

4.208
0.241
-0.348
0.502
0.008

0.423
0.007
0.039
0.030
0.001

9.952
36.213
-8.817
16.455
8.204

0.000
0.000
0.000
0.000
0.000

Table 8: My caption

Residuals vs Predicted Revenue


0.25

0.2

0.15

0.1

Residual

0.05

0
0

0.5

1.5

2.5

3.5

-0.05

-0.1

-0.15

-0.2

-0.25

-0.3

Predicted Revenue

Figure 5: Scatter plot showing relationships between Age(A), Height(H), Weight(W)

Management Residual Plot

Modern Type Residual Plot

0.25

0.2

0.15

0.25
0.2

0.1

0.15
0.1

Residuals

Residuals

0.05

10

-0.05

-0.1

0.05
0
0

0.2

0.4

0.6

0.8

1.2

-0.05
-0.1
-0.15

-0.15
-0.2
-0.2

-0.25

-0.3

-0.25

Modern Type

-0.3

Management

LocalIncome^2 Residual Plot

LocalIncome Residual Plot

0.25

0.2

0.2

0.15

0.15

0.1

0.1

0.05

0.05

Residuals

Residuals

0.25

0
0

10

15

20

25

30

35

-0.05
-0.1

100

200

300

400

500

600

700

800

900

1000

-0.05

-0.1

-0.15

-0.15

-0.2
-0.2
-0.25

-0.25
-0.3

LocalIncome

-0.3

LocalIncome^2

Figure 6: equation...
An observation of the residuals against predicted values and explanatory variables reveals no clear
patterns-with residuals randomly and evenly scattered in all cases-indicating that the residuals
are independent and homoscedastic. This model is therefore satisfactory, since it has very strong
explanatory power and does not violate any of the modelling assumptions. The equation which
describes this model is given by:

Revenuei = 4.208 + 0.241 M anagementi 0.348 LocalIncomei


+ 0.008 LocalIncome2i + 0.502 M odernT ypei + i
This model can be interpreted as follows:
For pub i-If Management quality, Local Income and Modern Type are all 0 then the expected
revenue is 4.208 million for any given pub (This is not possible in reality since Management
quality for example is measured on a scale of 1 to 9 ). Ceteris paribus a 1 unit improvement
in Management quality will increase Revenue by .241 million. If Local Income increases by
1000, revenue is expected is to decrease by 0.348 million (ceteris paribus). For every successive
1000 increase in Local Income, the reduction in revenue is lessened by .016 (2 0.008) (ceteris
paribus). Finally, modern/themed pubs on average make 0.502 million more than conventional
pubs (all things being equal).

Appendix

Você também pode gostar