Você está na página 1de 48

BMGT 430: Linear Statistical Models in Business

Shawn Mankad
4316 VMH
smankad@rhsmith.umd.edu
1 / 48
Welcome!
Computers are incredibly fast, accurate, and stupid. Human beings are
incredibly slow, inaccurate, and brilliant. Together they are powerful
beyond imagination.
-Albert Einstein
This course is the foundation for all data analytics and hence, is the most
important Statistics course you will take at UMD!
2 / 48
Faculty: Shawn Mankad
My background

Assistant Professor in DOIT - Decisions, Operations & Information


Technology

Ph.D. in Statistics

Research specializes in linear models

2nd year at UMD and 2nd time teaching this exact course
3 / 48
Syllabus
Everyone have one?
4 / 48
Faculty and Sta
Lectures

Shawn Mankad, Tues/Thurs 5 - 6:15 pm, 1333 VMH


Oce hours

Shawn Mankad, Mon/Wed 4-5pm, 4316 VMH

Lisa Patti, TBA (depends on student demand)


5 / 48
Materials
Course websites:

Canvas (lecture notes, assignments, solutions)


Software:

Minitab, Excel, SPSS (see SmithApps)

R (use is encouraged, but denitely not mandatory or even covered


extensively)

Other mathematical / statistical softwares are okay


6 / 48
Assessment
Approximately biweekly homeworks, 1 project, 2 midterms, 1 nal

Homeworks 25%, Project 15%, Midterms 30%, Final 30%

Write up your own homework

Late homework will be assessed at least a letter grade penalty


commensurate with how late it is

No cell phone use during lecture, please


7 / 48
Learning Objectives
In this course, you should leave able to perform the important aspects of
regression

Develop a facility for structuring a problem as a regression model

A comprehensive understanding of how the method of least squares is


used to estimate coecients

Recognize the robustness and limitations of the model with diagnostic


techniques

Perform statistical inference, with a major emphasis on building


parsimonious models

Use regression analysis to help make better decisions


8 / 48
Concept Review
Outline
Concept Review
9 / 48
Concept Review
Linear Models
Throughout the semester, we will always ask
Does X have an eect on Y? If so, to what extent?
Example 1:

Y is stock returns for Walmart

X is the Consumer Price Index


Example 2:

Y is sales for Walmart in dierent locations

X is the marketing cost in each location


10 / 48
Concept Review
Walmart Store Locations in Aug 2010
11 / 48
Concept Review
Population and Samples

Y is sales for Walmart in dierent locations

X is the marketing cost in each location


Because there are so many Walmart stores, to save time and money we
might ask just a couple dierent locations for their sales and marketing
information, i.e., we take a random sample of 100 stores
(y
1
, y
2
, . . . , y
100
).
12 / 48
Concept Review
Random Variable
From a random sample, we measure some characteristic, like the Walmart
store sales or number of shoppers.

Let y
i
represent the sales of store i .

Y is called a random variable, because y


i
will change slightly if you
obtain another sample from the population of Walmart stores (see p.
13, 24-25 of your textbook).
13 / 48
Concept Review
Expected Value - Population Parameter
E(Y) stands for the expected value of random variable Y (see textbook
p. 15).

This is just the population-level average value!

It is an example of a population-level parameter

The textbook occasionally uses instead of E(Y).


Example:

E(Y) is the true average sales over all Walmart locations.


14 / 48
Concept Review
Average - Sample Statistic
y is an average calculated from your data sample, i.e., y is a statistic (see
textbook p. 11)

Main idea in all of statistics and data mining: With enough data, y is
very close to E(Y), i.e., y E(Y) with lots of data.
Example:

E(Y) is the true average sales over all Walmart locations.

y is the sample statistic (calculated from the random sample) that


approximates the true average sales over all Walmart locations.
15 / 48
Concept Review
Example
Say we have a random variable X with mean 10.
Let Y = 3 + 5X.
What is E(Y), the expected value of Y?
What is the type of relationship between Y and X?
16 / 48
Concept Review
Example Solution
E(Y) = E(3 + 5X)
= E(3) + E(5X)
= 3 + 5E(X)
= 3 + 5(10)
= 53.
17 / 48
Concept Review
Conditional Expectations E(Y|X)
E(Y|X) stands for the expected value of random variable Y for a
particular value of X.

This is just the population-level average at a particular X value.

The textbook occasionally uses


Y|X
instead of E(Y|X).
We will learn the sample statistic for E(Y|X) is the estimated regression
model

E(Y|X) =
Y|X
= b
0
+ b
1
X.
18 / 48
Concept Review
Conditional Expectations E(Y|X)
Example:
If Y = 3 + 5X, then what is E(Y|X = 5)?
19 / 48
Concept Review
Conditional Expectations E(Y|X)
Example:

E(Y|X) is the true average sales over all Walmart locations with
advertising expenses equal to X.

E(Y|X = 100) is the true average sales over all Walmart locations
with advertising expenses equal to $100.

The regression model says the true average sales over all Walmart
locations with advertising expenses equal to $100 is
E(Y|X = 100) =
0
+
1
100.
20 / 48
Concept Review
Variance and Standard Deviation
Variance measures how spread out are a collection of numbers are.

A small variance indicates that the data points tend to be very close
together;

A high variance indicates that the data points are spread out.

See textbook p. 18.


21 / 48
Concept Review
Variance and Standard Deviation
The square root of variance is called the standard deviation, and it has
the same basic interpretation

variance = standard deviation.


22 / 48
Concept Review
Variance and Standard Deviation
23 / 48
Concept Review
Population Parameters:


2
X
is used to denote the true, population-level variance of random
variable X


X
is used to denote the true standard deviation of X.

These values are almost always unknown to us and the goal is then
to estimate them.
Sample Statistics:

We estimate the population parameters with the sample statistic


s
2
=
2
X
=
1
n1

n1
i =1
(x
i
x)
2
.

s =


2
X
=

1
n1

n1
i =1
(x
i
x)
2
.

See p. 11-12 of the textbook


24 / 48
Concept Review
Distributions
Recall from a population (like all Walmart stores), we collect a random
sample.
For each item in the random sample, we measure some characteristic (like
store sales) to create a random variable.
We create a histogram of the random variable to look at its values.
25 / 48
Concept Review
Normal Distributions
Stunningly, many variables look bell-shaped! This distributional pattern
is formally dened by a normal distribution.
26 / 48
Concept Review
Normal Distributions
Gauss invented the normal distribution and
linear models around the year 1800!
27 / 48
Concept Review
Normal Distributions
To dene a normal distribution, you must specify its mean and
variance. (Dont lose points on exams and homeworks for little mistakes
like this!)
We usually use shorthand and write N(, ) or N(,
2
).

As soon as you know the mean and variance


2
, you know what the
distribution of values looks like!
28 / 48
Concept Review
Other Distributions
We will also use other distributions in this class, like the t-distribution and
F-distribution.

Always remember to specify the name of the distribution and its


parameters.

You will need to know how to use the tables in the back of your book.
29 / 48
Concept Review
Example
The stock price for a large retailer in the 4th quarter is assumed to be
normally distributed with mean = 45 and standard deviation = 5.
What is the probability that the stock price in the 4th quarter exceeds 50?
30 / 48
Concept Review
Example
Let X represent the stock price.
Then we are interested in calculating P(X > 50).
P(X > 50) = P(
X 45
5
>
50 45
5
)
= P(Z > 1) = 0.5 0.3413 (Numbers come from Table B.1)
= 0.1587.
31 / 48
Concept Review
Example
We will use this trick throughout the semester (see textbook p. 19-23):

If X N(, ), then Z =
X

N(0, 1).

If the estimate of is used, then


t =
X
s
t(n 1 degrees of freedom).
Once the variable is transformed to N(0, 1) or a t distribution, then we
can use the table in the back of the book.
32 / 48
Concept Review
Condence Intervals
Suppose we are interested in estimating the population-level mean .
We would then calculate the sample average y.
But how accurate is y? It depends on the variance or spread of the
distribution of values.
33 / 48
Concept Review
Condence Intervals
An interval estimate called a condence interval is used to get a sense of
how accurate the statistic y is.
The formula for the (1 )% condence interval about the mean is
( y t
(

2
,n1)
s

n
, y + t
(

2
,n1)
s

n
).
34 / 48
Concept Review
Condence Intervals
Notice that we are using the t distribution, which always requires knowing
the degrees of freedom (n 1). In shorthand, we write t(n 1).
t
(

2
,n1)
denes the critical value, a value on the t distribution that
corresponds to area under the curve equaling

2
.
35 / 48
Concept Review
Example:
A manufacturer wants to estimate the average life span of an expensive
electrical component. Because the test to be used destroys the
component, a small sample is desired. The lifetimes in hours of ve
randomly selected components are
92,110,115,103,98.
Find a point estimate and 95% condence interval estimate of the
population average lifetime of the components.
36 / 48
Concept Review
The interval to be used is
( y t
(0.025,4)
s

n
, y + t
(0.025,4)
s

n
).
So, we need
y = 103.6
s =

n
i =1
(y
i
y)
2
n 1
= 9.18
t
(0.025,4)
= 2.776.
After plugging in all the numbers into the formula, we get (92.2, 115.0).
37 / 48
Concept Review
Hypothesis Testing
Hypothesis testing is extremely important in this class. Here are the key
denitions.

Null hypothesis H
0
: states the hypothesis to be tested.

Alternative hypothesis H
a
: includes values of the population
parameter not in the null hypothesis.

Test statistic: A number computed from the sample, usually put in a


form to use the table from the back of the book.
38 / 48
Concept Review
Example
Suppose now the manufacter wishes to test whether the population
average life of the components is 110 hours or more. If it is less,
adjustments are made the manufacturing process.
The lifetimes in hours of ve randomly selected components are
92,110,115,103,98.
39 / 48
Concept Review
Example
Suppose now the manufacter wishes to test whether the population
average life of the components is 110 hours or more. If it is less,
adjustments are made the manufacturing process.
The lifetimes in hours of ve randomly selected components are
92,110,115,103,98.
40 / 48
Concept Review
Example
The hypotheses of interest are
H
0
: 110;
H
a
: < 110.
41 / 48
Concept Review
Suppose now the manufacter wishes to test whether the population
average life of the components is 110 hours or more. If it is less,
adjustments are made the manufacturing process.
The lifetimes in hours of ve randomly selected components are
92,110,115,103,98.
42 / 48
Concept Review
We always use a t-statistic in this class when testing a mean. The form of
the t-statistic is
t =
y
0
s

n
= 1.56.
Plugging in values, we get
y = 103.6
s =

n
i =1
(y
i
y)
2
n 1
= 9.18
t = 1.56.
43 / 48
Concept Review
Now we use the following decision rule (see p. 35 for more rules)

Reject if: t < t


,n1
= 2.132;

Do not reject if: t t


,n1
= 2.132.
or calculate the p-value using the back of the book.
Since the t-statistic t = 1.56 > 2.132, we fail to reject the null
hypothesis at the 5% signicance level and conclude that population
average life of the components is 110 hours or more.
44 / 48
Concept Review
p-values
A p-value is the probability of obtaining a test statistic result at least as
extreme as what was actually observed, assuming that the null hypothesis
is true.
Example: If t = 1.56, then the test-statistic was 1.56 standard deviations
away from 0.

If H
a
: = 110, then p-value = P(t < 1.56 or t > 1.56)

If H
a
: < 110, then p-value = P(t < 1.56)

If H
a
: > 110, then p-value = P(t > 1.56)
45 / 48
Concept Review
If the p-value is 0.079, do you reject or fail to reject the null hypothesis at
the 10% condence level?
At the 5% condence level?
At the 1% condence level?
46 / 48
Concept Review
If the 95% condence interval is (92.2, 115.0), how would you evaluate
the following hypotheses at the 5% signicance level?
H
0
: = 92.0; H
a
: = 92.0
47 / 48
Concept Review
Since the 95% condence interval is (92.2, 115.0), how would you evaluate
the following hypotheses at the 5% signicance level?
H
0
: = 92.0; H
a
: = 92.0
Reject!! Because 92.0 is not contained in the 95% interval.
A 5% signicance level corresponds to a (100-5)% condence interval.
48 / 48

Você também pode gostar