Você está na página 1de 28

Full-time MBA:

Business Statistics (BST510)

Presenting Data and


Descriptive Statistics

Section one
Paul Bottomley
Bottomleypa@cf.ac.uk
Silver, pp.24-26, 35-38, 45-48, 50-68
The Nature of Statistics
A statistic is anything we calculate from a set of data.

Data and information


Can be graphical, numerical or textual.
Can be national (RPI), firm (P/E ratio), or individual
(attitudes, opinions) level.

Statistics as a subject tries to answer questions like:


What does our data (not) tell us?
Summarize & interpret information to aid understanding
How can we obtain useful data?
Designing and conducting surveys and experiments.
Types of Data
Primary: commissioned to solve this problem.
Secondary: commissioned by somebody else.

Cross-sectional: snapshot, same point in time.


A market research report of the UK car market in 2014.
GfKs television dataset we will be analysing.
Time-series: movie, over several periods of time.
Price of petrol between 2006 and 2014.
Household panel survey tracking purchase behaviour.

But the majority of data are derived from surveys, so


we must consider possible sources of error.
Sources of Survey Error (1)
Mis-recording: human error linked with data entry.
Interviewer records respondents age as 32, not 23.
Sampling error
Occurs naturally, depends on the sampling method.
Can be calculated (estimated e.g. +/- 3%)
Declines as sample size rises, but not proportionately.
Response error
Arises because questions are asked in a social context.
Importance of question wording.
People second guess the interviewer.
E.g. attribute ratings: 10 point scale, allocate 100 points?
Sources of Survey Error (2)
Non-response error
Low response rates are normal (40% excellent); reduces
precision (increases sampling error).
Non-response bias: responders differ from non-responders.
Scott Armstrong: compare early and late responders on key
questions.
Design error
Arises because of inappropriate sampling methods.
Choice of sampling frame (list).
Problems with quota sampling.
E.g.: Members of CIM; Cardiff high-street Tuesday a.m.
Television Dataset 1
(see end of handout)

Highlights from Nr Sound Sales Price Size


GfK monthly TV 38 4 31 301.81 17
Panel dataset. 47 2 34 756.66 24
48 4 42 460.00 21
60 2 49 150.19 14
Bar-coded sales
63 4 50 635.30 27
from leading .. .. .. .. ..
retailers, 1996. 321 2 1079 386.64 21
353 2 1104 590.54 24
Sample of 31 360 4 1433 158.33 14
models of TV 408 2 1439 423.23 24
10 shown here. 411 4 1622 456.85 24
Scales of Measurement
(Types of Numerical Data)
Determines the amount of information contained in the
data and influences the methods of analysis.
Nominal (categorical)
Numbers used to label things; can group values together.
E.g. Country-of-origin, brand choice
E.g., Boys = 1, Girls = 2; countries, brands.
Allowed operations: can test for equality (=).
Ordinal (rank order)
Numbers used to order values.
E.g., F1 results: (1) Red Bull, (2) Ferrari, (3) McLaren
No indication of the gaps (how close was Ferrari to Red Bull)?
Allowed operations: test for equality; greater/less than (= > <).
Scales of Measurement Cont.
Interval (metric)
Numbers have all the properties of ordinal data, plus the size
of the gaps are equal sized (meaningful). Can add / subtract.
40oC is 20o hotter than 20oC, but it is NOT twice as hot.
But has a non-defined, arbitrary zero.
Allowed operations: can test for equality; greater/less than;
and addition/subtraction (= > < + -).
Ratio (metric)
Same as interval, except zero is non-arbitrary (meaningful).
Can find ratios of values.
A TV costing 1000 is twice as expensive as a 500 set.
E.g., My bike has twice as many gears as yours.
Allowed operations: can test for equality; greater/less than;
add/subtract; and multiplication/division (= > < + - x )
Television Dataset #1 (n=31)

Nr Sound Sales Price Size


Nominal Data: 38 4 31 301.81 17
no mathematical 47 2 34 756.66 24
properties 48 4 42 460.00 21
60 2 49 150.19 14
63 4 50 635.30 27
Ratio Data: Price of .. .. .. .. ..
TVs. Non-negative 321 2 1079 386.64 21
continuous variable. 353 2 1104 590.54 24
360 4 1433 158.33 14
408 2 1439 423.23 24
Ratio Data: Unit sales & 411 4 1622 456.85 24
screen size. Non-negative
integer values.
Supermarket Shopping Questionnaire

Which supermarket do you prefer? Rank these supermarkets in


Tesco 1 order of your preference:
Asda 2 Tesco 1
Morrisons 3 Nominal Asda 2 Ordinal
Waitrose 4 Morrisons 3
Waitrose 4
How satisfied were you with your last shopping experience at each of
these supermarkets? (1 = not at all satisfied, 5 = very satisfied).
Tesco 1 2 3 4 5
Asda 1 2 3 4 5
Morrisons 1 2 3 4 5 Interval
Waitrose 1 2 3 4 5

How much did you spend last time you went supermarket shopping
Ratio
(to the nearest )? ___47___
Cardiff 10K Road Race
Imagine that you enter the Cardiff
10 km. You get your race number, train
hard and complete the course. Yippee
but you wont beat Mo Farah!
What type of data are each of the
following variables?
What statistical operations are OK?

Scale Example Typical Value Operations


Ratio Duration of 10K race 45:12 = > < + - x
Interval Time when you finished 11:45 AM = > < + -
Ordinal Position at finish 33rd = > <
Nominal Race competitor # 3312 =
Data Reduction
Data -> Useful information + Irrelevant information

It is like diagnosing a broken arm by X-ray flesh is less


interesting than bone in this context.
Data mining vast quantities of information crunched to
reveal a small number of high value nuggets.
Value depends on viewpoint: stock picking, returns or risk?
Frequency Distribution
A frequency distribution is a tabular summary of data showing
the number of items in several non-overlapping categories.

Sound Frequency Relative


System (Count) Frequency
Aside: Important when
#1 6 0.19
#2 11 0.35
we come to probability.
#3 2 0.06
#4 12 0.39

The classes are already defined. (TV sound systems).


Relative frequency = proportion of items in each category.
Relative frequency = frequency of class / no. of data points
Percentage frequency = relative frequency x 100
Presenting Data: Bar Charts
Bar charts offer a graphical summary of nominal (categorical)
data (e.g. voting preferences, brand choices).
* Height of each bar is proportional to its frequency (count).
Bars separated to stress the datas categorical nature.

14
12
10
Modal
Frequency

8
6 category
4
2
0
#1 #2 #3 #4
Sound System

Classes with smaller frequencies (<5%) can be grouped to


form an aggregate class, labelled other.
Changes in Professions 1996-2008
Doctors Judges, lawyers
Nurses
Teachers
Publicans
Prison Officers Shop-workers
Pilots
Carpenters Bricklayers
Miners
Farmers
Fishermen

Beware changes in definitions. People working in retailing


is down, while people working in fisheries is up!
Small Ads: What Men Want

Surprising similarities between men and women


sense, humour, fun and intelligent.
Beware problem of proportionality with font size.
Frequency Distributions (2)
A frequency distribution shows the number of items in
each of several non-overlapping classes (intervals).
More difficult with metric data (TV prices, Dataset #1).

301.61 756.66 460.00 150.19 635.30 239.99 904.05 206.60


417.82 882.05 176.97 466.69 259.47 478.90 173.66 333.79
673.69 1216.95 579.74 429.06 195.98 352.33 222.56 334.46
444.27 237.47 386.64 590.54 158.33 423.23 456.85

How many classes? (5 to 20)


No overlaps between class intervals or data gets counted twice.
No gaps between class intervals or data gets missed out.
Width of the each class? Max. - Min. = good starting place
Ideally same width reduces misunderstanding (see later).
Average of data in any interval as close to mid-point as possible.
Presenting Data: Histograms (1)
To build a histogram, place the bars over the class intervals.
The area (not height) of a bar is proportional to its frequency.
Choose classes carefully (5 to 20 ok).

Class Interval Frequency 12

0 to under 150 0 Price of TVs


10
150 to under 300 10
300 to under 450 9 8

450 to under 600 6 Frequency


6
600 to under 750 2
4
750 to under 900 2
900 to under 1050 1 2

1050 to under 1200 0


0
1200 to under 1350 1 150 300 450 600 750 900 1050 1200 1350
Histograms (2)
Increasing class width can lead to loss of information.
Decreasing class width can lead to scattered information.

Price of TVs
16 9

14
Price of TVs 8

Typical value:
7
12
Typical value: 6
10
400 - 500

Frequency
Frequency

8 250 - 500 5
4
6
3
4 2
2 1

0 0
250 500 750 1000 1250 100 300 500 700 900 1100 1300

Adv.: Appeals to visually minded dont get lost in detail.


Adv.: Meaningful patterns can be seen at a glance.
Disadv.: Loss of info. which could turn out to be useful.
Histograms (3)
Beware class intervals with unequal widths.
Age, 11-20, 21-30, 31-40, 41-50, 51-60, 61-70, 71-90.
Assuming all classes have the same frequency, the bar
associated with the 71-90 class will be

a) The same width but two times shorter.


b) The same width but two times taller.
c) Two times wider and two times shorter.
d) Two times wider and two times taller.

Beware open-ended class intervals


Especially, if you dont have the raw data. By convention,
we assume the class is twice the typical class-interval.
Example: Car-part deliveries times: less than 1 hr, 1 to less than 3,
3 to less than 5, 5 to less than 7, 7 or more hrs.
Open-class interval = 7 to less than 11hours
TV Data Set 2
Manufacturers:Bang & Olufsen
Nr Sound Sales Price Size Brandn
174 1 284 800 27.00 7
183 2 248 891 21.00 7
303 2 79 1295 27.00 7
316 2 74 1451 27.00 7
332 4 62 580 26.00 7
343 2 56 1192 24.00 7
366 2 48 1285 27.00 7
408 2 34 757 24.00 7
Dot Plot: Price of B&O Televisions

0 500 1000 1500 2000


Price ()

Central tendency refers to centre or middle of a distribution


of values. Data tend to cluster around here.
What is the typical price? Where is the balancing point?
Measures of Central Tendency: Mean
The mean is the average value of a variable.
Simply add up all the data values, count how many data
points there are and divide by this count.
Example: {4, 8, 9}, mean = (4 + 8 + 9) / 3 = 21 / 3 = 7
Sum of negative deviations = Sum of positive deviations
4 Mean = 7 8 9

_
The sample mean is denoted X and pronounced x bar
In mathematical notation, for n data points, mean is
_
X
X 1 X 2 X 3 ... X n

X
n n
Central Tendency: Mean Cont.
Lets return to B&0 prices. B&O
Mean = (800 + 891 ++ 757) / 8 = 8251 / 8 800
= 1031.38 891
Typical B&O costs about 1031 (in 1996)! 1295
Remember the units when telling the story. 1451
Adv: It is easily understood and unique. 580
Calculation based on all data, so no 1192
information wasted. 1285
Disadv: may give strange results (2.4 kids) 757
Only appropriate for interval/ratio data. Sum = 8251
Can be distorted by outlying values.
Central Tendency: The Median
The median (Md) is the middle value when the data
are placed in ascending order.
50% of the data values are smaller than the median.
50% of the data values are larger than the median.
With n data points, the median position is (n + 1)/2.

Example: {46, 54, 42, 32, 45}


Put the data in ascending order {32, 42, 45, 46, 54}
Position of median is (5 + 1)/2 = 3; so median = 45.

Dont confuse the medians position with its value.


Central Tendency: Median Cont.
When the number of data points are even,
the Md position will not be a whole number. B&O
Median = average of two middle values. 580
Example: Price of B&O televisions. 757
Position of median: (8 + 1)/2 = 4.5 800
Half way between 4th and 5th data values. 891
Median = (891 + 1192) / 2 = 1041.50 1192
Median = 891 + 0.5*(1192 891) 1285
1295
Median not sensitive to outlying values. 1451
Doesnt use all data points; misunderstood.
Central Tendency: The Mode
The mode is the value that occurs most frequently.
It is an actual value, not a frequency of occurrence.
B&O Sound System: #1 = 1, #2 = 6, #4 = 1. Mode = 2.

Useful for nominal (categorical) data, but can also be


applied to ordinal and metric variables. But
Sometimes data are multi-modal {1, 2, 1, 3, 5, 3, 4}
Sometimes no mode at all {1, 2, 4, 6, 8, 7, 5, 3}
Works well for bell-shaped distributions.
Unaffected by outlying values; lacks power.
Slight changes in the data can easily change the mode.
Comparing Measures of
Central Tendency
Criteria Mode Median Mean
Type of data Nominal Ordinal Metric
(scale) Ordinal Metric
Metric
Unique
No Yes Yes
Uses all
data (power) No No Yes
Resistant to
outliers Yes Yes No

Você também pode gostar