Você está na página 1de 49

Presenting Data

Week 2

Objectives
On completion of this module you should be able to: produce a stem-and-leaf plot (by hand and using Excel/PHStat2) construct a frequency distribution (by hand and using Excel/PHStat2) plot a histogram, ogive and scatterplot (by hand and using Excel/PHStat2) graph a bar chart, pie chart & grouped (side-byside) bar chart (by hand and using Excel/PHStat2)
2

Objectives
On completion of this module you should be able to: interpret the data presentations listed above, and apply the results and conclusions in real world examples and discover and describe common graphical errors, and explain how to overcome these.

Example 2-1
The following data represent the actual weight of potato chips found in bags labelled 50 grams. The manufacturer aims to overfill the bags by 5 grams to allow for settling and dehydrating of the chips prior to sale. The results of fill weights in a sample of 20 consecutive 50-gram bags are listed below (reading from left to right in the order of being filled):
59.4 56.8 56.0 57.9 59.2 51.7 57.5 54.8 52.6 51.5 51.6 55.7 53.7 54.1 59.6 52.4 55.6 54.5 50.2 56.1
4

(a) Stem-and-leaf
First create an ordered array (order data from smallest to largest).
50.2 51.5 51.6 51.7 52.4 52.6 53.7 54.1 54.4 54.8
55.6 55.7 56.0 56.1 56.8 57.5 57.9 59.2 59.4 59.6

Choose the stems. Probably easiest to use first two digits: 50, 51, 52, 53, 54, The leaves will then be the digits after the decimal point: 2, 5, 6, 7, 4,
5

Stem-and-leaf
Write the stems down the left hand side:
50

51
52 53 54 55 56 57 58 59
6

Stem-and-leaf
First data point is 50.2, so add 2 after 50.
50
51 52 53 54 55 56 57 58 59
7

Stem-and-leaf
Next data point is 51.5, so add 5 after 51.
50 2

51
52 53 54 55 56 57 58 59

Stem-and-leaf
Continue until all data is added.
50 2

51
52 53 54 55 56 57 58 59

5
4 7 1 6 0 5 2

6 7
6 4 8 7 1 8 9 4 6
9

Stem-and-leaf display using PHStat2


Stem-and-Leaf Display Stem unit: 1 50 51 52 53 54 55 56 57 58 59 2 567 46 7 148 67 018 59 246
10

Choosing stems and leaves for messy data


Often there are many possible ways to choose the stems. Lets look at some examples. Data set 1 0.0149, 0.9832, 0.2532, 0.4501, 0.7019, One suggestion is to round the numbers to 2 decimal places: 0.01, 0.98, 0.25, 0.45, 0.70, Use the 1st digit after the decimal point as stem (0, 9, 2, 4, 7,) and 2nd as leaf (1, 8, 5, 5, 0,).

11

Choosing stems and leaves for messy data


Data set 2 394.235, 388.583, 392.891, 393.998, 397.852, Round the numbers to 1 decimal place: 394.2, 388.6, 392.9, 394.0, 397.9, Notice how rounding has affected these numbers (eg 393.998 rounds to 394.0). Use the 1st three digits as the stems (394, 388, 392, 394, 397,) and 1st digit after the decimal point as the leaves (2, 6, 9, 0, 9,).
12

Choosing stems and leaves for messy data


Data set 3 190653, 121987, 154028, 161923, Round numbers to 3 significant figures: 191000, 122000, 154000, 162000, Use the 1st two digits as the stems (19, 12, 15, 16,) and 3rd digits as the leaves (1, 2, 4, 2,).

13

(b) Construct the frequency distribution

Data range: 59.6 - 50.2 9.4 This is a small data set so we choose a small number of classes: 8. Width of interval: 9.4 1.175
8

Easier to round this number to 1.2 (since data is given to 1 dec. pl.). Read information on class and boundary points in the study guide (p. 2-7).
14

(b) Frequency distribution

Now construct a table and tally data:


Weight of bag (gm)
50.2 to less than 51.4

Tally
/

Number of bags
1

51.4 to less than 52.6


52.6 to less than 53.8 53.8 to less than 55.0 55.0 to less than 56.2 56.2 to less than 57.4

////
// /// //// /

4
2 3 4 1

57.4 to less than 58.6


58.6 to less than 59.8

//
///

2
3
15

(b) Frequency distribution & percentage distribution


Weight of bag (gm)
50.2 to less than 51.4 51.4 to less than 52.6 52.6 to less than 53.8 53.8 to less than 55.0 55.0 to less than 56.2 56.2 to less than 57.4 57.4 to less than 58.6 58.6 to less than 59.8 Total

Number of bags
1 4 2 3 4 1 2 3 20

Percentage of bags
1 20 100 5% 4 20 100 20%

2 20 100 10%
3 20 100 15% 4 20 100 20% 1 20 100 5%

2 20 100 10%
3 20 100 15%

100%
16

(c) Frequency histogram


Histogram
5
Frequency

4 3 2 1 0 50.2 51.4 52.6 53.8 55 Bins


17

56.2 57.4 58.6 59.8

(c) Frequency histogram


Histogram of Bag Weights
4.5 4 3.5 3

Frequency

2.5 2 1.5 1 0.5 0 50.8 52 53.2 54.4 55.6 56.8 58 59.2 Midpoints
18

(c) Frequency histogram

We will discuss how to produce histograms using Excel and PHStat2 during workshops. Instructions are in the text and Excel Handbook sections included in the text. Make sure you can produce histograms (and other graphs) by hand as well!!!

19

d) Percentage distribution
Percentage Polygon 25%

20%

15%

10%

5%

0% --50.8 52 53.2 54.4 55.6 56.8 58 59.2

20

(e) Cumulative percentage distribution


Weight of bag (gm)
50.2 to less than 51.4 51.4 to less than 52.6 52.6 to less than 53.8 53.8 to less than 55.0 55.0 to less than 56.2 56.2 to less than 57.4 57.4 to less than 58.6 58.6 to less than 59.8

Percentage of bags
5 20 10 15 20 5 10 15

Cumulative percentage
5 25 35 50 70 75 85 100
21

(f) Cumulative percentage polygon (ogive)


Cumulative Percentage Polygon

120%

100%

80%

60%

40%

20%

0% 50.19 51.39 52.59 53.79 54.99 56.19 57.39 58.59 59.79

22

Solution 2-1
(g) On the basis of the results of (a) through (f), does there appear to be any concentration of the bag weights around specific values? There are no obvious outliers, no obvious patterns and the data seems fairly even distributed from 51 to about 60.

23

Solution 2-1
(h) If you had to make a prediction of the weight of potato chips in the next bag, what would you predict? Why? The best prediction would be somewhere around the middle of the data (because there is no trend or pattern obvious): we could predict about 55 grams. Note: we will learn how to make more accurate forecasts later in the course.
24

Example 2-2
In recent years, the cost of holiday accommodation on a particular island has been increasing. There was, however, a reduction as a reaction to reduced air travel in the aftermath of the attacks of September 11, 2001. Since then, rising fuel costs have increased the cost of commercial flights and so further discouraged travel to the island, but despite this, the cost of accommodation has continued to increase.
25

Example 2-2
Year

This data represents the cost of a double room for one nights accommodation on the island for the years 1995 to 2006. (a) Set up a scatter diagram with cost of the double room on the y-axis and year on the x-axis.

Cost of double room ($)


100
120 130 145 170 230 200 195 180

1995
1996 1997 1998 1999 2000 2001 2002 2003

2004
2005 2006

185
190 205
26

(a) Scatterplot
Cost of double room
250

200

Cost ($)

150

100

50

0 1994

1996

1998

2000 Year

2002

2004

2006

2008

27

Time series plot

Time series data is data that is recorded at regular time intervals (in our example it was years). A time series plot has time on the x-axis and connects the data with straight lines. Since this particular example records the data annually (i.e. at regular intervals), a time series plot is more appropriate than a scatterplot.
28

Time series plot


Cost of double room
250

200

Cost ($)

150

100

50

0 1994

1996

1998

2000 Year

2002

2004

2006

2008

29

(b) Patterns in the data

There is a clear upward trend in room cost from 1995 to 2001. Close to the September 11 attacks, cost decreases for each of the next three years. From 2004, the cost begins to increase again, but does not (yet) return to the heights experienced prior to September 11.
30

Example 2-3
A DVD hire company deals with a number of complaints regarding their rental DVDs. The number of times each complaint occurred is given in the table.

Complaint
Scratched disc Dirty disc Cracked disc Wrong DVD Too expensive Coarse language Explicit content Boring Too violent Too soppy Not funny Rental period too short Bad movie Rude staff Store closed

Frequency
125 116 21 54 39 26 41 29 18 27 33 14 12 9 4
31

(a) Construct a bar chart

Normally word categories (as in this example) are listed up y-axis and number categories (for example years, months, pay classification scales etc) are listed across the x-axis. Make sure you are confident preparing a bar chart by hand!! Remember always label axes and give graphs a title!

32

(a) Bar chart


Bar Chart
Wrong DVD Too violent Too soppy Too expensive Store closed Scratched disc

Complaint

Rude staff Rental period too short Not funny Explicit content Dirty disc Cracked disc Coarse language Boring Bad movie 0 20 40 60 80 100 120

33

140

(b) Construct a pie chart


Pie Chart Wrong DVD 10% Too violent 3% Too soppy 5% Too e xpe ns ive 7% Store clos ed 1% Dirty dis c 20%

Bad m ovie Boring 2% 5% Coars e language 5%d dis c Cracke 4%

Scratched dis c 21%

Explicit content Not funny 7% ntal pe riod too Rude Re s taff 6% short 2% 2%
34

(b) Pie chart

The default view of this pie chart is difficult to read! Youll often have to work with default graphs to improve their look (especially for assignments!!) Sometimes just resizing the graph can help!

35

(b) Pie chart


To produce a pie-chart by hand you need a protractor (to measure degrees). In the exam you will only get easy category sizes (eg multiples of 45o). To calculate the degrees for each category:
Complaint Scratched disc Frequency 125 %
125 100 21% 586 116 100 20% 586 21 100 4% 586

Degrees
125 360 77 586 116 360 71 586 21 360 13 586

Dirty disc
Cracked disc

116
21

36

Sc ra tc h
10% 15% 20% 25% 0% 5%

ed di s c

D irt y di sc D VD

W ro n g Ex pl ic it co nt en t

To o ex pe ns iv e

N ot fu n ny

B or in g To o so pp y gu a ge C oa rs e la n
Complaint

(c) Pareto diagram

Pareto Diagram

C ra c ke d di s c

R en t al p er io d to o B o vi ol en

To t

sh or t ad m ov R ud e St or e

ie st cl

af f os e

37

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

(c) Pareto diagram

Often Pareto charts have both vertical axes with the same scale Excel and PHStat2 do not do this easily. On the following slide, the left-hand axis allows for the total frequency value (586) and this lines up exactly with 100% in the cumulative frequency on the right-hand side. This graph also groups the very small categories (in this case only the last category) calling them other.
38

(c) Pareto diagram


Pareto Chart of Complaint
600 500
Count

100 80 60 40 20 0
Percent
39

400 300 200 100 Complaint


sc isc V D ent i ve nny ing ppy ge isc ent or t v ie taff her i d d D nt ns u or o ua d ol sh o s t y f i O d s d g g m t o e B e v t e r e o n c p n oo l a ack oo to Bad Rud ch Di r o cit ex No t T T od W pli oo ra s e Cr r c ri x a S T e E o p C al t n Re

Count Percent Cum %

125 116 54 41 39 33 29 27 26 21 18 14 12 9 4 22 20 10 7 7 6 5 5 5 4 3 2 2 2 1 22 42 52 59 66 72 77 82 86 90 93 96 98 99 100

(d) Which graphical method do you think is best to portray this data?

Pie chart : Too cramped with so many categories. The similar sized segments are difficult to compare. In some views, the category labels overlap each other. Pareto chart preferred over bar chart since it orders categories from smallest to largest, includes the cumulative percentage polygon and makes it easy to see most common complaints. 40

(e) Conclusions about most common complaints

The two most common complaints are scratched and dirty discs (21% and 20% respectively or 41% of complaints in total). The third most common is wrong DVD (10%).

41

Additional Example: Grouped bar chart


Given the following two-way cross-classification table, construct a side-by-side bar chart comparing men and women for each of the three categories on the vertical axis. Discuss the resulting graph.
Junior accountant Men 40 Accountant 50 Senior accountant 25 Total 115

Women

35

30

70
42

Grouped bar chart


Gender and job position

Senior accountant

Junior accountant

Male Female

Accountant

10

20

30

40

50

60

43

Grouped bar chart

Because there are clearly more men in each of the three job positions (junior accountant, accountant and senior accountant) it is difficult to comment on the ratio of men to women in each class. Note that PHStat2 has changed the order of the three categories (to alphabetical order from bottom to top). It seems as if the relative number of women is dropping as the job position increases. We might be better to use relative frequencies to compare gender differences. 44

Features of graphical data


Basic features of an ideal graph include: Showing the data Getting the viewer to focus on the substance of the graph rather than on how the graph was developed Avoiding distortion Encouraging comparisons of data Serving a clear purpose Being integrated with the statistical and verbal descriptions of the graph
Source: Levine et al., 2005.
45

Principles of graphical excellence


Graphical excellence: is a well-designed presentation of data that provides substance, statistics, and design. communicates complex ideas with clarity, precision, and efficiency. gives the viewer the largest number of ideas in the shortest time with the least ink. almost always involves several dimensions. requires telling the truth about the data.
Source: Levine et al., 2005. 46

Data-Ink Ratio

The data-ink ratio is the proportion of the graphics ink that is devoted to nonredundant display of data information. Data - ink Data - ink ratio = Total ink used to print the graphic
Aim maximise proportion of ink used in graph that is devoted to data.
Source: Levine et al., 2005.
47

Graphical excellence

Chartjunk decoration that is non-data-ink or redundant data ink. Lie factor the ratio of the size of the effect shown in the graph to the size of the effect in the data. Aim is to reduce both of these! Will discuss examples in more detail in tutorials important to be there!
Source: Levine et al., 2005.
48

After the lecture each week

Review the lecture material Complete all readings Complete all of recommended problems (listed in SG) from the textbook Complete at least some of additional problems Consider (briefly) the discussion points prior to tutorials

49

Você também pode gostar