Você está na página 1de 49

Data Analysis & Data Fluency

Topics of todays discussion


Reading Data Tables to make Conclusions
Measures of Central Tendency Mean, Median, Mode
Measures of Dispersion Range, Standard Deviation
Looking at data over a period of time as a trend
Correlation and Causality

Topics of todays discussion


Reading Data Tables to make Conclusions
Measures of Central Tendency Mean, Median, Mode
Measures of Dispersion Range, Standard Deviation
Looking at data over a period of time as a trend
Correlation and Causality

Reading Data Tables Situation 1


Let us assume a city has 4 modern format stores (named

Store 1, Store 2, etc) of a single retail player


They are more or less of similar size and have similar
monthly Sales
However, the Sales by different categories are different
for example, one Store might have a higher Sales of FMCG
and another a higher Sale of Staples
In such a scenario, let us look at the buyers of instant
noodles in these 4 stores

Reading Data Tables Situation 1


Brands Purchased in each Store Instant Noodles
Store 1

Store 2

Store 3

Store 4

% Buying Maggi variants only

70%

75%

55%

85%

% Buying Yippee, Top Ramen,


etc

20%

20%

35%

10%

% Buying both

10%

5%

10%

5%

Among buyers of Instant


Noodles in each Store

Reading Data Tables Situation 1


% Contribution of Buyers of Instant Noodles Brands from each
Store
Store 1

Store 2

Store 3

Store 4

% Buying Maggi variants only

18%

27%

42%

13%

% Buying Yippee, Top Ramen,


etc

13%

18%

66%

4%

% Buying both

20%

14%

60%

6%

Reading Data Tables Situation 1 Assignment


Reading the 2 tables what will you conclude about Instant

Noodles sales from the 4 stores?


Can you make some guesses about the difference in the
catchment profiles of these stores?

Reading Data Tables Situation 1 Hint


Brands Purchased in each Store Instant Noodles
Store 1

Store 2

Store 3

Store 4

Base: Buyers of Instant


Noodles in each Store

500

700

1500

300

% Buying Maggi variants only

70%

75%

55%

85%

% Buying Yippee, Top Ramen,


etc

20%

20%

35%

10%

% Buying both

10%

5%

10%

5%

Reading Data Tables Situation 1 Hint


% Contribution of Buyers of Instant Noodles Brands from each
Store
Store 1

Store 2

Store 3

Store 4

%
of
ALL
INSTANT
NOODLES BUYERS

17%

23%

50%

10%

% Buying Maggi variants only

18%

27%

42%

13%

% Buying Yippee, Top Ramen,


etc

13%

18%

66%

4%

% Buying both

20%

14%

60%

6%

Reading Data Tables Situation 1


SOME SIMPLE CONCLUSIONS
Store 3 has a substantially large number of buyers of
Instant Noodles as category Store 4 has the least
Among their respective buyers, Stores 1, 2 and 4 have high

(70%+) solus Maggi buyers, especially Store 4 (85%)


Store 3 has lesser (55%) solus Maggi buyers. But, being

the largest seller of instant noodles, contributes maximum


to Maggi sales, as well as to the other brands sales

Reading Data Tables Situation 1


SOME SIMPLE CONCLUSIONS
Same-sized Stores, yet Store 3 has

Higher Instant Noodles sales and


Higher % of new brand (Yippee, Smoodles, etc) Sales
So, the catchment profile might be
- younger, with more double income hhlds, bachelors, etc
- also, psychographically, more open to trying new brands
- more exposed to media hence aware of new brands

Similarly, Store 4 catchment profile might be just the

opposite

Sales in a Hyper Store Month-wise, last Quarter


Oct

Nov

Dec

Total Qtr

STAPLES

30.40

28.63

29.05

88.08

F&V

20.00

20.72

18.69

59.41

FISH & MEAT

9.73

10.68

11.02

31.44

BAKERY

6.39

6.54

9.78

22.72

DAIRY & FROZEN

8.84

8.72

8.22

25.78

FMCG

46.65

47.33

48.55

142.53

LIQUOR

29.68

28.73

39.02

97.43

APPAREL

6.17

5.16

6.51

17.84

E&E

0.84

0.48

0.66

1.98

HWP

9.55

9.28

10.06

28.90

COMMON

0.38

0.39

0.43

1.20

168.65

166.66

181.99

517.30

Sales (in Rs. Lakhs):

TOTAL

Col% - Each month, what is the contribution of each category?


Oct

Nov

Dec

Total Qtr

STAPLES

18.0%

17.2%

16.0%

17.0%

F&V

11.9%

12.4%

10.3%

11.5%

FISH & MEAT

5.8%

6.4%

6.1%

6.1%

BAKERY

3.8%

3.9%

5.4%

4.4%

DAIRY & FROZEN

5.2%

5.2%

4.5%

5.0%

FMCG

27.7%

28.4%

26.7%

27.6%

LIQUOR

17.6%

17.2%

21.4%

18.8%

APPAREL

3.7%

3.1%

3.6%

3.4%

E&E

0.5%

0.3%

0.4%

0.4%

HWP

5.7%

5.6%

5.5%

5.6%

COMMON

0.2%

0.2%

0.2%

0.2%

Sales:

Row% - For each category, what is the contribution from each month?
Turnaround Hyper Store

Oct

Nov

Dec

Total Qtr

STAPLES

34.5%

32.5%

33.0%

100.0%

F&V

33.7%

34.9%

31.5%

100.0%

FISH & MEAT

31.0%

34.0%

35.1%

100.0%

BAKERY

28.1%

28.8%

43.1%

100.0%

DAIRY & FROZEN

34.3%

33.8%

31.9%

100.0%

FMCG

32.7%

33.2%

34.1%

100.0%

LIQUOR

30.5%

29.5%

40.1%

100.0%

APPAREL

34.6%

28.9%

36.5%

100.0%

E&E

42.6%

24.2%

33.3%

100.0%

HWP

33.1%

32.1%

34.8%

100.0%

COMMON

31.8%

32.4%

35.8%

100.0%

Sales:

Topics of todays discussion


Reading Data Tables to make Conclusions
Measures of Central Tendency Mean, Median,

Mode
Measures of Dispersion Range, Standard Deviation
Looking at data over a period of time as a trend
Correlation and Causality

Measures of Central Tendency


If we have a lot of data and we want to tell something without giving all the

data, there is a need to describe them with a single number


For example, if a shopkeeper has sold 10 soap bars in a day and someone asks

At what price did you sell the soaps? the person wants to hear one
number. How does the shopkeeper respond?
- Maybe, he will give the typical price of a soap bar sold
- Maybe, he will give some number that represents the middle
of all
the prices
- Maybe, he will give the most frequent price at which they have been sold
In all cases, he is trying to give a number that somehow represents the
centre of all the prices at which the soap bars have been sold (i.e. a
Measure of Central Tendency)

Mean, Median and Mode


There are three Measures of Central Tendency that are used commonly

Arithmetic Mean, Median and Mode.


Let us take the situation of the sales of 10 soap bars to explain the Measures of Central
Tendency. Suppose, the prices (in Rs) of the soap bars sold throughout the day were as
follows:
Rs 20, Rs 25, Rs 15, Rs 20, Rs 10, Rs 10, Rs 40, Rs 20, Rs 20, Rs 30

Arithmetic Mean
- Most commonly used Measure of Central Tendency
- Tries to give the centre of the data by computing the sum of
numbers and dividing it with the number of numbers
- So, in our example, it is (Sum of prices of 10 soap bars)/10,
i.e. Rs 210/10 = Rs 21

Mean, Median and Mode


Median
- Gives the middle number from a set of the data.
It is computed in two steps:
STEP 1: The numbers are ordered from lowest to highest (or reverse)
STEP 2:
- if the total numbers in the set is odd, the single middle-most number
is the median
- if the total numbers in the set is even, the average of the two middle
numbers is the median

Mean, Median and Mode


Let us compute the Median price of the soap bars sold.
STEP 1: Prices of 10 soap bars arranged in order (ascending)
Rs 10, Rs 10, Rs 15, Rs 20, Rs 20, Rs 20, Rs 20, Rs 25, Rs 30, Rs 40

STEP 2: Since there are 10 soap bars sold (i.e. an even number of
soap bars sold), there are two middle numbers the 5 th and the 6th ones
Rs 10, Rs 10, Rs 15, Rs 20,

Rs 20, Rs 20,

Rs 20, Rs 25, Rs 30, Rs 40

So the Median will be the mean of Rs 20 and Rs 20, i.e. Rs 20


NOTE: Median is the middle number (or mid-point) and not the
centre-of-gravity. For example, if the highest price was Rs 400 instead of Rs 40 the
median would still be Rs 20. The Mean would change

Mean, Median and Mode


Mode
- The most typical number or the one that shows up the most
number of times
For example, in the case of 10 soap bars sold
- One soap each has been sold at the price points of Rs 15, Rs 25, Rs 30 and Rs
40
- Two soaps have been sold at Rs 10 each
- Four soaps have been sold at Rs 20 each

Since the maximum number (four) of soaps have been sold at Rs 20,
Mode will be Rs 20.
A set of data can have one or more Modes

Mean, Median and Mode


Arithmetic Mean Most commonly used Measure
However, Arithmetic Mean should be avoided in cases where the

data has some outliers


Some very large or very small values in the data can skew the Mean
In such cases, Median can be a better one number conclusion
Mode could also be used, especially when one number occurs most

frequently in the data set

Exercise 1: Measures of Central Tendency


In a shop selling wrist watches, around 500 wrist watches were sold in
a month
90% of the watches were sold at a price ranging from Rs 700 to Rs 3000
9% of the watches were sold at prices below Rs 700
2 very high-end expensive watches were sold at prices above Rs 2 lakhs during

the month

If you are asked for a one number to give an answer to the question:
At what prices are wrist watches sold from this shop?
Which measure of Central Tendency will you use Mean, Median or
Mode?

Exercise 2: Measures of Central Tendency


In another shop selling wrist watches in a relatively down-market area,
around 200 wrist watches were sold in a month.
All the wrist watches were in the price range of Rs 500 to Rs 2000.
If you are asked for a one number to give an answer to the question:
At what prices are wrist watches sold from this shop?
Which measure of Central Tendency will you use Mean, Median or
Mode?

Exercise 3: Measures of Central Tendency


In a modern format outlet the sales of the largest SKU of potato wafers,
on a particular day, has been as follows:
Brands (selling at Rs X)

Units sold

Lays (at Rs 20)

43

Bingo (at Rs 20)

32

Pringles (at Rs 80)

Other local brands (at Rs 20)

31

If you are asked for a one number to give an answer to the question:
At what price are branded potato wafers sold from this outlet?
Which measure of Central Tendency will you use Mean, Median or
Mode?

Exercise 4: Measures of Central Tendency


In an outlet selling footwear in Kolkata, following were the sales (out of
350 footwears sold) by the Sizes of footwear, during a particular week:
If you are asked for a one number to give an answer to the question: What is the Size of
footwear of Kolkattans, in general? Which measure of Central Tendency will you use
Mean, Median or Mode?
Are there any other conclusions that you can draw from this data? Would you want to
look at the given data in some other way?

Size of footwear

Units sold

3 or below

30

40

75

50

75

60

9 or above

20

The concept of Weighted Mean


Brands Purchased in each Store Instant Noodles
All

Store 1

Store 2

Store 3

Store 4

3000

500

700

1500

300

% Buying Maggi variants only

70%

75%

55%

85%

% Buying Yippee, Top Ramen, etc

20%

20%

35%

10%

% Buying both

10%

5%

10%

5%

Base: Buyers of Instant Noodles in


each Store

The concept of Weighted Mean


Brands Purchased in each Store Instant Noodles
All

Store 1

Store 2

Store 3

Store 4

Base: Buyers of Instant Noodles in


each Store

3000

500

700

1500

300

% Buying Maggi variants only

65%

70%

75%

55%

85%

% Buying Yippee, Top Ramen, etc

27%

20%

20%

35%

10%

% Buying both

8%

10%

5%

10%

5%

Topics of todays discussion


Reading Data Tables to make Conclusions
Measures of Central Tendency Mean, Median, Mode
Measures of Dispersion Range, Standard Deviation
Looking at data over a period of time as a trend
Correlation and Causality

Measures of Dispersion
Lot of times, a single number (Mean, Median or Mode) does not lead

to sufficient conclusions. Let us look at the following data:


From 3 outlets selling wrist watches, the average price of watches sold is as follows:

Mean price of watches sold

Outlet 1

Outlet 2

Outlet 3

Rs 890

Rs 886

Rs 891

Looking at Arithmetic Mean, can we conclude that 3 outlets have

similar priced watches sold and similar profile of customers?


The answer is NO!
So, we need some Measures to know how the data is spread across the

Arithmetic Mean.

Range, Variance and Standard Deviation


To understand how spread apart the data is from Mean, three
measures are used Range, Variance and Standard Deviation
Range is the difference between the highest value and the lowest value
in the data set a simple measure, giving clear idea of outliers
Standard Deviation, very simply put, is the Measure that shows by
how much do the individual members of a data set differ
from the Mean value of the data
So, high Standard Deviation (or SD) individual data points are very
much different or spread apart from the Mean
Low SD individual data points are close to the Mean

Example: Measures of Dispersion


Let us look at the 3 outlets selling wrist watches once again now, with
the Measures of Dispersion
Outlet 1

Outlet 2

Outlet 3

Mean price of watches sold

Rs 890

Rs 886

Rs 891

Range

Rs 2500

Rs 99,500

Rs 7,500

Standard Deviation

Rs 22.40

Rs 255.90

Rs 98.20

Kind of Sales happening in the 3 outlets are completely different.


Outlet 1 caters to a very homogeneous segment all buying watches

within a narrow price range.


Outlet 2 has Sales at a wide range of price points (maybe have outliers

too). caters to diverse demographic profiles some very affluent ones


and some from middle/lower income groups

Assignment 2
In the catchment area of a store, 1000 people were asked some questions in a
survey. All 1000 people themselves shop for day-to-day household items,
belong in the age group of 21 45 yrs, and are housewives or single
earning members.
They were asked to agree or disagree with a statement I love buying day-to-day
items from modern format outlets rather than going to traditional Kirana stores
in a five point scale:

Of the 1000 people responding to this question, the mean score obtained was 3.1
out of 5. What can you conclude from this?

Assignment 2
If you are now given the following distribution:

What would you conclude?


Can you make some hypotheses on the sub-groups of people giving this
opinion?
Would you want the data to be analyzed in some other sub-groups?

Assignment 2
We may need to look at an output like:

to check, Is the polarization of findings due to different attitudes


among different age groups and different stages of life?

Topics of todays discussion


Reading Data Tables to make Conclusions
Measures of Central Tendency Mean, Median, Mode
Measures of Dispersion Range, Standard Deviation
Looking at data over a period of time as a trend
Correlation and Causality

Data over a period of time as a trend


Lot of times, data has to be looked at, over a period of time as a trend
(e.g. Day-to-day Sales Reports, Weekly Reports or Monthly Sales Reports)

Following is the monthly sales data (for 10 months) of consumer


durables from a store:
MONTHS SALES (No. of Units)
1

110

100

120

140

170

150

160

190

200

10

190

Data over a period of time as a trend


Since there can be lot of month-on-month fluctuations in the data, it
can be difficult to understand the direction of the data
To observe the medium or long term trend, one way of looking at the
data is to graphically see these Sales as a Scatter Plot and then
draw a trend-line

Data over a period of time as a trend


The other way of smoothening the fluctuations and looking at the real
picture is by computing Moving Averages. Following table shows 3-monthly
moving averages:
MONTHS

SALES

SALES (Moving Avg)

110

100

120

110

140

120

170

143

150

153

160

160

190

167

200

183

10

190

193

Data over a period of time as a trend


The Moving Average shows that there is a clear and consistent
upward trend in Sales over this period of 10 months.
This is evident from the following graph:

Topics of todays discussion


Reading Data Tables to make Conclusions
Measures of Central Tendency Mean, Median, Mode
Measures of Dispersion Range, Standard Deviation
Looking at data over a period of time as a trend
Correlation and Causality

Correlation between two variables


When we look at data for two or more variables, we sometimes see that data for
two variables move in the same direction.
For example, if we record the heights and weights of a large number of people, we
would observe that there are many taller people who also weigh higher and similarly
there are many shorter people who also weigh lesser

Also, there can be two variables that move mostly in opposite directions
For example, the Power of the engine and Mileage of the car mostly, cars with
higher Power would have lower Mileage.

We refer to the term Correlation to explain the strength of linear association


between two variables and a co-efficient r is used to measure this strength
The value of r can range between 1 and +1

Correlation between two variables


r value closer to +1 a strong positive association
r value closer to -1 a strong negative association
r value closer to 0 weak association between the two variables
In Retail Sales data too, it will be interesting to observe the Correlation of
Sales of certain categories over the period of time
- Is there a high positive Correlation between Sales of Shampoos and
Conditioners?
- Is there a negative Correlation between Sales of Shower Gels and Soaps?
by looking at long-term Sales data.

Correlation does not imply Causality

However, one must note CORRELATION DOES NOT IMPLY


CAUSALITY
Meaning, a high positive Correlation between A and B does not
mean that A causes B or A leads to B
e.g. Brand Imagery vs Brand Usage
generally a high positive correlation but does it mean increase in
Brand Imagery would lead to increase in Brand Usage?
NO!

Assignment 3: Correlations
Correlations between the Sales of product categories:

What can you conclude from this?

Assignment 3: Correlations
Correlations between the Sales of product categories:
STAPLES
CORRELATIONS

STAPLES

F&V

FISH & MEAT

BAKERY

DAIRY & FROZEN

FMCG

LIQUOR

F&V

FISH & M BAKERY

D&F

FMCG

LIQUOR

1.00

0.63

0.62

0.37

0.68

0.96

-0.18

0.63

1.00

0.39

-0.24

0.83

0.73

-0.29

0.62

0.39

1.00

0.61

0.15

0.61

0.15

0.37

-0.24

0.61

1.00

-0.32

0.31

0.56

0.68

0.83

0.15

-0.32

1.00

0.71

-0.39

0.96

0.73

0.61

0.31

0.71

1.00

-0.06

-0.18

-0.29

0.15

0.56

-0.39

-0.06

1.00

Correlation does not imply Causality


Let us look at the following article:
EATING BREAKFAST MAY BEAT TEEN OBESITY
In the study, published in Pediatrics, researchers analyzed the dietary and weight
patterns of a group of 2,216 adolescents over a five-year period (1998-1999 to 20032004) from public schools in Minneapolis-St. Paul, Minn.
The researchers write that teens who ate breakfast regularly had a lower percentage of
total calories from saturated fat and ate more fiber and carbohydrates than those who
skipped breakfast. In addition, regular breakfast eaters seemed more physically active
than breakfast skippers.
Over time, researchers found teens who regularly ate breakfast tended to gain less
weight and had a lower body mass index than breakfast skippers."
[Source: WebMD Health News, March 2008]
From the data and argument given in the article, is it fair to conclude that not eating
breakfast leads to obesity or Eating breakfast reduces chance of obesity? Why?

Correlation does not imply Causality


EATING BREAKFAST MAY BEAT TEEN OBESITY
In the study, published in Pediatrics, researchers analyzed the dietary and weight
patterns of a group of 2,216 adolescents over a five-year period (1998-1999 to 20032004) from public schools in Minneapolis-St. Paul, Minn.
The researchers write that teens who ate breakfast regularly had a lower percentage of
total calories from saturated fat and ate more fiber and carbohydrates than those who
skipped breakfast. In addition, regular breakfast eaters seemed more physically active
than breakfast skippers.
Over time, researchers found teens who regularly ate breakfast tended to gain less
weight and had a lower body mass index than breakfast skippers."
[Source: WebMD Health News, March 2008]
Open to Discussion

Correlation does not imply Causality


The Title and part of content, tries to say:
Eating breakfast No Obesity, Breakfast Skipping Obesity
Also, Breakfast eating Physically active
Is it ? Or is it that these two things go together?
Maybe, its the other way round People who have high body fat are less likely to get hungry in morning, so Obesity
Breakfast Skipping!!!
Or maybe, Physical Activity Breakfast (as you are hungry) and Physical Activity No Obesity
So, high Correlation, by no means can give Causality

generally a high positive correlation but does it mean increase in Brand Imagery would lead to increase in Brand Usage?
NO!

THANK YOU!

Você também pode gostar