Intro To Data Analysis Project

Data Analysis & Data Fluency
Topics of todays discussion

Reading Data Tables to make Conclusions
Measures of Central Tendency Mean, Median, Mode
Measures of Dispersion Range, Standard Deviation
Looking at data over a period of time as a trend
Correlation and Causality

Reading Data Tables Situation 1

Let us assume a city has 4 modern format stores (named
Store 1, Store 2, etc) of a single retail player

They are more or less of similar size and have similar
monthly Sales
However, the Sales by different categories are different
for example, one Store might have a higher Sales of FMCG
and another a higher Sale of Staples
In such a scenario, let us look at the buyers of instant
noodles in these 4 stores

Brands Purchased in each Store Instant Noodles
Store 1
Store 2
Store 3
Store 4
% Buying Maggi variants only
70%
75%
55%
85%
% Buying Yippee, Top Ramen,

etc
20%
20%
35%
10%
% Buying both
10%
5%
10%
5%
Among buyers of Instant

Noodles in each Store

% Contribution of Buyers of Instant Noodles Brands from each
Store
Store 1
Store 2
Store 3
Store 4
18%
27%
42%
13%

etc
13%
18%
66%
4%
% Buying both
20%
14%
60%
6%
Reading Data Tables Situation 1 Assignment

Reading the 2 tables what will you conclude about Instant
Noodles sales from the 4 stores?

Can you make some guesses about the difference in the
catchment profiles of these stores?
Reading Data Tables Situation 1 Hint

Store 1
Store 2
Store 3
Store 4
Base: Buyers of Instant

Noodles in each Store
500
700
1500
300
70%
75%
55%
85%

etc
20%
20%
35%
10%
% Buying both
10%
5%
10%
5%
Reading Data Tables Situation 1 Hint

% Contribution of Buyers of Instant Noodles Brands from each
Store
Store 1
Store 2
Store 3
Store 4
%
of
ALL
INSTANT
NOODLES BUYERS
17%
23%
50%
10%
18%
27%
42%
13%

etc
13%
18%
66%
4%
% Buying both
20%
14%
60%
6%

SOME SIMPLE CONCLUSIONS
Store 3 has a substantially large number of buyers of
Instant Noodles as category Store 4 has the least
Among their respective buyers, Stores 1, 2 and 4 have high
(70%+) solus Maggi buyers, especially Store 4 (85%)

Store 3 has lesser (55%) solus Maggi buyers. But, being
the largest seller of instant noodles, contributes maximum

to Maggi sales, as well as to the other brands sales

SOME SIMPLE CONCLUSIONS
Same-sized Stores, yet Store 3 has
Higher Instant Noodles sales and

Higher % of new brand (Yippee, Smoodles, etc) Sales
So, the catchment profile might be
- younger, with more double income hhlds, bachelors, etc
- also, psychographically, more open to trying new brands
- more exposed to media hence aware of new brands
Similarly, Store 4 catchment profile might be just the
opposite
Sales in a Hyper Store Month-wise, last Quarter

Oct
Nov
Dec
Total Qtr
STAPLES
30.40
28.63
29.05
88.08
F&V
20.00
20.72
18.69
59.41
FISH & MEAT
9.73
10.68
11.02
31.44
BAKERY
6.39
6.54
9.78
22.72
DAIRY & FROZEN
8.84
8.72
8.22
25.78
FMCG
46.65
47.33
48.55
142.53
LIQUOR
29.68
28.73
39.02
97.43
APPAREL
6.17
5.16
6.51
17.84
E&E
0.84
0.48
0.66
1.98
HWP
9.55
9.28
10.06
28.90
COMMON
0.38
0.39
0.43
1.20
168.65
166.66
181.99
517.30
Sales (in Rs. Lakhs):
TOTAL
Col% - Each month, what is the contribution of each category?

Oct
Nov
Dec
Total Qtr
STAPLES
18.0%
17.2%
16.0%
17.0%
F&V
11.9%
12.4%
10.3%
11.5%
FISH & MEAT
5.8%
6.4%
6.1%
6.1%
BAKERY
3.8%
3.9%
5.4%
4.4%
DAIRY & FROZEN
5.2%
5.2%
4.5%
5.0%
FMCG
27.7%
28.4%
26.7%
27.6%
LIQUOR
17.6%
17.2%
21.4%
18.8%
APPAREL
3.7%
3.1%
3.6%
3.4%
E&E
0.5%
0.3%
0.4%
0.4%
HWP
5.7%
5.6%
5.5%
5.6%
COMMON
0.2%
0.2%
0.2%
0.2%
Sales:
Row% - For each category, what is the contribution from each month?
Turnaround Hyper Store
Oct
Nov
Dec
Total Qtr
STAPLES
34.5%
32.5%
33.0%
100.0%
F&V
33.7%
34.9%
31.5%
100.0%
FISH & MEAT
31.0%
34.0%
35.1%
100.0%
BAKERY
28.1%
28.8%
43.1%
100.0%
DAIRY & FROZEN
34.3%
33.8%
31.9%
100.0%
FMCG
32.7%
33.2%
34.1%
100.0%
LIQUOR
30.5%
29.5%
40.1%
100.0%
APPAREL
34.6%
28.9%
36.5%
100.0%
E&E
42.6%
24.2%
33.3%
100.0%
HWP
33.1%
32.1%
34.8%
100.0%
COMMON
31.8%
32.4%
35.8%
100.0%
Sales:

Measures of Central Tendency Mean, Median,
Mode
Measures of Central Tendency

If we have a lot of data and we want to tell something without giving all the
data, there is a need to describe them with a single number

For example, if a shopkeeper has sold 10 soap bars in a day and someone asks
At what price did you sell the soaps? the person wants to hear one
number. How does the shopkeeper respond?
- Maybe, he will give the typical price of a soap bar sold
- Maybe, he will give some number that represents the middle
of all
the prices
- Maybe, he will give the most frequent price at which they have been sold
In all cases, he is trying to give a number that somehow represents the
centre of all the prices at which the soap bars have been sold (i.e. a
Measure of Central Tendency)
Mean, Median and Mode

There are three Measures of Central Tendency that are used commonly
Arithmetic Mean, Median and Mode.

Let us take the situation of the sales of 10 soap bars to explain the Measures of Central
Tendency. Suppose, the prices (in Rs) of the soap bars sold throughout the day were as
follows:
Rs 20, Rs 25, Rs 15, Rs 20, Rs 10, Rs 10, Rs 40, Rs 20, Rs 20, Rs 30
Arithmetic Mean
- Most commonly used Measure of Central Tendency
- Tries to give the centre of the data by computing the sum of
numbers and dividing it with the number of numbers
- So, in our example, it is (Sum of prices of 10 soap bars)/10,
i.e. Rs 210/10 = Rs 21

Median
- Gives the middle number from a set of the data.
It is computed in two steps:
STEP 1: The numbers are ordered from lowest to highest (or reverse)
STEP 2:
- if the total numbers in the set is odd, the single middle-most number
is the median
- if the total numbers in the set is even, the average of the two middle
numbers is the median

Let us compute the Median price of the soap bars sold.
STEP 1: Prices of 10 soap bars arranged in order (ascending)
Rs 10, Rs 10, Rs 15, Rs 20, Rs 20, Rs 20, Rs 20, Rs 25, Rs 30, Rs 40
STEP 2: Since there are 10 soap bars sold (i.e. an even number of
soap bars sold), there are two middle numbers the 5 th and the 6th ones
Rs 10, Rs 10, Rs 15, Rs 20,
Rs 20, Rs 20,
Rs 20, Rs 25, Rs 30, Rs 40
So the Median will be the mean of Rs 20 and Rs 20, i.e. Rs 20

NOTE: Median is the middle number (or mid-point) and not the
centre-of-gravity. For example, if the highest price was Rs 400 instead of Rs 40 the
median would still be Rs 20. The Mean would change

Mode
- The most typical number or the one that shows up the most
number of times
For example, in the case of 10 soap bars sold
- One soap each has been sold at the price points of Rs 15, Rs 25, Rs 30 and Rs
40
- Two soaps have been sold at Rs 10 each
- Four soaps have been sold at Rs 20 each
Since the maximum number (four) of soaps have been sold at Rs 20,
Mode will be Rs 20.
A set of data can have one or more Modes

Arithmetic Mean Most commonly used Measure
However, Arithmetic Mean should be avoided in cases where the
data has some outliers

Some very large or very small values in the data can skew the Mean
In such cases, Median can be a better one number conclusion
Mode could also be used, especially when one number occurs most
frequently in the data set
Exercise 1: Measures of Central Tendency

In a shop selling wrist watches, around 500 wrist watches were sold in
a month
90% of the watches were sold at a price ranging from Rs 700 to Rs 3000
9% of the watches were sold at prices below Rs 700
2 very high-end expensive watches were sold at prices above Rs 2 lakhs during
the month
If you are asked for a one number to give an answer to the question:
At what prices are wrist watches sold from this shop?
Which measure of Central Tendency will you use Mean, Median or
Mode?

In another shop selling wrist watches in a relatively down-market area,
around 200 wrist watches were sold in a month.
All the wrist watches were in the price range of Rs 500 to Rs 2000.
At what prices are wrist watches sold from this shop?
Mode?

In a modern format outlet the sales of the largest SKU of potato wafers,
on a particular day, has been as follows:
Brands (selling at Rs X)
Units sold
Lays (at Rs 20)
43
Bingo (at Rs 20)
32
Pringles (at Rs 80)
Other local brands (at Rs 20)
31
At what price are branded potato wafers sold from this outlet?
Mode?

In an outlet selling footwear in Kolkata, following were the sales (out of
350 footwears sold) by the Sizes of footwear, during a particular week:
If you are asked for a one number to give an answer to the question: What is the Size of
footwear of Kolkattans, in general? Which measure of Central Tendency will you use
Mean, Median or Mode?
Are there any other conclusions that you can draw from this data? Would you want to
look at the given data in some other way?
Size of footwear
Units sold
3 or below
30
40
75
50
75
60
9 or above
20
The concept of Weighted Mean

All
Store 1
Store 2
Store 3
Store 4
3000
500
700
1500
300
70%
75%
55%
85%
% Buying Yippee, Top Ramen, etc
20%
20%
35%
10%
% Buying both
10%
5%
10%
5%
Base: Buyers of Instant Noodles in

each Store
The concept of Weighted Mean

All
Store 1
Store 2
Store 3
Store 4
Base: Buyers of Instant Noodles in

each Store
3000
500
700
1500
300
65%
70%
75%
55%
85%
% Buying Yippee, Top Ramen, etc
27%
20%
20%
35%
10%
% Buying both
8%
10%
5%
10%
5%

Measures of Dispersion
Lot of times, a single number (Mean, Median or Mode) does not lead
to sufficient conclusions. Let us look at the following data:

From 3 outlets selling wrist watches, the average price of watches sold is as follows:
Mean price of watches sold
Outlet 1
Outlet 2
Outlet 3
Rs 890
Rs 886
Rs 891
Looking at Arithmetic Mean, can we conclude that 3 outlets have
similar priced watches sold and similar profile of customers?

The answer is NO!
So, we need some Measures to know how the data is spread across the
Arithmetic Mean.
Range, Variance and Standard Deviation

To understand how spread apart the data is from Mean, three
measures are used Range, Variance and Standard Deviation
Range is the difference between the highest value and the lowest value
in the data set a simple measure, giving clear idea of outliers
Standard Deviation, very simply put, is the Measure that shows by
how much do the individual members of a data set differ
from the Mean value of the data
So, high Standard Deviation (or SD) individual data points are very
much different or spread apart from the Mean
Low SD individual data points are close to the Mean
Example: Measures of Dispersion

Let us look at the 3 outlets selling wrist watches once again now, with
the Measures of Dispersion
Outlet 1
Outlet 2
Outlet 3
Mean price of watches sold
Rs 890
Rs 886
Rs 891
Range
Rs 2500
Rs 99,500
Rs 7,500
Standard Deviation
Rs 22.40
Rs 255.90
Rs 98.20
Kind of Sales happening in the 3 outlets are completely different.

Outlet 1 caters to a very homogeneous segment all buying watches
within a narrow price range.

Outlet 2 has Sales at a wide range of price points (maybe have outliers
too). caters to diverse demographic profiles some very affluent ones

and some from middle/lower income groups
Assignment 2
In the catchment area of a store, 1000 people were asked some questions in a
survey. All 1000 people themselves shop for day-to-day household items,
belong in the age group of 21 45 yrs, and are housewives or single
earning members.
They were asked to agree or disagree with a statement I love buying day-to-day
items from modern format outlets rather than going to traditional Kirana stores
in a five point scale:
Of the 1000 people responding to this question, the mean score obtained was 3.1
out of 5. What can you conclude from this?
Assignment 2
If you are now given the following distribution:
What would you conclude?

Can you make some hypotheses on the sub-groups of people giving this
opinion?
Would you want the data to be analyzed in some other sub-groups?
Assignment 2
We may need to look at an output like:
to check, Is the polarization of findings due to different attitudes

among different age groups and different stages of life?

Data over a period of time as a trend

Lot of times, data has to be looked at, over a period of time as a trend
(e.g. Day-to-day Sales Reports, Weekly Reports or Monthly Sales Reports)
Following is the monthly sales data (for 10 months) of consumer

durables from a store:
MONTHS SALES (No. of Units)
1
110
100
120
140
170
150
160
190
200
10
190

Since there can be lot of month-on-month fluctuations in the data, it
can be difficult to understand the direction of the data
To observe the medium or long term trend, one way of looking at the
data is to graphically see these Sales as a Scatter Plot and then
draw a trend-line

The other way of smoothening the fluctuations and looking at the real
picture is by computing Moving Averages. Following table shows 3-monthly
moving averages:
MONTHS
SALES
SALES (Moving Avg)
110
100
120
110
140
120
170
143
150
153
160
160
190
167
200
183
10
190
193

The Moving Average shows that there is a clear and consistent
upward trend in Sales over this period of 10 months.
This is evident from the following graph:

Correlation between two variables

When we look at data for two or more variables, we sometimes see that data for
two variables move in the same direction.
For example, if we record the heights and weights of a large number of people, we
would observe that there are many taller people who also weigh higher and similarly
there are many shorter people who also weigh lesser
Also, there can be two variables that move mostly in opposite directions
For example, the Power of the engine and Mileage of the car mostly, cars with
higher Power would have lower Mileage.
We refer to the term Correlation to explain the strength of linear association

between two variables and a co-efficient r is used to measure this strength
The value of r can range between 1 and +1
Correlation between two variables

r value closer to +1 a strong positive association
r value closer to -1 a strong negative association
r value closer to 0 weak association between the two variables
In Retail Sales data too, it will be interesting to observe the Correlation of
Sales of certain categories over the period of time
- Is there a high positive Correlation between Sales of Shampoos and
Conditioners?
- Is there a negative Correlation between Sales of Shower Gels and Soaps?
by looking at long-term Sales data.
Correlation does not imply Causality
However, one must note CORRELATION DOES NOT IMPLY

CAUSALITY
Meaning, a high positive Correlation between A and B does not
mean that A causes B or A leads to B
e.g. Brand Imagery vs Brand Usage
generally a high positive correlation but does it mean increase in
Brand Imagery would lead to increase in Brand Usage?
NO!
Assignment 3: Correlations
Correlations between the Sales of product categories:
What can you conclude from this?
Assignment 3: Correlations
Correlations between the Sales of product categories:
STAPLES
CORRELATIONS
STAPLES
F&V
FISH & MEAT
BAKERY
DAIRY & FROZEN
FMCG
LIQUOR
F&V
FISH & M BAKERY
D&F
FMCG
LIQUOR
1.00
0.63
0.62
0.37
0.68
0.96
-0.18
0.63
1.00
0.39
-0.24
0.83
0.73
-0.29
0.62
0.39
1.00
0.61
0.15
0.61
0.15
0.37
-0.24
0.61
1.00
-0.32
0.31
0.56
0.68
0.83
0.15
-0.32
1.00
0.71
-0.39
0.96
0.73
0.61
0.31
0.71
1.00
-0.06
-0.18
-0.29
0.15
0.56
-0.39
-0.06
1.00

Let us look at the following article:
EATING BREAKFAST MAY BEAT TEEN OBESITY
In the study, published in Pediatrics, researchers analyzed the dietary and weight
patterns of a group of 2,216 adolescents over a five-year period (1998-1999 to 20032004) from public schools in Minneapolis-St. Paul, Minn.
The researchers write that teens who ate breakfast regularly had a lower percentage of
total calories from saturated fat and ate more fiber and carbohydrates than those who
skipped breakfast. In addition, regular breakfast eaters seemed more physically active
than breakfast skippers.
Over time, researchers found teens who regularly ate breakfast tended to gain less
weight and had a lower body mass index than breakfast skippers."
[Source: WebMD Health News, March 2008]
From the data and argument given in the article, is it fair to conclude that not eating
breakfast leads to obesity or Eating breakfast reduces chance of obesity? Why?

EATING BREAKFAST MAY BEAT TEEN OBESITY
In the study, published in Pediatrics, researchers analyzed the dietary and weight
patterns of a group of 2,216 adolescents over a five-year period (1998-1999 to 20032004) from public schools in Minneapolis-St. Paul, Minn.
The researchers write that teens who ate breakfast regularly had a lower percentage of
total calories from saturated fat and ate more fiber and carbohydrates than those who
skipped breakfast. In addition, regular breakfast eaters seemed more physically active
than breakfast skippers.
Over time, researchers found teens who regularly ate breakfast tended to gain less
weight and had a lower body mass index than breakfast skippers."
[Source: WebMD Health News, March 2008]
Open to Discussion

The Title and part of content, tries to say:
Eating breakfast No Obesity, Breakfast Skipping Obesity
Also, Breakfast eating Physically active
Is it ? Or is it that these two things go together?
Maybe, its the other way round People who have high body fat are less likely to get hungry in morning, so Obesity
Breakfast Skipping!!!
Or maybe, Physical Activity Breakfast (as you are hungry) and Physical Activity No Obesity
So, high Correlation, by no means can give Causality
generally a high positive correlation but does it mean increase in Brand Imagery would lead to increase in Brand Usage?
NO!
THANK YOU!

Intro To Data Analysis Project

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Intro To Data Analysis Project

Enviado por

Direitos autorais:

Formatos disponíveis

Data Analysis & Data Fluency

Topics of todays discussion

Topics of todays discussion

Reading Data Tables Situation 1

Store 1, Store 2, etc) of a single retail player

Reading Data Tables Situation 1

% Buying Maggi variants only

% Buying Yippee, Top Ramen,

Among buyers of Instant

Reading Data Tables Situation 1

% Buying Maggi variants only

% Buying Yippee, Top Ramen,

Reading Data Tables Situation 1 Assignment

Noodles sales from the 4 stores?

Reading Data Tables Situation 1 Hint

Base: Buyers of Instant

% Buying Maggi variants only

% Buying Yippee, Top Ramen,

Reading Data Tables Situation 1 Hint

% Buying Maggi variants only

% Buying Yippee, Top Ramen,

Reading Data Tables Situation 1

(70%+) solus Maggi buyers, especially Store 4 (85%)

the largest seller of instant noodles, contributes maximum

Reading Data Tables Situation 1

Higher Instant Noodles sales and

Similarly, Store 4 catchment profile might be just the

Sales in a Hyper Store Month-wise, last Quarter

FISH & MEAT

DAIRY & FROZEN

Sales (in Rs. Lakhs):

Col% - Each month, what is the contribution of each category?

FISH & MEAT

DAIRY & FROZEN

FISH & MEAT

DAIRY & FROZEN

Topics of todays discussion

Measures of Central Tendency

data, there is a need to describe them with a single number

Mean, Median and Mode

Arithmetic Mean, Median and Mode.

Mean, Median and Mode

Mean, Median and Mode

Rs 20, Rs 25, Rs 30, Rs 40

So the Median will be the mean of Rs 20 and Rs 20, i.e. Rs 20

Mean, Median and Mode

Mean, Median and Mode

data has some outliers

frequently in the data set

Exercise 1: Measures of Central Tendency

Exercise 2: Measures of Central Tendency

Exercise 3: Measures of Central Tendency

Lays (at Rs 20)

Bingo (at Rs 20)

Pringles (at Rs 80)

Other local brands (at Rs 20)

Exercise 4: Measures of Central Tendency

The concept of Weighted Mean

% Buying Maggi variants only

% Buying Yippee, Top Ramen, etc

Base: Buyers of Instant Noodles in

The concept of Weighted Mean