Você está na página 1de 3

Why would you neglect outliers?

The Green Cow

Some time ago your cow was reading a book written for beginning amateur ornithologists. 1 In general, that book was quiet interesting and contained a lot of practical advises. One advice however should be neglected. The author of the book gives advices about gathering, registration and interpretation of data on birds. He advices to count dierent sorts in a region of your choice. For clarity, you make a report on the daily mean number of birds of each observed sort. The author states that you should neglect observations that are not in line with the general trend. For instance: assuming that, on an average day, you see two storks in your region of interest. One day, you suddenly see 30 storks go by. Following the suggestion of the author, you should disregard this observation as a measurement failure. Let us take a look to see why that is not such a good idea.

Outliers inuencing the mean


First of all, it is important to know where the advice comes from. Let us set a -imaginary- scenario for a birdwatcher. Assume you go birdwatching once a month to a park in your neighborhood and register the numbers of birds you see. In table 1 you nd a possible example of the monthly numbers. Assume most months you see 20 to 30 birds during your observation session. Only in October you see 56, which is a lot more. If you want to report how many birds you have seen on average, you calculate the number of 25.4 birds per session.2 Following the advice of the book, you then delete the observation in October and calculate the mean again. Now you nd an average of 22.6 birds per session. According to the advice you got, you report this number. In gure 1 you can see what number of birds you could expect to observe from the averages with and without the observation in October. These plots should be interpreted as follows: on each day you go observing birds, the probability to see the number of birds in the horizontal axis can be read in
Cows see a lot of birds coming and going while they are out in the elds. Therefore, they can provide a lot of observations to biologist on condition they are well-trained. 2 Anyone wondering how you can see 0.4 birds in a session? Imagine that the average number of observed birds is somewhere between 25 and 26. The 0.4 suggest that out of 10 days you can expect to see 25 birds on six days and 26 on four days. O course this is what you could expect, not necessarily what you will actually observe.
1

Month January February March April May June July August September October November December

Observed Birds 20 23 19 25 21 20 24 26 23 56 27 21

Table 1: Imaginary number of birds observed every month the vertical axis.3 You see that there is quiet a dierence between the two graphs.

Please, dont neglect the outliers


Which of both graphs is the right one? Probably neither. Neglecting the outliers make you assume that all distributions are normal. However, not everything is normally distributed. 4 The outlying observation in October should have been explained by the yearly bird migration. The number of birds that are observed is not constant over the year. The outlying observation is not the result of some kind of measurement failure, but resulted from a true variation in the number of birds that are present. The explanation of outliers is not always as easy as in this case. It might be necessarily to repeat your observations or let someone else repeat your experiment. One should never assume that outlying observations result from measurement failures. Therefore, they should not be neglected.

The data in the horizontal axis are continuous. However, as you cant observer noninteger numbers of birds, you have to take the sum of the probabilities from one integer to another. 4 See previous messages of your cow.

(a) Distribution of expected number of birds with all observation

(b) Distribution of expected number of birds without observation in October

Figure 1: Probability densities

Você também pode gostar