0 Votos favoráveis0 Votos desfavoráveis

6 visualizações6 páginasJul 21, 2012

© Attribution Non-Commercial (BY-NC)

PDF, TXT ou leia online no Scribd

Attribution Non-Commercial (BY-NC)

6 visualizações

Attribution Non-Commercial (BY-NC)

- Appendix A
- Business Statistics -A First Course, 6th Edition by Levine
- 2018 19 ap statistics pacing calendar
- how to make a box and whisker
- Contrasting Ideal and Realistic Conditions as a Means to Improve Judgment-based Software Development Effort Estimation
- quantitativedataanalysis-131122004449-phpapp01
- Chart Tamer Introduction
- Graphs
- 0fcfd505c053518611000000
- A2_QMT
- v28c01.pdf
- Concept Description
- The Sobering Reality of ‘Beginner’s Mind’ - Balestracci
- 2) Graphing Using Minitab
- Statistics Calculator Word
- The Impact of Natural Gas on The Growth of The Population And Total Wealth
- AS_GCE_Mathematics_4732_01_January_2007_Question_Paper.pdf
- EMS Clinical Response Times
- MDT11CCD
- ..7...-s2.0-S0304405X98000221-main

Você está na página 1de 6

http://msenux.redwoods.edu/math/R/boxplot.php

Department of Mathematics

College of the Redwoods

Department Home Page myCR WebAdvisor Optimath Calendar Search Site

Boxplots in R

In this activity we show our readers how to create a boxplot in R. In preparation for this activity, we must first explore what statisticians call "measures of central tendency," specifically the mean and median of a data set.

We first create a set of data that we will use throughout this activity. Although our data set is somewhat artificial, all of what we explain in this activity (as it relates to our data set) can also be applied to any set of data chosen by our readers. With this thought in mind, we enter our data set at the R prompt.

> x=c(0,4,15, 1, 6, 3, 20, 5, 8, 1, 3)

We can examine the contents of the variable x with the following command:

> x [1] 0 4 15 1 6 3 20 5 8 1 3

The Mean One of the most important measures of central tendency in statistics is the mean, which is found by summing the elements of the data set, then dividing by the number of elements in the set. We can sum the elements in the data set contained in the variable x with the R-command sum.

> sum(x) [1] 66

Readers should convince themselves (get out the pencil and paper) that the elements in the variable x do indeed sum to 66. Add up the individual elements in the list stored in x and show that the sum is 66. To find the number of elements in the list stored in the variable x, we can get out our abacus and count them, or we can use R's length command.

> length(x) [1] 11

Readers should check that the list stored in x does indeed have 11 elements. Count them! To find the average of the list stored in x, we divide the sum by the number of elements in the list.

> sum(x)/length(x) [1] 6

1 de 6

10/05/2011 11:30

Boxplots in R

http://msenux.redwoods.edu/math/R/boxplot.php

Recall that the sum of the elements in x was 66, the length was 11, so the average (or mean) is 66/11=6, as verified by our R-command sum(x)/length(x). However, because finding the mean is such a common requirement in most statistical analysis, it should come as no surprise that R has a command for finding the mean of a data set.

> mean(x) [1] 6

The Median The mean of a data set can be strongly influenced by "outliers" in the data. Consider anew the data stored in the variable x.

> x [1] 0 4 15 1 6 3 20 5 8 1 3

Let's sketch a quick histogram of the data stored in x. The following command produces the histogram shown in Figure 1.

> hist(x)

Figure 1. A histogram of the data stored in the variable x. Note that the data is badly skewed to the right. Note the long tail to the right in Figure 1. Statisticians say that the data is "skewed to the right." Imagine that the bars of the histograms represent masses of equal density. If we were to place a fulcrum or "knife-edge" located at the mean (at x = 6), the masses would balance. The outliers can greatly affect the placement of the mean. It's like an old-fashioned "teeter-totter." A child seated at greater distance from the fulcrum is able to balance a much heavier child seated closer to the fulcrum. To pursue this line of reasoning a bit further, imagine that the numbers contained in the variable x represent speakers' fees in thousands of dollars. Let's sort the data in ascending order. Let's sort the data in ascending order.

> sort(x) [1] 0 1 1 3 3 4 5 6 8 15 20

It is probably unfair to say that the "average speaking fee is $6,000." Although statistically correct, the average (mean) speaking fee ($6,000) does not reflect a common speaking charge for this collection of speakers. Indeed, the two outliers (the speakers charging $15,000 and $20,000) unduely influence the mean.

2 de 6

10/05/2011 11:30

Boxplots in R

http://msenux.redwoods.edu/math/R/boxplot.php

A second measure of central tendencey, a statistic called the median, will be seen to more closely resemble what a group might be charged should they hire one of the speakers represented in the data set stored in x. The median is defined to be the data item that is precisely in the middle of the sorted data set; that is, half (50%) of the data occurs to the left of the median, and half (50%) occurs to the right of the median. In the case that the data set has an odd number of elements, it is a simple matter to spot the data item that lies precisely in the middle. The data set stored in the variable x has 11 elements. Hence, the sixth element lies exactly in the middle of this data set. Thus, the median is 4. Note that this number represents a speaking fee of $4,000, which is probably more representative of a "middling fee" that a group might expect should they use one of the speakers represented by the data stored in x. Of course, R finds the median with ease.

> median(x) [1] 4

If a data set has an even number of elements, the median is found by averaging the two "middle elements." For example, the following data set has six elements.

> y=1:6 > y [1] 1 2 3 4 5 6

The median is found by averaging the third and fourth elements; that is, the median is (3 + 4)/2 = 3.5. R is completely aware of the even case.

> median(y) [1] 3.5

Quantiles

The median of a data set is located so that 50% of the data occurs to the left of the median (and 50% of the data occurs to the right of the median). There is no reason to restrict our attention to the 50% level. For example, we can find a point where 25% of the data occurs on its left (and 75% to its right). This point is known as the first "quartile" and is found with the following R command:

> sort(x) [1] 0 1 1 3 3 > quantile(x,0.25) 25% 2 4 5 6 8 15 20

To help explain, we've listed the data set in ascending order. R provides nine different algorithms for computing the 25% quantile which can be viewed by typing the command ?quantile. The default technique is to use linear interpolation to find the entry in the position given by the formula 1 + p(n -1), where p is the required percentage and n is the length of the data set. In this particular case, p = 0.25 and n = 11, so 1 + p(n -1) = 3.5. Thus, R will interpolate (linearly) a number that is exactly halfway between the third and fourth entries, arriving at 1 + 0.5(3 - 1) = 2. In similar fashion, R will approximate the 75% quantile with the following command:

> sort(x) [1] 0 1 1 3 3 > quantile(x,0.75) 75% 7 4 5 6 8 15 20

Note that 1 + p(n - 1) = 1 + 0.75(11 - 1) = 8.5, so R reports the number that is exactly halfway between the eighth and ninth entries, namely 6 + 0.5(8 - 6) = 7. In general, the p% quantile will be a number that finds p% of the data to its left. For the remainder of this activity, the most important statistics are the minimum, first quartile, median, second quartile, and the maximum. We can use the quantile command to compute all of these at once.

> quantile(x,c(0,0.25,0.5,0.75,1)) 0% 25% 50% 75% 100% 0 2 4 7 20

3 de 6

10/05/2011 11:30

Boxplots in R

http://msenux.redwoods.edu/math/R/boxplot.php

However, R's summary command will report each of these quantiles with descriptive headers, and throw in the mean for good measure.

> summary(x) Min. 1st Qu. 0 2 Median 4 Mean 3rd Qu. 6 7 Max. 20

Two pairs of numbers in the summary for our data set give the user a sense of the "spread" of the data involved. The first is the range of the data set.

> range(x) [1] 0 20

Note the R's range command reports the minimum and maximum entries in the data set. In addition, R's IQR command gives the inner quartile range.

> IQR(x) [1] 5

The inner quartile range reports the difference between the 75% quantile and the 25% quantile. In this case, IQR = 7 - 2 = 5.

It is easier to explain the boxplot if we first have a picture to which we can refer in the discussion. So, without any further ado, here is how R produces a boxplot for the data in the variable x.

> boxplot(x,range=0)

The above command was used to produce the boxplot shown in Figure 2.

Figure 2. The minimum, quartiles, median, and maximum are used to construct a "box and whisker plot." So, how is this boxplot constructed? First, recall the summary data for the data in the variable x.

> summary(x) Min. 1st Qu. 0 2 Median 4 Mean 3rd Qu. 6 7 Max. 20

Here are the steps for creating the standard box and whiskers plot.

4 de 6

10/05/2011 11:30

Boxplots in R

http://msenux.redwoods.edu/math/R/boxplot.php

1. Draw a thick, dark, horizontal segment at the median, that is, at 4. See Figure 2. 2. Second, draw horizontal lines at the first and third quartiles, that is, at 2 and at 7. Use these to draw the "box." See Figure 2. 3. From the bottom edge of the box, draw a "whisker" that extends to the the minimum data value, namely 0. See Figure 2. 4. From the top edge of the box, draw a "whisker" that extends to the maximum data value, namely 20. See Figure 2. There are several important points that need making in regard to our box and whisker plot. 1. 50% of the data occurs between the lower and upper edges of the box, namely, between the first and third quartiles located at 2 and 7, respectively. 2. The lower 50% of the data occurs below the median, the dark horizontal line in the box in Figure 2. Likewise, the upper 50% of the data occurs above the median line in the box. 3. The lower 25% of the data occurs between the bottom edge of the box and the bottom edge of the lower whisker. Likewise, the upper 25% of the data occurs above the top edge of the box and the top edge of the upper whisker.

The Standard Box Plot does not pay special attention to outliers that might be present. The Modified Box Plot is constructed so as to highlight outliers. As in the Standard Boxplot described above, let's begin with a picture. Note that the Modified Boxplot is the default in R, and requires no special parameters.

> boxplot(x)

The above command was used to produce the modified "box and whiskers" plot shown in Figure 3.

Figure 3. A modified boxplot marks outliers for "special attention." Before explaining the construction, let's repeat the sorted data and the summary information.

> sort(x) [1] 0 1 1 3 3 4 5 > summary(x) Min. 1st Qu. Median 0 2 4 6 8 15 20 Max. 20

Here are the steps required to construct a modified box and whiskers plot: 1. The median and the quartiles are used to construct the box in exactly the same manner used to construct

5 de 6

10/05/2011 11:30

Boxplots in R

http://msenux.redwoods.edu/math/R/boxplot.php

the standard boxplot. 2. Multiply the IQR by 1.5. So, in this case, 1.5 x IQR = 1.5 (5) = 7.5. Let's call this result the STEP. That is, STEP = 7.5. 3. Add the STEP to the third quartile, obtaining 3rd Quartile + STEP = 7 + 7.5 = 14.5. Use this to perform two tasks: a. Any data beyond 14.5 is plotted using an empty circle. This explains the two circles at 15 and 20 in Figure 3. b. Locate the largest data point below 14.5. This is the number 8. This is where the end of the upper whisker is drawn. 4. Subtract the STEP from the first quartile, obtaining 1st Quartile - STEP = 2 - 7.5 = -5.5. Use this to perform two tasks: a. Any data below -5.5 is plotted using an empty circle. There are no such data points in Figure 3. b. Locate the smallest data point that occurs above -5.5. This is the number 0. This is where the end of the lower whisker is drawn.

Enjoy!

We hope you enjoyed this introduction to the R system. This interactive system provides a strong interactive interface for exploration in statistics. We encourage you to explore further. Use the command ?boxplot to learn more about what you can do with the boxplot command.

6 de 6

10/05/2011 11:30

- Appendix AEnviado porSajan Shah
- Business Statistics -A First Course, 6th Edition by LevineEnviado portestbankmarket
- 2018 19 ap statistics pacing calendarEnviado porapi-319519294
- how to make a box and whiskerEnviado porapi-270891801
- Contrasting Ideal and Realistic Conditions as a Means to Improve Judgment-based Software Development Effort EstimationEnviado porLeonardo Ramirez
- quantitativedataanalysis-131122004449-phpapp01Enviado porasa
- Chart Tamer IntroductionEnviado porSwapna Bandhar
- GraphsEnviado porImran Arshad
- 0fcfd505c053518611000000Enviado porJosé Alberto León Hernández
- A2_QMTEnviado poraskerman
- v28c01.pdfEnviado pormichiminino
- Concept DescriptionEnviado porMochammad Adji Firmansyah
- The Sobering Reality of ‘Beginner’s Mind’ - BalestracciEnviado portehky63
- 2) Graphing Using MinitabEnviado porLibyaFlower
- Statistics Calculator WordEnviado portamirat
- The Impact of Natural Gas on The Growth of The Population And Total WealthEnviado porJAM
- AS_GCE_Mathematics_4732_01_January_2007_Question_Paper.pdfEnviado porShiwanka Handapangoda
- EMS Clinical Response TimesEnviado porSemaj Vincent
- MDT11CCDEnviado porKarla Hoffman
- ..7...-s2.0-S0304405X98000221-mainEnviado porsajidobry_847601844
- residualtest.docEnviado porvarunsmith
- Qualitycontrol AssignmentEnviado pornestor martourez
- MBA ZC417-QM.docxEnviado porIPlobo
- today lecture .pdfEnviado porMubashir Ali
- Are Dividends Disappearing Dividend Concentration and the Consolidation of EarningsEnviado pordfg
- Order StatisticEnviado porhienluong293
- Apuntes Quantitative MethodsEnviado porPinya Colada
- Mechanical Engineers US Jobs - Statistics.pdfEnviado porJanak Anand
- 15. Adaptive Weighted Median FilterEnviado porjebilee
- Basic Statistics Power Point Presentation 13 MARCHEnviado porAnonymous q6DQvRo3gT

- Titans (Excerpt)Enviado porI Read YA
- Bn Catalog 2011 WebEnviado porsbinkerd1
- IntroductionEnviado porbji
- Case Study of Channel Element ConfigurationEnviado porJerick85
- Hse Monthly ReportEnviado porpunnyakumar
- Debating Daasdgadgrwin's Doubt_ a Scie - UnknownEnviado porjesus1843
- B tech mechanical questionsEnviado porAnkita Verma
- Chapter 10Enviado porSrinivas Kamapantula
- Tcp Throughput Calculation FormulaEnviado porVictor Bitar
- Kate · SlidesCarnivalEnviado porShandyka Yudha
- Creating Neon Light Effect in Archicad RenderEnviado porUsht
- World DisastersEnviado porapi-3808551
- 2G Huawei Capacity Optimization Process.pptxEnviado porVito Wahyudi
- ALFREDO_22010112130140_Lap.KTI_Bab7Enviado portitik
- preview of atmospheric pressure- th ed by the weight of airEnviado porapi-240312017
- Pistol Rifle Marksmenship Skill Drill-bookEnviado porguacamoledip
- Project Report on Sony DdepakEnviado portechcaresystem
- Cementing Best Practice Jorge SierraEnviado porhamora33
- Ch 6-Optics of Ani So Tropic MediaEnviado pordivimalai
- Technische_Arbeitsmappe_Gas_EN.pdfEnviado porVnlNl
- XB-37_User_Manual.pdfEnviado porroberto_aguilar12345
- Running Man PlacesEnviado porPrabu Yudhistira
- pt5A1FlexibleImpellers.pdfEnviado porCorina Avram
- duehm termodinámicaEnviado porlilusliluslilus
- note08Enviado porWA Q AS
- HDL LABEnviado porHemanth Kumar
- Most Apparent Distortion Full-reference Image Quality Assessment and the Role of StrategyEnviado porMatthew Dorsey
- Ethernet.pptxEnviado porJENNIFER LESLY 18IT058
- Skin and Conective TissueEnviado poraderiawihelmina
- Meter Safety and PPEEnviado porEnma Melendez

## Muito mais do que documentos

Descubra tudo o que o Scribd tem a oferecer, incluindo livros e audiolivros de grandes editoras.

Cancele quando quiser.