Escolar Documentos
Profissional Documentos
Cultura Documentos
Table of Contents 1. Introduction to Statistical Computing 2. Data Analysis and Statistical Concepts Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals 4 10 10 15 20 27 35 37 41 46 55 75 90 113 116 119 121 125 131 149 162 180 184 186
3. Excel (2003) Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals Excel 2003 Lagniappe
4. Excel (2007) Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals Excel Lagniappe
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
5. SPSS
Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals SPSS Lagniappe
190 199 209 218 234 245 253 255 256 262 278 284 299 311 318 323 327 341 347 352 365 375 382 385
6. Minitab Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals Minitab Lagniappe 7. SAS Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals SAS Lagniappe
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Readers of this manual are assumed to have completed some introductory statistics course. For individuals wishing to review statistical concepts, we recommend Introduction to Stats by DeVeaux, Velleman and Bock.
1
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Minitab Minitab was developed by Statistics professors at Penn State University (where it is still headquartered) in 1972. These professors were looking for a better way to teach undergraduate statistics in the classroom. From this starting point, Minitab is now used in over 4,000 universities around the world, in 80 countries and by hundreds of companies ranging from Fortune 500 to start up companies. Of the main statistical computing packages, Minitab has the strongest graphics and visualization capabilities. The package is most heavily used in Six Sigma and other quality design initiatives. Minitabs customer list includes a large number of manufacturing and product design firms such as Ford, GE, GM, HP and Whirlpool. For product information regarding Minitab, please visit: http://www.minitab.com/ SAS Statistical Analysis Software or SAS is typically considered to be the most complete statistical analysis package on the market (Professional Tip - please pronounce this as sass - if you pronounce the package as S-A-S people will think you are a poser). This is the package of choice of most applied statisticians. Although the most recent version of SAS (version 9) includes some point and click options, SAS uses a scripting language to tell the computer what data manipulations and computations to perform. We will be demonstrating how to actually write the code for SAS rather than defaulting to the point and click functionality in v.9, SAS Enterprise Guide, SAS Enterprise Miner and other more user-friendly GUI SAS products . Our rationale here is this if you learn to drive a manual transmission, you can drive anything. Similarly, if you can program in Base SAS, you can use (and understand) just about any statistical analysis package. The learning curve for SAS is longer and steeper than for the other packages, but the package is considered the benchmark for statistical computing. SAS is used in 110 countries, at 2,200 Universities, and at 96 of the Fortune 100 companies. For product information regarding SAS, please visit: http://www.sas.com/
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
It is familiar to most people; It reflects the inclusion of every item in the dataset; It always exists; It is unique; It is easily used with other statistical measurements.
However, the mean is ONLY used when the data is ratio scale (quantitative) and when there are no extreme values.
10
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
X
The formula for the calculation of a mean is
X
i 1
(We know how everyone LOVES formulas with Greek letters!) Where Xi = every observation in the dataset N = the number of observations in the dataset
The answer is on the next pagedont cheatdo it first to make sure that you understand how to calculate this foundational concept by hand.
11
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Did you get 9.66? Well done. A second measurement of central tendency of a dataset is the median. The median is literally, the middle of the dataset: It is the central value of an array of numbers sorted in ascending (or descending) order; 50% of the observations lie below the median and 50% of the observations lie above the median; It represents the second quartile (Q2); It is unique.
As with the mean, the median is used when the data is ratio scale (quantitative). However, unlike the mean, the median can accommodate extreme values.
Take the men in the Norcross plant (n=10) again, and determine the median years they have spent in their current job. The answer is on the next page. Did you cheat last time? You can redeem yourself by doing this one by hand
12
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Did you get 9.5? Well done. The mean and the median are pretty close 9.66 and 9.50, respectively. But which one is right? Which one should be reported as the central tendency or the most representative value of the years on the job for the men in the Norcross plant? Mathematically they are both correct. However, if there are no extreme values (defined here as observations which are more than three standard deviations from the mean), then we would typically report the mean rather than the median (as we would here). However, what if there are extreme values, then what? Consider the men in Norcross again. What if employee 082 had 30 years with the company instead of 14 years. How would the mean and median be affected? The mean would increase to 11.26 while the median remains the same at 9.50 (do this by hand to convince yourself of this concept). Go back and look at the formula for the mean and think about why the mean was so heavily affected, while the median was not. A third measurement of central tendency is the mode. The mode is the most frequently occurring value in a dataset: There can be multiple modes; It is not influenced by extreme observations; Can be used with both qualitative and quantitative data.
Go back to the WidgeOne.xls dataset and the men in the Norcross plant. What is the mode for their years on the job? Hmmmthere is no mode. There is no value that appears more than one time. Thats OK, sometimes there is no mode in a dataset. What if employee 077 had 4 years on the job instead of 5 years. Then we would have a mode 4 years. This is a measurement of central tendency. But 4 years is different (a lot different) from 9.66 and 9.50 years. Is it correct? Technically yes, this would be mathematically correct, but not the most appropriate measurement to report as the central tendency of the dataset. Typically, the mode is considered to be the weakest of the three
13
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
measurements of central tendency for quantitative data and is ONLY used if the mean or median is not available. When would that be? Calculate the mean and median gender of the dataset. Go ahead. We will wait. It cant be done. When the data in question is qualitative (e.g., gender, plant, position) the ONLY measurement of central tendency that is available is the mode. What you need to know When representing the central tendency of quantitative data, default to the mean. If the data has extreme values, use the median. If the data is qualitative, use the mode.
14
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
(X
i 1
X )2
Where, Xi = each individual observation X = the mean of the dataset N = the number of observations in the dataset Note if calculating the standard deviation of a sample rather than a population, the denominator becomes n-1. We subtract one degree of freedom. The standard deviation provides us with the mean units of each observation from the mean. If this number is large, the data is very spread out (i.e., the observations are different). If this number is small, the data is very compact (i.e., the observations are very similar).
15
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
16
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Did you get 3.30? Well done. What does this number MEAN? 3.30 what? It means that the standard deviation of the dataset is 3.30 years. The average deviation (in either direction) of each individuals tenure is 3.30 years from the mean of 9.66. Relative to the mean, we would consider this data to be fairly compactmeaning that the data is not very spread out (this will be seen more clearly in the next section when a graphical representation is created). You may recall from your earlier Statistics course(s) a second statistical calculation that provides a second measurement of dispersion the variance. The variance is simply the square of the standard deviation. Although variance is an important concept to statisticians, it is not typically used by practitioners. This is because variance is not very user friendly in terms of interpretation. In the case of the men in Norcross, the variance would be reported as 10.88 years squared. There is another application of the term variance that has a more generic meaning that is heavily used by practitioners. It is the difference, either in absolute numbers or percentages, of each observation from some base value. For example, it is common for individuals to refer to a budget variance, where this number would be the actual number minus the budgeted number: Project # 123 Budget Hours 150 Actual Hours 175 Variance +25 Variance % +17%
Remember when calculating the variance percentage in this context, you take the difference (150-175) divided by the budgeted number (150), not the actual number (many professionals make this mistakeonce).
17
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Another method of representing the dispersion of a dataset is to provide the frequency counts for observations across specified ranges.
Here is how your answer should appear: Category Less than 5 years 5-10 years More than 10 years Total Frequency 9 16 15 40 Relative Frequency 22.50% 40.00% 37.50% 100.00% Cumulative Frequency 22.50% 62.50% 100.00%
It is important to note that the categories are mutually exclusive (no observation can occur in two categories simultaneously) and collectively exhaustive (every observation is accommodated). This representation of the dispersion of the data is referred to as a frequency table and is the most common and one of the most useful representations of data. In this instance, we converted a quantitative variable into a qualitative variable for the purposes of developing a frequency table. We do this frequently to take a different kind of look at a quantitative variable.
18
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If we had a qualitative variable that we wanted to better understand, we would generate the appropriate measurement of central tendency (Mode) and the measurement of dispersion (frequencies) through the application of a frequency table.
What you need to know Measurements of dispersion provide information regarding how spreadout or compact the data is. Typically this is communicated through the computation of the standard deviation AND some display of the frequency counts of the observations across specified categories. If the data is qualitative, the only measurement of dispersion comes from the frequency table.
19
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
20
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
15+ years
Note in this graphic that the left axis represents the actual frequency counts for each category and the right axis represents the cumulative percentage for all categories. From this graphic, it is easy to see that the data is normally distributed with a mean, median and mode in the 7-10 year category. This histogram was developed using Excel.
21
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Pie Charts Pie charts can be useful for displaying the relative frequency of observations by category, if used properly. Consider these two guidelines: 1. Use 5 or fewer slices if more than 5 slices are needed, use a table; 2. Order the relative frequencies in ascending (or descending) order. Using the same Job Tenure data, the associated pie chart, generated using Excel, would look like this:
This pie chart was developed using Excel. It should probably be noted at this point that approximately 8% of all men and .5% of all women are colorblind. Although colorblindness comes in many different forms, the most common forms involve the colors red, green, yellow and brown. Individuals who are colorblind cannot discern from among these colors. Therefore, when
22
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
constructing pie charts or any other type of colored visual representation of your analysis, avoid placing these colors adjacent to each other. Bar Charts Bar Charts ARE NOT Histograms! Bar Charts are intended to represent the frequency counts of QUALITATIVE data. The plant information from WidgeOne.xls would look like this:
This bar chart was developed using Excel. A very technical pointBar Charts are always displayed horizontally. Column Charts are always displayed vertically. They are different charts. Column Charts are typically used with a time element, where each column is a different point in time. When there is no time element (as with the Widgeone data), Bar Charts and Pie Charts are the primary tools used to display qualitative data.
23
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Stem and Leaf Plots Stem and leaf plots, like histograms, provide a visual representation of the shape of the data and the central tendency of the dataset. Here is the stem and leaf plot for the Job Tenure variable: Stem Leaf 17 0 16 15 0 14 001 13 000 12 1 11 011 10 0115 9 000 8 00016 7 016 6 1 5 007 4 01 3 00 2 001 10 01 Frequency 1 1 3 3 1 3 4 3 5 3 1 3 2 2 3 1 1
When reading a stem and leaf plot, the first number represents the stem and the numbers to the right represent the leaves, while the number to the far right represents the frequency of the stem. For example, the first stem of the plot above is a 17 and the first (and only) leaf is 0. This means that there is one observation that has 17.0 years on the job. To the far right of the 17, there is a 1. This indicates that there is only one employee with 17.x years on the job.
24
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Boxplots The last tool described in this manual for visualizing univariate data is the boxplot. The boxplot builds on the information displayed in a stem-and-leaf plot and focuses particular attention on the symmetry of the distribution and incorporates numerical measures of tendency and location. Prior to creating a boxplot, you need to be familiar with the concepts of quartiles. The boxplot incorporates the median, the mean and the four quartiles of a variable. The quartiles of a dataset are the points where 25%, 50% (the same as the median), 75% and 100% (the max value) of the data lies below. Quartiles are typically written as Q1, Q2, Q3, Q4, respectively. The data that lies between Q1 and Q3 is referred to as the Interquartile Range or IQR. This is the center 50% of the dataset. Below is the boxplot for the Job Tenure variable from WidgeOne.xls. The boxplot is typically placed next to the stem-and-leaf plot for context: 17 0 16 15 0 14 001 13 000 12 1 11 011 10 0115 9 000 8 00016 7 016 61 5 007 4 01 3 00 2 001 10 01 1 1 3 3 1 3 4 3 5 3 1 3 2 2 3 1 1 | | | | | | +-----+ | | | | *--+--* | | | | +-----+ | | | | |
25
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
From this boxplot, you can see that Q1 begins at 5, Q2 (also the median) begins at 8 (the actual median of the dataset is 8.35), Q3 begins at 11 and the highest value of the dataset is 17.0. Since the median (indicated by a +) is approximately in the center of the IQR box, we would conclude that this dataset is relatively symmetric. This boxplot was developed using SAS. What you need to know Many individuals, who are analytically very strong, often place insufficient emphasis on graphics and visual representations of data. Many individuals who are not strong analytically, but need analysis to support their decision-making, often place an overemphasis on graphics and visualization. Individuals who can execute both well will go far. Histograms, Stem and Leaf and Boxplots are used with QUANTITATIVE DATA. Bar Charts, Pie Charts, Column Charts are used with QUALITATIVE DATA.
26
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Grand Total 7 20 10 20 17 40
This table displays the frequency of the number of females and males at each plant. We could also display this table as percentages rather than as frequencies:
Count of Gender Plant Gender D N Grand Total F 65.00% 35.00% 100.00% M 50.00% 50.00% 100.00% Grand Total 57.50% 42.50% 100.00%
27
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Here the percentages are given as a percentage of each gender (row percentages). Specifically, the interpretation of the first cell would be of all of the female employees, 65% work in Dallas. The percentages could easily be reversed to represent the percentage of individuals at each plant (column percentages): Count of Gender Plant Gender D N Grand Total F 56.52% 41.18% 50.00% M 43.48% 58.82% 50.00% Grand Total 100.00%100.00% 100.00% In this version of the table, the first cell now communicates of all of the Dallas employees, 56.52% are female. Finally, we can also represent the data as overall percentages: Count of Gender Plant Gender D N Grand Total F 32.50% 17.50% 50.00% M 25.00% 25.00% 50.00% Grand Total 57.50% 42.50% 100.00% In this version of the table, the first cell now communicatesof all employees, 32.50% are females in Dallas.
28
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Before moving on, please ensure that you fully understand the differences across these three tables. They are subtle, but important. Both gender and plant are categorical variables. We could incorporate a quantitative variable into this table such as job tenure: Average of YRONJOB Gender F M Grand Total
Plant D
This table now provides information about the average job tenure for each gender and each plant, and for each gender at each plant. For example, the first cell now communicates, The females in Dallas have an average job tenure of 8.85 years. These tables were developed in Excel using Pivot Tables.
29
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Stacked Bar Charts Stacked bars are a convenient way to display percentages or proportions, such as might be done in a pie chart, for multiple variables. For example, the proportion of each gender at each plant, would be displayed like this in a stacked bar chart:
30
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This stacked bar chart could be reversed, where gender is displayed as the bars and the segments are the plants:
The second graphic is fine. However, when the population size differ particularly by a lot stacked bar charts are less informative. It is difficult to understand how the groups compare. For example, the difference in the number of Dallas and Norcross employees is not dramatic, but even here it is difficult to discern which has a greater proportion of men.
31
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
100% Stacked Bar Charts To solve this problem, we can apply a 100% stacked bar chart. This visualization tool simply calibrates the populations of interest like the two plants to both be evaluated out of a total of 100%. You can almost think of 100% Stacked Bar Charts as side-by-side pie charts.
Compare this graphic to the first Stacked Bar Graph. They are different. They communicate subtly different messages.
32
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Scatter Plots What if we wanted to better understand if there is a meaningful relationship between two quantitative variables? Such as the possible relationship between job tenure and productivity. This question can be addressed using a scatter plot, where one quantitative variable is plotted on the y-axis and the second quantitative variable is plotted on the x-axis:
If two variables are considered to be related, we would expect to see some pattern within the scatter plot, such as a line. If job tenure and productivity were positively related, then we would expect to see a 45 degree line moving from the SW corner to the NE corner. This would indicate that as job tenure goes up, productivity goes up. If job tenure and productivity were negatively related, then we would expect to see a 45 degree line moving from the NW corner to the SE corner. This would indicate that as job tenure goes up, productivity goes down.
33
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In this scatter plot, neither of these linear patterns (or any other pattern) is reflected. This cloud is referred to as a Null Plot. As a result, we would conclude that job tenure and productivity are not related. We can derive additional information from this scatter plot. Specifically, we can determine the best fit line in the form y=mx+b. This is the linear equation that minimizes the distances between the predicted values and the actual values, where y = the predicted values of an employees productivity and x = the actual number of years of an employees job tenure: y = -0.5715x + 89.318. This equation generates an R2 value of 0.1124, where this value represents the percentage of the variance of the dependent variable (productivity) that can be explained by the independent variable (job tenure). Detailed explanations of these concepts are outside of the scope of this document, but are heavily used in Statistics and form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
What you need to know Stacked Bar charts are used to display the counts within groupings of qualitative variables. When those groupings are of different sizes, a 100% Stacked Bar Chart is preferred. You can think of 100% Stacked Bar Charts as side by side Pie Charts. Scatterplots are used to communicate if a relationship exists between two quantitative variables.
34
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
35
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
is used to generate these random numbers, this process is referred to as simple random sampling where every element as a 1/n probability of selection. Simple random sampling using random number generation is a very common execution used by analysts to select a subset of a population of elements for analysis.
36
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Concept 6 Confidence Intervals therefore is different from the first four concepts reviewed in this manual, because we are moving from descriptive statistics to inferential statistics. Simply stated, a confidence interval is an estimation of some unknown population parameter (usually the mean), based on sample statistics, where the acceptable margin of error and/or confidence level is preestablished.
Inferential statistics is based on the Central Limit Theorem. Readers are assumed to have a working knowledge of this theorem. For a refresher on the Central Limit Theorem, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.
2
37
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
X (Z *
The formula used to estimate a two-sided confidence level of a population mean is
sX ) n , where
X = the sample mean; Z = the number of standard deviations, using the sampling distribution and the Central Limit Theorem,
associated with the established confidence level: 90% confidence = 1.645 95% confidence = 1.96 99% confidence = 2.575
Sx= the sample standard deviation; n = the number of elements in the sample.
p (Z *
The formula used to estimate a two-sided confidence level of a population proportion is where
pq ) n
38
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In both formulas, the expression after the + signs is the referred to as the Margin of Error.
39
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
What you really need to know When calculating confidence intervals, use a 95% default unless you know something about the decision maker. If the decision maker is conservative, use a 99% interval. If the decision maker is risk tolerant, use a 90% interval. To increase both confidence and decrease the margin of error, increase the sample size.
40
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
41
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
42
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If you needed to enter data into a new spreadsheet, you would simply type the data values into each cell, with labels in row 1. Excel will accept most characters letters and numbers in the cells. However, only numbers (with a few exceptions) can be subjected to the kinds of analysis outlined in Chapter 2. At this point, we need to access the WidgeOne.xls dataset in Excel. To access the dataset, click on File and then Open. At this point, a Microsoft explorer box will popup. Go to the folder or drive where you have saved the WidgeOne.xls file:
Note that this explorer box is looking for Excel files. If you need to change the file type, click on the drop down menu.
43
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Once you have opened the WidgeOne.xls file in Excel, you should see this:
44
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Recall from the initial description of the dataset, that there are three worksheets in the file. The Plant_Survey sheet is currently open (if it is not, please click on that tab at the bottom of the page). We will be executing most of our analysis in this sheet. However, if you click on one of the two other tabs Employees or Attendance you will move to one of those two sheets. Return to the Plant_Survey sheet and we will begin to execute the six statistical concepts from Chapter 2.
45
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Functions in Excel are organized into categories, based upon different specializations. We will be using functions in the Statistical category.
46
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Go ahead and click on the Statistical category. You will see this box:
As you scroll through this box, you will see a wide variety of statistical functions.
Click on Cancel and go back to the dataset. Before we perform any analysis, lets insert an additional column, where we will insert the labels Mean, Median and Mode. To do this, first place your cursor on the A in the first column and click, so that the entire column is highlighted. Now, click on Insert>Column.
47
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
At this point, the entire dataset should have shifted to the right, and the new column A is blank:
48
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, go to the bottom of the dataset to cell A43. In cells A43, A44 and A45, type Mean, Median and Mode, respectively:
49
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Not all variables will lend themselves to these calculationsremember that we only execute mean and median calculations on quantitative variables. So, it would be helpful if we could see the column headers to remind us what is in each column. This can be done using a split screen. Go to cell A2 (the row just below the headings) and then click on Window>Split. At this point you should see this:
50
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, you can use the toggle bar on the right to scroll back to your labels, and still see the column headers. For which columns should we report the measurements of central tendency? The quantitative values include JOBGRADE, SOCREL (social relations score3), YRONJOB (number of years on the job), PRDCTY (Productivity) and JOBSAT (job satisfaction). The calculation of the mode for the qualitative variables (PLANT, GENDER and POSITION) will be addressed below. Move your cursor to position F43. This is where we will place the mean for the JOBGRADE variable. With your cursor in this cell, click on the fx button. From the dialogue box, select Statistical. From the list of function names, click on the second entry AVERAGE. You will see this:
This dialogue box is effectively asking for what array of numbers do you want to calculate an average? Excel is pretty clever. You may already have the array populated in the first field (Number 1). For the JOBGRADE variable, this will be cell F2 through cell F41. If it is not already populated for you, simply click on the little spreadsheet button and highlight the cells F2 through F41. Note that cell F42 is empty. If it is included, it will be ignored. However, if there was a 0 in cell F42, it would be includedand a different mean would be
Psychology, Sociology and Marketing Majors will recognize that this is Likert Data. For the purposes of this manual, Likert Data will be treated as quantitative. However, it should be noted that pure mathematicians treat Likert Data as qualitative.
3
51
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
calculated. It is always best to only include the relevant cells in your calculations. After you have selected cells F2 through F41 as the array for the mean calculation, click OK. You should now see 6.6. Now, lets copy this function across to column J. With your cursor in cell F43, go to Edit>Copy, then highlight cells G43 through J43. Go to Edit>Paste. You should now see this:
52
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To populate the Median cells, we will use the same process. Place your cursor in cell F44 and click on the function button. From the Statistical functions, select MEDIAN and select the same array F2:F41. Click OK. Copy and paste the function in cell F44, across to cell J44. You should now see this:
53
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Although it is not typically used as the best measurement of central tendency of quantitative data, you can provide the mode for these variables using the same process MODE is a function listed in the Statistical category of functions.
54
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
55
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You probably noticed that the words Standard Deviation do not fit neatly into cell A47 they spilled over into B47 and C47. Remember that what you see in Excel is not necessarily what Excel sees. In reality, cells B47 and C47 are still empty from Excels perspective. But, this looks a little untidy. There are several ways to tidy this. We can expand column A until the words are visually contained within the column. This is accomplished by aligning the cursor between the A and the B at the top of the spreadsheet until the cursor looks like this and then double clicking. Column A will widen enough to accommodate the longest string of characters in the columnin this case Standard Deviation. A second method of accommodating the text is by wrapping the text into the cell. This is accomplished by selecting Format>Cells>Wrap text:
56
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After the text has been wrapped, you can then slightly widen the columns or narrow the rows (using the same process as for the columns), as needed. Once the label has been established, select the function button, then within the Statistical category, select the STDEV options and the same range as before F2:F41 and click OK. You should see 1.54919334. This is the standard deviation of the JOBGRADE variable. As before, copy this formula across to column J. You should now see this:
57
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We now have the basic descriptive statistics for the quantitative variables. You may notice that some of the values have no decimal points, some have one decimal point, some have many decimal points. We think this looks a little untidy (as Statisticians, we like things to be tidy). To make this spreadsheet look a bit more professional, lets format all of the data to have a consistent number of decimal points. To do this, click on the cell in the far upper left corner as circled above. This will highlight the entire spreadsheet. Then click on Format>Cells. Then select the Number category as indicated:
58
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
59
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Isnt that better? In practice, if you need to provide multiple descriptive statistics on a variable, this is not the process that you would go through. For multiple descriptive statistics, you would do the following4>Tools>Data Analysis>Descriptive Statistics. This path will bring up the following:
If you go to Tools and do not see the Data Analysis option, do not let your heart be troubled. Simply go to the Add Ins option under Tools and select the Analysis ToolPak. Then go back into Tools. You should now see the Data Analysis option. WARNINGif you have an unauthorized copy of Excel you will not have access to this very important functionality.
4
60
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
61
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
62
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Again, pretty untidy. Format the spreadsheet to have two decimal points for all values and expand the columns to accommodate the labels. Your tidy version should look like this:
63
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Notice that we reproduced all of the measurements from before, and several more5. This is a more efficient way to produce the descriptive statistics of a variable(s). In Chapter 2, we presented the concept of a frequency table as another method of displaying the spread of a dataset. As discussed, frequency tables are one of the most commonly used methods to display data understanding how to create a frequency table is a critical skill. The table created on page 17, was created in Excel. We will reproduce it here. The first step to creating a frequency table is to determine the categories that need to be developed for the quantitative variable (this process will effectively transform a quantitative ratio-scale variable into a qualitative categorical variable). Previously, we determined that the job tenure variable (YRONJOB) should be categorized into three levels less than 5 years, 5-10 years and more than 10 years. Recall that the categories must be mutually exclusive and collectively exhaustive. To accommodate these categories in Excel, we will create bins, where the TOP of each category identifies each bin.
For detailed information on the additional statistics produced, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
5
64
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In our WidgeOne.xls dataset, lets create a bin range for YRONJOB in column L:
These are the bins for the Histogram for Job Tenure. Category 1 is 0-4.99, Category 2 is 5-10.00 and Category 3 (which does not need to be entered) is everything above 10.00.
65
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click OK. This will bring up a dialogue box, asking for information regarding the quantitative variable to be analyzed, and the associated Bin Range:
Highlight the range of the YRONJOB variable (including the label) Highlight the Bin Range (including the label)
66
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Againa little untidybut this is the base of what we need for the frequency table. Lets clean this up and add some columns to reproduce the table from Chapter 2. First, replace the bin titles with the real category labels of Less than 5 years, 5-10 years and More than 10 years. Second, expand the columns as needed. Third, total the bottom of the frequency column using the SUM option in cell B5, type =SUM(B2:B4) (the
67
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
SUM function can be found in the Math & Trig category of functions). Next, create two addition column headers Relative Frequency and Cumulative Frequency. Your sheet should look like this:
68
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The Relative Frequency column will display the percentage of observations in each column an important piece of informationparticularly when comparing populations of different sizes. This is done by simply taking each frequency and dividing it by the total. For example, in cell C2, we would type =B2/B5. This would result in .2250 (9/40). Rather than typing this same formula again and again to capture the relative frequencies of the next two categories, we would like to copy this formula into cells C3 and C4. Do this now. Did you get #DIV/0? The problem is that when the formula =B2/B5 is copied down one cell, it becomes =B3/B6. There is nothing in cell B6. Since any number divided by 0 is undefined, we receive this error message. If we want to copy the formula into the cells below, we need to nail down the reference to the Total cell and prevent the reference from changing. To do this, we place a $ in front of the B and another $ in front of the 5 $B$5 instead of B5. This can also be accomplished by placing the cursor in between the B and the 5 and hitting the F4 button on your computer. Once you have nailed down the Total cell as a reference cell, you can copy the formula into cells C3 and C4. The Cumulative Frequency column will display the cumulative percentage of observations from 0 to the top of the category in question. This is accomplished by adding the relative frequency of a category to all of the relative frequencies before it. In Excel, we would type =C2 in cell D2 the first entry in the Cumulative column will always equal the first entry in the Relative Frequency column. In cell D3, we would enter =D2+C3. This will add the cumulative value (D2) plus the Relative Frequency for the category (C3). Andwe can now copy this formula into cell D4. Clearly, this is a lot of manual work in Excel for a relatively small table. However, our focus is on helping to build the Excel skills necessary to execute this kind of analysis for any size table or dataset.
69
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
70
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You probably can guess what is next lets make it a bit more tidy and presentable. First, lets convert the decimals to percentages since that is the way most people would expect to see the data. Highlight cells C2 through D5. Then select Format>Cells>Percentage:
Click OK.
71
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Second, lets format the text to ensure that it is all the same (right now some text is italicized and may not be the same font). Click on the cell in the most upper left hand corner this will highlight the entire spreadsheet. Select Format>Cells>Font. Select your preferred font and size (we are particularly fond of Century Gothic ). Finally, lets get rid of the border line between rows 4 and 5. Highlight the entire dataset again, and select Format>Cells>Border>None. Then go back to your table, highlight JUST the table, and select Format>Cells>Border:
If you want gridlines in your table (helpful when the table has many categories), click on Outline and Inside.
You can change the appearance of the line and the color of the lines.
72
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If you created the gridlines, your table should look something like this:
73
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
For nascent users of Excel, we understand that this seems like a lot of work. To this mild protest, we have two points. First - most recipients of your analysis will ONLY see your tables and/or graphics (next section). So you need to spend as much time making your analysis look clean and professional as you do ensuring that it is mathematically and logically correct. Second as you will see in the subsequent chapters, some of these executions, which appear awkward in Excel are quite easy in other software applications. Going through the labor in Excel will make you better appreciate the higher level packages.
74
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Remember that when creating bins in Excel, we identify the TOP of each categoryand the highest category does not need to be identified.
75
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click OK.
76
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Following the same process as was used to create the frequency table, identify the Input Range and the Bin Range, and ensure that the Labels box is checked. This time, also check the Cumulative Percentage and Chart Output options:
Selecting these two options will convert the frequency table into a histogram.
77
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You guessed ita little untidy. Lets format a few things on our histogram.
78
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
First, as before, lets change the Bin names to what we really want: Less than 3, 3-6, 7-10, 11-14 and 15+. These changes can be made in the frequency table the histogram will be automatically updated because the graphic is dynamically linked to the table. Second, double click on the right axis and go to the scale tab. Indicate that the Maximum should be 1.0. Third, highlight the legend and delete it (it does not really communicate any meaningful information). Fourth, double click on the x-axis and format the font as needed (we prefer Century Gothic ). Do the same for the other two axes. Finally, click on the area of the graphic and then right click. Select Chart Options and rename the labels as needed. Your final histogram should look something like this:
79
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To reproduce the pie chart in Chapter 2, begin by bringing up the sheet, which contains the frequency chart that you created in the previous section:
This is the Chart Wizard button, which is used to create most graphics in Excel.
80
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on the Chart Wizard button. The Chart Wizard will take you through four steps. The first step is to select the graphic:
81
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you have selected the Pie chart type, click on Next. The second step is to identify which data is to be charted. Assuming your frequency chart looks like the chart on the previous page, you will indicate cells A1 through B4 although the actual data is in column B, we need to include the data labels from column A:
82
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After the data has been correctly identified, click on Next. In the third step, we are including the appropriate labels for the Pie Chart:
After you have changed the Chart Title, go to the Data Labels tab and select Show Percent. Ultimately, what kind of label you select for your Pie Chart is personal preferencebut typically Pie Charts are used to communicate the percentages of each category. After you have completed the third step, you can either click Finish or click on Next to identify WHERE in the workbook you want to place your Pie Chart.
83
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Well done! As noted in Chapter 2, some recipients of your data may be colorblind. Although Excel is typically does not place colors such as green, red and brown together, should you need to override the default colors provided in Excel (or include patterns to accommodate printing in black and white), simply click on the Pie. Then right click and select Format Data Series.
84
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You can now go through each slice and change the color. If you need to change the solid colors to patterns, simply select the Fill Effects button.
Bar charts are created using a very similar process. We will create a bar chart of the same information.
85
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on the Chart Wizard button. Select the Bar chart type:
Select Next. As before, identify the data range, which will include the category names:
86
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click Next.
87
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In step 3, change the title, axis names and other formatting as needed:
88
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
89
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
90
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This dialogue box is the first of three steps in a Wizard. In step one, simply click Next (we will use the default selections). In step two, select the entire dataset (including labels):
Now click Next. The third step requires a bit more thought. The first screen in step three looks like this:
Click on Layout.
91
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We have four possible positions in which to place any of the variables on the right.
Lets start by reproducing the first Contingency Table in Chapter 2, which provides the number of individuals by Plant and by Gender. To do this, drag the Plant variable to the Column position and then drag the Gender variable to the Row position. Then, drag the Gender variable a second time (notice that after you placed the Gender variable in the Column position, it was still listed on the right) into the Data position. For now, we will leave the Page position unpopulated.
92
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
93
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
94
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This is referred to as an Excel pivot table. This is the easiest way to determine the MODE for qualitative variables in Excel (e.g., D is the MODE for Plant). If we wanted to convert the counts into percentages (which is typically more meaningful), we would click on the Pivot Table drop down box circled above. Select Field Settings, which will result in the following:
95
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Select the Show Data As drop down menu and select % of row. Then Click OK.
96
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
97
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This provides us with the breakdown of Plant by Gender. If we need to reverse this, and report the breakdown of Gender by Plant, we simply go back to Pivot Table>Field Settings>Show Data As>% of Columns. This set of executions will provide us with:
98
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If we want to incorporate an additional piece of information like the average job tenure by plant and by gender, we could do this by substituting the YRONJOB variable in the Data position. Select Pivot Table>Wizard>Layout. This series of executions will bring us back to the layout page:
Drag the Count of Gender back to the list on the right and then drag YRONJOB to the Data position. When the YRONJOB variable is in the Data position, the box will read Sum of YRONJOB. Lets change the sum to an average. Double click on the Sum of YRONJOB box and select Summarize by>Average.
99
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
100
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
101
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As beforeExcels output is a little untidy. To format the decimals to be consistent, highlight the data in the table and then select Format>Cells>Number>OK. Now you should see this:
102
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Much better! This table now provides information such as in the Dallas plant, women have an average of 8.85 years on the job. Now you can copy and paste this table into other documents or into another Excel sheet. As you can see, Pivot Tables are very useful and very flexible. However, because they are so flexible, they do require a bit of manipulation. Mastering Pivot Tables in Excel is a great differentiating skill, but will require practice (and patience). Stacked bar charts are easy to create and manipulate. To reproduce the stacked bar chart from the previous chapter, we will use the first Pivot Table created above that indicated the frequency counts by gender and by plant. Most graphics in Excel are created using the Chart Wizard as we did with the pie chart and the bar chart. Go to the table and copy these cells and paste them into another part of the spreadsheet:
Count of Gender Plant Gender D N Grand Total 20.00 F 13.00 7.00 M 10.00 10.00 20.00 Grand Total 23.00 17.00 40.00
Click on the Chart Wizard button and select the Bar chart type:
103
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Select the second Chart sub-type option. This is the stacked bar chart. Select Next. Identify the data range which will NOT include the totals, but WILL include the labels. Then add titles.
104
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Lets clean this up a bit. We need to change the labels from single letters to the actual names of the plants and the genders and change the axis to have no decimals (when dealing with discrete data like people, its best not to have any decimals). First, place your cursor on one of the blue bars and right click. Select source data and then series:
Highlight each series and then change the name to Female and Male.
105
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on OK. To change the decimals, double click on the axis and select Number and then change the number of decimals to 0. Finally, to change the N and the D to Norcross and Dallasgo back to the original data in the spreadsheet and replace the N with Norcross and the D with Dallas. See?! The data is integrated into the graphic! You should now see this:
Well Done! A 100% Stacked Bar Chart is executed in the exact same way, only you would select the third bar chart subtype. The final visualization in this section is the scatter plot. Scatter plots are typically used to determine if there is a meaningful relationship between two variables. To reproduce the scatter plot in the previous section, lets return to the Plant_Survey sheet. To examine the relationship between Job Productivity and Job Tenure, we will plot these two variables in a scatter plot. It is important to note that in a scatter plot, we are NOT trying to establish any causation, only correlation.
106
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We create scatter plots using the same process, which was used for all of the other graphics the Chart Wizard. Click on the chart wizard button and select the Chart type XY (Scatter):
Select Next. In the data range, select the variables PRDCTY and YRONJOB (including labels). In the Chart Options, create titles and axis labels as appropriate (Productivity is on the y-axis).
107
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Againpretty untidy. We will do three things to clean up the appearance of this graph: delete the legend, rescale the y-axis and take away the decimals. To delete the legend, simply click on it and then hit the delete button on your keyboard. This not only deletes the legend, which was not meaningful, it also creates space. The y-axis needs to be rescaled because the data does not actually start until above 50; there is a great deal of wasted space. To rescale the y-axis, double click on the y-axis and select scale and type 50 in the box for Minimum and select OK. To resize the graphic, simply highlight the chart and drag one of the corners until the graphic is the desired size. Finally, to delete the decimals, double click on the x-axis. Select the Number tab. Set the number of decimals to 0 and click OK. Do the same thing for the y-axis.
108
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As highlighted in Chapter 2, we can derive additional information from this graphic by adding a trendline to the data. To add a trendline, click on the dots in the graph and then right click. Select Add trendline.
109
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Select the Linear trend and then click on the Options tab.
110
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Ensure that the Display equation on chart and Display R-squared value on chart options are selected. Then click OK.
111
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As explained in Chapter 2, this information now provides us with the best linear equation, which fits the relationship between Productivity (y) and Job Tenure (x). The R-squared value of .1124 indicates that this is not a particularly strong relationship Job Tenure only explains 11.24% of the change in Productivity. These concepts form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
112
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
3.5 Using Excel 2003 for: Random Number Generation and Simple Random Sampling
Our WidgeOne.xls dataset is fairly small only 40 observations. As a result, it would be unusual that we would want to extract a sample from such a small dataset. However, for the purposes of executing the application of random number generation in Excel, lets assume that we want to randomly select ten individuals with whom we want to conduct in depth interviews. Lets begin by assigning random numbers to each individual. Go back to the Plant_Survey sheet and create a new column label RANDOM. Place your cursor in the first cell under the column label (row 2). Click on the formula button. Ensure that ALL is selected as the Function Category. Scroll down through the Function Names until you see RAND. Select RAND and click OK. This will generate the following:
There are three pieces of information you need to understand from this box: 1. The function takes no arguments which means that we do not need to provide any information; 2. The function will return an evenly distributed (uniform distribution) random number between 0 and 1; 3. The function is volatile which means that the value returned will change EVERY time the spreadsheet is manipulated. Click OK. You should see some number between 0 and 1 in this cell (your result will be different each time since the random number is generated using your computers internal clock). Remember that Excel reads this cell as
113
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
=RAND not as the number that you see. Now copy the formula in this cell down to the bottom of the dataset. Did you notice that your original number in row 2 changed? This is because it is volatile. Sometimes we need to have volatile arguments in (not with) Excel. Most of the time we do not. To convert the numbers you see from volatile to stable (unchanging), highlight the entire column, then select Edit>Copy>Edit>Paste Special>Values. Now, you should have a column of unchanging random numbers (your numbers will be different from ours):
114
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, sort the entire dataset on the random numbers just created. Select DATA>SORT>RANDOM (since the numbers are completely random, it does not matter if you select Ascending or Descending).
Click OK. Then, select the first 10 individuals for the interviews. This is a fairly simple, but very useful process.
115
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
116
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you click OK, you will see this dialogue box:
Alpha is 1-the Confidence Level The STD would have been previously calculated Size is the sample sizein this case 40
117
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If we are computing a 95% confidence interval, we would enter .05 for the alpha value (you can think of alpha as the probability you are willing to accept of being wrong). The standard deviation, which was computed previously for job satisfaction, was 1.026. The (sample) size is 40. Once this information is entered, the resulting computation should be .32. This is the margin of error for job satisfaction at a confidence level of 95%. You would then add and subtract this to/from the mean (6.85) to create the full interval. The full interval would then be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53.
An important note in spreadsheet development: you could enter 1.02 in this box or enter the cell reference J47. You would generate the same answer. However, you are almost ALWAYS better off entering the cell reference rather than hard coding a number. This makes the formula more portable.
6
118
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
119
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
For example, go into a clean Excel spreadsheet. Type Jan in cell A1 and then Feb in cell B1. Now, highlight these two cells. Place your cursor on the little square handle on the bottom right corner:
Drag this handle 10 spaces to the right. You now have the months of the year!
120
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
121
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You can move easily through much of the functionality in Excel 2007 by clicking on the tab headers at the top. Most of the time, you will be on the Home tab.
122
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
At this point, we need to access the WidgeOne.xls dataset in Excel 2007. To access the dataset, click on the Microsoft Office button at the top right of the sheet and select Open. You should see the following screen:
Once you have opened the WidgeOne.xls file in Excel, you should see this:
123
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
124
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
125
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
At this point, the entire dataset should have shifted to the right, and the new column A is blank. Now, go to the bottom of the dataset to cell A43. In cells A43, A44 and A45, type Mean, Median and Mode, respectively:
126
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Not all variables will lend themselves to these calculationsremember that we only execute mean and median calculations on quantitative variables. So, it would be helpful if we could see the column headers to remind us what is in each column. This can be done using a split screen. To do this, click on the View tab and then within the Window tools, select Freeze Panes>Freeze Top Row:
127
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
For which columns should we report the measurements of central tendency? The quantitative values include JOBGRADE, SOCREL (social relations score7), YRONJOB (number of years on the job), PRDCTY (Productivity) and JOBSAT (job satisfaction). The calculation of the mode for the qualitative variables (PLANT, GENDER and POSITION) will be addressed below. Move your cursor to position F43. This is where we will place the mean for the JOBGRADE variable. With your cursor in this cell, click on the fx button. From the dialogue box, select Statistical. From the list of function names, click on the second entry AVERAGE. You will see this:
Psychology, Sociology and Marketing Majors will recognize that this is Likert Data. For the purposes of this manual, Likert Data will be treated as quantitative. However, it should be noted that pure mathematicians treat Likert Data as qualitative.
7
128
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Once you select Average, the next dialogue box will request an array of numbers. Excel is pretty clever. You may already have the array populated in the first field (Number 1). For the JOBGRADE variable, this will be cell F2 through cell F41. If it is not already populated for you, simply click on the little spreadsheet button and highlight the cells F2 through F41. Note that cell F42 is empty. If it is included, it will be ignored. However, if there was a 0 in cell F42, it would be includedand a different mean would be calculated. It is always best to only include the relevant cells in your calculations. After you have selected cells F2 through F41 as the array for the mean calculation, click OK. You should now see 6.6. Now, lets copy this function across to column J. With your cursor in cell F43, go to The Home tab and select Copy from the Clipboard tools. Highlight cells G43 through J43. Then select Paste from the Clipboard tools. To populate the Median cells, we will use the same process. Place your cursor in cell F44 and click on the function button. From the Statistical functions, select MEDIAN and select the same array F2:F41. Click OK. Copy and paste the function in cell F44, across to cell J44. Although it is not typically used as the best measurement of central tendency of quantitative data, you can provide the mode for these variables using the same process MODE is a function listed in the Statistical category of functions.
129
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
130
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
131
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We now have the basic descriptive statistics for the quantitative variables. You may notice that some of the values have no decimal points, some have one decimal point, some have many decimal points. We think this looks a little untidy (as Statisticians, we like things to be tidy). To make this spreadsheet look a bit more professional, lets format all of the data to have a consistent number of decimal points. To do this, click on the cell in the far upper left corner the cell to the LEFT of the A column and ABOVE the first row. This will highlight the entire spreadsheet. Then from the Home tab, select the comma button from the Number tools. This will make all of the numbers in the spreadsheet have two decimal points.
132
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
133
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Note that if you needed to add or subtract decimal points, you could easily do so by selecting the cells of interest and then clicking on the increase or decrease decimals as circled above. In practice, if you need to provide multiple descriptive statistics on a variable, this is not the process that you would go through. For multiple descriptive statistics, you would go to the Data tab and from the Analysis Tools, select Data Analysis8. This path will bring up the following:
In the event that you do not see the Analysis Tools under the Data tab, click on the MS Office Button in the upper left corner. Select Excel Options at the bottom and then Add-Ins and then GO. Ensure that the Analysis Tool Pak is ticked and click on OK. It should be there now. Note that if you have an unauthorized copy of Excel, you probably wont have access to this very important functionality.
8
134
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Highlight the quantitative variable(s) of interest (F1:J41) Identify that you have labels in the first row
135
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
136
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Again, pretty untidy. Format the spreadsheet to have two decimal points for all values and expand the columns to accommodate the labels. Your tidy version should look like this:
137
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Notice that we reproduced all of the measurements from before, and several more9. This is a more efficient way to produce the descriptive statistics of a variable(s). In Chapter 2, we presented the concept of a frequency table as another method of displaying the spread of a dataset. As discussed, frequency tables are one of the most commonly used methods to display data understanding how to create a frequency table from a quantitative variable is a critical skill. The table created on in Chapter 2, was created in Excel. We will reproduce it here. The first step to creating a frequency table from a quantitative variable is to determine the categories that need to be developed for the quantitative variable (this process will effectively transform a quantitative ratioscale variable into a qualitative ordinal variable). Previously, we determined that the job tenure variable (YRONJOB) should be categorized into three levels less than 5 years, 5-10 years and more than 10 years. Recall that the categories must be mutually exclusive and collectively exhaustive. To accommodate these categories in Excel, we will create bins, where the TOP of each category identifies each bin.
For detailed information on the additional statistics produced, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
9
138
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In our WidgeOne.xls dataset, lets create a bin range for YRONJOB in column L:
These are the bins for the Histogram for Job Tenure. Category 1 is 0-4.99, Category 2 is 5-10.00 and Category 3 (which does not need to be entered) is everything above 10.00.
139
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Once these bins have been created, select the Data Tab, and then from the Analysis Tools, select Data Analysis and then Histogram:
Click OK. This will bring up a dialogue box, asking for information regarding the quantitative variable to be analyzed, and the associated Bin Range:
140
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Highlight the range of the YRONJOB variable (including the label) Highlight the Bin Range (including the label)
141
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
142
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Againa little untidybut this is the base of what we need for the frequency table. Lets clean this up and add some columns to reproduce the table from Chapter 2. First, replace the bin titles with the real category labels of Less than 5 years, 5-10 years and More than 10 years. Second, expand the columns as needed. Third, total the bottom of the frequency column using the SUM option in cell B5, type =SUM(B2:B4) (the SUM function can be found in the Math & Trig category of functions). Next, create two addition column headers Relative Frequency and Cumulative Frequency. Your sheet should look like this:
143
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The Relative Frequency column will display the percentage of observations in each column an important piece of informationparticularly when comparing populations of different sizes. This is done by simply taking each frequency and dividing it by the total. For example, in cell C2, we would type =B2/B5. This would result in .2750 (11/40). Rather than typing this same formula again and again to capture the relative frequencies of the next two categories, we would like to copy this formula into cells C3 and C4. Do this now. Did you get #DIV/0? The problem is that when the formula =B2/B5 is copied down one cell, it becomes =B3/B6. There is nothing in cell B6. Since any number divided by 0 is undefined, we receive this error message. If we want to copy the formula into the cells below, we need to nail down the reference to the Total cell and prevent the reference from changing. To do this, we place a $ in front of the B and another $ in front of the 5 $B$5 instead of B5. This can also be accomplished by placing the cursor in between the B and the 5 and hitting the F4 button on your computer. Once you have nailed down the Total cell as a reference cell, you can copy the formula into cells C3 and C4. The Cumulative Frequency column will display the cumulative percentage of observations from 0 to the top of the category in question. This is accomplished by adding the relative frequency of a category to all of the relative frequencies before it. In Excel, we would type =C2 in cell D2 the first entry in the Cumulative column will always equal the first entry in the Relative Frequency column. In cell D3, we would enter =D2+C3. This will add the cumulative value (D2) plus the Relative Frequency for the category (C3). Andwe can now copy this formula into cell D4. Clearly, this is a lot of manual work in Excel for a relatively small table. However, our focus is on helping to build the Excel skills necessary to execute this kind of analysis for any size table or dataset.
144
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
145
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You probably can guess what is next lets make it a bit more tidy and presentable. First, lets convert the decimals to percentages since that is the way most people would expect to see the data. Highlight cells C2 through D5. Then go to the Home tab and click on the % sign in the Numbers tools. This should have converted all of the numbers to percentages with no decimals. If you would like to see decimals, you can increase the decimals be selecting the increase decimal button in the Numbers tools on the Home tab. Second, lets format the text to ensure that it is all the same (right now some text is italicized and may not be the same font). Highlight the entire table of data (cells A1:D5). From the Font tools, select a common font (we prefer Century Gothic ). Also, you can take off the italics by clicking on the I in the Font tools (you may have to click it twice). Finally, if you want to standardize the appearance of the gridlines, from the Font tools, select the Border Box. From the pull down menu, identify that you want no borders. Then, go back and identify that you want a Thick Box Border.
146
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
147
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
For nascent users of Excel, we understand that this seems like a lot of work. To this mild protest, we have two points. First - most recipients of your analysis will ONLY see your tables and/or graphics (next section). So you need to spend as much time making your analysis look clean and professional as you do ensuring that it is mathematically and logically correct. Second as you will see in the subsequent chapters, some of these executions, which appear awkward in Excel are quite easy in other software applications.
148
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Remember that when creating bins in Excel, we identify the TOP of each categoryand the highest category does not need to be identified.
149
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Once these bins have been created, select the Data Tab, and then from the Analysis Tools, select Data Analysis and then Histogram:
Click OK.
150
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Following the same process as was used to create the frequency table, identify the Input Range and the Bin Range, and ensure that the Labels box is checked. This time, also check the Cumulative Percentage and Chart Output options:
YOU MUST IDENTIFY THAT YOU WANT THE OUTPUT IN A NEW WORKBOOK. WHY? WE DONT KNOWASK THE BRAIN TRUST AT MICROSOFT.
Selecting these two options will convert the frequency table into a histogram.
151
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
152
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You guessed ita little untidy. Lets format a few things on our histogram. First, as before, lets change the Bin names to what we really want: Less than 3, 3-6, 7-10, 11-14 and 15+. These changes can be made in the frequency table the histogram will be automatically updated because the graphic is dynamically linked to the table. Second, highlight the legend and delete it (it does not really communicate any meaningful information). Third, double click on the x-axis and format the font as needed (we prefer Century Gothic). Do the same for the other two axes. Finally, click on the area of the graphic and then right click. Select Chart Options and rename the labels as needed. Your final histogram should look something like this:
Frequency
10
8
6
4
2
0 Less than 3 3-6 Years 7-10 Years 10-14 Years 15+ Years
153
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Well Done! To reproduce the pie chart in Chapter 2, begin by bringing up the sheet, which contains the frequency chart that you created in the previous section. Go to the Insert tab:
154
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Highlight cells A1:B4. Do not include the total. Then, select Pie Chart from the Chart tools:
155
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Select the first option the basic 2D chart. You should now see this:
156
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, the primary issue with this pie chart, is that we have no information regarding the percentages that comprise each slice which is the whole reason to use a pie chart. To insert percentage values, go to the Layout tab and from the Labels tools, select Data Labels and then go to the bottom of the drop down and select More Label Options. You should see this:
157
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you have identified that you want a percentage value, click on Close. You should now see this:
158
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You can click on the Frequency title and change it to something more meaningful like Years on Job. Well done! As noted in Chapter 2, some recipients of your data may be colorblind. Although Excel is typically does not place colors such as green, red and brown together, should you need to override the default colors provided in Excel (or include patterns to accommodate printing in black and white), simply go to the Design tab and make an alternative selection. Bar charts are created using a very similar process. We will create a bar chart of the same information (as noted earlier bar charts are not histograms on their sides). To reproduce the bar chart in Chapter 2, begin by bringing up the sheet, which contains the frequency chart that you utilized to create the pie chart. Go to the Insert tab. Highlight the same data as before A1:B4 (be sure not to include the totals). From the Charts tools, select Bar and then select the first option.
159
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
160
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Where Pie Charts are used to explain relative proportions (percentages) Bar Charts are used to communicate counts. So, the units in this chart are fine. You may want to double click on the Frequency title and give it a more meaningful name. Also, you may want to delete the frequency legend, since it does not communicate any meaningful information.
161
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
162
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You can leave everything as its default, but for the Table/Range box, click on the little spreadsheet button as circled and highlight the entire dataset. It is particularly important to make sure that you include the titles from the first row. Select OK. At this point you should see something that looks like this:
163
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In the event that your sheet does not look like this, do not let your heart be troubled. We can fix this. If your pivot table template DOES NOT look like this, place your cursor inside the table and right click. Go to Pivot Table options. Go to the Display tab and tick the box for Classic Pivot Table Layout and click OK. You should now have the screen as shown above. You can think of this as an empty template with rows and columnswaiting to be filled with data. Lets begin by placing the Plant variable in the column and the Gender variable in the row:
You can click and drag the variables into the right position either in the listing or in the table itself.
164
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, if we are simply trying to ascertain counts for a basic contingency table, should we place the Plant or the Gender variable in the center of the table? The answer isit does not matterthe counts will be the same. I placed the Gender variable in the center and generated the following table:
165
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Try placing the Plant variable in that positionyou should generate the same numbers. Cool. From this table it is easy to see that the Mode of the Plant variable is Dallas and there is no mode for the Gender variable we have the same number of Males and Females. As we did before, lets look at this data in a few different ways. First, change the data to be a percentage of row. This can be done by clicking on the Count of Gender entry as circled above. Select Value Field Settings. You should see the following:
166
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on the Show values as tab and select Show values as % of row as indicated below:
Select OK.
167
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
168
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As we discussed before, this now tells us that Of all of the females, 65% work in Dallas. You could change this display to be the percent of columns or the percent of totalsthey all communicate subtly different messages. If we want to incorporate an additional piece of information like the average job tenure by plant and by gender, we could do this by substituting the YRONJOB variable in the Data position. Do this by dragging the Gender (or Plant) variable from the values box and placing the YRONJOB variable in the same place:
169
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The problem with our table at this point is that we really wanted the average Years on Job, but this is the summation of the total years on job for each intersection. To change the summation to an average, click on the Sum of YRONJOB in the Values box, and select Value Field Settings. From the Summarize By tab, change the default from Sum to Average and click OK. You should now have the following screen:
170
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As before, this is a little untidy. You can format the cells to have consistently two decimal places. Much better! This table now provides information such as In the Dallas plant, women have an average of 8.85 years on the job. Now you can copy and paste this table into other documents or into another Excel sheet. As you can see, Pivot Tables are very useful and very flexible. However, because they are so flexible, they do require a bit of manipulation. Mastering Pivot Tables in Excel is a great differentiating skill, but will require practice (and patience). Stacked bar charts are easy to create and manipulate. To reproduce the stacked bar chart from Chapter 2, we will use the first Pivot Table created above that indicated the frequency counts by gender and by plant. Go to the Pivot Table, convert the data back to counts and copy these cells and paste them into another part of the spreadsheet:
7 10 17
Note: DO NOT COPY THE PORTION OF THE PIVOT TABLE WITH THE DROP DOWN ARROWS.
7 10 17
171
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This will disengage the data from the Pivot Table. You will see why this is helpful soon. Now, highlight all of the data EXCEPT for the totals. With the data highlighted, go to the Insert tab and the Bar option in the Charts tools. Select the second option. You should see this:
F M D
10
15
20
25
This chart is finebut a little untidy. Because the chart is dynamically linked to the table, you can update the N to read Norcross and D to read Dallas and the same for the genders. We should also apply a title. Go to the Layout tab and the Labels tools and select Chart Title. Add the title.
172
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
10
15
20
25
Well Done! Whenever you have different population sizes, as is the case with the Dallas and Norcross Plants, it is helpful sometimes to scale both populations to 100% to more easily compare the two. This is the purpose of a 100% Stacked Bar Chart. To execute this chart, you start the same way highlight the data, go to the Insert tab, click on the Bar option but this time select the third option (all the bars are the same length).
173
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Norcross
0%
20%
40%
60%
80%
100%
You can think of this visualization as side by side pie charts. This graphic communicates the proportion of males and females within each plant. It is easy to see from this graphic that there are proportionately fewer women in Norcross than in Dallas.
174
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The final visualization in this section is the scatter plot. Scatter plots are typically used to determine if there is a meaningful relationship between two variables. To reproduce the scatter plot in the previous section, lets return to the Plant_Survey sheet. To examine the relationship between Job Productivity and Job Tenure, we will plot these two variables in a scatter plot. It is important to note that in a scatter plot, we are NOT trying to establish any causation, only correlation. First, it is helpful to have the two variables next to each otherwhich they are not. To move the PRDCTY variable next to the YRONJOB variable, click on the G at the top of the column where the PRDCTY variable is located. This will select the entire column. Now, click on the Ctrl button and the X button. There should be chasing lights around the variable column. Click on the I at the top of the YRONJOB variable, so that column is highlighted. Now click on Ctrl/Shift/+ at the same time. Cool. The variables should now be next to each other (this is actually a Lagniappe). Once the variables of interest are side-by-side, highlight both. Go to the Insert tab and select the Scatterplot option from the Chart tools. Delete the legend. You should see something like this:
PRDCTY
120.00
100.00
80.00 60.00 40.00 20.00 -
5.00
10.00
15.00
20.00
175
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Againpretty untidy. We will do three things to clean up the appearance of this graph: rescale the y-axis, add titles and take away the decimals. The y-axis needs to be rescaled because the data does not actually start until above 50; there is a great deal of wasted space. To rescale the y-axis, select the Layout tab. From the Axes tools, select Axes. Select Primary Vertical Axis>Primary Vertical Axis More Options. You should see this:
176
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on Close. Now, recall that graphics are typically dynamically linked to data in Excel. So, if we change the data, we change the graphic. In the Plant_Survey sheet, highlight both variables and decrease the decimals. This will change the appearance of the scatterplot as well. Finally, to add titles to the axes, select the Layout tab. Choose the Axis Title option from the Labels tools. Begin by assigning a Primary Vertical Axis title, and then a Primary Horizontal Axis title. Name them appropriately. Then, rename your chart title. Your scatter plot should look something like this:
Productivity
10
Years on Job
15
20
As highlighted in Chapter 2, we can derive additional information from this graphic by adding a trendline to the data. To add a trendline, click on the dots in the graph and then right click. Select Add trendline.
177
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Identify that you want the Equation and the Rsquared values on the chart.
Select Close.
178
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Productivity
10
Years on Job
15
20
As explained in Chapter 2, this information now provides us with the best linear equation, which fits the relationship between Productivity (y) and Job Tenure (x). The R-squared value of .1124 indicates that this is not a particularly strong relationship Job Tenure only explains 11.24% of the change in Productivity. These concepts form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
179
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
4.5 Using Excel 2007 for: Random Number Generation and Simple Random Sampling
Our WidgeOne.xls dataset is fairly small only 40 observations. As a result, it would be unusual that we would want to extract a sample from such a small dataset. However, for the purposes of executing the application of random number generation in Excel, lets assume that we want to randomly select ten individuals with whom we want to conduct in depth interviews. Lets begin by assigning random numbers to each individual. Go back to the Plant_Survey sheet and create a new column label RANDOM. Place your cursor in the first cell under the column label (row 2). Click on the formula button. Ensure that ALL is selected as the Function Category. Scroll down through the Function Names until you see RAND. Select RAND and click OK. This will generate the following:
There are three pieces of information you need to understand from this box: 1. The function takes no arguments which means that we do not need to provide any information; 2. The function will return an evenly distributed (uniform distribution) random number between 0 and 1; 3. The function is volatile which means that the value returned will change EVERY time the spreadsheet is manipulated.
180
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click OK. You should see some number between 0 and 1 in this cell (your result will be different each time since the random number is generated using your computers internal clock). Remember that Excel reads this cell as =RAND not as the number that you see. Now copy the formula in this cell down to the bottom of the dataset. Did you notice that your original number in row 2 changed? This is because it is volatile. Sometimes we need to have volatile arguments in (not with) Excel. Most of the time we do not. To convert the numbers you see from volatile to stable (unchanging), highlight the entire column, from the Home Tab, select Copy and then Paste>Paste Values.
181
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, you should have a column of unchanging random numbers (your numbers will be different from ours):
182
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, sort the entire dataset on the random numbers just created. Highlight the entire dataset. Then go to the Data tab. From the Sort and Filter tools, select Sort. When the dialogue box appears, click on the down arrow in the Sort by option. You will get a drop down of all of the variables. Select Random. It does not matter if you sort smallest to largest or largest to smallest its random.
Click OK. Then, select the first 10 individuals for the interviews. This is a fairly simple, but very useful process.
183
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
184
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Alpha is 1-the Confidence Level The STD would have been previously calculated Size is the sample sizein this case 40
If we are computing a 95% confidence interval, we would enter .05 for the alpha value (you can think of alpha as the probability you are willing to accept of being wrong). The standard deviation, which was computed previously for job satisfaction, was 1.0210. The (sample) size is 40. Once this information is entered, the resulting computation should be .32. This is the margin of error for job satisfaction at a confidence level of 95%. You would then add and subtract this to/from the mean (6.85) to create the full interval. The full interval would then be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53.
An important note in spreadsheet development: you could enter 1.02 in this box or enter the cell reference J47. You would generate the same answer. However, you are almost ALWAYS better off entering the cell reference rather than hard coding a number. This makes the formula more portable.
10
185
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
186
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Find a new, clean column on the right of the dataset. Title the column Productivity Category. You should see this:
187
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The quantitative productivity score is in column G. In cell L2, enter the formula =IF(G2<90,LOW,HIGH). Copy this formula to the bottom of the dataset. You should see the following:
188
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Cool. Now, lets assume that we want to add a third categorization medium. Lets define medium productivity as a productivity score between 80 and 90, and a low productivity score as a score less than 80 (high is still above 90). Replace the formula in cell L2 with this =IF(G2<80,LOW,IF(G2<90,Medium,HIGH)). You should see the following:
189
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Chapter 5 SPSS
What is SPSS?
SPSS was first developed in 1968 by social science researchers at Stanford University as a tool to help them with quantitative research. In fact, the acronym SPSS initially stood for Statistical Package for the Social Sciences. As with IBM and AT&T, the company (and its software) is simply known by its initials, in part as a testament to its diverse user base. Although the software is most heavily used in social science contexts particularly in psychology, political science and in academia it is also used in medicine, marketing, and many other contexts. SPSS is appealing to many users from less technical and/or mathematical disciplines because it has a particularly user-friendly interface consisting of an Excel-like spreadsheet for the data and menus and buttons for manipulations and analyses. Although this point and click interface makes SPSS particularly attractive for statistical computing novices, individuals who require greater statistical functionality may find the application limiting. This document has been written using SPSS version 15.4.
190
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
When you open SPSS you should see the following screen:
191
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As shown above, there are two tabs in the new file. A Variable View tab and a Data View tab. The Data View tab will display the data much the same way as an Excel spreadsheet. We must import the data from the Excel spreadsheet WidgeOne. Do this by clicking on File>Open>Data. Then click the computer icon>Computer>C$(\\Client)(V:):
Note that if you are accessing SPSS through Citrix, all of your drive names will change. For example, your C: drive will become your V: drive. Make sure that the File type is set to .xls to find an Excel file.
192
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Browse to where the WidgeOne file is located. When you open it, you will get a dialogue box like this:
193
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
194
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This is one of two possible views of your dataset. This is the Variable View. Note at the bottom of the screen, the Variable View tab is highlighted. This view lists the variables in your dataset. In our case, the column names in the WidgeOne file were converted to variable names in this SPSS file. The qualitative variables (e.g., GENDER and PLANT) are called string variables and the quantitative variables (e.g., PRDCTY and YRONJOB) are called numeric variables. For later displays it will be nice to create user-friendly labels for each of the values in these variables, instead of indicators like D for the Dallas plant. To create labels that will make our output easier to read, click on the Values cell in the PLANT row:
195
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You will be prompted for a name and a label. In the Value box, enter the value that appears in the actual data that you want to read differently in the output:
Click the Add button. Next assign the label Norcross for the value N and click Add again. Click OK. Do this for the other string variables Plant and Position. Please note that this is NOT affecting your actual data, it will only change the way that the output appears.
Go back to the Data View tab at the bottom of the screenyou will see the actual data11:
If you needed to create a new dataset from scratch, you would begin by defining your variables in the Variable View window and then return to the Data View window and input the values.
11
196
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
197
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To expand the columns, simply place your cursor in between the column headers (variable names) and drag the column to its desired width just like you would in Excel. At this point, we could convert the other worksheets from the WidgeOne dataset into SPSS files. Each would be converted to a separate SPSS file. These files could be merged into one file using the Merge Files option in the Data Menu (not available in Student Version). However, since we will only be using the variables in the Plant_Survey worksheet for our statistical analyses, we will not execute a merge at this time.
198
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
199
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We need to choose the variables for which we are interested in finding the mean and median. We will choose only the quantitative variables (those with the ruler icon next to them): JOBGRADE, SOCREL, YRONJOB, PRDCTY and JOBSAT. We make these selections by clicking on the variable from the list on the left and then clicking on the right arrow button circled above to place it on the variable list on the right. Almost every option in SPSS has this type of interface for selecting variables for analysis. You can choose more than one variable at a time by holding the Ctrl key down as you make your selections. Please make sure that the Display frequency tables option is UNTICKED. This will be more meaningful later.
200
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you have identified the variables for analysis, click on the Statistics option button as circled above. You should see this screen:
This should look pretty familiar. This is almost the same list of statistical information that was produced when we executed Tools>Data Analysis>Descriptive Statistics in Excel. Hmmmthat must mean that this stuff really is important. For now, just tick Mean and Median and select Continue.
201
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We obtain the following display containing the means and medians of our five variables in our SPSS Output window:
Statistics JOBGRADE N Valid Missing Mean Median 40 0 6.6000 6.5000 PRDCTY 40 0 84.5798 84.8114 SOCREL 40 0 5.3000 5.0000 YRONJOB 40 0 8.2900 8.3500 JOBSAT 40 0 6.8500 6.6000
Notice that these figures are consistent with what we had generated using Excel and what we had computed by hand. Isnt it nice when numbers match? What if we were only interested in a subset of the data? For example, what if we wanted to know the measurements of central tendency of these variables by gender and by plant?
We would select the Compare Means>Means option from the Analyze menu as shown:
202
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
203
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Typically, quantitative variables go in the Dependent List and qualitative variables go in the Independent List.
Choose the same five variables as before. Place these variables in the Dependent List. Then, place the variables Plant and Gender in the Independent List. This will enable us to better understand if there are differences between the genders and the plants across the quantitative variables like Productivity (PRDCTY). Once the variable lists have been populated, select the Options button. From the list, identify that you want the Mean and the Median. Select Continue and OK.
204
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
205
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This output is much more explanatory than the first set of output. Look at the differences between the plants. Which plant is more productive? Which plant has a higher Job Satisfaction score? Now look at the differences between the genders. Which gender has a higher social relations score? Is there a difference in productivity between the genders? Sometimes looking at an average by itself is misleading. For examplelets assume that a friend of yours just read an article about lung cancer. He goes on to tell you that 1% of all Americans will die of lung cancer. I should probably mention that your friend is a member of the great statistical unwashed. Does this mean that you have a 1% chance of dying of lung cancer? Of course not. It depends upon a lot of thingslikedo you smoke? If you re-evaluate that number by smokers/non-smokers, the values are very different. Thats the pointaverages are very misleading. You need to look at the average (or median) by different groupings to better understand the rest of the story.
As a rule, we do not use the mode as a Measurement of Central Tendency with quantitative data. If the data is qualitative Plant, Gender, Position it is the ONLY Measurement of Central Tendency available. We can determine the mode of variables such as these by selecting Analyze>Descriptive Statistics>Frequencies again from the Analyze menu. This time choose the qualitative variables Plant, Gender and Position. Check the box next to display frequency tables. Then click OK.
206
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You will see the following frequency tables from which it is easy to determine if there is a modal value (isnt this easier than what we had to go through in Excel?):
Plant Cumulative Frequency Valid Dallas Norcross Total 23 17 40 Percent 57.5 42.5 100.0 Valid Percent 57.5 42.5 100.0 Percent 57.5 100.0
Gender Cumulative Frequency Valid F M Total 20 20 40 Percent 50.0 50.0 100.0 Valid Percent 50.0 50.0 100.0 Percent 50.0 100.0
207
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
POSITION Cumulative Frequency Valid HRLY MGT Total 20 20 40 Percent 50.0 50.0 100.0 Valid Percent 50.0 50.0 100.0 Percent 50.0 100.0
You can also see here that we are reaping the work that we did earlier when we created the labels. It is much easier to understand these tables with the full labels.
208
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We could have obviously included lots of statistics in our analysis simply by choosing the ones we want from the Statistics screen. The second Measurement of Dispersion discussed in Chapter 2 was the frequency table. To execute a basic frequency table for a qualitative variable, go to Analyze> Descriptive Statistics>Frequencies. Select the qualitative variables for analysis. Ensure that the Display frequency tables is ticked at the bottom of the page. Click OK.
209
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
210
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In the previous chapters, we explained how to categorize a quantitative variable into a qualitative variable. For example, when we created a frequency table for the job tenure variable, we created three categories: < 5 years, 5-10 years and more than 10 years. To create these same categories in SPSS, we need to recode our YRONJOB variable into a new variable called JOBTEN. To do this, go to the Transform menu and choose the option Recode into Different Variables:
211
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
212
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Tick this box to tell SPSS that you are creating a qualitative variable
First we define the category New. In the screen above, you must indicate that the Range of this new value is from 0 to 4.9 (we wanted values less than 5 and the data had only one decimal place of accuracy). Check in the box that specifies that the new output variable will be of type String. We also name the new values New. Click on the Add button to add this new output value.
213
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Note that the values of YRONJOB between 0 and 4.9 will represent the category New in the new variable. Continue this same process creating the category Experienced (5-10 years on the job) and the category Mature (10+ years on the job). Note: since the value Experienced has 11 characters, change the Width from 8 to 11. After you have completed this process, click on Continue.
214
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The Name is what will appear in the dataset. The Label is what will appear in the output. Select Change and then select OK.
215
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
216
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now we can easily generate a frequency table for the new variable JOBTEN. As before, go to Analyze>Descriptive Statistics>Frequencies. Ensure that the frequency table option is ticked and select your new Jobten variable:
Job Tenure Cumulative Frequency Valid Experienced Mature New Total 16 15 9 40 Percent 40.0 37.5 22.5 100.0 Valid Percent 40.0 37.5 22.5 100.0 Percent 40.0 77.5 100.0
Well Done!
217
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
218
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
219
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click Continue. On the Explore dialogue box, make sure that the Both option is selected for the Display. Click OK.
220
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
221
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
YRONJOB Stem-and-Leaf Plot Frequency 2.00 5.00 5.00 4.00 8.00 7.00 4.00 4.00 1.00 Stem width: Each leaf: Stem & 0 0 0 0 0 1 1 1 1 . . . . . . . . . Leaf 01 22233 44555 6777 88888999 0000111 2333 4445 7
10.00 1 case(s)
Here is the Stem and Leaf plot. If you imagine rotating this graphic clockwise 90 degrees, it is basically a Histogram on its side. The plot tells us that each stem has a width of 10.00. This means that the values should be interpreted in units of 10. Lets start in the middle with the frequency of 7.00. Here, we have four values that are 10.x and three more values that are 11.x. The next line indicates a frequency of 4.00. In the dataset, we have an observation that is 12.x and three observations that are 13.x. The greatest observation is 17.
222
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
223
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This is a boxplot. Here, the center line is the median. The box is the Interquartile range the high end of the box is the 75th percentile and the low end is the 25th percentile. The whiskers that extend in either direction tell us the full range of the data. If there were any outliers (defined as observations with values more than 1.5*IQR from the mean), they would be identified here. Lots and lots of outputwith relatively little work. Thats what Im talking about!
224
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
225
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Because Pie Charts communicate proportions, drag the Count default back to the left and drag the percent option into this space.
The four tabs across the top of this dialogue box will take you through four different set of options. The only other thing we really need to do is to give our Pie Chart a title. So, click on the Titles tab and title the chart Job Tenure of WidgeOne Employees. Feel free to explore the other tabs.
226
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Lets say that you wanted to understand how the overall productivity of the company was allocated by plant what percentage of the productivity comes from Dallas versus Norcross. This is easy to do in a Pie Chart in SPSS. Go back to Graph>Legacy Dialogues>Interactive>Pie>Simple. Make the following changes:
227
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Dont forget to change the title. Click OK. You should see the following Pie Chart:
228
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To correctly obtain the percentages, double click the graph to see this:
229
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click show data labels and then close the properties window and the chart editor to obtain the following graph:
230
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This pie chart now provides information regarding the percentage of WidgeOnes productivity by plant (Norcross needs to step it up).
The next univariate visualization tool is a Bar chart. This is done in a very similar way to the Pie Chart.
231
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As before, change the title to something meaningful. Peruse the other tabs as you see fit. You should generate something like this:
232
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We should probably note at this point that if the definitions that you assigned when you transformed the quantitative variable into a qualitative variable are not universally known, you should include a legend or key at the bottom of your graphic to ensure that the reader understands the definition of New and Mature.
233
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
234
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As with Excel Pivot Tables, Crosstabs in SPSS are very flexible. If you wish to include more than just the frequency counts in the cells of your table, click on Cells. You will see the following window:
In the percentages section, select Row, Column and Total. Click Continue and then OK.
235
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Wowlook how much output was created in a single table! That was so much easier than Excel! The output table contains the conditional probabilities described in Chapter 2. In the first cell the intersection of Female and Dallas we have four pieces of information. We know that there are 13 women who work in Dallas. We know that of all of the Dallas employees, 56.5% are female. We know that of all of the women, 65% are in Dallas. Finally, we know that of all employees, 32.50% are females in Dallas.
Plant * Gender Crosstabulation Gender Female Plant Dallas Count % within Plant % within Gender % of Total Norcross Count % within Plant % within Gender % of Total Total Count % within Plant % within Gender % of Total 13 56.5% 65.0% 32.5% 7 41.2% 35.0% 17.5% 20 50.0% 100.0% 50.0% Male 10 43.5% 50.0% 25.0% 10 58.8% 50.0% 25.0% 20 50.0% 100.0% 50.0% Total 23 100.0% 57.5% 57.5% 17 100.0% 42.5% 42.5% 40 100.0% 100.0% 100.0%
236
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If you need to subset this information further (e.g. by Job Tenure), there is an easy way to do that. Go back to the Analyze>Descriptive Statistics>Crosstabs screen. Press Reset to return to the default settings.
Click OK.
237
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This time, the table will only show the cell counts (we could have included the percentages as before by following the same steps in the Cell Display screen):
Plant * Gender * Job Tenure Crosstabulation Count Gender Job Tenure Experienced Plant Dallas Norcross Total Mature Plant Dallas Norcross Total New Plant Dallas Norcross Total Female 3 3 6 6 2 8 4 2 6 Male 5 5 10 3 4 7 2 1 3 Total 8 8 16 9 6 15 6 3 9
Notice that the same information on Plant and Gender counts has now been provided by each level of Job Tenure Experienced, Mature and New (the levels are reported in alphabetical order rather than by order of magnitude). Cool.
238
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Stacked Bar Charts can be generated in SPSS using the same basic executions that you did for simple Bar Charts in the previous section. Select Graphs>Interactive>Bar:
Place the Plant variable here. Place the Gender variable here.
Select OK. You should see the following Stacked Bar Chart:
239
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
240
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Because these groups are of different sizes, it might be better to plot this information in a 100% Stacked Bar Chart instead. To do this, go back to Graph>Legacy Dialogues>Interactive>Bar:
241
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The last multivariate visualization technique is the Scatter Plot. Again, SPSS provides us with flexibility to subset our analysis if needed.
242
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
What variables might have a relationship? What about Productivity and Job Satisfaction? A Scatter Plot of these variables can be generated by selecting Graph>Legacy Dialogues>Interactive>Scatterplot:
Dont forget to change the title! You should see the following graphic:
243
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Sowhat do you think? It appears that there might be a positive relationship between the two variables, because the graphic roughly moves in a linear fashion from the SW corner to the NE corner.
244
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
5.5 Using SPSS for: Random Number Generation and Simple Random Sampling
Like the other software applications, SPSS will generate random numbers using the internal clock in the computer. As a result, every time a command is given to SPSS to generate some set of random numbers, a different set of random numbers will be generated. However, sometimes we may need to replicate a set of random numbers exactly the way they were previously generated. To accomplish this replication, SPSS allows the analyst to define a seed number that will ensure a consistent set of random numbers the numbers are still random and can be used to ensure statistical independence of samples. If you need to set the seed number so you can replicate your results, simply go to the Transform menu. Choose the Random Number Generators option. You should see the following screen:
245
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This system is set to have a Starting Point of 1234567. This starting point is referred to as a seed. You can set the starting point value prior to each analysis that uses the random numbers. The value must be a positive integer. To create a string of random numbers, which is uniformly distributed between 0 and 1, go to the Transform menu and choose Compute Variable. We will call the new random number variable Group as shown in the screen below. Look at the menu for Function Group. In this menu, select Random Numbers. You will then see a long list appear in the Functions and Special Variables menu. This is a list of distributions that you could use to generate the new random variable Group. This time double click on Rv.Uniform:
246
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Every distribution has parameters that must be specified. For the uniform distribution, the only parameters are the two values between which we want our random numbers to fall. The ?s in the expression RV.UNIFORM(?,?) which appears in the Numeric Expression box are asking you to fill in these two values for your random numbers. Change this expression to read RV.UNIFORM(0,1), so the random numbers will be between 0 and 1 (as it did in Excel). Click OK. The new variable Group should appear in your Data View. Here is what a typical result would look like:
247
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Remember that your results will vary since this variable was randomly generated. One of the primary reasons for generating random numbers is to assign observations into statistically independent groups. Using the random numbers, lets assign the 40 observations into 2 groups a test group and a control group. Just like we did in section 5.1, select Transform>Recode Into Different Variables. Select the new variable Group to be transformed. Click on Old and New Values. Set it up, so that the values between 0 and .5 are put into the Control Group and the values from .5 to 1 are in the Test group:
Click on Continue. Give the new variable a name like Assignment and then click OK.
248
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
249
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, you have two groups of randomly assigned employees. This is a very important concept in Statistical Testing. Because the process of selecting a random sample from a set of data is so common, there is a very straightforward way to accomplish this in SPSS. Suppose we wish to select a simple random sample of 30 individuals from this dataset. Select Data>Select Cases>Random Sample of Cases>Sample:
C a C C s a a e s s s e e s s w w w i You could choose to sample ia certain percentage of the cases or sample 30 out of the first 40 cases. Do the i t latter. Click on t Continue and tthen OK. h h h a s l a s h a s l a s h a s l a s h w e r e n o Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University t s e
w w e e r r e e n o t s e n o t s
250
251
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Note there is a new variable in the list filter_$. It assigns the value 1 to those values selected for the random sample and the value 0 to all others. Cases not selected for the sample are now slashed in the first column. Remember that all samples will all differ unless the same seed is used to generate them. At this point, you can execute all of your analysis as before, but only those cases with a filter=1 value will be analyzed. You can go back to all cases by selecting Data>Select Cases>All Cases.
252
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Ttests are very common tests used to determine if two sample means differ significantly or if one sample mean differs from some established value. For more detailed information on Ttests, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.
12
253
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click Continue and then OK. You will see the following output:
One-Sample Test Test Value = 0 95% Confidence Interval of the Difference t JOBSAT 42.440 df 39 Sig. (2-tailed) .000 Mean Difference 6.85000 Lower 6.5235 Upper 7.1765
As stated previously in Chapter 2, these results would be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 6.52 and 7.18. This means that the probability that the true mean job satisfaction of all employees, which is unknown, falls between 6.52 and 7.18 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.52 or > 7.18).
254
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Did you notice that? Probably not. Most people use SPSS because they dont have to write code to have the software do what they want. However, in the event that you find the point-and-click environment of SPSS too restricting, know that you always have the option to write custom syntax to have SPSS more specifically do what you want. To run syntax in SPSS, select File>New>Syntax. In the blank syntax screen type (or paste) the syntax above. Then select Run>All. You will generate the same output as before! Cool!
255
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Chapter 6 Minitab
What is Minitab?
Minitab was first developed in 1972 at Penn State University (Go Nittany Lions!). Initially, it was developed as a teaching tool to help professors teach basic statistics. It is still used for that purpose at more than 4000 colleges and universities around the world. One reason for its popularity in this venue is that it is a user-friendly, menudriven interface much like SPSS. Because it offers accurate and customizable analysis tools for quality control and other important business and industry functions, it is also now widely used by companies of all sizes. It is currently the package of choice at many manufacturing Fortune 500 companies including Ford Motor Company, 3M, Honeywell International, and Samsung.
256
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
When you open Minitab you should see the following screen:
257
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As shown above, the display consists of two windows. The Session Window is at the top and is where you will see commands and results displayed. The Data Window is at the bottom. It is the worksheet where the data is displayed in a spreadsheet format. You now need to open up the WidgeOne dataset in Minitab. In the File menu choose Open Worksheet as shown below:
258
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You will next see a typical window for opening a file. Remember the WidgeOne.xls dataset is an Excel file. Your computer will initially be looking for Minitab files only. You have the option of looking for files of any type. The window below shows the system being instructed to look for Excel files. It also shows the WidgeOne dataset being selected:
Click Open.
259
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
260
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The three worksheets in the Excel file Widgeone.xls have all been converted to separate Minitab worksheets, named Attendance, Employees and Plant_Survey. The statistical analyses for this guide is exclusively on the Plant_Survey worksheet. You can close out of the others now. Make sure to go to the File menu and save the Plant_Survey worksheet for future use.
261
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
262
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We need to choose the variables for which we are interested in finding the mean and median. We will choose only the quantitative variables: JOBGRADE, SOCREL, YRONJOB, PRDCTY and JOBSAT. We make these selections by clicking on the variables while holding down the Control Key and then clicking on the Select button. This button will appear darker once a variable has been highlighted.
263
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As with SPSS, this interface is common to almost every function in Minitab. The Select button will not activate until at least one variable has been highlighted for selection:
264
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you have selected the variables for analysis, your screen should look like this:
Now click on the Statistics button. This will show you the statistics selected for display. There are many more statistics on this list than you have been exposed to in this course.
265
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We can select several different descriptive statistics. Statistics are selected by clicking in the box next to each until a check mark appears. In this case, we have selected only the mean and median. Once your selections are made, click the OK button and then click OK again in the Display Descriptive Statistics window.
266
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We obtain the following display containing the means and medians of our five variables. The display appears in our Minitab Session window:
Results for: Plant_Survey Descriptive Statistics: JOBGRADE, SOCREL, YRONJOB, PRDCTY, JOBSAT
Variable JOBGRADE SOCREL YRONJOB PRDCTY JOBSAT Mean 6.600 5.300 8.290 84.58 6.850 Median 6.500 5.000 8.350 84.81 6.600
Notice that these figures are consistent with what we had generated using Excel and SPSS and what we had computed by hand. Again, it is nice when numbers match!
267
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Before we go on to looking at subsets of the data, lets recode the values of the qualitative variables we will be using with meaningful labels. The variables Plant and Gender are coded with single letters (N = Norcross, D = Dallas, etc.) We wish to recode these variables, so our displays and graphs will have more descriptive names. These are the steps we use to accomplish this. In the Data menu, select Code and then Text-to-Text as shown below:
268
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We choose Text-to-Text because we wish to change text values (like D) to other text values (like Dallas). We see a box like the one below. In this example, we have chosen the variable Plant as the column to code the data from and also as the column to code the data into. This means we will recode the data within the same column rather than choosing another one for the recoded values. Fill in the rest of the box as below:
269
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click OK. The Minitab Data Window should now look like this:
270
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Perform the same type of recode for the Gender variable (M = Male, F = Female). Your data will then appear as follows:
Now we are ready to look at subsets of the data that will be determined by these qualitative variables. For example, what if we wanted to know the measurements of central tendency of these variables by gender and by plant? We would proceed exactly as before Stat>Basic Statistics>Display Descriptive Statistics. We again choose the same five variables.
271
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This time we will click inside the box called By variables. Once we click inside this box, the menu of variable choices grows to include the qualitative variables from our set. Minitab knew we could not calculate means and medians for qualitative variables and did not include those in the variable selection box. However, when we wish to subset the data, these variables do come in as options. Please note that you should only place qualitative variables in the By variables box. For the first analysis, choose Plant and then click on the Select button. You should see the following display:
Click on the Statistics button to choose Mean and Median again. Also check the N Total box this time, so we get the frequency of each subset. Click OK and OK.
272
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
273
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Follow the same series of steps, only this type select Gender for the By variables box. Your output should look like this:
Descriptive Statistics: JOBGRADE, SOCREL, YRONJOB, PRDCTY, JOBSAT
Variable JOBGRADE SOCREL YRONJOB PRDCTY JOBSAT Gender Female Male Female Male Female Male Female Male Female Male Total Count 20 20 20 20 20 20 20 20 20 20 Mean 6.300 6.900 6.000 4.600 8.19 8.395 83.97 85.19 6.980 6.720 Median 6.000 7.000 6.000 5.000 8.50 8.350 84.90 84.81 6.700 6.400
As a rule, we do not use the mode as a Measurement of Central Tendency with quantitative data. If the data is qualitative Plant, Gender, Position the mode is the ONLY Measurement of Central Tendency available.
274
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We can determine the mode of variables such as these by choosing the Stat menu and within that menu selecting Tables and Tally Individual Variables as shown here:
275
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You will see the window below. Select the variables Plant, Gender and Position as shown:
276
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
It is easy to see that Dallas is the modal value for the Plant variable. It is also easy to see that there is no mode for the other two variables in this example.
277
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
278
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We could have obviously included lots of statistics in our analysis simply by choosing the ones we want from the Statistics screen. The second Measurement of Dispersion discussed in Chapter 2 was the frequency table. In Chapters 2 and 3, when we created a frequency table for the job tenure variable, we created three categories: < 5 years, 5-10 years and more than 10 years. To create these same categories in Minitab, we need to recode our YRONJOB variable into a new variable called JOBTEN.
279
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
We have selected Numeric to Text because we are changing the numerical variable YRONJOB to a qualitative variable that we will call JOBTEN.
280
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Select the YRONJOB variable as it is the one to be recoded. Type the name of the new variable JOBTEN in the Into Columns box (It is a new name, so it is not a choice to be selected from the left-hand menu). Then fill in the old and new values as we have them above. Note that values of YRONJOB between 0 and 4.9 will be coded as New. Values between 5 and 10 are coded as Experienced, and values over 10 are coded as Mature. Click OK.
281
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Your Data Window should now have the new text variable JOBTEN in it:
Now we can easily generate a frequency table for the new variable JOBTEN. Once more go to the Stat menu, select Tables>Tally Individual Variables. Select the variable JOBTEN. Click OK.
282
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Well Done!
283
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
284
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Here is the screen just before we click OK to draw our pie chart for JOBTEN:
Select the Labels button. Under the Titles tab, give your pie chart a meaningful title.
285
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Identify that you want the slices to be labeled with the Category name and the Percent (remember that the reason that we create Pie Charts is to visually represent the proportions). Select OK and OK.
286
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
New 22.5%
Experienced 40.0%
Mature 37.5%
Nice! You will probably find that of all of the packages, Minitab probably has the strongest graphics. (If you are saying Heyhow do I get back to my datasheet!??...just go to Window>Plant_Survey) If you need to create a pie chart to understand a quantitative variable (e.g. productivity) relative to a qualitative variable (e.g. Plant), in Minitab you must begin by getting some summary statistics for the quantitative variable with the qualitative one used to subset it. You can do this by selecting Stat>Basic Statistics>Store Descriptive Statistics. This process will store (save) our descriptive statistics in a table in our Data Window, so the Pie Chart command can use the results to make a chart. Replicate the window below. We are asking for statistics on PRDCTY by Plant.
287
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on the Statistics button and choose the statistic Sum. Click OK.
288
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You can think of this information as a little table within your datasheet that tells you the total amount of (summation of) productivity attributed to each plant.
Now, you are ready to make the pie chart for this data. Choose Graph>Pie Chart. You should indicate this time that your chart values are in a table. The categorical variable for your table was named ByVar1 (change this in the worksheet if you need to). The Summary variable was named Sum1 (again change it if you need to).
289
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on Labels and title the chart Chart of Productivity by Plant. Warning Warning Warning Will Robinson: If you do not do this then labels made for other charts will likely display on this new one. It happens to the best of us (wellnot usbut other people)! Select the Slice Labels as necessary. Click OK.
290
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Norcross 39.9%
Dallas 60.1%
Next, we wish to replicate the bar chart from Chapter 2, which displayed the frequency count for each value of the variable JOBTEN.
291
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Again, go to the Graphs menu. This time choose Bar Chart. This action will produce:
Select the Simple chart as above and click on OK. Select JOBTEN as the categorical variable. Provide a title by clicking on the Labels button. Then click OK.
292
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
293
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If you need to generate a different style of bar chart such as the one with horizontal bars in Chapter 2, you can play with some of the options in the Bar Chart panel. For example, to obtain horizontal bars, click on the Scale button before clicking OK. On the Axes and Tick screen select Transpose values and category scales as shown below:
294
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Experienced
JOBTEN
Mature
New
10 Count
12
14
16
18
To create a stem and leaf display for the variable YRONJOB, go to the Graph menu and select Stem and Leaf (Notice that only the quantitative variables are available for this graphic). Select YRONJOB as your variable. Click OK.
295
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You will get a Stem and Leaf Display like the one below in your Session Window:
Stem-and-Leaf Display: YRONJOB
Stem-and-leaf of YRONJOB Leaf Unit = 1.0 2 7 12 16 (8) 16 9 5 1 0 0 0 0 0 1 1 1 1 01 22233 44555 6777 88888999 0000111 2333 4445 7 N = 40
A quick note on interpretation of this messy outputthe (8) is in brackets to signify that this is the stem with the greatest number of observations. Here, the values include 8.x, 8.x, 8.x, 8.x, 8.x, 9.x, 9.x, 9.x years of service.
296
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To get a Boxplot for YRONJOB, go to the Graph menu and select Boxplot. You will see a screen like the below where you can choose the style of Boxplot you need. Choose Simple for this first one.
297
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Boxplot of YRONJOB
18 16 14 12 10 8 6 4 2 0
Recall that in a box plot, the box represents the middle 50% of the dataset (the top of Q1 and the top of Q3), and the line inside the box represents Q2 or the median.
298
YRONJOB
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Place the Plant variable in the Row position. Place the Gender variable in the Column position.
299
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click OK. The contingency table below should appear in your Session Window:
Tabulated statistics: Plant, Gender
Rows: Plant Columns: Gender Female Dallas 13 56.52 65.00 32.50 7 41.18 35.00 17.50 20 50.00 100.00 50.00 Male 10 43.48 50.00 25.00 10 58.82 50.00 25.00 20 50.00 100.00 50.00 All 23 100.00 57.50 57.50 17 100.00 42.50 42.50 40 100.00 100.00 100.00
Norcross
All
Cell Contents:
Notice the key at the bottom indicating that the cell contents have the count (frequency) on the top, followed by row percents, column percents and total percents. Wowlook how much output was created in a single table! That was so much easier than Excel! The output table contains the conditional probabilities described in Chapter 2. In the first cell the intersection of Female and Dallas we have four pieces of information. We know that there are 13 women who work in Dallas. We know that of all of the Dallas employees, 56.5% are female. We know that of all of the women, 65% are in Dallas. Finally, we know that of all employees, 32.50% are females in Dallas.
300
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If you need to subset this information further (e.g., by Job Tenure), there is an easy way to do that. Go back to the Stat>Tables>Crosstabulation and Chi-Square screen. This table will be a little busy, so lets just choose the Counts this time. Make your selections of the three variables as follows:
Click OK.
301
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
302
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Notice that the same information on Plant and Gender counts has now been provided by each level of Job Tenure Experienced, Mature and New (the levels are reported in alphabetical order rather than by order of magnitude). The stacked bar charts developed in Chapter 2 can be easily developed in Minitab. Start in the Graphs menu. Choose the option Bar Chart. You will see:
303
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Make sure you choose the Stacked option as shown and click OK. Then select the variables so that the category axis is Plant and the bars are stacked by Gender. This is done by selecting Gender last and making sure the stack categories of last categorical variable box is checked. See below:
304
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Plant by Gender
25
Gender Female Male
20
15
Count
10
0 Plant
Dallas
Norcross
305
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The 100% Stacked Bar Chart is a little less straight forward to generate. Select Graph>Bar Chart>Stack>OK as before. Assign the variables Plant and Gender as before. Then select Chart Options. You will see the following screen:
To generate the 100% calibration of the bars within each plant value, set the Y-axis to be shown as a % value and accumulate the values within each category.
Select OK.
306
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
80
Percent
60
40
20
0 Plant
Dallas
Norcross
The last multivariate visualization technique is the scatter plot. Again, Minitab provides us with flexibility to subset our analysis if needed. Consider the relationship between Job Satisfaction and Productivity as we did with SPSS in Chapter 5. This plot can be replicated in Minitab by going into the Graphs menu and choosing Scatterplot. A choice of types of Scatterplots follows.
307
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
308
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Next, choose PRDCTY for the Y-axis variable and JOBSAT for the X-axis variable as shown below:
Click OK. Click on the Labels button and add an appropriate title. Click OK and OK.
309
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
PRDCTY
85 80 75 70 5 6 7 JOBSAT 8 9
As we saw in Chapter 5, there is a slightly positive relationship between these two variables it appears as if Job Satisfaction and Productivity are related.
310
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
6.5 Using Minitab for: Random Number Generation and Simple Random Sampling
Like the other software applications, Minitab will generate random numbers using the internal clock in the computer. As a result, every time a command is given to Minitab to generate some set of random numbers, a different set of random numbers will be generated. The software normally chooses its own starting point for the generation process by using the time of day to choose a random starting point in the string. Sometimes, however, you may wish to control where Minitab starts its string. For example, you may wish to repeat a sequence by generating the same set of random data. In this case, the BASE command tells the random number generator where to start. The generator will use this base until you set a new BASE or exit Minitab. If you need to set the base number so you can replicate your results, simply go to the Calc menu. Choose the Set Base option. You should see the following screen:
Here, we have not chosen a base. We could have chosen a positive integer as our base. In doing so, we could replicate our results anytime we wish to do so by going back in and resetting the base to that value.
311
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To create a string of random numbers, which is uniformly distributed between 0 and 1, go to Calc > Random Data>Uniform. You may note here that Minitab has a lengthy list of distributions, as did SPSS, which can be used to generate random samples. Indeed, this sort of procedure is quite easy and versatile with this software. We will generate 40 values from this normal distribution with one value for every observation. We will name our new data column Group. Every distribution has parameters that must be specified. For the uniform distribution, the only parameters are the two values between which we want our random numbers to fall. We choose these values to be 0 and 1. Fill in your window like the one below:
312
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click OK. The new variable Group should appear in your Data Window. Here is what a typical result would look like:
Remember that your results will vary since this variable was randomly generated. One of the primary reasons for generating random numbers is to assign observations into statistically independent groups. Using the random numbers, lets assign the 40 observations into 3 groups. Here is one way to accomplish this: Choose Data>Code>Numeric to Text.
313
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Note that since the distribution of the random numbers is uniform, each random value has an equal probability of occurrence. This is very useful information for assignment of groups. If you are interested in assigning groups of approximately equal size, then you should allocate the values of 0 through .33 to one group, .34 to .66 to another, etc. If you want the first group to have approximately 25% of the population, then allocate the random values of 0 through .25 to the first group, etc. Cool.
314
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You should see something like the following in your Data Window:
315
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Again, remember results will vary due to the randomness (pun intended) of this procedure. This procedure has taken the 40 observations and assigned them into 3 groups based upon the random numbers created in the previous procedure. Each of the 40 employees is now in one of these randomly assigned, independent groups. Because this process of selecting a random sample from a set of data is so common, there is a very straightforward way to accomplish this in Minitab. Because we will be subsetting this dataset, go ahead NOW and save the Minitab file that you are working in File>Save Current Worksheet As (this will allow you to save it as an Excel spreadsheet again). Now, suppose we wish to select a simple random sample of 30 individuals from this dataset. Go to Calc>Random Data>Sample from Columns.
316
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Specify that you wish to select a random sample of 30 cases. In the From columns box, identify all of the variables. Then, identify all of the variables again for the Store samples in box. This will effectively take a random sample of size 30 from our dataset and discard the observations that were not selected (did you save your file?). Select OK. You should now be left with 30 observations.
317
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Ttests are very common tests used to determine if two sample means differ significantly or if one sample mean differs from some established value. For more detailed information on Ttests, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.
13
318
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click on the Options button. The default setup is the following. This selection will produce a complete 95% confidence interval.
Click OK and then OK. You will see the following output in your Session Window:
One-Sample T: JOBSAT
Variable JOBSAT N 40 Mean 6.85000 StDev 1.02081 SE Mean 0.16140 95% CI (6.52353, 7.17647)
As stated in previous chapters, these results would be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 6.52 and 7.18.
319
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This means that the probability that the true mean job satisfaction of all employees, which is unknown, falls between 6.52 and 7.18 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.52 or > 7.18). Another option here, which is only available for the 95% Interval (the most common), is the Interval Chart. Lets look at the 95% Interval graphic for Job Satisfaction by Plant. To do this, go to Graph>Interval Plot. Since we have one quantitative variable (Job Satisfaction) that we want to evaluate by two groups within a qualitative variable (Plant), select One Y With Groups:
320
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Select OK. Assign JobSat to the Graph variable and Plant to the Categorical variables for grouping:
Add a Title to your graphic as appropriate through the Labels button. Select OK.
321
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
7.2
JOBSAT
6.8
6.4
6.0
Dallas Plant
Norcross
322
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Experienced
JOBTEN
Mature
New
10 Count
12
14
16
18
There is nothing really wrong with this, but it would be better if we could order the bars in a more logical way like New/Experienced/Mature.
323
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To reorder the values, go back to the Plant_Survey sheet. Click on any value in the JobTen column. Now, right click. Select Column>Value Order:
324
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Click OK.
325
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
New
JobTen
Experienced
Mature
10 Count
12
14
16
18
326
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Chapter 7 SAS
What is SAS?
Unlike the other three packages, SAS uses a programming language (really a library of pre-written statistical algorithms) to execute analysis. If you are not a programmer at heart, do not let your heart be troubled we have provided all of the code that you need to execute the necessary commands to generate the prescribed output. After a few successful executions, you will be able to generate your own programs! Of the four packages, we acknowledge that SAS typically represents the greatest challenge for students. However, SAS is the most analytically comprehensive, the most widely used and the most flexible of the statistical software applications.
327
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Lets start by getting oriented with the SAS interface for Windows:
Log Window
Editor Window
Buttons to move among the Log, Editor and Output (not shown) Windows
328
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you launch SAS, you have access to five SAS windows: the Editor, Log, Output, Explorer and Results windows. The Output and Results windows are not visible when you first start SAS. Of these windows, we will be focused on the Editor, Log and Output windows. The Editor window The Editor window is where the programs will be written. In a typical SAS session, this is where you will spend most of your time. The Log window The Log is a file generated by SAS that contains your SAS program (in black), along with a listing of notes (in blue or green), error messages (in red) and other information pertaining to your program. After every execution, you should get into the habit of checking your Log window to ensure that the program ran correctly regardless of how the output looks. The Output window The Output window contains the results of your analysis generated by your program. We wont be using much of the Explorer and Results windows. For additional information on using the SAS software, we recommend Step-by-Step Basic Statistics Using SAS by Larry Hatcher. SAS programs have two parts the Data Step and the Procedure or Proc Steps. Data Steps are used to manipulate data, add, change or delete variables or observations, format data, etc. Statistical Analysis is not done within the context of a data step. And, Data Steps typically do not generate any output. Procedure steps begin with the term Proc followed by a SAS-defined command, which will execute a specific algorithm. Procedure steps are where we will do most of our analysis and typically do generate output. All SAS statements end with a semicolon (most of your errors in the beginning will be because you did not insert a semicolon).
329
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The good news is that SAS is generally not case sensitive and is smart enough to correct some of the most common mistakes (except for the semicolon omission). If you make a programming error, the Log will provide you with guidance regarding where the problem is and how to fix it. Lets get started. Once you have launched the SAS System: Maximize the Editor Window by clicking on the Maximize Button in the upper right corner; Close the Explorer and Results Windows by clicking on the Close Button in the upper right corner of the Explorer Window.
330
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
331
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Within the context of the Editor Window, any SAS code which starts with an asterisk will be ignored by SASbut can be very useful for the programmer. We will provide you with these comments (Notes written after an asterisk) to help explain the logic behind the code. Before we start, save WidgeOne.xls somewhere convenient that you can easily access it. To get the data into SAS, we will execute a Wizard (this is the only one that you will get in SAS so enjoy it). Select File>Import Data:
332
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The first step in the Wizard will ask for the type of file that you are importing. It should default to Excel (dont worry about the version). Click Next. You will then get a browse box SAS wants you to point to where the Excel file is saved. Easy enoughbut if you are running SAS from a remote location (as would be the case with Citrix) please remember that your drive names change. For example, your C:\ drive will be read as your V:\ drive:
The C$ on Client(V:) is my local C:\ drive. The T: and U: drives are USB ports where a flash drive might be. Citrix can see the flash drives as long as the flash drive was inserted into the port BEFORE Citrix was accessed.
333
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Once you have located your file, select Next. The next step in the Wizard will ask you which table you want to import. These are the various sheets in the workbook. Select the Plant_Survey sheet. The next step in the Wizard will look like this:
Basically, SAS just wants you to give the new SAS file a name. Enter the name WidgeOne in the box called Member. Select next. On the last step of the Wizard, SAS will ask you if you want to save the code that it just created for you. Just ignore this and select finish.
334
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
You will be back at your Editor screen which is blank. It may not seem like it, but SAS just did a lot of stuff. It accessed the WidgeOne.xls file, isolated the Plant_Survey worksheet and saved it as a SAS file that we can now access. Lets take a look at the file that we just created. In your Editor window, type the following SAS Code (if you are accessing this electronically, you can just copy and paste):
Proc Print data = WidgeOne; Run;
This is a very simple, but representative set of code in SAS. The PROC PRINT command is a procedure command telling SAS to Print the data not to a printer, but to the screen so that we can see it. The DATA= part of the statement is telling SAS which dataset to print. Notice that the statement ends in a semicolon. All SAS Statements end with a semicolon. The final statement in this module of code is Run;. All of your code will end with a run statement. To SAS, this is like a period at the end of a sentence.
335
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you have this typed into your Editor screen, click on the little running man in the tool bar:
336
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
337
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This is the WidgeOne dataset as a SAS file. You see the bottom of the fileuse the toggle bar on the right to toggle to the top of the file. Now you should see this:
338
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Now, look at your Log window click on the middle button at the bottom of the screen. You should see this:
339
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Congratulationsyou successfully ran your first SAS program. Piece of cake! Take a minutego get a soft drinkand celebrate! From this point forward, we will simply provide you with the code necessary to execute the specified statistical analysis, followed by the appropriate output. We will include comments in the code in a way such that you can copy it EXACTLY the way it is written here into your Editor Window.
340
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
341
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Output:
342
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
What if we are only interested in a subset of the data? For example, what if we wanted to know the measurements of central tendency of these variables by gender and by plant? We simply need to add a Class statement to the code: Proc Means data=WidgeOne Mean Median Max Min; Var JobSat Prdcty YRONJOB SOCREL JOBGRADE; Class Plant Gender; Run;
additional statement
343
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
344
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
At this point, we are not going to cover how to convert the F and M values into proper namesat this point it might be confusing. But it will be covered in the Lagniappe. As a rule, we do not use the mode as a Measurement of Central Tendency with quantitative data. If the data is qualitative Plant, Gender, Position it is the ONLY Measurement of Central Tendency available. Here is the code used to determine frequency counts (and mode) for qualitative variables: *Proc Freq14 is the SAS command to use when determining the frequency counts for qualitative data; Proc Freq data=WidgeOne; Tables Plant Gender Position; Run;
14
345
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
346
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
347
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
348
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Notice the inclusion of the standard deviation on the far rightthe descriptive statistics will be included in the order they are requested in the Proc Means statement. The second Measurement of Dispersion was the frequency table. When we created a frequency table for the job tenure variable, we create three categories: < 5 years, 5-10 years and more than 10 years. To create these same categories in SAS, we need to execute a Data statement: *In the following data statement, we are creating a categorized version of the YRONJOB variable for the purposes of generating a frequency table. Note that this is done through the creation of a NEW variable that will be ADDED to the dataset - we are NOT overwriting the YRONJOB variable. The new qualitative variable is JOBTEN. We should also note that we are creating a second dataset WidgeOne1. While you do not HAVE to create a new dataset, we recommend it. It is typically best to keep your original dataset in its original form in the event that you have to start your analysis from scratch; Data WidgeOne1; Set WidgeOne; If YRONJOB <5 Then JOBTEN = 'New'; Else If YRONJOB =>5 AND YRONJOB<=10 Then JOBTEN = 'Experienced'; Else If YRONJOB >10 Then JOBTEN = 'Mature'; Run; *Print the new dataset to ensure that the new variable was created properly; Proc Print data=WidgeOne1; Run; *Now we run the frequency table on JobTen...which will provide us with the dispersion of the YRONJOB variable across the specified categories; Proc Freq data=WidgeOne1; Tables JobTen;
349
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Run; Note in the Data step that we named the new dataset WidgeOne1and this new dataset was based on the original dataset WidgeOne (see the Set statement). Adding the number 1 to the end of a dataset name and then incrementing it by one every time you alter the dataset is a convenient way to keep track of your manipulations. Here is the associated output:
350
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
351
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
352
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
353
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
If you need to create a pie chart to understand a quantitative variable (e.g., productivity) relative to a qualitative variable (e.g., Plant), the code would be modified appropriately: *This code will produce a pie chart, where the qualitative variable listed after the "pie" command will identify the slices. Here, we have ased for the percentage of total productivity by Plant; proc gchart data=WidgeOne1; pie Plant /sumvar=prdcty percent=Inside; legend; Run; Quit;
354
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
355
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To replicate the bar chart in Chapter 2, execute the following code: *This code will produce a Bar Chart, where the qualitative variable listed after the "HBAR" command will identify the categories for the bars. Here, we have simply asked for the frequency count (type = freq) for each value of "JOBTEN; proc gchart data=WidgeOne1; HBAR JOBTEN / type = freq; legend; Run; Quit; You can change this graphic to be vertical by substituting VBAR for HBAR.
356
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
357
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The histogram, stem-and-leaf and the box plot are all generated in SAS using the same command Proc Univariate: *To create Stem and Leaf and Box Plots in SAS, we use the Proc Univariate command with the "plots" option. This procedure is only valid for quantitative variables. So, for Job Tenure, we will reference the YRONJOB variable;
Proc Univariate data=WidgeOne1 plots; Var YRONJOB; Histogram; Run;
358
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This procedure in SAS is a very comprehensive univariate analysis. Notice that Proc Univariate provides everything Proc Means providedand more. Key results from the output screen have been highlighted: The UNIVARIATE Procedure Variable: YRONJOB (YRONJOB) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 40 8.29 4.25656657 -0.0806474 3455.58 51.345797 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 40 331.6 18.118359 -0.7479985 706.616 0.67302227
Basic Statistical Measures Location Mean Median Mode 8.290000 8.350000 8.000000 Variability Std Deviation Variance Range Interquartile Range 4.25657 18.11836 16.90000 6.10000
NOTE: The mode displayed is the smallest of 3 modes with a count of 3. Tests for Location: Mu0=0 Test -Statistic-----p Value------
359
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
t M S
12.31757 20 410
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 17.00 17.00 14.55 14.00 11.10 8.35 5.00 2.05 1.50 0.10 0.10
The UNIVARIATE Procedure Variable: YRONJOB (YRONJOB) Extreme Observations ----Lowest---Value 0.1 1.0 Obs 1 2 ----Highest--Value 14.0 14.0 Obs 36 37
360
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
2.0 2.0 2.1 Here are the Stem and Leaf and Box Plots: Stem Leaf 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
4 3 5
38 39 40
# 0 0 001 000 1 011 0115 000 00016 016 1 007 01 00 001 0 1 ----+----+----+----+ 1 1 3 3 1 3 4 3 5 3 1 3 2 2 3 1 1
361
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Variable:
YRONJOB
(YRONJOB)
Normal Probability Plot 17.5+ | | | | | | | | | | | | | | | | 0.5+ +* ++ +*+ **+* ***++ *+++ *++ * *+ **++ ***+ **++ *+ ** +** +** *+** *++ * ++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
362
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The histogram is found in the graphic outputyou may need to scroll down to the bottom.
30
25
20
P e r c e n t
15
10
0 70 75 80 P R D C TY 85 90 95
363
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Another option, which is a bit more surgical, for creating a boxplot is to run:
Proc Sort data=WidgeOne; By Position; Run; Proc Boxplot data=WidgeOne; Plot Jobsat*Position; Run;
J O B S A T
5 H R LY P O S I TI O N M G T
364
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
365
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
366
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Notice that the conditional percentages that were discussed in Chapter 2, are already embedded in this contingency table. Look at the legend in the upper left corner of the matrix. This legend provides a guide to the numbers in the matrix. In the first cell the intersection of Female and Dallas we have four pieces of information. We know that there are 13 women who work in Dallas. We know that of all of the Dallas employees, 56.5% are female. We know that of all of the women, 65% are in Dallas. Finally, we know that of all employees, 32.50% are females in Dallas. If you need to subset this information further (e.g., by Job Tenure), add a by statement to the code: Proc Freq data=WidgeOne1; Tables Plant*Gender; By Jobten; Run;
367
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
------------------------------------------ JOBTEN=Exp -----------------------------------------The FREQ Procedure Table of Plant by Gender Plant(Plant) Gender(Gender)
Frequency Percent Row Pct Col Pct F M Total Dallas 3 5 8 18.75 31.25 50.00 37.50 62.50 50.00 50.00 Norcross 3 5 8 18.75 31.25 50.00 37.50 62.50 50.00 50.00 Total 6 10 16 37.50 62.50 100.00
368
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
----------------------------------------- JOBTEN=Mat -----------------------------------------The FREQ Procedure Table of Plant by Gender Plant(Plant) Gender(Gender)
Frequency Percent Row Pct Col Pct F M Total Dallas 6 3 9 40.00 20.00 60.00 66.67 33.33 75.00 42.86 Norcross 2 4 6 13.33 26.67 40.00 33.33 66.67 25.00 57.14 Total 8 7 15 53.33 46.67 100.00
369
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Frequency Percent Row Pct Col Pct F M Total Dallas 4 2 6 44.44 22.22 66.67 66.67 33.33 66.67 66.67 Norcross 2 1 3 22.22 11.11 33.33 66.67 33.33 33.33 33.33 Total 6 3 9 66.67 33.33 100.00 Notice that the same information on Plant and Gender counts has now been provided by each level of Job Tenure Experienced, Mature and New (the levels are reported in alphabetical order). The stacked charts developed in Chapter 2 can be easily developed in SAS using the same code as was used with the single variable analysis, with the addition of a subgroup statement:
370
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
*To create a stacked chart using two variables, simply use the same code as before when analyzing a single variable and add a "subgroup" statement; proc gchart data=WidgeOne1; HBAR Plant / subgroup=gender; legend; Run; Quit;
371
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
To create a 100% stacked bar chart, you need to run a similar set of statements, with a few changes:
Proc gchart data=WidgeOne; HBAR Plant/subgroup = Gender type=pct group = Plant nozero g100 gaxis=axis1; Run; Quit;
372
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
P l ant D
P l ant D
F R E Q . 23
C U M . F R E Q . 23
P C T . 100
C U M . P C T . 100
17
17
100
100
10
20
30
40
50 P E R C E N T
60
70
80
90
100
G ender
The last multivariate visualization technique is the scatter plot. Again, SAS provides us with flexibility to subset our analysis if needed. Consider the Job Tenure and Productivity plot in Chapter 2. This plot can be replicated in SAS using the following code: *To create a scatter plot using two quantitative variables use the Proc plot command...the first variable stated will appear on the y-axis...typically this is the dependent variable; Proc Plot data=WidgeOne1; Plot Prdcty*Yronjob; Run; Here is the associated output:
373
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
P R D C TY 100
90
80
70
60 5 6 7 JO B S A T 8 9
374
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
7.5 Using SAS for: Random Number Generation and Simple Random Sampling
Like the other software applications, SAS will generate random numbers using the internal clock in the computer. As a result, every time the ranuni statement in SAS is run, a different set of random numbers will be generated. However, sometimes we may need to replicate a set of random numbers exactly the way they were previously generated. To accomplish this replication, SAS allows the analyst to define a seed number that will ensure a consistent set of random numbers the numbers are still random and can be used to ensure statistical independence of samples. The random numbers follow a uniform distribution of outcomes that lie between 0 and 1. Here is the code: *To create a string of random numbers, use the ranuni statement. Because this process is effectively creating a new variable (of random numbers), this process is executed within the context of a Data statement; Data WidgeOne2; Set WidgeOne1; Group= ranuni(123456); run; Proc Print data=WidgeOne2; run; The number inside the parentheses (123456) is an arbitrary number that is provided as the seed.
375
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
376
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
One of the primary reasons for generating random numbers is to assign observations into statistically independent groups. Using the random numbers, lets assign the 40 observations into 3 groups. Here is the code that would execute this: *This code will take the 40 observations and assign them into 3 groups based upon the random numbers created in the data statement above. The out= statement creates a new file that has these assignments; Proc Rank data=WidgeOne2 Groups=3 Out=Samples; Var Group; Run; *Notice in this code, we are referencing the "Samples" file; Proc Print data=Samples; Run;
377
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
378
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Notice that individuals were assigned to group 0, 1 or 2, completely at random based upon the random numbers that were generated in the previous step. Because this process of selection for randomization of groups is so common, there is a more parsimonious set of code within SAS to execute the same process. *Another way to create a sample (or samples) from a population is through Proc Surveyselect; Proc surveyselect data=WidgeOne1 out=Sample2 Method=SRS Sampsize=30 Seed=123; Run; Proc Print data=Sample2; Run; In this code, the Method=SRS indicates to SAS that we are interested in using a Simple Random Sampling methodology (other statistical sampling methodologies include Stratified, Systemic and Cluster). The Sampsize = statement tells SAS how big to make the sample (clearly this number must be smaller than the total size of the dataset). Finally, the Seed= statement provides SAS with a seed from which to draw the random sample. This seed is important if, for example, another analyst needs to replicate your results. If this option is deleted, a different sample will be generated every time the code is executed.
379
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Here is the output: The SAS System Emp Plant Norcross Norcross Dallas Dallas Dallas Dallas Norcross Norcross Norcross Norcross Dallas Dallas Norcross Norcross Dallas Dallas Norcross Dallas Norcross Norcross Norcross Dallas Dallas Dallas Dallas 08:58 Friday, May 19, 2006 29
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ID 011 077 088 086 019 090 069 063 009 062 024 091 006 061 097 100 058 010 071 078 082 028 016 095 029
Gender F M M M F M M M F M F M F M M M F F M M M F F M F
POSITION HRLY HRLY MGT MGT MGT MGT MGT MGT MGT MGT HRLY HRLY MGT MGT HRLY HRLY HRLY HRLY HRLY HRLY HRLY HRLY MGT HRLY HRLY
JOBGRADE 4 5 8 6 7 9 7 7 7 8 7 9 7 6 7 9 6 6 4 6 5 7 9 6 5
SOCREL 6 5 6 5 5 6 5 4 5 5 6 1 6 0 5 5 5 6 5 4 5 7 5 5 5
YRONJOB 5.0 5.0 5.7 6.1 7.6 8.0 8.0 8.6 9.0 9.0 10.1 10.1 10.5 11.0 11.1 11.1 12.1 13.0 13.0 14.0 14.0 14.1 0.1 1.0 2.0
PRDCTY 78.2421 76.0067 93.0348 93.2102 93.9137 93.2102 82.1495 81.0000 79.8586 86.3210 91.8112 83.6393 67.6506 80.1839 90.4228 85.9835 91.8112 74.9012 78.7253 78.0813 78.2421 86.9980 91.8112 92.5094 87.5075
JOBSAT 5.4 6.2 8.1 8.0 8.5 7.6 7.1 7.9 5.8 6.1 6.5 5.7 7.9 7.3 6.2 6.5 6.7 5.5 5.4 6.3 5.0 6.5 8.5 6.1 6.5
JOBTEN Exp Exp Exp Exp Exp Exp Exp Exp Exp Exp Mat Mat Mat Mat Mat Mat Mat Mat Mat Mat Mat Mat New New New
380
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
26 27 28 29 30
F F F M M
4 6 9 8 4
5 10 6 6 5
This output is a subset of 30 of the 40 observations in the dataset. If you utilized the Seed=123 option, you should have generated the exact same sample. If you did not utilize the Seed= option (or used a different seed), your sample will be different although you should still have 30 observations.
381
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
In this code, the CLM option creates the confidence interval around the Job Satisfaction variable. As stated in Chapter 2, confidence intervals are created at a 90%, 95% or 99% confidence level, where these levels represent our confidence (or the probability) that the TRUE population mean lies within the calculated interval. Statisticians refer to a complementary term alpha to indicate the probability that our calculated interval DOES NOT contain the TRUE population mean. Effectively, alpha is the probability that we are wrong. The alpha value is calculated as 1-the confidence level. So, typically, alpha is established to be .01, .05 or .10. In SAS, the default value is .05 (if the alpha= option is not provided, SAS will create a 95% confidence interval).
382
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
383
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
As stated previously in Chapter 2, these results would be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53. This means that the probability that the true mean job satisfaction of all employees, which is unknown, falls between 7.17 and 6.53 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.53 or > 7.17).
384
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The RTF references are for Rich Text File. By sandwiching your code between these ODS statements, your output will be put into a rich text file.
385
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
After you run the code, you will see the following screen popup:
If you save it, and then open it within a Word document, the table will operated like any other table in Word so you can move it, change the font, etc.
386
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
The second Lagniappe that we wanted to share with you is how to create labels like converting F and M to Female and Male. The process to do this is to create the formatting logicand then go back and apply it. You have to do this in two steps. Here is the code:
Proc Format; Value $Gencode Value $Plantcode Run; Data WidgeOne1; Set WidgeOne; Format Gender $Gencode. Plant $Plantcode.; Run; Proc Print data=Widgeone1; Run; The formats created above are applied in a Data statement. The logic in the format statement is to name the variable to be formatted (Gender) and then reference the format ($Gencode.) M F D N = = = = "MALE" "FEMALE"; "DALLAS" "NORCROSS"; Proc Format will create the formatting for the labels it does not apply the formatsjust creates them. The $ sign is used with qualitative formats.
387
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
This will change the dataset (and associated output) to be more user friendly:
388
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
One secret that SAS Jocks tend to keep to themselves is that rarely do individuals who are highly proficient in SAS actually ever develop SAS code from scratch. They maintain libraries of codeand then modify these lines of code as needed. Here is a complete outline of the SAS code used in this manual (including notations) to begin your library:
*Proc Print will print the dataset to your output screen - not to your printer; Proc Print data=WidgeOne; Run; *Proc Means will provide two of the three measurements of central tendency Mean and Median...these measurements must be specified. The "Var" statement tells SAS which variables you are interested in - remember that mean and median is only relevant for quantitative variables; Proc Means data=WidgeOne Mean Median Max Min STD; Var JobSat Prdcty YRONJOB SOCREL JOBGRADE; *Class Plant Gender; Run; *Proc Freq is the SAS command to use when determining the frequency counts for qualitative data; Proc Freq data=WidgeOne; Tables Plant Gender Position; Run; Data WidgeOne1; Set WidgeOne; If YRONJOB <5 Then JOBTEN = 'New'; Else If YRONJOB =>5 AND YRONJOB<=10 Then JOBTEN = 'Experienced'; Else If YRONJOB >10 Then JOBTEN = 'Mature'; Format Plant Plant.; Run; *Print the new dataset to ensure that the new variable was created properly; Proc Print data=WidgeOne1; Run; *Now we run the frequency table on JobTen...which will provide us with the dispersion of the YRONJOB variable across the specified categories; Proc Freq data=WidgeOne1;
389
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Tables JobTen; Run; *This code will produce a pie chart, where the qualitative variable listed after the "pie" command will identify the slices. Here, we have simply asked for the percent (type = pct) for each value of "JOBTEN; proc gchart data=WidgeOne1; pie JOBTEN / type = pct; legend; Run; Quit; *This code will produce a pie chart, where the qualitative variable listed after the "pie" command will identify the slices. Here, we have ased for the percentage of total productivity by Plant; proc gchart data=WidgeOne1; pie Plant /sumvar=prdcty percent=Inside; legend; Run; Quit; *This code will produce a histogram, where the qualitative variable listed after the "HBAR" command will identify the categories for the bars. Here, we have simply asked for the frequency count (type = freq) for each value of "JOBTEN; proc gchart data=WidgeOne1; HBAR JOBTEN / type = freq; legend; Run; Quit; *To create Stem and Leaf and Box Plots in SAS, we use the Proc Univariate command with the "plots" option. This procedure is only valid for quantitative variables. Job Tenure, we will reference the YRONJOB variable; Proc Univariate data=WidgeOne1 plots; Var YRONJOB; Histogram; So, for
390
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
Run; *To create a contingency table, simply run the Proc Freq command. Identify the variables of interest in the Tables statement separated by an *. Then add "by" statements where necessary to subset the analysis. Note - whenever you add a "by" statement, you will need to sort the data by that variable first; Proc Sort data=WidgeOne1; by JobTen; Run; Proc Freq data=WidgeOne1; Tables Plant*Gender; by JobTen; Run; *To create a stacked chart using two variables, simply use the same code as before when analyzing a single variable and add a "subgroup" statement; proc gchart data=WidgeOne1; HBAR Plant / subgroup=gender; legend; Run; Quit; *To create a scatter plot using two quantitative variables use the Proc plot command...the first variable stated will appear on the y-axis...typically this is the dependent variable.; Proc Plot data=WidgeOne1; Plot Prdcty*Yronjob; Run; Quit;
391
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
392
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University