Reference Manual For Statistical Software

Reference Manual for Statistical Software: A gentle overview of Excel, SPSS, Minitab, SAS
Table of Contents 1. Introduction to Statistical Computing 2. Data Analysis and Statistical Concepts Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals 4 10 10 15 20 27 35 37 41 46 55 75 90 113 116 119 121 125 131 149 162 180 184 186
3. Excel (2003) Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals Excel 2003 Lagniappe
4. Excel (2007) Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals Excel Lagniappe
Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University
5. SPSS
Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals SPSS Lagniappe
190 199 209 218 234 245 253 255 256 262 278 284 299 311 318 323 327 341 347 352 365 375 382 385
6. Minitab Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals Minitab Lagniappe 7. SAS Concept 1 Measurements of Central Tendency Concept 2 Measurements of Dispersion Concept 3 Visualization of Univariate Data Concept 4 Visualization of Multivariate Data Concept 5 Random Number Generation And Simple Sampling Concept 6 Confidence Intervals SAS Lagniappe
Chapter 1. Introduction to Statistical Computing

This reference manual has been developed to assist students in the basics of statistical computing sort of a Statistical Computing for Dummies. It is not our intention to use this manual to teach statistical concepts1but rather to demonstrate how to utilize previously taught statistical and data analysis concepts the way that professionals and practitioners apply them through the able assistance of computing. Proficiency in software allows students to focus more on the interpretation of the output and on the application of results rather than on the mathematical computations. We should pause here and strongly make the point that computers should serve as a medium of expediency of calculation not as a substitution for the ability to execute a calculation. Throughout this manual, we will present statistical concepts, context for their use, and formulas where appropriate. We provide exercises to execute these concepts by hand. Then, each concept will be applied in a consistent manner using each of the four major statistical computing packages Excel (2003 and 2007), SPSS, Minitab and SAS.
Readers of this manual are assumed to have completed some introductory statistics course. For individuals wishing to review statistical concepts, we recommend Introduction to Stats by DeVeaux, Velleman and Bock.
1
1.1 Statistical Packages Used in this Manual

We have chosen to incorporate the four most widely used statistical computing packages in this manual Excel, SPSS, Minitab and SAS. While each of these packages can be used for basic data analysis, they each have specializations. Any individual who can represent themselves as knowledgeable and proficient in any subset or all of these packages will possess a marketable and differentiating skill set. Excel The first package covered in this manual is Microsofts Excel. This spreadsheet software package is ubiquitous. This spreadsheet package represents a very basic and efficient way to organize, analyze and present data. Employers today expect that, at a minimum, new hires with college degrees will have a working knowledge of Excel. Excel is used anywhere that data is available which is everywhere. Excel is found in offices, libraries, schools, universities, home offices and everywhere in between. In addition to its role as a data analysis package, Excel is often used as a starting point to capture and organize data and then import it into more sophisticated analysis packages such as SPSS, Minitab or SAS. And, after analysis is complete, datasets can be exported back to Excel and shared with others who may not have access to (or have the ability to use) other analysis packages (we gently refer to this group as the great statistical unwashed). For product information regarding Excel, please visit: http://office.microsoft.com/en-us SPSS The Statistical Package for the Social Sciences or SPSS is one of the most heavily used statistical computing packages in industry. SPSS has over 250,000 customers in 60 countries and is particularly heavily used in Medicine, Psychology, Marketing, Political Science and other social sciences. Because of its more point and click orientation, SPSS has become one of the preferred packages of non-statisticians. For product information regarding SPSS, please visit: http://www.spss.com/
Minitab Minitab was developed by Statistics professors at Penn State University (where it is still headquartered) in 1972. These professors were looking for a better way to teach undergraduate statistics in the classroom. From this starting point, Minitab is now used in over 4,000 universities around the world, in 80 countries and by hundreds of companies ranging from Fortune 500 to start up companies. Of the main statistical computing packages, Minitab has the strongest graphics and visualization capabilities. The package is most heavily used in Six Sigma and other quality design initiatives. Minitabs customer list includes a large number of manufacturing and product design firms such as Ford, GE, GM, HP and Whirlpool. For product information regarding Minitab, please visit: http://www.minitab.com/ SAS Statistical Analysis Software or SAS is typically considered to be the most complete statistical analysis package on the market (Professional Tip - please pronounce this as sass - if you pronounce the package as S-A-S people will think you are a poser). This is the package of choice of most applied statisticians. Although the most recent version of SAS (version 9) includes some point and click options, SAS uses a scripting language to tell the computer what data manipulations and computations to perform. We will be demonstrating how to actually write the code for SAS rather than defaulting to the point and click functionality in v.9, SAS Enterprise Guide, SAS Enterprise Miner and other more user-friendly GUI SAS products . Our rationale here is this if you learn to drive a manual transmission, you can drive anything. Similarly, if you can program in Base SAS, you can use (and understand) just about any statistical analysis package. The learning curve for SAS is longer and steeper than for the other packages, but the package is considered the benchmark for statistical computing. SAS is used in 110 countries, at 2,200 Universities, and at 96 of the Fortune 100 companies. For product information regarding SAS, please visit: http://www.sas.com/
1.2 Organization of Manual

After a brief review of the most common, and we believe essential, statistical/data analysis concepts that every college-educated person, regardless of discipline, should know we will then explain how each of these concepts is executed in Excel (2003 and 2007), SPSS (v.15.4), Minitab (v.15) and SAS (v. 9.0). We have taken a software-oriented approach rather than a statistical concept-oriented approach, because it is the software application rather than the statistical concepts that represent the focus of this document. For example, our first concept is descriptive statistics. Rather than explaining descriptive statistics through each package and then moving into the second analysis concept, we focus on all of the concepts in Excel, then move to a focus on all of the concepts in SPSS, etc. Yes, we understand that from the readers perspective this may be a bit monotonous. After you finish your Ph.D. in Statistics, you can write your manual your way. Throughout each chapter, we have used screenshots from the various packages, and have developed easyto-follow examples using a common dataset. At the end of each chapter, we have included a section titled Lagniappe. This word derives from New World Spanish la apa, the gift. The word came into the Creole dialect of New Orleans and there acquired a French spelling. It is still used in the Gulf States, especially southern Louisiana, to denote a little bonus that a friendly shopkeeper might add to a purchase. Our lagniappe for our readers includes the extra and interesting things that we have learned to do with each of these software programs that might not be easily found or well known. A little extra information at no extra cost!
1.3 Overview of Dataset

Throughout this manual, we will use a common dataset taken from a small manufacturing company the WidgeOne company. The WidgeOne dataset: An Excel file WidgeOne.xls Both qualitative and quantitative variables 23 variables total Three sheets in one workbook o Plant_Survey o Employees o Attendance 40 observations VARIABLE EMPID PLANT GENDER POSITION JOBSAT YRONJOB JOBGRADE SOCREL PRDCTY Last Name First Name JAN MEANING Employee ID Plant ID Gender Job Type Job Satisfaction (1-10) Years in current job Job Level (1-10) HR Social Relationship Score (0-10) HR Productivity Rating (out of 100) Employee Last Name Employee First Name Attendance in January (%) VARIABLE TYPE Qualitative Qualitative Qualitative Qualitative Quantitative Quantitative Quantitative Quantitative Quantitative Qualitative Qualitative Quantitative SHEET ALL Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Plant_Survey Employees Employees Attendance
Here is a screen shot taken of WidgeOne.xls:
Chapter 2: Data Analysis and Statistical Concepts

As former practitioners who used statistics on an almost daily basis in our professions in finance, marketing, engineering, manufacturing and medicine, we have developed our TOP 6 list of the most common and most useful applications of Statistics and Data Analysis. After a brief explanation of each concept, examples will be provided for how to execute these concepts by hand (with a calculator). We cannot emphasize strongly enough that the calculation of the concepts needs to be mastered and fully understood before they can be effectively outsourced to a software application.
Concept 1 Measurements of Central Tendency

The most common application of Statistics is the measurement of central tendency of a dataset, of which there are three. Central tendency is a geeky way of answering the question What is the most representative value? The mean or average is the first and most popular measurement of central tendency because:

It is familiar to most people; It reflects the inclusion of every item in the dataset; It always exists; It is unique; It is easily used with other statistical measurements.
However, the mean is ONLY used when the data is ratio scale (quantitative) and when there are no extreme values.
10
X
The formula for the calculation of a mean is
X
i 1
(We know how everyone LOVES formulas with Greek letters!) Where Xi = every observation in the dataset N = the number of observations in the dataset
FUN MANUAL CALCULATION!!

Using the WidgeOne.xls dataset, calculate the mean years that men in the Norcross plant (n=10) have been in their current job (YRONJOB).
The answer is on the next pagedont cheatdo it first to make sure that you understand how to calculate this foundational concept by hand.
11
Did you get 9.66? Well done. A second measurement of central tendency of a dataset is the median. The median is literally, the middle of the dataset: It is the central value of an array of numbers sorted in ascending (or descending) order; 50% of the observations lie below the median and 50% of the observations lie above the median; It represents the second quartile (Q2); It is unique.
As with the mean, the median is used when the data is ratio scale (quantitative). However, unlike the mean, the median can accommodate extreme values.
Take the men in the Norcross plant (n=10) again, and determine the median years they have spent in their current job. The answer is on the next page. Did you cheat last time? You can redeem yourself by doing this one by hand
12
Did you get 9.5? Well done. The mean and the median are pretty close 9.66 and 9.50, respectively. But which one is right? Which one should be reported as the central tendency or the most representative value of the years on the job for the men in the Norcross plant? Mathematically they are both correct. However, if there are no extreme values (defined here as observations which are more than three standard deviations from the mean), then we would typically report the mean rather than the median (as we would here). However, what if there are extreme values, then what? Consider the men in Norcross again. What if employee 082 had 30 years with the company instead of 14 years. How would the mean and median be affected? The mean would increase to 11.26 while the median remains the same at 9.50 (do this by hand to convince yourself of this concept). Go back and look at the formula for the mean and think about why the mean was so heavily affected, while the median was not. A third measurement of central tendency is the mode. The mode is the most frequently occurring value in a dataset: There can be multiple modes; It is not influenced by extreme observations; Can be used with both qualitative and quantitative data.
Go back to the WidgeOne.xls dataset and the men in the Norcross plant. What is the mode for their years on the job? Hmmmthere is no mode. There is no value that appears more than one time. Thats OK, sometimes there is no mode in a dataset. What if employee 077 had 4 years on the job instead of 5 years. Then we would have a mode 4 years. This is a measurement of central tendency. But 4 years is different (a lot different) from 9.66 and 9.50 years. Is it correct? Technically yes, this would be mathematically correct, but not the most appropriate measurement to report as the central tendency of the dataset. Typically, the mode is considered to be the weakest of the three
13
measurements of central tendency for quantitative data and is ONLY used if the mean or median is not available. When would that be? Calculate the mean and median gender of the dataset. Go ahead. We will wait. It cant be done. When the data in question is qualitative (e.g., gender, plant, position) the ONLY measurement of central tendency that is available is the mode. What you need to know When representing the central tendency of quantitative data, default to the mean. If the data has extreme values, use the median. If the data is qualitative, use the mode.
14
Concept 2 Measurements of Dispersion

When describing a dataset to someone, its generally not enough to just provide the measurement of central tendency. You should also provide some measurement of dispersion. We use measurements of dispersion to describe how spread out the data is. We can provide this information in two ways calculating the standard deviation of the dataset and providing the frequency counts across different ranges of the data. You can think of the standard deviation of a dataset to be the average distance of each observation from the mean.
(X
i 1
X )2
Here is the formula
Where, Xi = each individual observation X = the mean of the dataset N = the number of observations in the dataset Note if calculating the standard deviation of a sample rather than a population, the denominator becomes n-1. We subtract one degree of freedom. The standard deviation provides us with the mean units of each observation from the mean. If this number is large, the data is very spread out (i.e., the observations are different). If this number is small, the data is very compact (i.e., the observations are very similar).
15

Refer back to the WidgeOne.xls dataset. Calculate the standard deviation of the number of years on the job for the men in Norcross (n=10). Remember that the mean was 9.66 years. The answer is on the next pagedont cheatdo it first to make sure that you understand how to calculate this foundational concept by hand.
16
Did you get 3.30? Well done. What does this number MEAN? 3.30 what? It means that the standard deviation of the dataset is 3.30 years. The average deviation (in either direction) of each individuals tenure is 3.30 years from the mean of 9.66. Relative to the mean, we would consider this data to be fairly compactmeaning that the data is not very spread out (this will be seen more clearly in the next section when a graphical representation is created). You may recall from your earlier Statistics course(s) a second statistical calculation that provides a second measurement of dispersion the variance. The variance is simply the square of the standard deviation. Although variance is an important concept to statisticians, it is not typically used by practitioners. This is because variance is not very user friendly in terms of interpretation. In the case of the men in Norcross, the variance would be reported as 10.88 years squared. There is another application of the term variance that has a more generic meaning that is heavily used by practitioners. It is the difference, either in absolute numbers or percentages, of each observation from some base value. For example, it is common for individuals to refer to a budget variance, where this number would be the actual number minus the budgeted number: Project # 123 Budget Hours 150 Actual Hours 175 Variance +25 Variance % +17%
Remember when calculating the variance percentage in this context, you take the difference (150-175) divided by the budgeted number (150), not the actual number (many professionals make this mistakeonce).
17
Another method of representing the dispersion of a dataset is to provide the frequency counts for observations across specified ranges.

Using the WidgeOne.xls dataset, determine the number of individuals with job tenure (YRONJOB) in the following categories: Less than 5 years 5 10 years More than 10 years
Here is how your answer should appear: Category Less than 5 years 5-10 years More than 10 years Total Frequency 9 16 15 40 Relative Frequency 22.50% 40.00% 37.50% 100.00% Cumulative Frequency 22.50% 62.50% 100.00%
It is important to note that the categories are mutually exclusive (no observation can occur in two categories simultaneously) and collectively exhaustive (every observation is accommodated). This representation of the dispersion of the data is referred to as a frequency table and is the most common and one of the most useful representations of data. In this instance, we converted a quantitative variable into a qualitative variable for the purposes of developing a frequency table. We do this frequently to take a different kind of look at a quantitative variable.
18
If we had a qualitative variable that we wanted to better understand, we would generate the appropriate measurement of central tendency (Mode) and the measurement of dispersion (frequencies) through the application of a frequency table.
What you need to know Measurements of dispersion provide information regarding how spreadout or compact the data is. Typically this is communicated through the computation of the standard deviation AND some display of the frequency counts of the observations across specified categories. If the data is qualitative, the only measurement of dispersion comes from the frequency table.
19
Concept 3 Visualization of Univariate Data

Typically, data analysis includes BOTH the computational analysis as well as some visual representation of the analysis. Many recipients of your work will never look at your actual calculations only your tables and graphs (remember the reference above to the great statistical unwashed?). As a result, visual representation of your analysis should receive the same amount of attention and dedication as your computational analysis. Edward Tufte has published several books and articles on the topic of the visualization of data. We recommend is seminal work The Visual Display of Quantitiative Information as an excellent reference on the topic. See https://www.edwardtufte.com/. When developing a visual representation of a single variable, the most common tools include Histograms, Pie Charts, Bar Charts, Box Plots and Stem and Leaf Plots. Each of these will be discussed briefly in turn. Histograms Histograms visually communicate the shape, central tendency and dispersion of the dataset. For this reason, Histograms, are heavily used in conjunction with the measurements of central tendency and the measurements of dispersion to describe a particular variable. Histograms are used with QUANTITATIVE DATA. For all of the packages that we will discuss below, you can simply reference the quantitative variable directly and a Histogram will be generated. In Excel, you will actually have to manually convert the data from quantitative into categorized qualitative data (ordinal data). Returning to the WidgeOne.xls dataset and the job tenure variable, if we want to create 5 categorical intervals using Excel, we calculate the range (16.9) and divide this number by 5 (3.38). If we massage this interval number to 3, we will have the following intervals: Less than 3 years 3-6 years 7-10 years 11-14 years
20
15+ years
These intervals would result in the following histogram:
Note in this graphic that the left axis represents the actual frequency counts for each category and the right axis represents the cumulative percentage for all categories. From this graphic, it is easy to see that the data is normally distributed with a mean, median and mode in the 7-10 year category. This histogram was developed using Excel.
21
Pie Charts Pie charts can be useful for displaying the relative frequency of observations by category, if used properly. Consider these two guidelines: 1. Use 5 or fewer slices if more than 5 slices are needed, use a table; 2. Order the relative frequencies in ascending (or descending) order. Using the same Job Tenure data, the associated pie chart, generated using Excel, would look like this:
This pie chart was developed using Excel. It should probably be noted at this point that approximately 8% of all men and .5% of all women are colorblind. Although colorblindness comes in many different forms, the most common forms involve the colors red, green, yellow and brown. Individuals who are colorblind cannot discern from among these colors. Therefore, when
22
constructing pie charts or any other type of colored visual representation of your analysis, avoid placing these colors adjacent to each other. Bar Charts Bar Charts ARE NOT Histograms! Bar Charts are intended to represent the frequency counts of QUALITATIVE data. The plant information from WidgeOne.xls would look like this:
This bar chart was developed using Excel. A very technical pointBar Charts are always displayed horizontally. Column Charts are always displayed vertically. They are different charts. Column Charts are typically used with a time element, where each column is a different point in time. When there is no time element (as with the Widgeone data), Bar Charts and Pie Charts are the primary tools used to display qualitative data.
23
Stem and Leaf Plots Stem and leaf plots, like histograms, provide a visual representation of the shape of the data and the central tendency of the dataset. Here is the stem and leaf plot for the Job Tenure variable: Stem Leaf 17 0 16 15 0 14 001 13 000 12 1 11 011 10 0115 9 000 8 00016 7 016 6 1 5 007 4 01 3 00 2 001 10 01 Frequency 1 1 3 3 1 3 4 3 5 3 1 3 2 2 3 1 1
When reading a stem and leaf plot, the first number represents the stem and the numbers to the right represent the leaves, while the number to the far right represents the frequency of the stem. For example, the first stem of the plot above is a 17 and the first (and only) leaf is 0. This means that there is one observation that has 17.0 years on the job. To the far right of the 17, there is a 1. This indicates that there is only one employee with 17.x years on the job.
24
Boxplots The last tool described in this manual for visualizing univariate data is the boxplot. The boxplot builds on the information displayed in a stem-and-leaf plot and focuses particular attention on the symmetry of the distribution and incorporates numerical measures of tendency and location. Prior to creating a boxplot, you need to be familiar with the concepts of quartiles. The boxplot incorporates the median, the mean and the four quartiles of a variable. The quartiles of a dataset are the points where 25%, 50% (the same as the median), 75% and 100% (the max value) of the data lies below. Quartiles are typically written as Q1, Q2, Q3, Q4, respectively. The data that lies between Q1 and Q3 is referred to as the Interquartile Range or IQR. This is the center 50% of the dataset. Below is the boxplot for the Job Tenure variable from WidgeOne.xls. The boxplot is typically placed next to the stem-and-leaf plot for context: 17 0 16 15 0 14 001 13 000 12 1 11 011 10 0115 9 000 8 00016 7 016 61 5 007 4 01 3 00 2 001 10 01 1 1 3 3 1 3 4 3 5 3 1 3 2 2 3 1 1 | | | | | | +-----+ | | | | *--+--* | | | | +-----+ | | | | |
25
From this boxplot, you can see that Q1 begins at 5, Q2 (also the median) begins at 8 (the actual median of the dataset is 8.35), Q3 begins at 11 and the highest value of the dataset is 17.0. Since the median (indicated by a +) is approximately in the center of the IQR box, we would conclude that this dataset is relatively symmetric. This boxplot was developed using SAS. What you need to know Many individuals, who are analytically very strong, often place insufficient emphasis on graphics and visual representations of data. Many individuals who are not strong analytically, but need analysis to support their decision-making, often place an overemphasis on graphics and visualization. Individuals who can execute both well will go far. Histograms, Stem and Leaf and Boxplots are used with QUANTITATIVE DATA. Bar Charts, Pie Charts, Column Charts are used with QUALITATIVE DATA.
26
Concept 4 Organization/Visualization of Multivariate Data

Frequently, we need to understand and report the relationships between and among variables within a dataset. When developing visual representations of multiple variables, the most common tools include Contingency Tables (qualitative and quantitative data), Stacked Bar Charts (qualitative data), 100% Stacked Bar Charts (qualitative data), and Scatter plots (quantitative data). Each of these will be discussed briefly in order. Contingency Tables One of the most common and useful methods of displaying the relationships between two or more variables is the contingency table. This table is highly versatile and easily constructed. As an example, lets take the GENDER and PLANT variables from the WidgeOne.xls dataset. A contingency table of these two variables would look like this: Count of Gender Plant Gender D N F 13 M 10 Grand Total 23
Grand Total 7 20 10 20 17 40
This table displays the frequency of the number of females and males at each plant. We could also display this table as percentages rather than as frequencies:
Count of Gender Plant Gender D N Grand Total F 65.00% 35.00% 100.00% M 50.00% 50.00% 100.00% Grand Total 57.50% 42.50% 100.00%
27
Here the percentages are given as a percentage of each gender (row percentages). Specifically, the interpretation of the first cell would be of all of the female employees, 65% work in Dallas. The percentages could easily be reversed to represent the percentage of individuals at each plant (column percentages): Count of Gender Plant Gender D N Grand Total F 56.52% 41.18% 50.00% M 43.48% 58.82% 50.00% Grand Total 100.00%100.00% 100.00% In this version of the table, the first cell now communicates of all of the Dallas employees, 56.52% are female. Finally, we can also represent the data as overall percentages: Count of Gender Plant Gender D N Grand Total F 32.50% 17.50% 50.00% M 25.00% 25.00% 50.00% Grand Total 57.50% 42.50% 100.00% In this version of the table, the first cell now communicatesof all employees, 32.50% are females in Dallas.
28
Before moving on, please ensure that you fully understand the differences across these three tables. They are subtle, but important. Both gender and plant are categorical variables. We could incorporate a quantitative variable into this table such as job tenure: Average of YRONJOB Gender F M Grand Total
Plant D
N 8.85 7.13 8.10
Grand Total 6.94 8.19 9.66 8.40 8.54 8.29
This table now provides information about the average job tenure for each gender and each plant, and for each gender at each plant. For example, the first cell now communicates, The females in Dallas have an average job tenure of 8.85 years. These tables were developed in Excel using Pivot Tables.
29
Stacked Bar Charts Stacked bars are a convenient way to display percentages or proportions, such as might be done in a pie chart, for multiple variables. For example, the proportion of each gender at each plant, would be displayed like this in a stacked bar chart:
30
This stacked bar chart could be reversed, where gender is displayed as the bars and the segments are the plants:
The second graphic is fine. However, when the population size differ particularly by a lot stacked bar charts are less informative. It is difficult to understand how the groups compare. For example, the difference in the number of Dallas and Norcross employees is not dramatic, but even here it is difficult to discern which has a greater proportion of men.
31
100% Stacked Bar Charts To solve this problem, we can apply a 100% stacked bar chart. This visualization tool simply calibrates the populations of interest like the two plants to both be evaluated out of a total of 100%. You can almost think of 100% Stacked Bar Charts as side-by-side pie charts.
Compare this graphic to the first Stacked Bar Graph. They are different. They communicate subtly different messages.
32
Scatter Plots What if we wanted to better understand if there is a meaningful relationship between two quantitative variables? Such as the possible relationship between job tenure and productivity. This question can be addressed using a scatter plot, where one quantitative variable is plotted on the y-axis and the second quantitative variable is plotted on the x-axis:
If two variables are considered to be related, we would expect to see some pattern within the scatter plot, such as a line. If job tenure and productivity were positively related, then we would expect to see a 45 degree line moving from the SW corner to the NE corner. This would indicate that as job tenure goes up, productivity goes up. If job tenure and productivity were negatively related, then we would expect to see a 45 degree line moving from the NW corner to the SE corner. This would indicate that as job tenure goes up, productivity goes down.
33
In this scatter plot, neither of these linear patterns (or any other pattern) is reflected. This cloud is referred to as a Null Plot. As a result, we would conclude that job tenure and productivity are not related. We can derive additional information from this scatter plot. Specifically, we can determine the best fit line in the form y=mx+b. This is the linear equation that minimizes the distances between the predicted values and the actual values, where y = the predicted values of an employees productivity and x = the actual number of years of an employees job tenure: y = -0.5715x + 89.318. This equation generates an R2 value of 0.1124, where this value represents the percentage of the variance of the dependent variable (productivity) that can be explained by the independent variable (job tenure). Detailed explanations of these concepts are outside of the scope of this document, but are heavily used in Statistics and form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
What you need to know Stacked Bar charts are used to display the counts within groupings of qualitative variables. When those groupings are of different sizes, a 100% Stacked Bar Chart is preferred. You can think of 100% Stacked Bar Charts as side by side Pie Charts. Scatterplots are used to communicate if a relationship exists between two quantitative variables.
34
Concept 5 - Random Number Generation and Simple Random Sampling

The statistical concepts covered up to this point would really fall under the heading of Data Analysis or Basic Descriptive Statistics. These concepts enable us to describe or represent a given dataset to other people. They represent a critical, albeit simple, set of analytical tools. Concepts 1-4 are employed once the data has been gathered. Now lets take a step backwhat if the data NEEDS to be gathered? Entire disciplines exist in the areas of experimental design and sampling. Although the scope of this document does not include an examination of these areas, we will address a foundational concept of these areas random number generation to support simple random sampling using statistical software. Humans are woefully deficient in our ability to generate truly random numbers. We are subject to latent biases and laziness in the sense that we tend to repeat numbers. In fact, human random number generation is so NOT random, that computer programs have been written that accurately predict the random numbers that humans will select. Randomly generated numbers can be forced to follow a particular probability distribution and/or fall between an established minimum and maximum value. We will be generating numbers which follow a uniform distribution, where every number as has the same probability of occurrence. This is the most common execution of random number generation. It should be noted that random numbers could follow any probability distribution (e.g., normal, binomial, Poisson, etc). One of the primary rationales for generating a string of random numbers is to select a sample of observations for analysis (another common rationale for random number generation is simulation analysis). Often, researchers do not have the time the access, or the money to analyze every element in a dataset. Assigning a random number to every element in a dataset and then selecting, for example, the first 50 elements when sorted based upon the random number, is a statistically valid method of sampling. When a uniform distribution
35
is used to generate these random numbers, this process is referred to as simple random sampling where every element as a 1/n probability of selection. Simple random sampling using random number generation is a very common execution used by analysts to select a subset of a population of elements for analysis.
36
Concept 6 Confidence Intervals

As stated previously, Concepts 1-4 fall under the heading of Descriptive Statistics, where the analyst has access to the entire dataset and is simply providing a description or visual representation of the central tendency or the dispersion of the dataset. Concept 5 Random Number Generation is an important tool that analysts use to subset a dataset or assign elements for survey or additional analysis. When a sample is analyzed for the purposes of better understanding a population, the process is referred to as Inferential Statistics2. Here is a brief comparative of Descriptive Statistics and Inferential Statistics: Descriptive Statistics Population (entire dataset) 100% accurate (assuming calculations were done correctly) 100% Measurements of Central Tendency ALWAYS Preferred! Inferential Statistics Sample from a Population Some Margin of Error will be expected Typically, 90%, 95% or 99% Confidence Intervals around a Population parameter Never preferredbut is accepted as a trade off for cost and/or time.
Dataset Accuracy Confidence Example Preference?
Concept 6 Confidence Intervals therefore is different from the first four concepts reviewed in this manual, because we are moving from descriptive statistics to inferential statistics. Simply stated, a confidence interval is an estimation of some unknown population parameter (usually the mean), based on sample statistics, where the acceptable margin of error and/or confidence level is preestablished.
Inferential statistics is based on the Central Limit Theorem. Readers are assumed to have a working knowledge of this theorem. For a refresher on the Central Limit Theorem, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.
2
37
X (Z *
The formula used to estimate a two-sided confidence level of a population mean is
sX ) n , where
X = the sample mean; Z = the number of standard deviations, using the sampling distribution and the Central Limit Theorem,
associated with the established confidence level: 90% confidence = 1.645 95% confidence = 1.96 99% confidence = 2.575
Sx= the sample standard deviation; n = the number of elements in the sample.
p (Z *
The formula used to estimate a two-sided confidence level of a population proportion is where
pq ) n
p = the sample proportion; q = 1-p; Z = same as above; n = same as above.
38
In both formulas, the expression after the + signs is the referred to as the Margin of Error.

Lets assume that the WidgeOne.xls dataset is a representative sample of a larger manufacturing firm with hundreds of employees in Norcross, GA and Dallas, TX. Lets also assume that the HR department at WidgeOne has been charged with understanding the level of job satisfaction among employees. For cost reasons, they were unable to survey the entire organization, so they surveyed the 40 employees in our dataset. Report the job satisfaction for all WidgeOne employees, using the sample of 40. Use a 95% level of confidence. From the WidgeOne.xls dataset, the mean Job Satisfaction is 6.85 (where 1=low satisfaction and 10 = high satisfaction) and the standard deviation is 1.02. Using the formula above, the confidence interval calculation is: 6.85 + 1.96*(1.02/(SQRT40)) or 6.85 + .32 If you actually gave this number to most people, they would have no idea what it meant. The proper way to communicate this information is: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53. This means that the probability that the true mean job satisfaction of all employees, which is unknown, falls between 7.17 and 6.53 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.53 or > 7.17).
39
What you really need to know When calculating confidence intervals, use a 95% default unless you know something about the decision maker. If the decision maker is conservative, use a 99% interval. If the decision maker is risk tolerant, use a 90% interval. To increase both confidence and decrease the margin of error, increase the sample size.
40
Chapter 3 Microsofts Excel 2003

What is Excel?
As mentioned in Chapter 1, Microsofts Excel has become the standard for basic data analysis. And, again, individuals with a college education in the 21st century will be expected to have a working knowledge of this foundational package. Excel is critical to understand not only because it facilitates basic data analysis, but also because it is typically the starting point for PC-based data which can then be analyzed using more sophisticated packages like SPSS (Chapter 5), Minitab (Chapter 6) or SAS (Chapter 7). Because there are substantive differences between Excel 2003 and Excel 2007, executions of the six concepts outlined in Chapter 2 will be demonstrated in the two versions of Excel separately. If you are working with Excel 2007, please skip this chapter and proceed to Chapter 4 Microsofts Excel 2007. When you open Excel, the interface includes row and columns, with cells at the intersections. You can input data or formulas into the individual cells.
41
Here is a screen shot of a blank Excel page:
The cursor in this page is in cell F10
42
If you needed to enter data into a new spreadsheet, you would simply type the data values into each cell, with labels in row 1. Excel will accept most characters letters and numbers in the cells. However, only numbers (with a few exceptions) can be subjected to the kinds of analysis outlined in Chapter 2. At this point, we need to access the WidgeOne.xls dataset in Excel. To access the dataset, click on File and then Open. At this point, a Microsoft explorer box will popup. Go to the folder or drive where you have saved the WidgeOne.xls file:
Note that this explorer box is looking for Excel files. If you need to change the file type, click on the drop down menu.
43
Once you have opened the WidgeOne.xls file in Excel, you should see this:
44
Recall from the initial description of the dataset, that there are three worksheets in the file. The Plant_Survey sheet is currently open (if it is not, please click on that tab at the bottom of the page). We will be executing most of our analysis in this sheet. However, if you click on one of the two other tabs Employees or Attendance you will move to one of those two sheets. Return to the Plant_Survey sheet and we will begin to execute the six statistical concepts from Chapter 2.
45
3.1 Using Excel 2003 for: Measurements of Central Tendency

The three measurements of central tendency can be executed in Excel using pre-programmed formulas. Notice in the screen shot from the previous page there is a circle around a button with an fx. Click on this button. You will see this box:
Functions in Excel are organized into categories, based upon different specializations. We will be using functions in the Statistical category.
46
Go ahead and click on the Statistical category. You will see this box:
As you scroll through this box, you will see a wide variety of statistical functions.
Click on Cancel and go back to the dataset. Before we perform any analysis, lets insert an additional column, where we will insert the labels Mean, Median and Mode. To do this, first place your cursor on the A in the first column and click, so that the entire column is highlighted. Now, click on Insert>Column.
47
At this point, the entire dataset should have shifted to the right, and the new column A is blank:
48
Now, go to the bottom of the dataset to cell A43. In cells A43, A44 and A45, type Mean, Median and Mode, respectively:
49
Not all variables will lend themselves to these calculationsremember that we only execute mean and median calculations on quantitative variables. So, it would be helpful if we could see the column headers to remind us what is in each column. This can be done using a split screen. Go to cell A2 (the row just below the headings) and then click on Window>Split. At this point you should see this:
50
Now, you can use the toggle bar on the right to scroll back to your labels, and still see the column headers. For which columns should we report the measurements of central tendency? The quantitative values include JOBGRADE, SOCREL (social relations score3), YRONJOB (number of years on the job), PRDCTY (Productivity) and JOBSAT (job satisfaction). The calculation of the mode for the qualitative variables (PLANT, GENDER and POSITION) will be addressed below. Move your cursor to position F43. This is where we will place the mean for the JOBGRADE variable. With your cursor in this cell, click on the fx button. From the dialogue box, select Statistical. From the list of function names, click on the second entry AVERAGE. You will see this:
This dialogue box is effectively asking for what array of numbers do you want to calculate an average? Excel is pretty clever. You may already have the array populated in the first field (Number 1). For the JOBGRADE variable, this will be cell F2 through cell F41. If it is not already populated for you, simply click on the little spreadsheet button and highlight the cells F2 through F41. Note that cell F42 is empty. If it is included, it will be ignored. However, if there was a 0 in cell F42, it would be includedand a different mean would be
Psychology, Sociology and Marketing Majors will recognize that this is Likert Data. For the purposes of this manual, Likert Data will be treated as quantitative. However, it should be noted that pure mathematicians treat Likert Data as qualitative.
3
51
calculated. It is always best to only include the relevant cells in your calculations. After you have selected cells F2 through F41 as the array for the mean calculation, click OK. You should now see 6.6. Now, lets copy this function across to column J. With your cursor in cell F43, go to Edit>Copy, then highlight cells G43 through J43. Go to Edit>Paste. You should now see this:
52
To populate the Median cells, we will use the same process. Place your cursor in cell F44 and click on the function button. From the Statistical functions, select MEDIAN and select the same array F2:F41. Click OK. Copy and paste the function in cell F44, across to cell J44. You should now see this:
53
Although it is not typically used as the best measurement of central tendency of quantitative data, you can provide the mode for these variables using the same process MODE is a function listed in the Statistical category of functions.
54
3.2 Using Excel 2003 for: Measurements of Dispersion

Recall from Chapter 2, the most commonly used measurements used to describe the dispersion of a variable include the standard deviation and the frequency table. The standard deviation will be calculated in Excel using the function button. Returning to the WidgeOne.xls dataset, enter a label for the standard deviation below the measurements of central tendency:
55
You probably noticed that the words Standard Deviation do not fit neatly into cell A47 they spilled over into B47 and C47. Remember that what you see in Excel is not necessarily what Excel sees. In reality, cells B47 and C47 are still empty from Excels perspective. But, this looks a little untidy. There are several ways to tidy this. We can expand column A until the words are visually contained within the column. This is accomplished by aligning the cursor between the A and the B at the top of the spreadsheet until the cursor looks like this and then double clicking. Column A will widen enough to accommodate the longest string of characters in the columnin this case Standard Deviation. A second method of accommodating the text is by wrapping the text into the cell. This is accomplished by selecting Format>Cells>Wrap text:
To wrap the text, ensure that this option is checked.
56
After the text has been wrapped, you can then slightly widen the columns or narrow the rows (using the same process as for the columns), as needed. Once the label has been established, select the function button, then within the Statistical category, select the STDEV options and the same range as before F2:F41 and click OK. You should see 1.54919334. This is the standard deviation of the JOBGRADE variable. As before, copy this formula across to column J. You should now see this:
57
We now have the basic descriptive statistics for the quantitative variables. You may notice that some of the values have no decimal points, some have one decimal point, some have many decimal points. We think this looks a little untidy (as Statisticians, we like things to be tidy). To make this spreadsheet look a bit more professional, lets format all of the data to have a consistent number of decimal points. To do this, click on the cell in the far upper left corner as circled above. This will highlight the entire spreadsheet. Then click on Format>Cells. Then select the Number category as indicated:
Then click OK.
58
Now, your very tidy spreadsheet should look like this:
59
Isnt that better? In practice, if you need to provide multiple descriptive statistics on a variable, this is not the process that you would go through. For multiple descriptive statistics, you would do the following4>Tools>Data Analysis>Descriptive Statistics. This path will bring up the following:
Select the Descriptive Statistics option.
If you go to Tools and do not see the Data Analysis option, do not let your heart be troubled. Simply go to the Add Ins option under Tools and select the Analysis ToolPak. Then go back into Tools. You should now see the Data Analysis option. WARNINGif you have an unauthorized copy of Excel you will not have access to this very important functionality.
4
60
You will then see the following dialogue box:

For the input range, highlight all of the values for the quantitative variables, including the column headings.
Ensure that this option is checked.
Ensure that this option is checked.
Now click OK.
61
You should now see this:
62
Again, pretty untidy. Format the spreadsheet to have two decimal points for all values and expand the columns to accommodate the labels. Your tidy version should look like this:
63
Notice that we reproduced all of the measurements from before, and several more5. This is a more efficient way to produce the descriptive statistics of a variable(s). In Chapter 2, we presented the concept of a frequency table as another method of displaying the spread of a dataset. As discussed, frequency tables are one of the most commonly used methods to display data understanding how to create a frequency table is a critical skill. The table created on page 17, was created in Excel. We will reproduce it here. The first step to creating a frequency table is to determine the categories that need to be developed for the quantitative variable (this process will effectively transform a quantitative ratio-scale variable into a qualitative categorical variable). Previously, we determined that the job tenure variable (YRONJOB) should be categorized into three levels less than 5 years, 5-10 years and more than 10 years. Recall that the categories must be mutually exclusive and collectively exhaustive. To accommodate these categories in Excel, we will create bins, where the TOP of each category identifies each bin.
For detailed information on the additional statistics produced, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
5
64
In our WidgeOne.xls dataset, lets create a bin range for YRONJOB in column L:
These are the bins for the Histogram for Job Tenure. Category 1 is 0-4.99, Category 2 is 5-10.00 and Category 3 (which does not need to be entered) is everything above 10.00.
65
Once these bins have been created, select Tools>Data Analysis>Histogram:
Click OK. This will bring up a dialogue box, asking for information regarding the quantitative variable to be analyzed, and the associated Bin Range:
Highlight the range of the YRONJOB variable (including the label) Highlight the Bin Range (including the label)
Ensure that the Labels option is checked
66
Now you should see this:
Againa little untidybut this is the base of what we need for the frequency table. Lets clean this up and add some columns to reproduce the table from Chapter 2. First, replace the bin titles with the real category labels of Less than 5 years, 5-10 years and More than 10 years. Second, expand the columns as needed. Third, total the bottom of the frequency column using the SUM option in cell B5, type =SUM(B2:B4) (the
67
SUM function can be found in the Math & Trig category of functions). Next, create two addition column headers Relative Frequency and Cumulative Frequency. Your sheet should look like this:
68
The Relative Frequency column will display the percentage of observations in each column an important piece of informationparticularly when comparing populations of different sizes. This is done by simply taking each frequency and dividing it by the total. For example, in cell C2, we would type =B2/B5. This would result in .2250 (9/40). Rather than typing this same formula again and again to capture the relative frequencies of the next two categories, we would like to copy this formula into cells C3 and C4. Do this now. Did you get #DIV/0? The problem is that when the formula =B2/B5 is copied down one cell, it becomes =B3/B6. There is nothing in cell B6. Since any number divided by 0 is undefined, we receive this error message. If we want to copy the formula into the cells below, we need to nail down the reference to the Total cell and prevent the reference from changing. To do this, we place a $ in front of the B and another $ in front of the 5 $B$5 instead of B5. This can also be accomplished by placing the cursor in between the B and the 5 and hitting the F4 button on your computer. Once you have nailed down the Total cell as a reference cell, you can copy the formula into cells C3 and C4. The Cumulative Frequency column will display the cumulative percentage of observations from 0 to the top of the category in question. This is accomplished by adding the relative frequency of a category to all of the relative frequencies before it. In Excel, we would type =C2 in cell D2 the first entry in the Cumulative column will always equal the first entry in the Relative Frequency column. In cell D3, we would enter =D2+C3. This will add the cumulative value (D2) plus the Relative Frequency for the category (C3). Andwe can now copy this formula into cell D4. Clearly, this is a lot of manual work in Excel for a relatively small table. However, our focus is on helping to build the Excel skills necessary to execute this kind of analysis for any size table or dataset.
69
You should now have this:
70
You probably can guess what is next lets make it a bit more tidy and presentable. First, lets convert the decimals to percentages since that is the way most people would expect to see the data. Highlight cells C2 through D5. Then select Format>Cells>Percentage:
Click OK.
71
Second, lets format the text to ensure that it is all the same (right now some text is italicized and may not be the same font). Click on the cell in the most upper left hand corner this will highlight the entire spreadsheet. Select Format>Cells>Font. Select your preferred font and size (we are particularly fond of Century Gothic ). Finally, lets get rid of the border line between rows 4 and 5. Highlight the entire dataset again, and select Format>Cells>Border>None. Then go back to your table, highlight JUST the table, and select Format>Cells>Border:
If you want gridlines in your table (helpful when the table has many categories), click on Outline and Inside.
You can change the appearance of the line and the color of the lines.
72
If you created the gridlines, your table should look something like this:
73
For nascent users of Excel, we understand that this seems like a lot of work. To this mild protest, we have two points. First - most recipients of your analysis will ONLY see your tables and/or graphics (next section). So you need to spend as much time making your analysis look clean and professional as you do ensuring that it is mathematically and logically correct. Second as you will see in the subsequent chapters, some of these executions, which appear awkward in Excel are quite easy in other software applications. Going through the labor in Excel will make you better appreciate the higher level packages.
74
3.3 Using Excel 2003 for: Visualization/Organization of Univariate Data

In this section, we will provide the steps needed to create Histograms, Pie Charts and Bar Charts. The Stem and Leaf Plot and the Box Plot as outlined in Chapter 2, while important, are not easily executed in Excel. These visualization tools are however easily executed in the other software applications and will be addressed in subsequent chapters. To reproduce the Histogram in Chapter 2, we will follow most of the same process, which was used to create the frequency table in the previous section. Starting with the Plant_Survey sheet open, go to the Bins that were developed from the previous exercise. Lets create five categories instead of just three:
Remember that when creating bins in Excel, we identify the TOP of each categoryand the highest category does not need to be identified.
75
Now, just as before, select Tools>Data Analysis>Histogram:
Click OK.
76
Following the same process as was used to create the frequency table, identify the Input Range and the Bin Range, and ensure that the Labels box is checked. This time, also check the Cumulative Percentage and Chart Output options:
Selecting these two options will convert the frequency table into a histogram.
Now click OK.
77
You should see this:
You guessed ita little untidy. Lets format a few things on our histogram.
78
First, as before, lets change the Bin names to what we really want: Less than 3, 3-6, 7-10, 11-14 and 15+. These changes can be made in the frequency table the histogram will be automatically updated because the graphic is dynamically linked to the table. Second, double click on the right axis and go to the scale tab. Indicate that the Maximum should be 1.0. Third, highlight the legend and delete it (it does not really communicate any meaningful information). Fourth, double click on the x-axis and format the font as needed (we prefer Century Gothic ). Do the same for the other two axes. Finally, click on the area of the graphic and then right click. Select Chart Options and rename the labels as needed. Your final histogram should look something like this:
79
To reproduce the pie chart in Chapter 2, begin by bringing up the sheet, which contains the frequency chart that you created in the previous section:
This is the Chart Wizard button, which is used to create most graphics in Excel.
80
Click on the Chart Wizard button. The Chart Wizard will take you through four steps. The first step is to select the graphic:
81
After you have selected the Pie chart type, click on Next. The second step is to identify which data is to be charted. Assuming your frequency chart looks like the chart on the previous page, you will indicate cells A1 through B4 although the actual data is in column B, we need to include the data labels from column A:
82
After the data has been correctly identified, click on Next. In the third step, we are including the appropriate labels for the Pie Chart:
After you have changed the Chart Title, go to the Data Labels tab and select Show Percent. Ultimately, what kind of label you select for your Pie Chart is personal preferencebut typically Pie Charts are used to communicate the percentages of each category. After you have completed the third step, you can either click Finish or click on Next to identify WHERE in the workbook you want to place your Pie Chart.
83
Your completed Pie Chart should look like this:
Well done! As noted in Chapter 2, some recipients of your data may be colorblind. Although Excel is typically does not place colors such as green, red and brown together, should you need to override the default colors provided in Excel (or include patterns to accommodate printing in black and white), simply click on the Pie. Then right click and select Format Data Series.
84
The following box will appear:
You can now go through each slice and change the color. If you need to change the solid colors to patterns, simply select the Fill Effects button.
Bar charts are created using a very similar process. We will create a bar chart of the same information.
85
Click on the Chart Wizard button. Select the Bar chart type:
Select Next. As before, identify the data range, which will include the category names:
86
Click Next.
87
In step 3, change the title, axis names and other formatting as needed:
88
Then click Finish to place the chart in the current worksheet:
89
3.4 Using Excel 2003 for: Visualization/Organization of Multivariate Data

In this section, we will provide the steps necessary to create Contingency Tables, Stacked Bar Charts, 100% Stacked Bar Charts and Scatterplots in Excel. As stated in Chapter 2, Contingency Tables are one of the most common and useful methods of communicating the relationships between and among variables in a dataset. In Excel, the tool used to create these tables is particularly useful and very flexible (this is one of the few examples where Excel may rival or outperform the more sophisticated applications). To reproduce the Contingency Tables in Chapter 2, return to the Plant_Survey page of the WidgeOne.xls dataset. Select DATAPivotTable and PivotChart Report. You should see this:
90
This dialogue box is the first of three steps in a Wizard. In step one, simply click Next (we will use the default selections). In step two, select the entire dataset (including labels):
Now click Next. The third step requires a bit more thought. The first screen in step three looks like this:
Click on Layout.
91
You will see this:
We have four possible positions in which to place any of the variables on the right.
Here are all of the variables in our dataset.
Lets start by reproducing the first Contingency Table in Chapter 2, which provides the number of individuals by Plant and by Gender. To do this, drag the Plant variable to the Column position and then drag the Gender variable to the Row position. Then, drag the Gender variable a second time (notice that after you placed the Gender variable in the Column position, it was still listed on the right) into the Data position. For now, we will leave the Page position unpopulated.
92
Now click OK and then Finish.
93
94
This is referred to as an Excel pivot table. This is the easiest way to determine the MODE for qualitative variables in Excel (e.g., D is the MODE for Plant). If we wanted to convert the counts into percentages (which is typically more meaningful), we would click on the Pivot Table drop down box circled above. Select Field Settings, which will result in the following:
The Options page will bring up a dialogue box.
95
Select the Show Data As drop down menu and select % of row. Then Click OK.
96
97
This provides us with the breakdown of Plant by Gender. If we need to reverse this, and report the breakdown of Gender by Plant, we simply go back to Pivot Table>Field Settings>Show Data As>% of Columns. This set of executions will provide us with:
98
If we want to incorporate an additional piece of information like the average job tenure by plant and by gender, we could do this by substituting the YRONJOB variable in the Data position. Select Pivot Table>Wizard>Layout. This series of executions will bring us back to the layout page:
Drag the Count of Gender back to the list on the right and then drag YRONJOB to the Data position. When the YRONJOB variable is in the Data position, the box will read Sum of YRONJOB. Lets change the sum to an average. Double click on the Sum of YRONJOB box and select Summarize by>Average.
99
Then select OK>OK>Finish.
100
101
As beforeExcels output is a little untidy. To format the decimals to be consistent, highlight the data in the table and then select Format>Cells>Number>OK. Now you should see this:
102
Much better! This table now provides information such as in the Dallas plant, women have an average of 8.85 years on the job. Now you can copy and paste this table into other documents or into another Excel sheet. As you can see, Pivot Tables are very useful and very flexible. However, because they are so flexible, they do require a bit of manipulation. Mastering Pivot Tables in Excel is a great differentiating skill, but will require practice (and patience). Stacked bar charts are easy to create and manipulate. To reproduce the stacked bar chart from the previous chapter, we will use the first Pivot Table created above that indicated the frequency counts by gender and by plant. Most graphics in Excel are created using the Chart Wizard as we did with the pie chart and the bar chart. Go to the table and copy these cells and paste them into another part of the spreadsheet:
Count of Gender Plant Gender D N Grand Total 20.00 F 13.00 7.00 M 10.00 10.00 20.00 Grand Total 23.00 17.00 40.00
Gender D F M Grand Total
N 13.00 10.00 23.00
Grand Total 7.00 20.00 10.00 20.00 17.00 40.00
This will disengage the data from the Pivot Table.
Click on the Chart Wizard button and select the Bar chart type:
103
Select the second Chart sub-type option. This is the stacked bar chart. Select Next. Identify the data range which will NOT include the totals, but WILL include the labels. Then add titles.
104
You should see something like this:
Lets clean this up a bit. We need to change the labels from single letters to the actual names of the plants and the genders and change the axis to have no decimals (when dealing with discrete data like people, its best not to have any decimals). First, place your cursor on one of the blue bars and right click. Select source data and then series:
Highlight each series and then change the name to Female and Male.
105
Click on OK. To change the decimals, double click on the axis and select Number and then change the number of decimals to 0. Finally, to change the N and the D to Norcross and Dallasgo back to the original data in the spreadsheet and replace the N with Norcross and the D with Dallas. See?! The data is integrated into the graphic! You should now see this:
Well Done! A 100% Stacked Bar Chart is executed in the exact same way, only you would select the third bar chart subtype. The final visualization in this section is the scatter plot. Scatter plots are typically used to determine if there is a meaningful relationship between two variables. To reproduce the scatter plot in the previous section, lets return to the Plant_Survey sheet. To examine the relationship between Job Productivity and Job Tenure, we will plot these two variables in a scatter plot. It is important to note that in a scatter plot, we are NOT trying to establish any causation, only correlation.
106
We create scatter plots using the same process, which was used for all of the other graphics the Chart Wizard. Click on the chart wizard button and select the Chart type XY (Scatter):
Select Next. In the data range, select the variables PRDCTY and YRONJOB (including labels). In the Chart Options, create titles and axis labels as appropriate (Productivity is on the y-axis).
107
Now, you should have something that looks like this:
Againpretty untidy. We will do three things to clean up the appearance of this graph: delete the legend, rescale the y-axis and take away the decimals. To delete the legend, simply click on it and then hit the delete button on your keyboard. This not only deletes the legend, which was not meaningful, it also creates space. The y-axis needs to be rescaled because the data does not actually start until above 50; there is a great deal of wasted space. To rescale the y-axis, double click on the y-axis and select scale and type 50 in the box for Minimum and select OK. To resize the graphic, simply highlight the chart and drag one of the corners until the graphic is the desired size. Finally, to delete the decimals, double click on the x-axis. Select the Number tab. Set the number of decimals to 0 and click OK. Do the same thing for the y-axis.
108
Your scatter plot should look something like this:
As highlighted in Chapter 2, we can derive additional information from this graphic by adding a trendline to the data. To add a trendline, click on the dots in the graph and then right click. Select Add trendline.
109
Select the Linear trend and then click on the Options tab.
110
Ensure that the Display equation on chart and Display R-squared value on chart options are selected. Then click OK.
111
As explained in Chapter 2, this information now provides us with the best linear equation, which fits the relationship between Productivity (y) and Job Tenure (x). The R-squared value of .1124 indicates that this is not a particularly strong relationship Job Tenure only explains 11.24% of the change in Productivity. These concepts form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
112
3.5 Using Excel 2003 for: Random Number Generation and Simple Random Sampling
Our WidgeOne.xls dataset is fairly small only 40 observations. As a result, it would be unusual that we would want to extract a sample from such a small dataset. However, for the purposes of executing the application of random number generation in Excel, lets assume that we want to randomly select ten individuals with whom we want to conduct in depth interviews. Lets begin by assigning random numbers to each individual. Go back to the Plant_Survey sheet and create a new column label RANDOM. Place your cursor in the first cell under the column label (row 2). Click on the formula button. Ensure that ALL is selected as the Function Category. Scroll down through the Function Names until you see RAND. Select RAND and click OK. This will generate the following:
There are three pieces of information you need to understand from this box: 1. The function takes no arguments which means that we do not need to provide any information; 2. The function will return an evenly distributed (uniform distribution) random number between 0 and 1; 3. The function is volatile which means that the value returned will change EVERY time the spreadsheet is manipulated. Click OK. You should see some number between 0 and 1 in this cell (your result will be different each time since the random number is generated using your computers internal clock). Remember that Excel reads this cell as
113
=RAND not as the number that you see. Now copy the formula in this cell down to the bottom of the dataset. Did you notice that your original number in row 2 changed? This is because it is volatile. Sometimes we need to have volatile arguments in (not with) Excel. Most of the time we do not. To convert the numbers you see from volatile to stable (unchanging), highlight the entire column, then select Edit>Copy>Edit>Paste Special>Values. Now, you should have a column of unchanging random numbers (your numbers will be different from ours):
114
Now, sort the entire dataset on the random numbers just created. Select DATA>SORT>RANDOM (since the numbers are completely random, it does not matter if you select Ascending or Descending).
Click OK. Then, select the first 10 individuals for the interviews. This is a fairly simple, but very useful process.
115
3.6 Using Excel 2003 for: Confidence Intervals

The penultimate section in this chapter will aid in the calculation of Confidence Intervals one of the most commonly used techniques in Inferential Statistics. From Chapter 2 Lets assume that the WidgeOne.xls dataset is a representative sample of a larger manufacturing firm with hundreds of employees in Norcross, GA and Dallas, TX (if we have access to the entire organizations data, we would not calculate confidence intervals of any population parameter we would report the descriptive statistics). Lets also assume that the HR department at WidgeOne has been charged with understanding the level of job satisfaction among employees. For cost reasons, they were unable to survey the entire organization, so they surveyed the 40 employees in our dataset. Report the job satisfaction for all WidgeOne employees, using the sample of 40. Use a 95% level of confidence. Go back to the Plant_Survey sheet. We previously calculated the mean job satisfaction to be 6.85 and the standard deviation to be 1.02. Using this information, we can use Excel to compute the confidence interval. To execute this computation, go into blank portion of the spreadsheet and click on the function button. Ensure that the Statistical function category is selected and then scroll through the function names until you get to the CONFIDENCE function. Click OK.
116
After you click OK, you will see this dialogue box:
Alpha is 1-the Confidence Level The STD would have been previously calculated Size is the sample sizein this case 40
117
If we are computing a 95% confidence interval, we would enter .05 for the alpha value (you can think of alpha as the probability you are willing to accept of being wrong). The standard deviation, which was computed previously for job satisfaction, was 1.026. The (sample) size is 40. Once this information is entered, the resulting computation should be .32. This is the margin of error for job satisfaction at a confidence level of 95%. You would then add and subtract this to/from the mean (6.85) to create the full interval. The full interval would then be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53.
An important note in spreadsheet development: you could enter 1.02 in this box or enter the cell reference J47. You would generate the same answer. However, you are almost ALWAYS better off entering the cell reference rather than hard coding a number. This makes the formula more portable.
6
118
Excel 2003 Lagniappe

We could spend hundreds of pages providing Excel Lagniappe. Dont worrywe wont. As you may have already discerned, Excel is an incredibly underutilized data analysis package. Most Excel users dont use the majority of the functionality that is available. As you continue to further develop your Excel skills, you will find more and more functionality not explored here. In terms of extra interesting Excel short cuts, we would like to share three: 1. Transposing Columns Occasionally you may find that you need to have the columns in a dataset rearranged. For example, from the WidgeOne.xls dataset, you may want to create a scatter plot of Productivity (PRDCTY) and Social Relations score (SOCREL). Recall that when we selected the chart wizard, and selected the data range, we had to indicate a single range. What if the variables are not next to each other? To move PRDCTY next to SOCREL, first highlight the PRDCTY column by clicking on the column letter. Then press Ctrl and x. Then highlight the SOCREL column by clicking on the column letter. Now hold Ctrl and Shift down together then press +. Cool huh? 2. Highlighting data In the WidgeOne.xls dataset, we only have 40 observations. So, highlighting all of the observations is not particularly difficult. But, what if we had 4,000 or 400,000 observations? Highlighting all of the rows might take some time. To highlight an entire column(s) of data, position your cursor at the top of the dataset. Press down the Ctrl button. With the Ctrl button pressed, hit the End key and then the down arrow key. This will highlight all of the data to the end. 3. Autofill Often we need to create a pattern in Excel. Examples might include days of the week, months of the year, a consecutive series of numbers or a geometric series (2,4,8,16). If you can provide Excel with the first few entries in a common pattern, it will then autofill the remaining entries.
119
For example, go into a clean Excel spreadsheet. Type Jan in cell A1 and then Feb in cell B1. Now, highlight these two cells. Place your cursor on the little square handle on the bottom right corner:
Drag this handle 10 spaces to the right. You now have the months of the year!
120
Chapter 4 Microsofts Excel 2007

What is Different in Excel 2007?
Here is the official Microsoft Site that provides information regarding the differences between 2003 and 2007 http://office.microsoft.com/en-us/Excel/HA100738731033.aspx Much of the functionality is similar, but the layout is very different it is more logical once you get the hang of it. And, the graphics are much improved.
121
Here is a screen shot of a blank Excel 2007 page:
You can move easily through much of the functionality in Excel 2007 by clicking on the tab headers at the top. Most of the time, you will be on the Home tab.
122
At this point, we need to access the WidgeOne.xls dataset in Excel 2007. To access the dataset, click on the Microsoft Office button at the top right of the sheet and select Open. You should see the following screen:
Once you have opened the WidgeOne.xls file in Excel, you should see this:
123
124
4.1 Using Excel 2007 for: Measurements of Central Tendency

The three measurements of central tendency can be executed in Excel 2007 using pre-programmed formulas and the fx button as before. Prior to executing the Mean, Median and Standard Deviation, lets insert an additional column on the left hand side. To do this, first place your cursor on the A in the first column and click, so that the entire column is highlighted. Now, under the Home tab, in the Cells tools, select Insert:
125
At this point, the entire dataset should have shifted to the right, and the new column A is blank. Now, go to the bottom of the dataset to cell A43. In cells A43, A44 and A45, type Mean, Median and Mode, respectively:
126
Not all variables will lend themselves to these calculationsremember that we only execute mean and median calculations on quantitative variables. So, it would be helpful if we could see the column headers to remind us what is in each column. This can be done using a split screen. To do this, click on the View tab and then within the Window tools, select Freeze Panes>Freeze Top Row:
127
For which columns should we report the measurements of central tendency? The quantitative values include JOBGRADE, SOCREL (social relations score7), YRONJOB (number of years on the job), PRDCTY (Productivity) and JOBSAT (job satisfaction). The calculation of the mode for the qualitative variables (PLANT, GENDER and POSITION) will be addressed below. Move your cursor to position F43. This is where we will place the mean for the JOBGRADE variable. With your cursor in this cell, click on the fx button. From the dialogue box, select Statistical. From the list of function names, click on the second entry AVERAGE. You will see this:
Psychology, Sociology and Marketing Majors will recognize that this is Likert Data. For the purposes of this manual, Likert Data will be treated as quantitative. However, it should be noted that pure mathematicians treat Likert Data as qualitative.
7
128
Once you select Average, the next dialogue box will request an array of numbers. Excel is pretty clever. You may already have the array populated in the first field (Number 1). For the JOBGRADE variable, this will be cell F2 through cell F41. If it is not already populated for you, simply click on the little spreadsheet button and highlight the cells F2 through F41. Note that cell F42 is empty. If it is included, it will be ignored. However, if there was a 0 in cell F42, it would be includedand a different mean would be calculated. It is always best to only include the relevant cells in your calculations. After you have selected cells F2 through F41 as the array for the mean calculation, click OK. You should now see 6.6. Now, lets copy this function across to column J. With your cursor in cell F43, go to The Home tab and select Copy from the Clipboard tools. Highlight cells G43 through J43. Then select Paste from the Clipboard tools. To populate the Median cells, we will use the same process. Place your cursor in cell F44 and click on the function button. From the Statistical functions, select MEDIAN and select the same array F2:F41. Click OK. Copy and paste the function in cell F44, across to cell J44. Although it is not typically used as the best measurement of central tendency of quantitative data, you can provide the mode for these variables using the same process MODE is a function listed in the Statistical category of functions.
129
Your screen should now look like this:
130
4.2 Using Excel 2007 for: Measurements of Dispersion

Recall from Chapter 2, the most commonly used measurements used to describe the dispersion of a variable include the standard deviation and the frequency table. The standard deviation will be calculated in Excel using the function button. Returning to the WidgeOne.xls dataset, enter a label for the Standard Deviation below the measurements of central tendency. You probably noticed that the words Standard Deviation do not fit neatly into cell A47 they spilled over into B47 and C47. Remember that what you see in Excel is not necessarily what Excel sees. In reality, cells B47 and C47 are still empty from Excels perspective. But, this looks a little untidy. There are several ways to tidy this. We can expand column A until the words are visually contained within the column. This is accomplished by aligning the cursor between the A and the B at the top of the spreadsheet until the cursor looks like this and then double clicking. Column A will widen enough to accommodate the longest string of characters in the columnin this case Standard Deviation. A second method of accommodating the text is by wrapping the text into the cell. This is accomplished by going to the Home tab, and the Alignment tools, and selecting Wrap Text. After the text has been wrapped, you can then slightly widen the columns or narrow the rows (using the same process as for the columns), as needed. Once the label has been established, select the function button, then within the Statistical category, select the STDEV options and the same range as before F2:F41 and click OK. You should see 1.54919334. This is the standard deviation of the JOBGRADE variable. As before, copy this formula across to column J.
131
We now have the basic descriptive statistics for the quantitative variables. You may notice that some of the values have no decimal points, some have one decimal point, some have many decimal points. We think this looks a little untidy (as Statisticians, we like things to be tidy). To make this spreadsheet look a bit more professional, lets format all of the data to have a consistent number of decimal points. To do this, click on the cell in the far upper left corner the cell to the LEFT of the A column and ABOVE the first row. This will highlight the entire spreadsheet. Then from the Home tab, select the comma button from the Number tools. This will make all of the numbers in the spreadsheet have two decimal points.
132
Now, your very tidy spreadsheet should look like this
133
Note that if you needed to add or subtract decimal points, you could easily do so by selecting the cells of interest and then clicking on the increase or decrease decimals as circled above. In practice, if you need to provide multiple descriptive statistics on a variable, this is not the process that you would go through. For multiple descriptive statistics, you would go to the Data tab and from the Analysis Tools, select Data Analysis8. This path will bring up the following:
Select the Descriptive Statistics option.
In the event that you do not see the Analysis Tools under the Data tab, click on the MS Office Button in the upper left corner. Select Excel Options at the bottom and then Add-Ins and then GO. Ensure that the Analysis Tool Pak is ticked and click on OK. It should be there now. Note that if you have an unauthorized copy of Excel, you probably wont have access to this very important functionality.
8
134
You will then see the following dialogue box:
Highlight the quantitative variable(s) of interest (F1:J41) Identify that you have labels in the first row
Identify that you want to produce summary statistics
Now click OK.
135
136
Again, pretty untidy. Format the spreadsheet to have two decimal points for all values and expand the columns to accommodate the labels. Your tidy version should look like this:
137
Notice that we reproduced all of the measurements from before, and several more9. This is a more efficient way to produce the descriptive statistics of a variable(s). In Chapter 2, we presented the concept of a frequency table as another method of displaying the spread of a dataset. As discussed, frequency tables are one of the most commonly used methods to display data understanding how to create a frequency table from a quantitative variable is a critical skill. The table created on in Chapter 2, was created in Excel. We will reproduce it here. The first step to creating a frequency table from a quantitative variable is to determine the categories that need to be developed for the quantitative variable (this process will effectively transform a quantitative ratioscale variable into a qualitative ordinal variable). Previously, we determined that the job tenure variable (YRONJOB) should be categorized into three levels less than 5 years, 5-10 years and more than 10 years. Recall that the categories must be mutually exclusive and collectively exhaustive. To accommodate these categories in Excel, we will create bins, where the TOP of each category identifies each bin.
For detailed information on the additional statistics produced, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
9
138
In our WidgeOne.xls dataset, lets create a bin range for YRONJOB in column L:
These are the bins for the Histogram for Job Tenure. Category 1 is 0-4.99, Category 2 is 5-10.00 and Category 3 (which does not need to be entered) is everything above 10.00.
139
Once these bins have been created, select the Data Tab, and then from the Analysis Tools, select Data Analysis and then Histogram:
Click OK. This will bring up a dialogue box, asking for information regarding the quantitative variable to be analyzed, and the associated Bin Range:
140
Highlight the range of the YRONJOB variable (including the label) Highlight the Bin Range (including the label)
Ensure that the Labels option is checked
141
Now you should see this:
142
Againa little untidybut this is the base of what we need for the frequency table. Lets clean this up and add some columns to reproduce the table from Chapter 2. First, replace the bin titles with the real category labels of Less than 5 years, 5-10 years and More than 10 years. Second, expand the columns as needed. Third, total the bottom of the frequency column using the SUM option in cell B5, type =SUM(B2:B4) (the SUM function can be found in the Math & Trig category of functions). Next, create two addition column headers Relative Frequency and Cumulative Frequency. Your sheet should look like this:
143
The Relative Frequency column will display the percentage of observations in each column an important piece of informationparticularly when comparing populations of different sizes. This is done by simply taking each frequency and dividing it by the total. For example, in cell C2, we would type =B2/B5. This would result in .2750 (11/40). Rather than typing this same formula again and again to capture the relative frequencies of the next two categories, we would like to copy this formula into cells C3 and C4. Do this now. Did you get #DIV/0? The problem is that when the formula =B2/B5 is copied down one cell, it becomes =B3/B6. There is nothing in cell B6. Since any number divided by 0 is undefined, we receive this error message. If we want to copy the formula into the cells below, we need to nail down the reference to the Total cell and prevent the reference from changing. To do this, we place a $ in front of the B and another $ in front of the 5 $B$5 instead of B5. This can also be accomplished by placing the cursor in between the B and the 5 and hitting the F4 button on your computer. Once you have nailed down the Total cell as a reference cell, you can copy the formula into cells C3 and C4. The Cumulative Frequency column will display the cumulative percentage of observations from 0 to the top of the category in question. This is accomplished by adding the relative frequency of a category to all of the relative frequencies before it. In Excel, we would type =C2 in cell D2 the first entry in the Cumulative column will always equal the first entry in the Relative Frequency column. In cell D3, we would enter =D2+C3. This will add the cumulative value (D2) plus the Relative Frequency for the category (C3). Andwe can now copy this formula into cell D4. Clearly, this is a lot of manual work in Excel for a relatively small table. However, our focus is on helping to build the Excel skills necessary to execute this kind of analysis for any size table or dataset.
144
145
You probably can guess what is next lets make it a bit more tidy and presentable. First, lets convert the decimals to percentages since that is the way most people would expect to see the data. Highlight cells C2 through D5. Then go to the Home tab and click on the % sign in the Numbers tools. This should have converted all of the numbers to percentages with no decimals. If you would like to see decimals, you can increase the decimals be selecting the increase decimal button in the Numbers tools on the Home tab. Second, lets format the text to ensure that it is all the same (right now some text is italicized and may not be the same font). Highlight the entire table of data (cells A1:D5). From the Font tools, select a common font (we prefer Century Gothic ). Also, you can take off the italics by clicking on the I in the Font tools (you may have to click it twice). Finally, if you want to standardize the appearance of the gridlines, from the Font tools, select the Border Box. From the pull down menu, identify that you want no borders. Then, go back and identify that you want a Thick Box Border.
146
At this point, your table should look something like this:
147
For nascent users of Excel, we understand that this seems like a lot of work. To this mild protest, we have two points. First - most recipients of your analysis will ONLY see your tables and/or graphics (next section). So you need to spend as much time making your analysis look clean and professional as you do ensuring that it is mathematically and logically correct. Second as you will see in the subsequent chapters, some of these executions, which appear awkward in Excel are quite easy in other software applications.
148
4.3 Using Excel 2007 for: Visualization/Organization of Univariate Data

In this section, we will provide the steps needed to create Histograms, Pie Charts and Bar Charts. The Stem and Leaf Plot and the Box Plot as outlined in Chapter 2, while important, are not easily executed in Excel. These visualization tools are however easily executed in the other software applications and will be addressed in subsequent chapters. To reproduce the histogram from Chapter 2, we will follow most of the same process, which was used to create the frequency table in the previous section. Starting with the Plant_Survey sheet open, go to the Bins that were developed from the previous exercise. Lets create five categories instead of just three:
Remember that when creating bins in Excel, we identify the TOP of each categoryand the highest category does not need to be identified.
149
Once these bins have been created, select the Data Tab, and then from the Analysis Tools, select Data Analysis and then Histogram:
Click OK.
150
Following the same process as was used to create the frequency table, identify the Input Range and the Bin Range, and ensure that the Labels box is checked. This time, also check the Cumulative Percentage and Chart Output options:
YOU MUST IDENTIFY THAT YOU WANT THE OUTPUT IN A NEW WORKBOOK. WHY? WE DONT KNOWASK THE BRAIN TRUST AT MICROSOFT.
Selecting these two options will convert the frequency table into a histogram.
Now click OK.
151
152
You guessed ita little untidy. Lets format a few things on our histogram. First, as before, lets change the Bin names to what we really want: Less than 3, 3-6, 7-10, 11-14 and 15+. These changes can be made in the frequency table the histogram will be automatically updated because the graphic is dynamically linked to the table. Second, highlight the legend and delete it (it does not really communicate any meaningful information). Third, double click on the x-axis and format the font as needed (we prefer Century Gothic). Do the same for the other two axes. Finally, click on the area of the graphic and then right click. Select Chart Options and rename the labels as needed. Your final histogram should look something like this:
Job Tenure of WidgeOne Employees

14
12
Frequency
10
8
6
4
2
0 Less than 3 3-6 Years 7-10 Years 10-14 Years 15+ Years
Years on the Job
153
Well Done! To reproduce the pie chart in Chapter 2, begin by bringing up the sheet, which contains the frequency chart that you created in the previous section. Go to the Insert tab:
154
Highlight cells A1:B4. Do not include the total. Then, select Pie Chart from the Chart tools:
155
Select the first option the basic 2D chart. You should now see this:
156
Now, the primary issue with this pie chart, is that we have no information regarding the percentages that comprise each slice which is the whole reason to use a pie chart. To insert percentage values, go to the Layout tab and from the Labels tools, select Data Labels and then go to the bottom of the drop down and select More Label Options. You should see this:
Deselect value and select percentage.
157
After you have identified that you want a percentage value, click on Close. You should now see this:
158
You can click on the Frequency title and change it to something more meaningful like Years on Job. Well done! As noted in Chapter 2, some recipients of your data may be colorblind. Although Excel is typically does not place colors such as green, red and brown together, should you need to override the default colors provided in Excel (or include patterns to accommodate printing in black and white), simply go to the Design tab and make an alternative selection. Bar charts are created using a very similar process. We will create a bar chart of the same information (as noted earlier bar charts are not histograms on their sides). To reproduce the bar chart in Chapter 2, begin by bringing up the sheet, which contains the frequency chart that you utilized to create the pie chart. Go to the Insert tab. Highlight the same data as before A1:B4 (be sure not to include the totals). From the Charts tools, select Bar and then select the first option.
159
You should see the following chart:
160
Where Pie Charts are used to explain relative proportions (percentages) Bar Charts are used to communicate counts. So, the units in this chart are fine. You may want to double click on the Frequency title and give it a more meaningful name. Also, you may want to delete the frequency legend, since it does not communicate any meaningful information.
161
4.4 Using Excel 2007 for: Visualization/Organization of Multivariate Data

In this section, we will provide the steps necessary to create Contingency Tables, Stacked Bar Charts, 100% Stacked Bar Charts and Scatterplots in Excel 2007. As stated in Chapter 2, Contingency Tables are one of the most common and useful methods of communicating the relationships between and among variables in a dataset. In Excel 2007, the tool used to create these tables is particularly useful and very flexible (this is one of the few examples where Excel may rival or outperform the more sophisticated applications). To reproduce the Contingency Tables in Chapter 2, return to the Plant_Survey page of the WidgeOne.xls dataset. From the Insert tab, select Pivot Table from the Tables tools. You should see this:
162
You can leave everything as its default, but for the Table/Range box, click on the little spreadsheet button as circled and highlight the entire dataset. It is particularly important to make sure that you include the titles from the first row. Select OK. At this point you should see something that looks like this:
163
In the event that your sheet does not look like this, do not let your heart be troubled. We can fix this. If your pivot table template DOES NOT look like this, place your cursor inside the table and right click. Go to Pivot Table options. Go to the Display tab and tick the box for Classic Pivot Table Layout and click OK. You should now have the screen as shown above. You can think of this as an empty template with rows and columnswaiting to be filled with data. Lets begin by placing the Plant variable in the column and the Gender variable in the row:
You can click and drag the variables into the right position either in the listing or in the table itself.
164
Now, if we are simply trying to ascertain counts for a basic contingency table, should we place the Plant or the Gender variable in the center of the table? The answer isit does not matterthe counts will be the same. I placed the Gender variable in the center and generated the following table:
165
Try placing the Plant variable in that positionyou should generate the same numbers. Cool. From this table it is easy to see that the Mode of the Plant variable is Dallas and there is no mode for the Gender variable we have the same number of Males and Females. As we did before, lets look at this data in a few different ways. First, change the data to be a percentage of row. This can be done by clicking on the Count of Gender entry as circled above. Select Value Field Settings. You should see the following:
166
Click on the Show values as tab and select Show values as % of row as indicated below:
Select OK.
167
168
As we discussed before, this now tells us that Of all of the females, 65% work in Dallas. You could change this display to be the percent of columns or the percent of totalsthey all communicate subtly different messages. If we want to incorporate an additional piece of information like the average job tenure by plant and by gender, we could do this by substituting the YRONJOB variable in the Data position. Do this by dragging the Gender (or Plant) variable from the values box and placing the YRONJOB variable in the same place:
169
The problem with our table at this point is that we really wanted the average Years on Job, but this is the summation of the total years on job for each intersection. To change the summation to an average, click on the Sum of YRONJOB in the Values box, and select Value Field Settings. From the Summarize By tab, change the default from Sum to Average and click OK. You should now have the following screen:
170
As before, this is a little untidy. You can format the cells to have consistently two decimal places. Much better! This table now provides information such as In the Dallas plant, women have an average of 8.85 years on the job. Now you can copy and paste this table into other documents or into another Excel sheet. As you can see, Pivot Tables are very useful and very flexible. However, because they are so flexible, they do require a bit of manipulation. Mastering Pivot Tables in Excel is a great differentiating skill, but will require practice (and patience). Stacked bar charts are easy to create and manipulate. To reproduce the stacked bar chart from Chapter 2, we will use the first Pivot Table created above that indicated the frequency counts by gender and by plant. Go to the Pivot Table, convert the data back to counts and copy these cells and paste them into another part of the spreadsheet:
Count of Gender Plant Gender D N F 13 M 10 Grand Total 23 Gender F M Grand Total D 13 10 23 N
7 10 17
Grand Total 20 20 40 Grand Total 20 20 40
Note: DO NOT COPY THE PORTION OF THE PIVOT TABLE WITH THE DROP DOWN ARROWS.
7 10 17
171
This will disengage the data from the Pivot Table. You will see why this is helpful soon. Now, highlight all of the data EXCEPT for the totals. With the data highlighted, go to the Insert tab and the Bar option in the Charts tools. Select the second option. You should see this:
F M D
10
15
20
25
This chart is finebut a little untidy. Because the chart is dynamically linked to the table, you can update the N to read Norcross and D to read Dallas and the same for the genders. We should also apply a title. Go to the Layout tab and the Labels tools and select Chart Title. Add the title.
172
You should now have something like this:
Men and Women at the Two Plants

Norcross Female Male Dallas
10
15
20
25
Well Done! Whenever you have different population sizes, as is the case with the Dallas and Norcross Plants, it is helpful sometimes to scale both populations to 100% to more easily compare the two. This is the purpose of a 100% Stacked Bar Chart. To execute this chart, you start the same way highlight the data, go to the Insert tab, click on the Bar option but this time select the third option (all the bars are the same length).
173
Norcross
Female Male Dallas
0%
20%
40%
60%
80%
100%
You can think of this visualization as side by side pie charts. This graphic communicates the proportion of males and females within each plant. It is easy to see from this graphic that there are proportionately fewer women in Norcross than in Dallas.
174
The final visualization in this section is the scatter plot. Scatter plots are typically used to determine if there is a meaningful relationship between two variables. To reproduce the scatter plot in the previous section, lets return to the Plant_Survey sheet. To examine the relationship between Job Productivity and Job Tenure, we will plot these two variables in a scatter plot. It is important to note that in a scatter plot, we are NOT trying to establish any causation, only correlation. First, it is helpful to have the two variables next to each otherwhich they are not. To move the PRDCTY variable next to the YRONJOB variable, click on the G at the top of the column where the PRDCTY variable is located. This will select the entire column. Now, click on the Ctrl button and the X button. There should be chasing lights around the variable column. Click on the I at the top of the YRONJOB variable, so that column is highlighted. Now click on Ctrl/Shift/+ at the same time. Cool. The variables should now be next to each other (this is actually a Lagniappe). Once the variables of interest are side-by-side, highlight both. Go to the Insert tab and select the Scatterplot option from the Chart tools. Delete the legend. You should see something like this:
PRDCTY
120.00
100.00
80.00 60.00 40.00 20.00 -
5.00
10.00
15.00
20.00
175
Againpretty untidy. We will do three things to clean up the appearance of this graph: rescale the y-axis, add titles and take away the decimals. The y-axis needs to be rescaled because the data does not actually start until above 50; there is a great deal of wasted space. To rescale the y-axis, select the Layout tab. From the Axes tools, select Axes. Select Primary Vertical Axis>Primary Vertical Axis More Options. You should see this:
Change the Minimum Default from 0 to 60.
176
Click on Close. Now, recall that graphics are typically dynamically linked to data in Excel. So, if we change the data, we change the graphic. In the Plant_Survey sheet, highlight both variables and decrease the decimals. This will change the appearance of the scatterplot as well. Finally, to add titles to the axes, select the Layout tab. Choose the Axis Title option from the Labels tools. Begin by assigning a Primary Vertical Axis title, and then a Primary Horizontal Axis title. Name them appropriately. Then, rename your chart title. Your scatter plot should look something like this:
Productivity versus Years on Job

100 95 90 85 80 75 70 65 60
Productivity
10
Years on Job
15
20
As highlighted in Chapter 2, we can derive additional information from this graphic by adding a trendline to the data. To add a trendline, click on the dots in the graph and then right click. Select Add trendline.
177
Identify that the trend is linear.
Identify that you want the Equation and the Rsquared values on the chart.
Select Close.
178
Productivity versus Years on Job

100 95 90 85 80 75 70 65 60 y = -0.571x + 89.31 R = 0.112
Productivity
10
Years on Job
15
20
As explained in Chapter 2, this information now provides us with the best linear equation, which fits the relationship between Productivity (y) and Job Tenure (x). The R-squared value of .1124 indicates that this is not a particularly strong relationship Job Tenure only explains 11.24% of the change in Productivity. These concepts form the basis of Regression Modeling. For a more detailed explanation of Regression Modeling, we recommend Statistical Methods and Data Analysis by Ott and Longnecker.
179
4.5 Using Excel 2007 for: Random Number Generation and Simple Random Sampling
Our WidgeOne.xls dataset is fairly small only 40 observations. As a result, it would be unusual that we would want to extract a sample from such a small dataset. However, for the purposes of executing the application of random number generation in Excel, lets assume that we want to randomly select ten individuals with whom we want to conduct in depth interviews. Lets begin by assigning random numbers to each individual. Go back to the Plant_Survey sheet and create a new column label RANDOM. Place your cursor in the first cell under the column label (row 2). Click on the formula button. Ensure that ALL is selected as the Function Category. Scroll down through the Function Names until you see RAND. Select RAND and click OK. This will generate the following:
There are three pieces of information you need to understand from this box: 1. The function takes no arguments which means that we do not need to provide any information; 2. The function will return an evenly distributed (uniform distribution) random number between 0 and 1; 3. The function is volatile which means that the value returned will change EVERY time the spreadsheet is manipulated.
180
Click OK. You should see some number between 0 and 1 in this cell (your result will be different each time since the random number is generated using your computers internal clock). Remember that Excel reads this cell as =RAND not as the number that you see. Now copy the formula in this cell down to the bottom of the dataset. Did you notice that your original number in row 2 changed? This is because it is volatile. Sometimes we need to have volatile arguments in (not with) Excel. Most of the time we do not. To convert the numbers you see from volatile to stable (unchanging), highlight the entire column, from the Home Tab, select Copy and then Paste>Paste Values.
181
Now, you should have a column of unchanging random numbers (your numbers will be different from ours):
182
Now, sort the entire dataset on the random numbers just created. Highlight the entire dataset. Then go to the Data tab. From the Sort and Filter tools, select Sort. When the dialogue box appears, click on the down arrow in the Sort by option. You will get a drop down of all of the variables. Select Random. It does not matter if you sort smallest to largest or largest to smallest its random.
Click OK. Then, select the first 10 individuals for the interviews. This is a fairly simple, but very useful process.
183
4.6 Using Excel 2007 for: Confidence Intervals

The penultimate section in this chapter will aid in the calculation of Confidence Intervals one of the most commonly used techniques in Inferential Statistics. From Chapter 2 Lets assume that the WidgeOne.xls dataset is a representative sample of a larger manufacturing firm with hundreds of employees in Norcross, GA and Dallas, TX (if we have access to the entire organizations data, we would not calculate confidence intervals of any population parameter we would report the descriptive statistics). Lets also assume that the HR department at WidgeOne has been charged with understanding the level of job satisfaction among employees. For cost reasons, they were unable to survey the entire organization, so they surveyed the 40 employees in our dataset. Report the job satisfaction for all WidgeOne employees, using the sample of 40. Use a 95% level of confidence. Go back to the Plant_Survey sheet. We previously calculated the mean job satisfaction to be 6.85 and the standard deviation to be 1.02. Using this information, we can use Excel to compute the confidence interval. To execute this computation, go into blank portion of the spreadsheet and click on the function button. Ensure that the Statistical function category is selected and then scroll through the function names until you get to the CONFIDENCE function. Click OK.
184
You should see the following:
Alpha is 1-the Confidence Level The STD would have been previously calculated Size is the sample sizein this case 40
If we are computing a 95% confidence interval, we would enter .05 for the alpha value (you can think of alpha as the probability you are willing to accept of being wrong). The standard deviation, which was computed previously for job satisfaction, was 1.0210. The (sample) size is 40. Once this information is entered, the resulting computation should be .32. This is the margin of error for job satisfaction at a confidence level of 95%. You would then add and subtract this to/from the mean (6.85) to create the full interval. The full interval would then be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53.
An important note in spreadsheet development: you could enter 1.02 in this box or enter the cell reference J47. You would generate the same answer. However, you are almost ALWAYS better off entering the cell reference rather than hard coding a number. This makes the formula more portable.
10
185
Excel 2007 Lagniappe

We dont have much Lagniappe that is unique to 2007. One concept that is valuable to understand in Excel (2003 or 2007) is the If statement and the Nested If statement. Lets discuss If statements in the context of converting a quantitative variable into a qualitative variable. Lets say that we want to categorize employees into high productivity (defined as a productivity score greater than 90) and low productivity (defined as a productivity score less than 90). One way to do this is through the application of an If statement.
186
Find a new, clean column on the right of the dataset. Title the column Productivity Category. You should see this:
187
The quantitative productivity score is in column G. In cell L2, enter the formula =IF(G2<90,LOW,HIGH). Copy this formula to the bottom of the dataset. You should see the following:
188
Cool. Now, lets assume that we want to add a third categorization medium. Lets define medium productivity as a productivity score between 80 and 90, and a low productivity score as a score less than 80 (high is still above 90). Replace the formula in cell L2 with this =IF(G2<80,LOW,IF(G2<90,Medium,HIGH)). You should see the following:
189
Chapter 5 SPSS
What is SPSS?
SPSS was first developed in 1968 by social science researchers at Stanford University as a tool to help them with quantitative research. In fact, the acronym SPSS initially stood for Statistical Package for the Social Sciences. As with IBM and AT&T, the company (and its software) is simply known by its initials, in part as a testament to its diverse user base. Although the software is most heavily used in social science contexts particularly in psychology, political science and in academia it is also used in medicine, marketing, and many other contexts. SPSS is appealing to many users from less technical and/or mathematical disciplines because it has a particularly user-friendly interface consisting of an Excel-like spreadsheet for the data and menus and buttons for manipulations and analyses. Although this point and click interface makes SPSS particularly attractive for statistical computing novices, individuals who require greater statistical functionality may find the application limiting. This document has been written using SPSS version 15.4.
Getting data into SPSS

Prior to actually executing any of the statistical concepts from Chapter 2, we first need to get the WidgeOne.xls dataset into the SPSS system and convert it into an SPSS file.
190
When you open SPSS you should see the following screen:
191
As shown above, there are two tabs in the new file. A Variable View tab and a Data View tab. The Data View tab will display the data much the same way as an Excel spreadsheet. We must import the data from the Excel spreadsheet WidgeOne. Do this by clicking on File>Open>Data. Then click the computer icon>Computer>C$(\\Client)(V:):
Note that if you are accessing SPSS through Citrix, all of your drive names will change. For example, your C: drive will become your V: drive. Make sure that the File type is set to .xls to find an Excel file.
192
Browse to where the WidgeOne file is located. When you open it, you will get a dialogue box like this:
Make sure that you select the Plant_Survey worksheet.
193
The following window should appear:
194
This is one of two possible views of your dataset. This is the Variable View. Note at the bottom of the screen, the Variable View tab is highlighted. This view lists the variables in your dataset. In our case, the column names in the WidgeOne file were converted to variable names in this SPSS file. The qualitative variables (e.g., GENDER and PLANT) are called string variables and the quantitative variables (e.g., PRDCTY and YRONJOB) are called numeric variables. For later displays it will be nice to create user-friendly labels for each of the values in these variables, instead of indicators like D for the Dallas plant. To create labels that will make our output easier to read, click on the Values cell in the PLANT row:
195
You will be prompted for a name and a label. In the Value box, enter the value that appears in the actual data that you want to read differently in the output:
Click the Add button. Next assign the label Norcross for the value N and click Add again. Click OK. Do this for the other string variables Plant and Position. Please note that this is NOT affecting your actual data, it will only change the way that the output appears.
Go back to the Data View tab at the bottom of the screenyou will see the actual data11:
If you needed to create a new dataset from scratch, you would begin by defining your variables in the Variable View window and then return to the Data View window and input the values.
11
196
197
To expand the columns, simply place your cursor in between the column headers (variable names) and drag the column to its desired width just like you would in Excel. At this point, we could convert the other worksheets from the WidgeOne dataset into SPSS files. Each would be converted to a separate SPSS file. These files could be merged into one file using the Merge Files option in the Data Menu (not available in Student Version). However, since we will only be using the variables in the Plant_Survey worksheet for our statistical analyses, we will not execute a merge at this time.
198
5.1 Using SPSS for: Measurements of Central Tendency

SPSS is a menu-based system. Thus it is only a matter of finding what you want to do on the menus and customizing your request. For most computations, you should find SPSS to be easier than Excel. In order to find the two most predominant measures of central tendency (the mean and median) we start in the Analyze menu. Within that menu, choose Descriptive Statistics and Frequencies as shown:
199
Next you will see the following:
We need to choose the variables for which we are interested in finding the mean and median. We will choose only the quantitative variables (those with the ruler icon next to them): JOBGRADE, SOCREL, YRONJOB, PRDCTY and JOBSAT. We make these selections by clicking on the variable from the list on the left and then clicking on the right arrow button circled above to place it on the variable list on the right. Almost every option in SPSS has this type of interface for selecting variables for analysis. You can choose more than one variable at a time by holding the Ctrl key down as you make your selections. Please make sure that the Display frequency tables option is UNTICKED. This will be more meaningful later.
200
After you have identified the variables for analysis, click on the Statistics option button as circled above. You should see this screen:
This should look pretty familiar. This is almost the same list of statistical information that was produced when we executed Tools>Data Analysis>Descriptive Statistics in Excel. Hmmmthat must mean that this stuff really is important. For now, just tick Mean and Median and select Continue.
201
We obtain the following display containing the means and medians of our five variables in our SPSS Output window:
Statistics JOBGRADE N Valid Missing Mean Median 40 0 6.6000 6.5000 PRDCTY 40 0 84.5798 84.8114 SOCREL 40 0 5.3000 5.0000 YRONJOB 40 0 8.2900 8.3500 JOBSAT 40 0 6.8500 6.6000
Notice that these figures are consistent with what we had generated using Excel and what we had computed by hand. Isnt it nice when numbers match? What if we were only interested in a subset of the data? For example, what if we wanted to know the measurements of central tendency of these variables by gender and by plant?
We would select the Compare Means>Means option from the Analyze menu as shown:
202
203
You should see the following screen:
Typically, quantitative variables go in the Dependent List and qualitative variables go in the Independent List.
Choose the same five variables as before. Place these variables in the Dependent List. Then, place the variables Plant and Gender in the Independent List. This will enable us to better understand if there are differences between the genders and the plants across the quantitative variables like Productivity (PRDCTY). Once the variable lists have been populated, select the Options button. From the list, identify that you want the Mean and the Median. Select Continue and OK.
204
205
This output is much more explanatory than the first set of output. Look at the differences between the plants. Which plant is more productive? Which plant has a higher Job Satisfaction score? Now look at the differences between the genders. Which gender has a higher social relations score? Is there a difference in productivity between the genders? Sometimes looking at an average by itself is misleading. For examplelets assume that a friend of yours just read an article about lung cancer. He goes on to tell you that 1% of all Americans will die of lung cancer. I should probably mention that your friend is a member of the great statistical unwashed. Does this mean that you have a 1% chance of dying of lung cancer? Of course not. It depends upon a lot of thingslikedo you smoke? If you re-evaluate that number by smokers/non-smokers, the values are very different. Thats the pointaverages are very misleading. You need to look at the average (or median) by different groupings to better understand the rest of the story.
As a rule, we do not use the mode as a Measurement of Central Tendency with quantitative data. If the data is qualitative Plant, Gender, Position it is the ONLY Measurement of Central Tendency available. We can determine the mode of variables such as these by selecting Analyze>Descriptive Statistics>Frequencies again from the Analyze menu. This time choose the qualitative variables Plant, Gender and Position. Check the box next to display frequency tables. Then click OK.
206
You will see the following frequency tables from which it is easy to determine if there is a modal value (isnt this easier than what we had to go through in Excel?):
Plant Cumulative Frequency Valid Dallas Norcross Total 23 17 40 Percent 57.5 42.5 100.0 Valid Percent 57.5 42.5 100.0 Percent 57.5 100.0
Gender Cumulative Frequency Valid F M Total 20 20 40 Percent 50.0 50.0 100.0 Valid Percent 50.0 50.0 100.0 Percent 50.0 100.0
207
POSITION Cumulative Frequency Valid HRLY MGT Total 20 20 40 Percent 50.0 50.0 100.0 Valid Percent 50.0 50.0 100.0 Percent 50.0 100.0
You can also see here that we are reaping the work that we did earlier when we created the labels. It is much easier to understand these tables with the full labels.
208
5.2 Using SPSS for: Measurements of Dispersion

To represent the dispersion of a quantitative variable (Measurements of Dispersion are not relevant for qualitative variables), we typically report the standard deviation. To do this in SPSS, return to the Analyze menu. Select Descriptive Statistics>Frequencies and select the quantitative variables as before. Turn off the display for frequency tables and click on the Statistics button. Select Standard Deviation. Click Continue and then OK. You should see the following output:
Statistics JOBGRADE N Valid Missing Std. Deviation 40 0 1.54919 PRDCTY 40 0 7.25633 SOCREL 40 0 1.65173 YRONJOB 40 0 4.25657 JOBSAT 40 0 1.02081
We could have obviously included lots of statistics in our analysis simply by choosing the ones we want from the Statistics screen. The second Measurement of Dispersion discussed in Chapter 2 was the frequency table. To execute a basic frequency table for a qualitative variable, go to Analyze> Descriptive Statistics>Frequencies. Select the qualitative variables for analysis. Ensure that the Display frequency tables is ticked at the bottom of the page. Click OK.
209
210
In the previous chapters, we explained how to categorize a quantitative variable into a qualitative variable. For example, when we created a frequency table for the job tenure variable, we created three categories: < 5 years, 5-10 years and more than 10 years. To create these same categories in SPSS, we need to recode our YRONJOB variable into a new variable called JOBTEN. To do this, go to the Transform menu and choose the option Recode into Different Variables:
211
You should see the following:
Click on the Old and New Variables button.
212
You should now see the following screen:
Identify the name of the first category
Identify the range of values for the first category.
Tick this box to tell SPSS that you are creating a qualitative variable
First we define the category New. In the screen above, you must indicate that the Range of this new value is from 0 to 4.9 (we wanted values less than 5 and the data had only one decimal place of accuracy). Check in the box that specifies that the new output variable will be of type String. We also name the new values New. Click on the Add button to add this new output value.
213
These actions will produce the following:
Note that the values of YRONJOB between 0 and 4.9 will represent the category New in the new variable. Continue this same process creating the category Experienced (5-10 years on the job) and the category Mature (10+ years on the job). Note: since the value Experienced has 11 characters, change the Width from 8 to 11. After you have completed this process, click on Continue.
214
The Name is what will appear in the dataset. The Label is what will appear in the output. Select Change and then select OK.
215
216
Now we can easily generate a frequency table for the new variable JOBTEN. As before, go to Analyze>Descriptive Statistics>Frequencies. Ensure that the frequency table option is ticked and select your new Jobten variable:
Job Tenure Cumulative Frequency Valid Experienced Mature New Total 16 15 9 40 Percent 40.0 37.5 22.5 100.0 Valid Percent 40.0 37.5 22.5 100.0 Percent 40.0 77.5 100.0
Well Done!
217
5.3 Using SPSS for: Visualization/Organization of Univariate Data

For professional presentations or for formal documents, we recommend the use of a graphics package (e.g. Microsoft Power Point). However, SPSS has some nice graphs available in the Graphs menu, which can be used less formally. In addition, it is very useful to develop graphics for your own purposes, because it enables you to see things about your data that you might not have otherwise seen. As with Excel, lets begin with a Histogram. We will also execute a Stem and Leaf plot, which we were not able to do with Excel. To create a Histogram of the YRONJOB variable, select Analyze>Descriptive Statistics>Explore:
218
Assign YRONJOB to the Dependent list. Select the Plots button:
219
Tick the Stem and Leaf and Histogram options.
Click Continue. On the Explore dialogue box, make sure that the Both option is selected for the Display. Click OK.
220
This set of executions will generate the following output:

Descriptives Statistic YRONJOB Mean 95% Confidence Interval for Mean Upper Bound 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis 9.6513 8.2917 8.3500 18.118 4.25657 .10 17.00 16.90 6.10 -.081 -.748 .374 .733 Lower Bound 8.2900 6.9287 Std. Error .67302
221
YRONJOB Stem-and-Leaf Plot Frequency 2.00 5.00 5.00 4.00 8.00 7.00 4.00 4.00 1.00 Stem width: Each leaf: Stem & 0 0 0 0 0 1 1 1 1 . . . . . . . . . Leaf 01 22233 44555 6777 88888999 0000111 2333 4445 7
10.00 1 case(s)
Here is the Stem and Leaf plot. If you imagine rotating this graphic clockwise 90 degrees, it is basically a Histogram on its side. The plot tells us that each stem has a width of 10.00. This means that the values should be interpreted in units of 10. Lets start in the middle with the frequency of 7.00. Here, we have four values that are 10.x and three more values that are 11.x. The next line indicates a frequency of 4.00. In the dataset, we have an observation that is 12.x and three observations that are 13.x. The greatest observation is 17.
222
223
This is a boxplot. Here, the center line is the median. The box is the Interquartile range the high end of the box is the 75th percentile and the low end is the 25th percentile. The whiskers that extend in either direction tell us the full range of the data. If there were any outliers (defined as observations with values more than 1.5*IQR from the mean), they would be identified here. Lots and lots of outputwith relatively little work. Thats what Im talking about!
224
To replicate the pie chart developed in Chapter 2, go to Graph>Legacy Dialogues>Interactive>Pie>Simple:
225
You should see this screen:
Drag the qualitative variable of interest in this box.
Because Pie Charts communicate proportions, drag the Count default back to the left and drag the percent option into this space.
The four tabs across the top of this dialogue box will take you through four different set of options. The only other thing we really need to do is to give our Pie Chart a title. So, click on the Titles tab and title the chart Job Tenure of WidgeOne Employees. Feel free to explore the other tabs.
226
You should have generated the following:
Lets say that you wanted to understand how the overall productivity of the company was allocated by plant what percentage of the productivity comes from Dallas versus Norcross. This is easy to do in a Pie Chart in SPSS. Go back to Graph>Legacy Dialogues>Interactive>Pie>Simple. Make the following changes:
227
Dont forget to change the title. Click OK. You should see the following Pie Chart:
228
To correctly obtain the percentages, double click the graph to see this:
229
Click show data labels and then close the properties window and the chart editor to obtain the following graph:
230
This pie chart now provides information regarding the percentage of WidgeOnes productivity by plant (Norcross needs to step it up).
The next univariate visualization tool is a Bar chart. This is done in a very similar way to the Pie Chart.
231
Select Graph>Legacy Dialogues>Interactive>Bar. You should see the following screen:
Change this to Horizontal.
Think of this layout as the axes in a graphic.
As before, change the title to something meaningful. Peruse the other tabs as you see fit. You should generate something like this:
232
We should probably note at this point that if the definitions that you assigned when you transformed the quantitative variable into a qualitative variable are not universally known, you should include a legend or key at the bottom of your graphic to ensure that the reader understands the definition of New and Mature.
233
5.4 Using SPSS for: Visualization/Organization of Multivariate Data

Contingency tables, Stacked Bar Charts, 100% Stacked Bar Charts and Scatter Plots can be easily generated in SPSS. To reproduce the Contingency Tables that were created in earlier chapters that included the variables Plant and Gender, select Analyze>Descriptive Statistics>Cross Tabs:
Place the Plant variable in the Row position.
Place the Gender variable in the Column position.
234
As with Excel Pivot Tables, Crosstabs in SPSS are very flexible. If you wish to include more than just the frequency counts in the cells of your table, click on Cells. You will see the following window:
In the percentages section, select Row, Column and Total. Click Continue and then OK.
235
Wowlook how much output was created in a single table! That was so much easier than Excel! The output table contains the conditional probabilities described in Chapter 2. In the first cell the intersection of Female and Dallas we have four pieces of information. We know that there are 13 women who work in Dallas. We know that of all of the Dallas employees, 56.5% are female. We know that of all of the women, 65% are in Dallas. Finally, we know that of all employees, 32.50% are females in Dallas.
Plant * Gender Crosstabulation Gender Female Plant Dallas Count % within Plant % within Gender % of Total Norcross Count % within Plant % within Gender % of Total Total Count % within Plant % within Gender % of Total 13 56.5% 65.0% 32.5% 7 41.2% 35.0% 17.5% 20 50.0% 100.0% 50.0% Male 10 43.5% 50.0% 25.0% 10 58.8% 50.0% 25.0% 20 50.0% 100.0% 50.0% Total 23 100.0% 57.5% 57.5% 17 100.0% 42.5% 42.5% 40 100.0% 100.0% 100.0%
236
If you need to subset this information further (e.g. by Job Tenure), there is an easy way to do that. Go back to the Analyze>Descriptive Statistics>Crosstabs screen. Press Reset to return to the default settings.
Make your selections of the three variables as follows:
Click OK.
237
This time, the table will only show the cell counts (we could have included the percentages as before by following the same steps in the Cell Display screen):
Plant * Gender * Job Tenure Crosstabulation Count Gender Job Tenure Experienced Plant Dallas Norcross Total Mature Plant Dallas Norcross Total New Plant Dallas Norcross Total Female 3 3 6 6 2 8 4 2 6 Male 5 5 10 3 4 7 2 1 3 Total 8 8 16 9 6 15 6 3 9
Notice that the same information on Plant and Gender counts has now been provided by each level of Job Tenure Experienced, Mature and New (the levels are reported in alphabetical order rather than by order of magnitude). Cool.
238
Stacked Bar Charts can be generated in SPSS using the same basic executions that you did for simple Bar Charts in the previous section. Select Graphs>Interactive>Bar:
Change the Title. Select a vertical orientation.
Place the Plant variable here. Place the Gender variable here.
Select OK. You should see the following Stacked Bar Chart:
239
240
Because these groups are of different sizes, it might be better to plot this information in a 100% Stacked Bar Chart instead. To do this, go back to Graph>Legacy Dialogues>Interactive>Bar:
Tick the 100% stacked option.
You should see the following graphic:
241
The last multivariate visualization technique is the Scatter Plot. Again, SPSS provides us with flexibility to subset our analysis if needed.
242
What variables might have a relationship? What about Productivity and Job Satisfaction? A Scatter Plot of these variables can be generated by selecting Graph>Legacy Dialogues>Interactive>Scatterplot:
Dont forget to change the title! You should see the following graphic:
243
Sowhat do you think? It appears that there might be a positive relationship between the two variables, because the graphic roughly moves in a linear fashion from the SW corner to the NE corner.
244
5.5 Using SPSS for: Random Number Generation and Simple Random Sampling
Like the other software applications, SPSS will generate random numbers using the internal clock in the computer. As a result, every time a command is given to SPSS to generate some set of random numbers, a different set of random numbers will be generated. However, sometimes we may need to replicate a set of random numbers exactly the way they were previously generated. To accomplish this replication, SPSS allows the analyst to define a seed number that will ensure a consistent set of random numbers the numbers are still random and can be used to ensure statistical independence of samples. If you need to set the seed number so you can replicate your results, simply go to the Transform menu. Choose the Random Number Generators option. You should see the following screen:
245
This system is set to have a Starting Point of 1234567. This starting point is referred to as a seed. You can set the starting point value prior to each analysis that uses the random numbers. The value must be a positive integer. To create a string of random numbers, which is uniformly distributed between 0 and 1, go to the Transform menu and choose Compute Variable. We will call the new random number variable Group as shown in the screen below. Look at the menu for Function Group. In this menu, select Random Numbers. You will then see a long list appear in the Functions and Special Variables menu. This is a list of distributions that you could use to generate the new random variable Group. This time double click on Rv.Uniform:
246
Every distribution has parameters that must be specified. For the uniform distribution, the only parameters are the two values between which we want our random numbers to fall. The ?s in the expression RV.UNIFORM(?,?) which appears in the Numeric Expression box are asking you to fill in these two values for your random numbers. Change this expression to read RV.UNIFORM(0,1), so the random numbers will be between 0 and 1 (as it did in Excel). Click OK. The new variable Group should appear in your Data View. Here is what a typical result would look like:
247
Remember that your results will vary since this variable was randomly generated. One of the primary reasons for generating random numbers is to assign observations into statistically independent groups. Using the random numbers, lets assign the 40 observations into 2 groups a test group and a control group. Just like we did in section 5.1, select Transform>Recode Into Different Variables. Select the new variable Group to be transformed. Click on Old and New Values. Set it up, so that the values between 0 and .5 are put into the Control Group and the values from .5 to 1 are in the Test group:
Click on Continue. Give the new variable a name like Assignment and then click OK.
248
You should see the following on your Data View:
249
Now, you have two groups of randomly assigned employees. This is a very important concept in Statistical Testing. Because the process of selecting a random sample from a set of data is so common, there is a very straightforward way to accomplish this in SPSS. Suppose we wish to select a simple random sample of 30 individuals from this dataset. Select Data>Select Cases>Random Sample of Cases>Sample:
C a C C s a a e s s s e e s s w w w i You could choose to sample ia certain percentage of the cases or sample 30 out of the first 40 cases. Do the i t latter. Click on t Continue and tthen OK. h h h a s l a s h a s l a s h a s l a s h w e r e n o Developed and maintained by the Mathematics and Statistics Department of Kennesaw State University t s e
w w e e r r e e n o t s e n o t s
250
Your Data view will now look like this:
Cases with a slash were not selected in the sample
251
Note there is a new variable in the list filter_$. It assigns the value 1 to those values selected for the random sample and the value 0 to all others. Cases not selected for the sample are now slashed in the first column. Remember that all samples will all differ unless the same seed is used to generate them. At this point, you can execute all of your analysis as before, but only those cases with a filter=1 value will be analyzed. You can go back to all cases by selecting Data>Select Cases>All Cases.
252
5.6 Using SPSS for: Confidence Intervals

Generating confidence intervals in SPSS is very easy. For example, if we wish to compute a 95% confidence interval for the mean Job Satisfaction rating of all employees, we would go to the Analyze menu and choose Compare Means and then choose One-Sample T Test12. Once the Job Satisfaction variable has been assigned, select Options and ensure that the CI will be generated at a 95% Confidence:
Ttests are very common tests used to determine if two sample means differ significantly or if one sample mean differs from some established value. For more detailed information on Ttests, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.
12
253
Click Continue and then OK. You will see the following output:
One-Sample Test Test Value = 0 95% Confidence Interval of the Difference t JOBSAT 42.440 df 39 Sig. (2-tailed) .000 Mean Difference 6.85000 Lower 6.5235 Upper 7.1765
As stated previously in Chapter 2, these results would be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 6.52 and 7.18. This means that the probability that the true mean job satisfaction of all employees, which is unknown, falls between 6.52 and 7.18 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.52 or > 7.18).
254
5.7 SPSS Lagniappe

You may have noticed that in your output screen in SPSS, you have been generating what is called Syntax. For example, when you executed the measurements of central tendency for the quantitative variables, SPSS wrote the following syntax:
FREQUENCIES VARIABLES=JOBGRADE PRDCTY SOCREL YRONJOB JOBSAT /STATISTICS=MEAN MEDIAN /ORDER= ANALYSIS . /FORMAT=NOTABLE
Did you notice that? Probably not. Most people use SPSS because they dont have to write code to have the software do what they want. However, in the event that you find the point-and-click environment of SPSS too restricting, know that you always have the option to write custom syntax to have SPSS more specifically do what you want. To run syntax in SPSS, select File>New>Syntax. In the blank syntax screen type (or paste) the syntax above. Then select Run>All. You will generate the same output as before! Cool!
255
Chapter 6 Minitab
What is Minitab?
Minitab was first developed in 1972 at Penn State University (Go Nittany Lions!). Initially, it was developed as a teaching tool to help professors teach basic statistics. It is still used for that purpose at more than 4000 colleges and universities around the world. One reason for its popularity in this venue is that it is a user-friendly, menudriven interface much like SPSS. Because it offers accurate and customizable analysis tools for quality control and other important business and industry functions, it is also now widely used by companies of all sizes. It is currently the package of choice at many manufacturing Fortune 500 companies including Ford Motor Company, 3M, Honeywell International, and Samsung.
Getting data into Minitab

Prior to actually executing any of the statistical concepts from Chapter 2, we first need to get the WidgeOne.xls dataset into the Minitab system and convert it into a Minitab file.
256
When you open Minitab you should see the following screen:
This is the Session Window.
This is the Data Window.
257
As shown above, the display consists of two windows. The Session Window is at the top and is where you will see commands and results displayed. The Data Window is at the bottom. It is the worksheet where the data is displayed in a spreadsheet format. You now need to open up the WidgeOne dataset in Minitab. In the File menu choose Open Worksheet as shown below:
258
You will next see a typical window for opening a file. Remember the WidgeOne.xls dataset is an Excel file. Your computer will initially be looking for Minitab files only. You have the option of looking for files of any type. The window below shows the system being instructed to look for Excel files. It also shows the WidgeOne dataset being selected:
Click Open.
259
You should then see the following display:
260
The three worksheets in the Excel file Widgeone.xls have all been converted to separate Minitab worksheets, named Attendance, Employees and Plant_Survey. The statistical analyses for this guide is exclusively on the Plant_Survey worksheet. You can close out of the others now. Make sure to go to the File menu and save the Plant_Survey worksheet for future use.
261
6.1 Using Minitab for: Measurements of Central Tendency

Minitab is a menu-based system. Thus it is only a matter of finding what you want to do on the menus and customizing your request. For most computations, you should find Minitab (like SPSS) to be easier than Excel. In order to find the two most predominant measures of central tendency (the mean and median) we start in the Stat menu. Within that menu, choose Basic Statistics and Display Descriptive Statistics as shown:
262
Next you will see the following:
We need to choose the variables for which we are interested in finding the mean and median. We will choose only the quantitative variables: JOBGRADE, SOCREL, YRONJOB, PRDCTY and JOBSAT. We make these selections by clicking on the variables while holding down the Control Key and then clicking on the Select button. This button will appear darker once a variable has been highlighted.
263
As with SPSS, this interface is common to almost every function in Minitab. The Select button will not activate until at least one variable has been highlighted for selection:
264
After you have selected the variables for analysis, your screen should look like this:
Now click on the Statistics button. This will show you the statistics selected for display. There are many more statistics on this list than you have been exposed to in this course.
265
The following dialogue box will appear:
We can select several different descriptive statistics. Statistics are selected by clicking in the box next to each until a check mark appears. In this case, we have selected only the mean and median. Once your selections are made, click the OK button and then click OK again in the Display Descriptive Statistics window.
266
We obtain the following display containing the means and medians of our five variables. The display appears in our Minitab Session window:
Results for: Plant_Survey Descriptive Statistics: JOBGRADE, SOCREL, YRONJOB, PRDCTY, JOBSAT
Variable JOBGRADE SOCREL YRONJOB PRDCTY JOBSAT Mean 6.600 5.300 8.290 84.58 6.850 Median 6.500 5.000 8.350 84.81 6.600
Notice that these figures are consistent with what we had generated using Excel and SPSS and what we had computed by hand. Again, it is nice when numbers match!
267
Before we go on to looking at subsets of the data, lets recode the values of the qualitative variables we will be using with meaningful labels. The variables Plant and Gender are coded with single letters (N = Norcross, D = Dallas, etc.) We wish to recode these variables, so our displays and graphs will have more descriptive names. These are the steps we use to accomplish this. In the Data menu, select Code and then Text-to-Text as shown below:
268
We choose Text-to-Text because we wish to change text values (like D) to other text values (like Dallas). We see a box like the one below. In this example, we have chosen the variable Plant as the column to code the data from and also as the column to code the data into. This means we will recode the data within the same column rather than choosing another one for the recoded values. Fill in the rest of the box as below:
269
Click OK. The Minitab Data Window should now look like this:
270
Perform the same type of recode for the Gender variable (M = Male, F = Female). Your data will then appear as follows:
Now we are ready to look at subsets of the data that will be determined by these qualitative variables. For example, what if we wanted to know the measurements of central tendency of these variables by gender and by plant? We would proceed exactly as before Stat>Basic Statistics>Display Descriptive Statistics. We again choose the same five variables.
271
This time we will click inside the box called By variables. Once we click inside this box, the menu of variable choices grows to include the qualitative variables from our set. Minitab knew we could not calculate means and medians for qualitative variables and did not include those in the variable selection box. However, when we wish to subset the data, these variables do come in as options. Please note that you should only place qualitative variables in the By variables box. For the first analysis, choose Plant and then click on the Select button. You should see the following display:
Click on the Statistics button to choose Mean and Median again. Also check the N Total box this time, so we get the frequency of each subset. Click OK and OK.
272
The following display will appear in your Session Window:

Descriptive Statistics: JOBGRADE, SOCREL, YRONJOB, PRDCTY, JOBSAT
Variable JOBGRADE SOCREL YRONJOB PRDCTY JOBSAT Plant Dallas Norcross Dallas Norcross Dallas Norcross Dallas Norcross Dallas Norcross Total Count 23 17 23 17 23 17 23 17 23 17 Mean 6.870 6.235 5.522 5.000 8.104 8.541 88.34 79.49 7.148 6.447 Median 7.000 6.000 5.000 5.000 8.000 9.000 90.42 79.86 7.000 6.300
273
Follow the same series of steps, only this type select Gender for the By variables box. Your output should look like this:
Variable JOBGRADE SOCREL YRONJOB PRDCTY JOBSAT Gender Female Male Female Male Female Male Female Male Female Male Total Count 20 20 20 20 20 20 20 20 20 20 Mean 6.300 6.900 6.000 4.600 8.19 8.395 83.97 85.19 6.980 6.720 Median 6.000 7.000 6.000 5.000 8.50 8.350 84.90 84.81 6.700 6.400
As a rule, we do not use the mode as a Measurement of Central Tendency with quantitative data. If the data is qualitative Plant, Gender, Position the mode is the ONLY Measurement of Central Tendency available.
274
We can determine the mode of variables such as these by choosing the Stat menu and within that menu selecting Tables and Tally Individual Variables as shown here:
275
You will see the window below. Select the variables Plant, Gender and Position as shown:
276
Click OK. This output appears in the Session Window:

Tally for Discrete Variables: Plant, Gender, POSITION
Plant Dallas Norcross N= Count 23 17 40 Percent 57.50 42.50 Gender Female Male N= Count 20 20 40 Percent 50.00 50.00 POSITION HRLY MGT N= Count 20 20 40 Percent 50.00 50.00
It is easy to see that Dallas is the modal value for the Plant variable. It is also easy to see that there is no mode for the other two variables in this example.
277
6.2 Using Minitab for: Measurements of Dispersion

To represent the dispersion of a quantitative variable (Measurements of Dispersion are not relevant for qualitative variables), we typically report the standard deviation. To do this in Minitab, we return to the Stat menu. Again we choose Basic Statistics and Display Descriptive Statistics. Select the 5 quantitative variables as before. Do not select any variables in the By variables box. Click on Statistics and select Standard deviation:
Click OK and then OK.
278
Here is the output you should see in the Session Window:

Variable JOBGRADE SOCREL YRONJOB PRDCTY JOBSAT N 40 40 40 40 40 N* 0 0 0 0 0 StDev 1.549 1.652 4.257 7.26 1.021
We could have obviously included lots of statistics in our analysis simply by choosing the ones we want from the Statistics screen. The second Measurement of Dispersion discussed in Chapter 2 was the frequency table. In Chapters 2 and 3, when we created a frequency table for the job tenure variable, we created three categories: < 5 years, 5-10 years and more than 10 years. To create these same categories in Minitab, we need to recode our YRONJOB variable into a new variable called JOBTEN.
279
Go to the Data menu and choose Code>Numeric to Text:
We have selected Numeric to Text because we are changing the numerical variable YRONJOB to a qualitative variable that we will call JOBTEN.
280
You will see a screen like the one below:
Select the YRONJOB variable as it is the one to be recoded. Type the name of the new variable JOBTEN in the Into Columns box (It is a new name, so it is not a choice to be selected from the left-hand menu). Then fill in the old and new values as we have them above. Note that values of YRONJOB between 0 and 4.9 will be coded as New. Values between 5 and 10 are coded as Experienced, and values over 10 are coded as Mature. Click OK.
281
Your Data Window should now have the new text variable JOBTEN in it:
Now we can easily generate a frequency table for the new variable JOBTEN. Once more go to the Stat menu, select Tables>Tally Individual Variables. Select the variable JOBTEN. Click OK.
282
You should see output like this in your Session Window:

Tally for Discrete Variables: JOBTEN
JOBTEN Experienced Mature New N= Count 16 15 9 40 Percent 40.00 37.50 22.50
Well Done!
283
6.3 Using Minitab for: Visualization/Organization of Univariate Data

As stated previously, for professional presentation or for formal documents, we recommend the use of a graphics package (e.g. Microsoft Power Point). However, Minitab (like SPSS) has some nice graphs available in the Graph menu, which can be used less formally. To replicate the pie chart developed in Chapter 2, we go to the Graph menu and select Pie Chart. Our first choice is whether we are charting raw data or values from a table. Our data is raw data (meaning that it is coming straight from the dataset), so we check this box. We then must click inside the Categorical Variables box. Once this is done, the box on the far left will fill with variable choices from our data set (yes, we knowthis is one of the more annoying aspects of Minitab). Select JOBTEN as your variable for this graph.
284
Here is the screen just before we click OK to draw our pie chart for JOBTEN:
Select the Labels button. Under the Titles tab, give your pie chart a meaningful title.
285
Then select the Slice Labels tab:
Identify that you want the slices to be labeled with the Category name and the Percent (remember that the reason that we create Pie Charts is to visually represent the proportions). Select OK and OK.
286
You should see the following chart:

Job Tenure of Employees at WidgeOne
C ategory Experienced Mature New
New 22.5%
Experienced 40.0%
Mature 37.5%
Nice! You will probably find that of all of the packages, Minitab probably has the strongest graphics. (If you are saying Heyhow do I get back to my datasheet!??...just go to Window>Plant_Survey) If you need to create a pie chart to understand a quantitative variable (e.g. productivity) relative to a qualitative variable (e.g. Plant), in Minitab you must begin by getting some summary statistics for the quantitative variable with the qualitative one used to subset it. You can do this by selecting Stat>Basic Statistics>Store Descriptive Statistics. This process will store (save) our descriptive statistics in a table in our Data Window, so the Pie Chart command can use the results to make a chart. Replicate the window below. We are asking for statistics on PRDCTY by Plant.
287
Click on the Statistics button and choose the statistic Sum. Click OK.
288
You will see the following in your Data Window:
You can think of this information as a little table within your datasheet that tells you the total amount of (summation of) productivity attributed to each plant.
Now, you are ready to make the pie chart for this data. Choose Graph>Pie Chart. You should indicate this time that your chart values are in a table. The categorical variable for your table was named ByVar1 (change this in the worksheet if you need to). The Summary variable was named Sum1 (again change it if you need to).
289
Fill out the screen as below:
Click on Labels and title the chart Chart of Productivity by Plant. Warning Warning Warning Will Robinson: If you do not do this then labels made for other charts will likely display on this new one. It happens to the best of us (wellnot usbut other people)! Select the Slice Labels as necessary. Click OK.
290
You should see a pie chart similar to the following:

Chart of Productivity by Plant
Category Dallas Norcross
Norcross 39.9%
Dallas 60.1%
Next, we wish to replicate the bar chart from Chapter 2, which displayed the frequency count for each value of the variable JOBTEN.
291
Again, go to the Graphs menu. This time choose Bar Chart. This action will produce:
Select the Simple chart as above and click on OK. Select JOBTEN as the categorical variable. Provide a title by clicking on the Labels button. Then click OK.
292
Bar Chart of Job Tenure

18 16 14 12
Count
10 8 6 4 2 0 Experienced Mature JOBTEN New
293
If you need to generate a different style of bar chart such as the one with horizontal bars in Chapter 2, you can play with some of the options in the Bar Chart panel. For example, to obtain horizontal bars, click on the Scale button before clicking OK. On the Axes and Tick screen select Transpose values and category scales as shown below:
Click OK and OK.
294
Experienced
JOBTEN
Mature
New
10 Count
12
14
16
18
To create a stem and leaf display for the variable YRONJOB, go to the Graph menu and select Stem and Leaf (Notice that only the quantitative variables are available for this graphic). Select YRONJOB as your variable. Click OK.
295
You will get a Stem and Leaf Display like the one below in your Session Window:
Stem-and-Leaf Display: YRONJOB
Stem-and-leaf of YRONJOB Leaf Unit = 1.0 2 7 12 16 (8) 16 9 5 1 0 0 0 0 0 1 1 1 1 01 22233 44555 6777 88888999 0000111 2333 4445 7 N = 40
A quick note on interpretation of this messy outputthe (8) is in brackets to signify that this is the stem with the greatest number of observations. Here, the values include 8.x, 8.x, 8.x, 8.x, 8.x, 9.x, 9.x, 9.x years of service.
296
To get a Boxplot for YRONJOB, go to the Graph menu and select Boxplot. You will see a screen like the below where you can choose the style of Boxplot you need. Choose Simple for this first one.
Click OK. Choose YRONJOB for your variable. Click OK again.
297
You will see the following:
Boxplot of YRONJOB
18 16 14 12 10 8 6 4 2 0
Recall that in a box plot, the box represents the middle 50% of the dataset (the top of Q1 and the top of Q3), and the line inside the box represents Q2 or the median.
298
YRONJOB
6.4 Using Minitab for: Visualization/Organization of Multivariate Data

Contingency tables, stacked bar charts, 100% stacked bar charts and scatter plots can be easily generated in Minitab. In order to use Minitab to reproduce the contingency table examining plant and gender from Chapter 2, simply go to the Stat menu. Choose Tables then choose Cross Tabulation and Chi Square (although we are not actually calculating the Chi Square stats, some of the information that we need is under this option). Choose Plant for your rows and Gender for your columns. The Cross Tabulations function in Minitab is quite flexible. If you wish to include more than just the frequency counts in the cells of your table, place a check next to row percents, column percents and total percents, as we have below:
Place the Plant variable in the Row position. Place the Gender variable in the Column position.
299
Click OK. The contingency table below should appear in your Session Window:
Tabulated statistics: Plant, Gender
Rows: Plant Columns: Gender Female Dallas 13 56.52 65.00 32.50 7 41.18 35.00 17.50 20 50.00 100.00 50.00 Male 10 43.48 50.00 25.00 10 58.82 50.00 25.00 20 50.00 100.00 50.00 All 23 100.00 57.50 57.50 17 100.00 42.50 42.50 40 100.00 100.00 100.00
Norcross
All
Cell Contents:
Count % of Row % of Column % of Total
Notice the key at the bottom indicating that the cell contents have the count (frequency) on the top, followed by row percents, column percents and total percents. Wowlook how much output was created in a single table! That was so much easier than Excel! The output table contains the conditional probabilities described in Chapter 2. In the first cell the intersection of Female and Dallas we have four pieces of information. We know that there are 13 women who work in Dallas. We know that of all of the Dallas employees, 56.5% are female. We know that of all of the women, 65% are in Dallas. Finally, we know that of all employees, 32.50% are females in Dallas.
300
If you need to subset this information further (e.g., by Job Tenure), there is an easy way to do that. Go back to the Stat>Tables>Crosstabulation and Chi-Square screen. This table will be a little busy, so lets just choose the Counts this time. Make your selections of the three variables as follows:
Click OK.
301
The table below will appear in the Session Window:

Tabulated statistics: Plant, Gender, JOBTEN Results for JOBTEN = Experienced
Rows: Plant Columns: Gender Female Dallas Norcross All Cell Contents: 3 3 6 Male 5 5 10 Count All 8 8 16
Results for JOBTEN = Mature

Rows: Plant Columns: Gender Female Dallas Norcross All Cell Contents: 6 2 8 Male 3 4 7 Count All 9 6 15
Results for JOBTEN = New

Rows: Plant Columns: Gender Female Dallas Norcross All 4 2 6 Male 2 1 3 All 6 3 9
302
Notice that the same information on Plant and Gender counts has now been provided by each level of Job Tenure Experienced, Mature and New (the levels are reported in alphabetical order rather than by order of magnitude). The stacked bar charts developed in Chapter 2 can be easily developed in Minitab. Start in the Graphs menu. Choose the option Bar Chart. You will see:
303
Make sure you choose the Stacked option as shown and click OK. Then select the variables so that the category axis is Plant and the bars are stacked by Gender. This is done by selecting Gender last and making sure the stack categories of last categorical variable box is checked. See below:
Add a title if you like. Then, as always, click OK.
304
Plant by Gender
25
Gender Female Male
20
15
Count
10
0 Plant
Dallas
Norcross
305
The 100% Stacked Bar Chart is a little less straight forward to generate. Select Graph>Bar Chart>Stack>OK as before. Assign the variables Plant and Gender as before. Then select Chart Options. You will see the following screen:
To generate the 100% calibration of the bars within each plant value, set the Y-axis to be shown as a % value and accumulate the values within each category.
Select OK.
306

Chart of Plant, Gender
100
Gender Male Female
80
Percent
60
40
20
0 Plant
Dallas
Norcross
Percent within levels of Plant.
The last multivariate visualization technique is the scatter plot. Again, Minitab provides us with flexibility to subset our analysis if needed. Consider the relationship between Job Satisfaction and Productivity as we did with SPSS in Chapter 5. This plot can be replicated in Minitab by going into the Graphs menu and choosing Scatterplot. A choice of types of Scatterplots follows.
307
Choose the Simple scatter plot:
Then click on OK.
308
Next, choose PRDCTY for the Y-axis variable and JOBSAT for the X-axis variable as shown below:
Click OK. Click on the Labels button and add an appropriate title. Click OK and OK.
309
Here is the associated output:

Productivity versus Job Satistifaction
100 95 90
PRDCTY
85 80 75 70 5 6 7 JOBSAT 8 9
As we saw in Chapter 5, there is a slightly positive relationship between these two variables it appears as if Job Satisfaction and Productivity are related.
310
6.5 Using Minitab for: Random Number Generation and Simple Random Sampling
Like the other software applications, Minitab will generate random numbers using the internal clock in the computer. As a result, every time a command is given to Minitab to generate some set of random numbers, a different set of random numbers will be generated. The software normally chooses its own starting point for the generation process by using the time of day to choose a random starting point in the string. Sometimes, however, you may wish to control where Minitab starts its string. For example, you may wish to repeat a sequence by generating the same set of random data. In this case, the BASE command tells the random number generator where to start. The generator will use this base until you set a new BASE or exit Minitab. If you need to set the base number so you can replicate your results, simply go to the Calc menu. Choose the Set Base option. You should see the following screen:
Here, we have not chosen a base. We could have chosen a positive integer as our base. In doing so, we could replicate our results anytime we wish to do so by going back in and resetting the base to that value.
311
To create a string of random numbers, which is uniformly distributed between 0 and 1, go to Calc > Random Data>Uniform. You may note here that Minitab has a lengthy list of distributions, as did SPSS, which can be used to generate random samples. Indeed, this sort of procedure is quite easy and versatile with this software. We will generate 40 values from this normal distribution with one value for every observation. We will name our new data column Group. Every distribution has parameters that must be specified. For the uniform distribution, the only parameters are the two values between which we want our random numbers to fall. We choose these values to be 0 and 1. Fill in your window like the one below:
312
Click OK. The new variable Group should appear in your Data Window. Here is what a typical result would look like:
Remember that your results will vary since this variable was randomly generated. One of the primary reasons for generating random numbers is to assign observations into statistically independent groups. Using the random numbers, lets assign the 40 observations into 3 groups. Here is one way to accomplish this: Choose Data>Code>Numeric to Text.
313
The Code Numeric to Text screen should look like this:
Note that since the distribution of the random numbers is uniform, each random value has an equal probability of occurrence. This is very useful information for assignment of groups. If you are interested in assigning groups of approximately equal size, then you should allocate the values of 0 through .33 to one group, .34 to .66 to another, etc. If you want the first group to have approximately 25% of the population, then allocate the random values of 0 through .25 to the first group, etc. Cool.
314
You should see something like the following in your Data Window:
315
Again, remember results will vary due to the randomness (pun intended) of this procedure. This procedure has taken the 40 observations and assigned them into 3 groups based upon the random numbers created in the previous procedure. Each of the 40 employees is now in one of these randomly assigned, independent groups. Because this process of selecting a random sample from a set of data is so common, there is a very straightforward way to accomplish this in Minitab. Because we will be subsetting this dataset, go ahead NOW and save the Minitab file that you are working in File>Save Current Worksheet As (this will allow you to save it as an Excel spreadsheet again). Now, suppose we wish to select a simple random sample of 30 individuals from this dataset. Go to Calc>Random Data>Sample from Columns.
316
You should see a screen like the following:
Specify that you wish to select a random sample of 30 cases. In the From columns box, identify all of the variables. Then, identify all of the variables again for the Store samples in box. This will effectively take a random sample of size 30 from our dataset and discard the observations that were not selected (did you save your file?). Select OK. You should now be left with 30 observations.
317
6.6 Using Minitab for: Confidence Intervals

Generating confidence intervals in Minitab is very easy. For example, if we wish to compute a 95% confidence interval for the mean Job Satisfaction rating of all employees, we would go to Stat>Basic Statistics>One-Sample T13. We would see the following screen on which we could choose the variable(s) we want to include in our analysis. This time we choose JOBSAT as the Test Variable:
Ttests are very common tests used to determine if two sample means differ significantly or if one sample mean differs from some established value. For more detailed information on Ttests, we suggest Statistical Methods and Data Analysis by Ott and Longnecker.
13
318
Click on the Options button. The default setup is the following. This selection will produce a complete 95% confidence interval.
Click OK and then OK. You will see the following output in your Session Window:
One-Sample T: JOBSAT
Variable JOBSAT N 40 Mean 6.85000 StDev 1.02081 SE Mean 0.16140 95% CI (6.52353, 7.17647)
As stated in previous chapters, these results would be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 6.52 and 7.18.
319
This means that the probability that the true mean job satisfaction of all employees, which is unknown, falls between 6.52 and 7.18 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.52 or > 7.18). Another option here, which is only available for the 95% Interval (the most common), is the Interval Chart. Lets look at the 95% Interval graphic for Job Satisfaction by Plant. To do this, go to Graph>Interval Plot. Since we have one quantitative variable (Job Satisfaction) that we want to evaluate by two groups within a qualitative variable (Plant), select One Y With Groups:
320
Select OK. Assign JobSat to the Graph variable and Plant to the Categorical variables for grouping:
Add a Title to your graphic as appropriate through the Labels button. Select OK.
321
You should see the following Plot:

Interval Plot of JOBSAT
95% CI for the Mean 7.6
7.2
JOBSAT
6.8
6.4
6.0
Dallas Plant
Norcross
Now thats what Im talking about!
322
6.7 Minitab Lagniappe

When you create graphics in Minitab like a bar graph you may have noticed that the bars are ordered alphabetically. For example, our Job Tenure Bar graph looked like this:
Experienced
JOBTEN
Mature
New
10 Count
12
14
16
18
There is nothing really wrong with this, but it would be better if we could order the bars in a more logical way like New/Experienced/Mature.
323
To reorder the values, go back to the Plant_Survey sheet. Click on any value in the JobTen column. Now, right click. Select Column>Value Order:
324
You should now see this screen:
Reorder these values manually in your preferred order
Click OK.
325
Now your graph looks like this:

New
JobTen
Experienced
Mature
10 Count
12
14
16
18
326
Chapter 7 SAS
What is SAS?
Unlike the other three packages, SAS uses a programming language (really a library of pre-written statistical algorithms) to execute analysis. If you are not a programmer at heart, do not let your heart be troubled we have provided all of the code that you need to execute the necessary commands to generate the prescribed output. After a few successful executions, you will be able to generate your own programs! Of the four packages, we acknowledge that SAS typically represents the greatest challenge for students. However, SAS is the most analytically comprehensive, the most widely used and the most flexible of the statistical software applications.
Getting data into SAS

Prior to actually executing any of the statistical concepts from Chapter 2, we first need to get the WidgeOne.xls dataset into the SAS system and convert it into a SAS file.
327
Lets start by getting oriented with the SAS interface for Windows:
Log Window
Editor Window
Explorer and Results Windows
Buttons to move among the Log, Editor and Output (not shown) Windows
328
After you launch SAS, you have access to five SAS windows: the Editor, Log, Output, Explorer and Results windows. The Output and Results windows are not visible when you first start SAS. Of these windows, we will be focused on the Editor, Log and Output windows. The Editor window The Editor window is where the programs will be written. In a typical SAS session, this is where you will spend most of your time. The Log window The Log is a file generated by SAS that contains your SAS program (in black), along with a listing of notes (in blue or green), error messages (in red) and other information pertaining to your program. After every execution, you should get into the habit of checking your Log window to ensure that the program ran correctly regardless of how the output looks. The Output window The Output window contains the results of your analysis generated by your program. We wont be using much of the Explorer and Results windows. For additional information on using the SAS software, we recommend Step-by-Step Basic Statistics Using SAS by Larry Hatcher. SAS programs have two parts the Data Step and the Procedure or Proc Steps. Data Steps are used to manipulate data, add, change or delete variables or observations, format data, etc. Statistical Analysis is not done within the context of a data step. And, Data Steps typically do not generate any output. Procedure steps begin with the term Proc followed by a SAS-defined command, which will execute a specific algorithm. Procedure steps are where we will do most of our analysis and typically do generate output. All SAS statements end with a semicolon (most of your errors in the beginning will be because you did not insert a semicolon).
329
The good news is that SAS is generally not case sensitive and is smart enough to correct some of the most common mistakes (except for the semicolon omission). If you make a programming error, the Log will provide you with guidance regarding where the problem is and how to fix it. Lets get started. Once you have launched the SAS System: Maximize the Editor Window by clicking on the Maximize Button in the upper right corner; Close the Explorer and Results Windows by clicking on the Close Button in the upper right corner of the Explorer Window.
330
Now, your SAS interface should look like this:
331
Within the context of the Editor Window, any SAS code which starts with an asterisk will be ignored by SASbut can be very useful for the programmer. We will provide you with these comments (Notes written after an asterisk) to help explain the logic behind the code. Before we start, save WidgeOne.xls somewhere convenient that you can easily access it. To get the data into SAS, we will execute a Wizard (this is the only one that you will get in SAS so enjoy it). Select File>Import Data:
332
The first step in the Wizard will ask for the type of file that you are importing. It should default to Excel (dont worry about the version). Click Next. You will then get a browse box SAS wants you to point to where the Excel file is saved. Easy enoughbut if you are running SAS from a remote location (as would be the case with Citrix) please remember that your drive names change. For example, your C:\ drive will be read as your V:\ drive:
The C$ on Client(V:) is my local C:\ drive. The T: and U: drives are USB ports where a flash drive might be. Citrix can see the flash drives as long as the flash drive was inserted into the port BEFORE Citrix was accessed.
333
Once you have located your file, select Next. The next step in the Wizard will ask you which table you want to import. These are the various sheets in the workbook. Select the Plant_Survey sheet. The next step in the Wizard will look like this:
Basically, SAS just wants you to give the new SAS file a name. Enter the name WidgeOne in the box called Member. Select next. On the last step of the Wizard, SAS will ask you if you want to save the code that it just created for you. Just ignore this and select finish.
334
You will be back at your Editor screen which is blank. It may not seem like it, but SAS just did a lot of stuff. It accessed the WidgeOne.xls file, isolated the Plant_Survey worksheet and saved it as a SAS file that we can now access. Lets take a look at the file that we just created. In your Editor window, type the following SAS Code (if you are accessing this electronically, you can just copy and paste):
Proc Print data = WidgeOne; Run;
This is a very simple, but representative set of code in SAS. The PROC PRINT command is a procedure command telling SAS to Print the data not to a printer, but to the screen so that we can see it. The DATA= part of the statement is telling SAS which dataset to print. Notice that the statement ends in a semicolon. All SAS Statements end with a semicolon. The final statement in this module of code is Run;. All of your code will end with a run statement. To SAS, this is like a period at the end of a sentence.
335
After you have this typed into your Editor screen, click on the little running man in the tool bar:
336
If everything ran successfully, you should now see this:
337
This is the WidgeOne dataset as a SAS file. You see the bottom of the fileuse the toggle bar on the right to toggle to the top of the file. Now you should see this:
338
Now, look at your Log window click on the middle button at the bottom of the screen. You should see this:
339
Congratulationsyou successfully ran your first SAS program. Piece of cake! Take a minutego get a soft drinkand celebrate! From this point forward, we will simply provide you with the code necessary to execute the specified statistical analysis, followed by the appropriate output. We will include comments in the code in a way such that you can copy it EXACTLY the way it is written here into your Editor Window.
340
7.1 Using SAS for: Measurements of Central Tendency

Here is the code to execute mean and median for quantitative variables in SAS: *Proc Means will provide two of the three measurements of central tendency - Mean and Median...these measurements must be specified. The "Var" statement tells SAS which variables you are interested in - remember that mean and median is only relevant for quantitative variables; Proc Means data=WidgeOne Mean Median Max Min; Var JobSat Prdcty YRONJOB SOCREL JOBGRADE; Run;
341
Output:
342
What if we are only interested in a subset of the data? For example, what if we wanted to know the measurements of central tendency of these variables by gender and by plant? We simply need to add a Class statement to the code: Proc Means data=WidgeOne Mean Median Max Min; Var JobSat Prdcty YRONJOB SOCREL JOBGRADE; Class Plant Gender; Run;
additional statement
343
This code will generate the following output:
344
At this point, we are not going to cover how to convert the F and M values into proper namesat this point it might be confusing. But it will be covered in the Lagniappe. As a rule, we do not use the mode as a Measurement of Central Tendency with quantitative data. If the data is qualitative Plant, Gender, Position it is the ONLY Measurement of Central Tendency available. Here is the code used to determine frequency counts (and mode) for qualitative variables: *Proc Freq14 is the SAS command to use when determining the frequency counts for qualitative data; Proc Freq data=WidgeOne; Tables Plant Gender Position; Run;
14
This was the late Rick James favorite Proc in SAS.
345
Here is the output:
346
7.2 Using SAS for: Measurements of Dispersion

To represent the dispersion of a quantitative variable (Measurements of Dispersion are not relevant for qualitative variables), we report the standard deviation. To do this in SAS, we go back to the Proc Means code, and add another option: Proc Means data=WidgeOne Mean Median Max Min STD; Var JobSat Prdcty YRONJOB SOCREL JOBGRADE; *Class Plant Gender; Run; Notice the addition of the standard deviation option. Now, when we run Proc Means, we will have a fairly comprehensive set of descriptive statistics. Also notice that the Class Plant Gender; statement has been commented out of the code. This is a handy way of suppressing code that you may not want to delete, but dont want to run.
347
Here is the new output:
348
Notice the inclusion of the standard deviation on the far rightthe descriptive statistics will be included in the order they are requested in the Proc Means statement. The second Measurement of Dispersion was the frequency table. When we created a frequency table for the job tenure variable, we create three categories: < 5 years, 5-10 years and more than 10 years. To create these same categories in SAS, we need to execute a Data statement: *In the following data statement, we are creating a categorized version of the YRONJOB variable for the purposes of generating a frequency table. Note that this is done through the creation of a NEW variable that will be ADDED to the dataset - we are NOT overwriting the YRONJOB variable. The new qualitative variable is JOBTEN. We should also note that we are creating a second dataset WidgeOne1. While you do not HAVE to create a new dataset, we recommend it. It is typically best to keep your original dataset in its original form in the event that you have to start your analysis from scratch; Data WidgeOne1; Set WidgeOne; If YRONJOB <5 Then JOBTEN = 'New'; Else If YRONJOB =>5 AND YRONJOB<=10 Then JOBTEN = 'Experienced'; Else If YRONJOB >10 Then JOBTEN = 'Mature'; Run; *Print the new dataset to ensure that the new variable was created properly; Proc Print data=WidgeOne1; Run; *Now we run the frequency table on JobTen...which will provide us with the dispersion of the YRONJOB variable across the specified categories; Proc Freq data=WidgeOne1; Tables JobTen;
349
Run; Note in the Data step that we named the new dataset WidgeOne1and this new dataset was based on the original dataset WidgeOne (see the Set statement). Adding the number 1 to the end of a dataset name and then incrementing it by one every time you alter the dataset is a convenient way to keep track of your manipulations. Here is the associated output:
350
351
7.3 Using SAS for: Visualization/Organization of Univariate Data

As stated previously, for professional presentation or for formal documents, we recommend the use of a graphics package (e.g. Microsoft Power Point). However, we have provided the code necessary to create basic graphics in SAS. To replicate the pie chart developed in Chapter 2, execute the following code: *This code will produce a pie chart, where the qualitative variable listed after the "pie" command will identify the slices. Here, we have simply asked for the percentages (type = pct) for each value of "JOBTEN"; proc gchart data=WidgeOne1; pie JOBTEN / type = pct; legend; Run; Quit;
352
Using this code, you will create the following output:
353
If you need to create a pie chart to understand a quantitative variable (e.g., productivity) relative to a qualitative variable (e.g., Plant), the code would be modified appropriately: *This code will produce a pie chart, where the qualitative variable listed after the "pie" command will identify the slices. Here, we have ased for the percentage of total productivity by Plant; proc gchart data=WidgeOne1; pie Plant /sumvar=prdcty percent=Inside; legend; Run; Quit;
354
355
To replicate the bar chart in Chapter 2, execute the following code: *This code will produce a Bar Chart, where the qualitative variable listed after the "HBAR" command will identify the categories for the bars. Here, we have simply asked for the frequency count (type = freq) for each value of "JOBTEN; proc gchart data=WidgeOne1; HBAR JOBTEN / type = freq; legend; Run; Quit; You can change this graphic to be vertical by substituting VBAR for HBAR.
356
357
The histogram, stem-and-leaf and the box plot are all generated in SAS using the same command Proc Univariate: *To create Stem and Leaf and Box Plots in SAS, we use the Proc Univariate command with the "plots" option. This procedure is only valid for quantitative variables. So, for Job Tenure, we will reference the YRONJOB variable;
Proc Univariate data=WidgeOne1 plots; Var YRONJOB; Histogram; Run;
358
This procedure in SAS is a very comprehensive univariate analysis. Notice that Proc Univariate provides everything Proc Means providedand more. Key results from the output screen have been highlighted: The UNIVARIATE Procedure Variable: YRONJOB (YRONJOB) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 40 8.29 4.25656657 -0.0806474 3455.58 51.345797 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 40 331.6 18.118359 -0.7479985 706.616 0.67302227
Basic Statistical Measures Location Mean Median Mode 8.290000 8.350000 8.000000 Variability Std Deviation Variance Range Interquartile Range 4.25657 18.11836 16.90000 6.10000
NOTE: The mode displayed is the smallest of 3 modes with a count of 3. Tests for Location: Mu0=0 Test -Statistic-----p Value------
359
Student's t Sign Signed Rank
t M S
12.31757 20 410
Pr > |t| Pr >= |M| Pr >= |S|
<.0001 <.0001 <.0001
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 17.00 17.00 14.55 14.00 11.10 8.35 5.00 2.05 1.50 0.10 0.10
The UNIVARIATE Procedure Variable: YRONJOB (YRONJOB) Extreme Observations ----Lowest---Value 0.1 1.0 Obs 1 2 ----Highest--Value 14.0 14.0 Obs 36 37
360
2.0 2.0 2.1 Here are the Stem and Leaf and Box Plots: Stem Leaf 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
4 3 5
14.1 15.0 17.0
38 39 40
# 0 0 001 000 1 011 0115 000 00016 016 1 007 01 00 001 0 1 ----+----+----+----+ 1 1 3 3 1 3 4 3 5 3 1 3 2 2 3 1 1
Boxplot | | | | | | +-----+ | | | | *--+--* | | | | +-----+ | | | | |
The UNIVARIATE Procedure
361
Variable:
YRONJOB
(YRONJOB)
Normal Probability Plot 17.5+ | | | | | | | | | | | | | | | | 0.5+ +* ++ +*+ **+* ***++ *+++ *++ * *+ **++ ***+ **++ *+ ** +** +** *+** *++ * ++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2
362
The histogram is found in the graphic outputyou may need to scroll down to the bottom.
30
25
20
P e r c e n t
15
10
0 70 75 80 P R D C TY 85 90 95
363
Another option, which is a bit more surgical, for creating a boxplot is to run:
Proc Sort data=WidgeOne; By Position; Run; Proc Boxplot data=WidgeOne; Plot Jobsat*Position; Run;
This will produce side-by-side boxplots of employee Job Satisfaction by Position:

9
J O B S A T
5 H R LY P O S I TI O N M G T
364
7.4 Using SAS for: Visualization/Organization of Multivariate Data

As with the other packages, contingency tables, stacked bar charts, 100% stacked bar charts and scatter plots can be easily generated. However, we cannot emphasize enough that these are analytical packages not graphics packages. SAS provides great flexibility in the creation of contingency tables. Here is the basic code: *To create a contingency table, simply run the Proc Freq command. Identify the variables of interest in the Tables statement separated by an *. Then add "by" statements where necessary to subset the analysis. Note - whenever you add a "by" statement, you will need to sort the data by that variable first; Proc Sort data=WidgeOne1; by JobTen; Run; Proc Freq data=WidgeOne1; Tables Plant*Gender; Run;
365
366
Notice that the conditional percentages that were discussed in Chapter 2, are already embedded in this contingency table. Look at the legend in the upper left corner of the matrix. This legend provides a guide to the numbers in the matrix. In the first cell the intersection of Female and Dallas we have four pieces of information. We know that there are 13 women who work in Dallas. We know that of all of the Dallas employees, 56.5% are female. We know that of all of the women, 65% are in Dallas. Finally, we know that of all employees, 32.50% are females in Dallas. If you need to subset this information further (e.g., by Job Tenure), add a by statement to the code: Proc Freq data=WidgeOne1; Tables Plant*Gender; By Jobten; Run;
367
------------------------------------------ JOBTEN=Exp -----------------------------------------The FREQ Procedure Table of Plant by Gender Plant(Plant) Gender(Gender)
Frequency Percent Row Pct Col Pct F M Total Dallas 3 5 8 18.75 31.25 50.00 37.50 62.50 50.00 50.00 Norcross 3 5 8 18.75 31.25 50.00 37.50 62.50 50.00 50.00 Total 6 10 16 37.50 62.50 100.00
368
----------------------------------------- JOBTEN=Mat -----------------------------------------The FREQ Procedure Table of Plant by Gender Plant(Plant) Gender(Gender)
Frequency Percent Row Pct Col Pct F M Total Dallas 6 3 9 40.00 20.00 60.00 66.67 33.33 75.00 42.86 Norcross 2 4 6 13.33 26.67 40.00 33.33 66.67 25.00 57.14 Total 8 7 15 53.33 46.67 100.00
369
The FREQ Procedure Table of Plant by Gender Plant(Plant) Gender(Gender)
Frequency Percent Row Pct Col Pct F M Total Dallas 4 2 6 44.44 22.22 66.67 66.67 33.33 66.67 66.67 Norcross 2 1 3 22.22 11.11 33.33 66.67 33.33 33.33 33.33 Total 6 3 9 66.67 33.33 100.00 Notice that the same information on Plant and Gender counts has now been provided by each level of Job Tenure Experienced, Mature and New (the levels are reported in alphabetical order). The stacked charts developed in Chapter 2 can be easily developed in SAS using the same code as was used with the single variable analysis, with the addition of a subgroup statement:
370
*To create a stacked chart using two variables, simply use the same code as before when analyzing a single variable and add a "subgroup" statement; proc gchart data=WidgeOne1; HBAR Plant / subgroup=gender; legend; Run; Quit;
371
To create a 100% stacked bar chart, you need to run a similar set of statements, with a few changes:
Proc gchart data=WidgeOne; HBAR Plant/subgroup = Gender type=pct group = Plant nozero g100 gaxis=axis1; Run; Quit;
372
P l ant D
P l ant D
F R E Q . 23
C U M . F R E Q . 23
P C T . 100
C U M . P C T . 100
17
17
100
100
10
20
30
40
50 P E R C E N T
60
70
80
90
100
G ender
The last multivariate visualization technique is the scatter plot. Again, SAS provides us with flexibility to subset our analysis if needed. Consider the Job Tenure and Productivity plot in Chapter 2. This plot can be replicated in SAS using the following code: *To create a scatter plot using two quantitative variables use the Proc plot command...the first variable stated will appear on the y-axis...typically this is the dependent variable; Proc Plot data=WidgeOne1; Plot Prdcty*Yronjob; Run; Here is the associated output:
373
P R D C TY 100
90
80
70
60 5 6 7 JO B S A T 8 9
374
7.5 Using SAS for: Random Number Generation and Simple Random Sampling
Like the other software applications, SAS will generate random numbers using the internal clock in the computer. As a result, every time the ranuni statement in SAS is run, a different set of random numbers will be generated. However, sometimes we may need to replicate a set of random numbers exactly the way they were previously generated. To accomplish this replication, SAS allows the analyst to define a seed number that will ensure a consistent set of random numbers the numbers are still random and can be used to ensure statistical independence of samples. The random numbers follow a uniform distribution of outcomes that lie between 0 and 1. Here is the code: *To create a string of random numbers, use the ranuni statement. Because this process is effectively creating a new variable (of random numbers), this process is executed within the context of a Data statement; Data WidgeOne2; Set WidgeOne1; Group= ranuni(123456); run; Proc Print data=WidgeOne2; run; The number inside the parentheses (123456) is an arbitrary number that is provided as the seed.
375
Here is the output:
376
One of the primary reasons for generating random numbers is to assign observations into statistically independent groups. Using the random numbers, lets assign the 40 observations into 3 groups. Here is the code that would execute this: *This code will take the 40 observations and assign them into 3 groups based upon the random numbers created in the data statement above. The out= statement creates a new file that has these assignments; Proc Rank data=WidgeOne2 Groups=3 Out=Samples; Var Group; Run; *Notice in this code, we are referencing the "Samples" file; Proc Print data=Samples; Run;
377
Here is the associated output for the Samples file:
378
Notice that individuals were assigned to group 0, 1 or 2, completely at random based upon the random numbers that were generated in the previous step. Because this process of selection for randomization of groups is so common, there is a more parsimonious set of code within SAS to execute the same process. *Another way to create a sample (or samples) from a population is through Proc Surveyselect; Proc surveyselect data=WidgeOne1 out=Sample2 Method=SRS Sampsize=30 Seed=123; Run; Proc Print data=Sample2; Run; In this code, the Method=SRS indicates to SAS that we are interested in using a Simple Random Sampling methodology (other statistical sampling methodologies include Stratified, Systemic and Cluster). The Sampsize = statement tells SAS how big to make the sample (clearly this number must be smaller than the total size of the dataset). Finally, the Seed= statement provides SAS with a seed from which to draw the random sample. This seed is important if, for example, another analyst needs to replicate your results. If this option is deleted, a different sample will be generated every time the code is executed.
379
Here is the output: The SAS System Emp Plant Norcross Norcross Dallas Dallas Dallas Dallas Norcross Norcross Norcross Norcross Dallas Dallas Norcross Norcross Dallas Dallas Norcross Dallas Norcross Norcross Norcross Dallas Dallas Dallas Dallas 08:58 Friday, May 19, 2006 29
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
ID 011 077 088 086 019 090 069 063 009 062 024 091 006 061 097 100 058 010 071 078 082 028 016 095 029
Gender F M M M F M M M F M F M F M M M F F M M M F F M F
POSITION HRLY HRLY MGT MGT MGT MGT MGT MGT MGT MGT HRLY HRLY MGT MGT HRLY HRLY HRLY HRLY HRLY HRLY HRLY HRLY MGT HRLY HRLY
JOBGRADE 4 5 8 6 7 9 7 7 7 8 7 9 7 6 7 9 6 6 4 6 5 7 9 6 5
SOCREL 6 5 6 5 5 6 5 4 5 5 6 1 6 0 5 5 5 6 5 4 5 7 5 5 5
YRONJOB 5.0 5.0 5.7 6.1 7.6 8.0 8.0 8.6 9.0 9.0 10.1 10.1 10.5 11.0 11.1 11.1 12.1 13.0 13.0 14.0 14.0 14.1 0.1 1.0 2.0
PRDCTY 78.2421 76.0067 93.0348 93.2102 93.9137 93.2102 82.1495 81.0000 79.8586 86.3210 91.8112 83.6393 67.6506 80.1839 90.4228 85.9835 91.8112 74.9012 78.7253 78.0813 78.2421 86.9980 91.8112 92.5094 87.5075
JOBSAT 5.4 6.2 8.1 8.0 8.5 7.6 7.1 7.9 5.8 6.1 6.5 5.7 7.9 7.3 6.2 6.5 6.7 5.5 5.4 6.3 5.0 6.5 8.5 6.1 6.5
JOBTEN Exp Exp Exp Exp Exp Exp Exp Exp Exp Exp Mat Mat Mat Mat Mat Mat Mat Mat Mat Mat Mat Mat New New New
380
26 27 28 29 30
013 027 004 094 075
Norcross Dallas Norcross Dallas Norcross
F F F M M
HRLY HRLY MGT HRLY HRLY
4 6 9 8 4
5 10 6 6 5
2.0 2.1 3.0 3.0 4.0
75.3740 91.2893 81.8202 88.0185 74.9012
5.9 6.5 6.7 6.0 5.9
New New New New New
This output is a subset of 30 of the 40 observations in the dataset. If you utilized the Seed=123 option, you should have generated the exact same sample. If you did not utilize the Seed= option (or used a different seed), your sample will be different although you should still have 30 observations.
381
7.6 Using SAS for: Confidence Intervals

Generating confidence intervals in SAS could not be easier. We simply add an option statement into the Proc Means code that we used previously: *To create confidence intervals in SAS, use the CLM option in Proc Means. CIs are only executed on quantitative data; Proc Means data=WidgeOne CLM alpha=.05; Var JobSat; Run; Remember that
In this code, the CLM option creates the confidence interval around the Job Satisfaction variable. As stated in Chapter 2, confidence intervals are created at a 90%, 95% or 99% confidence level, where these levels represent our confidence (or the probability) that the TRUE population mean lies within the calculated interval. Statisticians refer to a complementary term alpha to indicate the probability that our calculated interval DOES NOT contain the TRUE population mean. Effectively, alpha is the probability that we are wrong. The alpha value is calculated as 1-the confidence level. So, typically, alpha is established to be .01, .05 or .10. In SAS, the default value is .05 (if the alpha= option is not provided, SAS will create a 95% confidence interval).
382
383
As stated previously in Chapter 2, these results would be reported as: Based on a representative sample of 40 employees, we are 95% confident that job satisfaction among all employees is estimated to be between 7.17 and 6.53. This means that the probability that the true mean job satisfaction of all employees, which is unknown, falls between 7.17 and 6.53 is 95%. It also means that there is a 5% probability that the true mean job satisfaction is outside of this range (< 6.53 or > 7.17).
384
7.7 SAS Lagniappe

As with the other packages, there is almost an infinite amount of Langiappe that we could provide here. This might be more true for SAS than for the other packages since SAS is so much bigger and more complex than the other packages. The first Lagniappe is output. Lets talk about output. You probably noticed that after we generate output and copy/paste it into Word, it looks, well, awful. We can do better. To make the output look a little more presentable when it is copied into another package, we can use the Output Delivery System or ODS. To use the ODS, you just need to open it on the front end of your code and then close it at the end of your code. Like this:
ODS RTF; Proc Means data=WidgeOne maxdec=2; Var PRDCTY JobSat YRONJOB; Run; ODS RTF Close;
The RTF references are for Rich Text File. By sandwiching your code between these ODS statements, your output will be put into a rich text file.
385
After you run the code, you will see the following screen popup:
Open the file. You should see this:

Variable Label N Mean Std Dev Minimum Maximum
84.58 6.85 8.29 7.26 1.02 4.26 67.65 5.00 0.10 97.47 8.60 17.00
PRDCTY PRDCTY 40 JOBSAT JOBSAT 40 YRONJOB YRONJOB 40
If you save it, and then open it within a Word document, the table will operated like any other table in Word so you can move it, change the font, etc.
386
The second Lagniappe that we wanted to share with you is how to create labels like converting F and M to Female and Male. The process to do this is to create the formatting logicand then go back and apply it. You have to do this in two steps. Here is the code:
Proc Format; Value $Gencode Value $Plantcode Run; Data WidgeOne1; Set WidgeOne; Format Gender $Gencode. Plant $Plantcode.; Run; Proc Print data=Widgeone1; Run; The formats created above are applied in a Data statement. The logic in the format statement is to name the variable to be formatted (Gender) and then reference the format ($Gencode.) M F D N = = = = "MALE" "FEMALE"; "DALLAS" "NORCROSS"; Proc Format will create the formatting for the labels it does not apply the formatsjust creates them. The $ sign is used with qualitative formats.
387
This will change the dataset (and associated output) to be more user friendly:
388
One secret that SAS Jocks tend to keep to themselves is that rarely do individuals who are highly proficient in SAS actually ever develop SAS code from scratch. They maintain libraries of codeand then modify these lines of code as needed. Here is a complete outline of the SAS code used in this manual (including notations) to begin your library:
*Proc Print will print the dataset to your output screen - not to your printer; Proc Print data=WidgeOne; Run; *Proc Means will provide two of the three measurements of central tendency Mean and Median...these measurements must be specified. The "Var" statement tells SAS which variables you are interested in - remember that mean and median is only relevant for quantitative variables; Proc Means data=WidgeOne Mean Median Max Min STD; Var JobSat Prdcty YRONJOB SOCREL JOBGRADE; *Class Plant Gender; Run; *Proc Freq is the SAS command to use when determining the frequency counts for qualitative data; Proc Freq data=WidgeOne; Tables Plant Gender Position; Run; Data WidgeOne1; Set WidgeOne; If YRONJOB <5 Then JOBTEN = 'New'; Else If YRONJOB =>5 AND YRONJOB<=10 Then JOBTEN = 'Experienced'; Else If YRONJOB >10 Then JOBTEN = 'Mature'; Format Plant Plant.; Run; *Print the new dataset to ensure that the new variable was created properly; Proc Print data=WidgeOne1; Run; *Now we run the frequency table on JobTen...which will provide us with the dispersion of the YRONJOB variable across the specified categories; Proc Freq data=WidgeOne1;
389
Tables JobTen; Run; *This code will produce a pie chart, where the qualitative variable listed after the "pie" command will identify the slices. Here, we have simply asked for the percent (type = pct) for each value of "JOBTEN; proc gchart data=WidgeOne1; pie JOBTEN / type = pct; legend; Run; Quit; *This code will produce a pie chart, where the qualitative variable listed after the "pie" command will identify the slices. Here, we have ased for the percentage of total productivity by Plant; proc gchart data=WidgeOne1; pie Plant /sumvar=prdcty percent=Inside; legend; Run; Quit; *This code will produce a histogram, where the qualitative variable listed after the "HBAR" command will identify the categories for the bars. Here, we have simply asked for the frequency count (type = freq) for each value of "JOBTEN; proc gchart data=WidgeOne1; HBAR JOBTEN / type = freq; legend; Run; Quit; *To create Stem and Leaf and Box Plots in SAS, we use the Proc Univariate command with the "plots" option. This procedure is only valid for quantitative variables. Job Tenure, we will reference the YRONJOB variable; Proc Univariate data=WidgeOne1 plots; Var YRONJOB; Histogram; So, for
390
Run; *To create a contingency table, simply run the Proc Freq command. Identify the variables of interest in the Tables statement separated by an *. Then add "by" statements where necessary to subset the analysis. Note - whenever you add a "by" statement, you will need to sort the data by that variable first; Proc Sort data=WidgeOne1; by JobTen; Run; Proc Freq data=WidgeOne1; Tables Plant*Gender; by JobTen; Run; *To create a stacked chart using two variables, simply use the same code as before when analyzing a single variable and add a "subgroup" statement; proc gchart data=WidgeOne1; HBAR Plant / subgroup=gender; legend; Run; Quit; *To create a scatter plot using two quantitative variables use the Proc plot command...the first variable stated will appear on the y-axis...typically this is the dependent variable.; Proc Plot data=WidgeOne1; Plot Prdcty*Yronjob; Run; Quit;
391
Congratulations. You are now a Geek. Take a bow.
392

Reference Manual For Statistical Software

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Reference Manual For Statistical Software

Enviado por

Direitos autorais:

Formatos disponíveis

Reference Manual for Statistical Software: A gentle overview of Excel, SPSS, Minitab, SAS

Chapter 1. Introduction to Statistical Computing

1.1 Statistical Packages Used in this Manual

1.2 Organization of Manual

1.3 Overview of Dataset

Here is a screen shot taken of WidgeOne.xls:

Chapter 2: Data Analysis and Statistical Concepts

Concept 1 Measurements of Central Tendency

FUN MANUAL CALCULATION!!

FUN MANUAL CALCULATION!!

Concept 2 Measurements of Dispersion

Here is the formula

FUN MANUAL CALCULATION!!

FUN MANUAL CALCULATION!!

Concept 3 Visualization of Univariate Data

These intervals would result in the following histogram:

Concept 4 Organization/Visualization of Multivariate Data

N 8.85 7.13 8.10

Grand Total 6.94 8.19 9.66 8.40 8.54 8.29

Concept 5 - Random Number Generation and Simple Random Sampling

Concept 6 Confidence Intervals

Dataset Accuracy Confidence Example Preference?

p = the sample proportion; q = 1-p; Z = same as above; n = same as above.

FUN MANUAL CALCULATION!!

Chapter 3 Microsofts Excel 2003

Here is a screen shot of a blank Excel page:

The cursor in this page is in cell F10

3.1 Using Excel 2003 for: Measurements of Central Tendency

3.2 Using Excel 2003 for: Measurements of Dispersion

To wrap the text, ensure that this option is checked.

Then click OK.

Now, your very tidy spreadsheet should look like this:

Select the Descriptive Statistics option.

You will then see the following dialogue box:

Ensure that this option is checked.

Ensure that this option is checked.

Now click OK.

You should now see this:

Once these bins have been created, select Tools>Data Analysis>Histogram:

Ensure that the Labels option is checked

Now you should see this:

You should now have this:

3.3 Using Excel 2003 for: Visualization/Organization of Univariate Data

Now, just as before, select Tools>Data Analysis>Histogram:

Now click OK.

You should see this:

Your completed Pie Chart should look like this:

The following box will appear:

Then click Finish to place the chart in the current worksheet:

3.4 Using Excel 2003 for: Visualization/Organization of Multivariate Data

You will see this:

Here are all of the variables in our dataset.

You should now have this:

Now click OK and then Finish.

You should now see this:

The Options page will bring up a dialogue box.

You should now see this:

Then select OK>OK>Finish.

You should now see this:

Gender D F M Grand Total

N 13.00 10.00 23.00

Grand Total 7.00 20.00 10.00 20.00 17.00 40.00

This will disengage the data from the Pivot Table.