Você está na página 1de 8

An introduction to basic statistics in R

Big Data, Big Opportunities

Figure 1 - p-values, according to


Randall Munroe of xkcd.com. Don’t
take this entirely seriously…

R is a free, open source statistical programming language that is becoming increasingly widely used in
the scientific world. It is extremely capable and has many “bleeding edge” capabilities. It can seem a
little intimidating at first, particularly if you haven't used a command-line driven program before, but
fear not, this will be a very gentle introduction. We're going to focus on some very simple R commands
here, but I hope these will be skills that you can build on during your scientific education and beyond.
There are all manners of novel uses for R! As a point of note, I will give examples of how to complete
most parts of the exercise in one fashion, but there are often several ways to solve a particular problem.
At several points in the exercise I will ask you to find a different route by which to solve a problem.
This will probably require a little research on your part…

This exercise is designed to give a gentle introduction to the two main foci of this course: R
programming and the practical application of statistics. The exercises this week are designed to provide
a quick refresher on some basic statistical concepts, as well as to start building familiarity with R. It
will be useful if you have the R Reference Card that I supplied you – the abridged Hitchhikers Guide to
R (DON’T PANIC).

When you are coding in R, you will have to be very careful that your syntax is exactly correct. A
misplaced, missing, or substituted comma will lead to a command that doesn’t run, or runs incorrectly.
Check this carefully before asking for assistance. Also, I’m only human, so there is a small chance
there are typos in the commands in the handouts. If so, mea culpa.
Background to the Problem

This week, in preparation for the kinds of problem that we’ll be facing for the bulk of the course, we
are going to look at some agricultural data from the whole US. We’re going to use these data as an
example to review the following important big data concepts, and how to deal with them in R:

 Continuous, discrete, ordinal and categorical variables


 Confidence intervals
 Distributions of data (or how I learnt to stop worrying and love the normal distribution)
 Statistical significance (i.e. alpha and p-values)
 Some basic statistical tests comparing samples

Exercise

As with all of my exercises, the questions you will be expected to answer are interspersed throughout
the text and are highlighted in grey. Please read carefully to make sure you don’t miss anything, and be
sure to include answers to every question in your write-up.

Part 1 – Preparation

When you work in R, all of the files, variables and analyses that you create during one session
(providing you don’t overwrite them) and backups of your commands are saved in one particular
directory. It’s good practice to have a directory for each set of analyses on which you are working, so
create a directory called “Big Data R Introduction” somewhere convenient so that we can tell R to use
it as your working directory.

Now start R. You should be faced with a wall of text and a blinking cursor. We could start off working
from the command line (prepare for that Stranger Things vibe), but instead we’ll use a GUI (Graphical
User Interface) that is available for R - RStudio.

When you open RStudio you should see the same wall of text in the bottom left of the screen, but three
more windows. Top left is a text editor to allow you to save your command history, top right allows
you to control the data that you are importing into/exporting from R, and bottom right gives you a
range of options (the most commonly used of which are “Help” and “Plots”, where your visualisations
will appear).

Good coding practice in R is to keep your commented code in its final form in a text file displayed in
the source window (top left), and then to run it in the bottom left window from there. That way you
have a permanent record of your work and you can reconstruct your progress in seconds if needs be.
I’m going to provide the code in raw R for these exercises, but I’d like you to compile your code into
such a text file to submit with your packet.

R commands take the format command_name(command_inputs). To find out what your current
working directory is, type the following into R (omit the “>”, that just shows you where each line
starts, and note that cutting and pasting will result in some errors due to the “fancy” formatting of this
document):

> getwd()
R will tell you the name of your working directory. Note that R did not need any command inputs for
the getwd() command. The command to change your working directory is setwd(). Try to change it
now…

Did that work for you? If not, it’s probably because setwd() needs a command input so that R knows
where to change the directory to. You can see the format needed to input a directory from the output
from getwd(). Note that the directory structure is in inverted commas. This tells R that it is receiving
text string input, rather than numeric or variable name input (on which more later). An appropriate
command would look something like:

>setwd(“/Users/MyName/MyDirectory”)

Another use of getwd() will show that you’ve successfully executed the change. Note that you can use
the up and down arrows to cycle through past commands and save a bit of keyboard work. You can
also do this by clicking “Misc” and “Change Working Directory” from the menu bar, and then finding
the directory you’ve just created, but we’re getting familiar with R proper, here. Note that after each
session you’ll want to save your commands and results either as you exit, or manually by clicking
“Workspace” and then “Save Workspace File”. In RStudio you can use the “Session” menu, or the top
right box to do the same thing.

Part 2 – Analysis of US Farm Numbers

Download the file USfarmlandnumber.csv from Learn to your computer and save it to your working
directory.

Now load these data into R as a data frame called US.farmland.number. You can do this easily using
the read.csv command:

> US.farmland.number <- read.csv(“US.farmland.number.csv”)

Here we’re adding some more functionality to R. We are reading an external comma-separated variable
file (or csv, a simple spreadsheet-style document format that all Excel-type programs should be able to
save as) called US.farmland.number.csv from the working directory (if it’s not there, you’re out of
luck), and then storing it as a data frame (the R variable type to store raw data) called
US.farmland.number using the <- command. Note that, in almost all cases <- is equivalent to = in R,
but, for clarity, we will use <- for file and variable manipulation, and = for mathematical operations.

It is a good idea to get into a convention as to the naming of your variables in R to aid your recollection
while you are working on a project.

Now check that the file has read correctly:

> US.farmland.number[1,]

If you’d just typed US.farmland.number, you would have output the whole file to the screen. With
large datasets you would have been pretty swamped. Instead, we’ve just output the first row of the
dataset by appending the square brackets. If you wanted to output a single column, you could type its
reference after the comma in the brackets. A single cell can be specified using row and column
references.
These data are sufficiently small that we can look at the whole lot at once. View the whole of the
US.farmland.number data frame on your screen. As you can see, this looks good, except that the names
of the rows of your table (the states) appear as the first column of the data frame. There’s a quick way
around this – re-run your previous read.csv() command, except add “, row.names = 1” after the file
name. Now look at the data frame again. Looking better?

Question 1) How would you output the 32nd row of the data frame? How about the value in the
fourth column of the seventh row? The first five rows? How would you show the last
two columns of the last 4 rows of the data frame? Provide a screen capture of each your
results and the commands you used to achieve them.

For this question I haven’t given you some of the commands that you’ll need, so you will have to refer
to your R quick reference guide to figure it out. If your group can’t find what you’re after, ask me and
I’ll guide you in the right direction.

Returning to the whole data frame, you should see that we have data concerning the number of farms
that are present in each of the 50 states, grouped by size. This data frame provides examples of three of
the four different scales on which variables can be measured: categorical, ordinal, and discrete. A
categorical variable is solely qualitative – a datum being classified according to a categorical variable
will belong to one of the categories, but cannot be quantified further. None of the categories are
“greater” or “lesser” than the others. In our example here, the categorical variable would be “State”
(assertions of the greatness, or otherwise, of Texas, aside).

An ordinal variable would be the categories of farm size – these categories can be ranked in order, but
we cannot precisely quantify the differences in size between our categories (i.e. due to the way the
categories have been arranged, it is not possible to say that a farm of “greater than 2000 acres” is twice
as large as a farm of “1000-1999 acres”, only that it is larger (note that parts of this scale can be
quantified, which makes this example a little trickier than normal).

Finally, an example of a discrete variable would be “Number of farms”. In a step up from ordinal
variables, these can be ordered and the differences between each value that the variables can take is
meaningful (i.e. the increase from 10 to 20 farms involves a doubling in number). Discrete variables
cannot, however, take all values on a scale. There is no such thing (in reality) as having 1.73 farms.

The variable type that we have not encountered is a continuous variable, which is the same as a discrete
variable except that it can take all values on a scale – for example a farm could measure 118.29 acres in
area.

Question 2) Provide three examples of each of the four types of variable (continuous, discrete,
ordinal, and categorical). These can be drawn from anywhere, so long as your instructor
is able to recognise them!

Some useful data are missing, however. Let’s rectify that problem:

Question 3) Add a column to your dataset, name it “Totals” and fill it with the total number of farms
in each state. Provide a screen capture of the last column if your final data frame and the
commands you used to produce and fill it. Provide a description of the function of each
part of each command.
To do this, you will need to use the following commands:

> US.farmland.number <- cbind(US.farmland.number, 0)

This adds (or “binds” in R terminology) a column full of zeros to the right-hand edge of the
US.farmland.number dataframe.

> names(US.farmland.number)[13]<-c("Totals")

Using this command, you have named the thirteenth value in the stored list of column names to Totals.
The names() command provides you with the vector (i.e. series) of column names stored for
US.farmland.number, which can be modified. The c() command (one of the most common and most
useful in R) creates a data vector that can be input into R. In this case, your vector is only a single
variable long, but it could be as long as is needed. Note that, as before, you have to surround text by
quotation marks, otherwise R looks for existing variables with that name to work with, rather than
adding new text.

One way to fill the Totals column would be to fill each row separately using commands like this:

> US.farmland.number[1,13]<-sum(US.farmland.number[1,1:12])

We can do this a little more quickly, however. If you don’t give R a reference for the columns or rows
to work with, by default it will apply your action to all columns or rows. Try:

> US.farmland.number[,13]<-rowSums(US.farmland.number[,1:12])

Easier?

Question 4) Calculate the mean number of farms in a state by two different methods in R. Provide
the commands you used for each, and the output.

This mean (should be around 42000) helps us describe the characteristics of farms in the US. It only
tells us part of the story, however. Going back to our original data, you can see that Texas is the farm
capital of the US, with over 200,000 farms. Alaska, on the other hand, only has 762 farms!
Consequently, we need to calculate another statistic to measure this variability. This is the standard
deviation, and can be calculated using the R command sd(). The standard deviation measures the
dispersion of your data from the average – high standard deviations mean dispersed data (lots of data
that is a long way from the mean), low standard deviations mean data that is not spread far from the
mean. It is important to note that standard deviation is measured in the same units as your variable.

Question 5) Now calculate the standard deviation (and variance if you’re feeling adventurous) of
number of farms in a state. Provide your results and the commands you used for each.
What does this tell us about the data?

These data suffer from (at least) one major bias that produces the observed standard deviation. What
might that be? How can you correct for it?

Question 6) Seek out new data to allow you to account for the bias that you have identified. Input
these data into a new column in your dataframe (using our friend c()) and use the data to
calculate a new metric quantifying the number of farms in a state. Calculate the mean
and standard deviation of this new metric. How do they differ from your answers to
questions 3 and 4? Show your data and working.

The standard deviation has another useful property – it allows us to calculate confidence intervals for
our data. A confidence interval is the range of values in which there is an x% probability that a new
observation drawn from the same population would lie. In our example dataset, this would mean that if
we looked at the number of farms in a hypothetical 51st state, it would have an x% chance of lying
between the two stated values of the confidence interval. How do we calculate confidence intervals?

Assuming that our data are normally distributed (about which more later), there is a 68.27% chance that
this new data point will lie within one standard deviation (herein shortened to σ) of the mean, a 95.45%
chance of it lying within 2σ, and a 99.73% chance of it lying within 3σ of the mean. You can calculate
the numbers of standard deviations corresponding to other confidence intervals using the qnorm(x)
command in R, where x = 1 - (0.5*alpha), and alpha = your desired significance level (the likelihood
that your pattern is occurring by chance), expressed as a proportion of 1 (as alpha is a probability). For
example, for a desired significance level of 10% (i.e. the 90% confidence interval), alpha = 0.1, and x =
0.95, and your confidence interval (CI) spans ± 1.645σ.

Question 7) Calculate alpha, x, σ, and hence values (in numbers of farms) for the 75% and 98%
confidence intervals for the number of farms in a US state. Show the commands you
used and your answers.

You will remember, however, that we have been assuming that our data were normally distributed – i.e.
that they follow a bell curve around a mean (described by the relationships of σ described above). Do
they, though? Now we get to experiment with another one of R’s cool features – plot production!

Question 8) Plot a histogram of your original area data. Does it look normally distributed, or is it
skewed to one side? How about a histogram of your newly created metric? Does that
look normally distributed or not? What do these observations mean? Provide screen
captures of your histograms.

The R command to plot a histogram is, helpfully, hist(). Command inputs for this command include the
data to be plotted (essential), and the number of bins you want the histogram to be broken in to –
breaks = x (optional). You can play around with some different options to see what best captures the
distribution of your data.

If our data are not normally distributed, then we cannot use the above method to calculate their CIs. In
fact, there are several approaches we can use. If we know the distribution in which our data fall (see
http://en.wikipedia.org/wiki/List_of_probability_distributions for a pretty comprehensive list), we can
use a different distribution specific method to calculate (for example, replace qnorm() with qchisq() for
a chi-squared distribution). See your R Short Reference Card for more details. If we don’t know
anything about the distribution of our data, other than the mean and variance, we can use the most
general form to calculate a CI – Chebyshev’s inequality, or build a Monte Carlo model to estimate the
CI through raw number crunching. I’m not going to go into detail about these latter at the moment as
(a) much of the data you will be dealing with can be approximated to a normal distribution, and (b) we
don’t have the time now. Be aware of the problem, and we’ll deal with it later should it become
necessary.
Moving back to our farm data, one of our main goals with statistical analysis is to determine whether
particular datasets show a certain hypothesised relationship. For example, we might be interested in
whether Montana shows the same proportion of farms of different sizes as Vermont. We could just plot
these data and eyeball it, but there is much bias inherent to that approach (see the Nate Silver reading
from today), so we can do better. Enter statistics!

Question 9) Calculate the proportion of farms of different sizes in Montana and Vermont. Eyeball it
– do they look different to you? Show your code and output.

I hope that, as you were completing question 8, a significant (ehem…) problem occurred to you. How
different do two sets of data have to be in order to be considered different? Would a single farm’s
difference in one size category be enough, or do all categories have to be different by a factor of 10?
The formal way in which we answer this question is by stating an alpha (α) and then testing to see if
our data are more or less likely than this to be similar (calculating a p-value from a statistical test). If
they are more likely than α to be similar (i.e. if p > α), we assume that they display the same
relationship, but if they are less likely to be similar than α (i.e. if p < α), we assume they are different.

Convention in much of the sciences (physics and engineering aside) is to use α = 0.05, so two
relationships are different if there is less than a 1/20 chance of them being the same. This value is,
however, arbitrary – being able to interpret the meaning of p is much more important. If a statistical test
produces a p of 0.051, there is a very slightly more than 1/20 chance of the datasets being the same –
there could still be a difference and it’s probably still worth investigating further if such a difference
makes intuitive sense. Note that many people do not use α, and will discuss relationships as being
similar or different with respect to certain p-levels (your instructor included). This is because p is
telling you the probability that your data are related, and, with our new, nuanced interpretation of p-
values, we don’t need to set an a priori α in order to interpret relationships, so long as we’re careful not
to bias our results towards our desired outcome.

Question 10) Head online and find three articles quoting p-values to support or falsify a relationship.
What are the p-values, how significant are they, and what relationship do they support?
Provide screenshots of the data/plots for each.

I will add here that there is much hand-wringing going on in the statistical community at the moment
about whether p-values are even worthwhile any more. As they are hard to define exactly, much
misunderstood, and abused. We will stick with them here, but be aware that alternatives exist.

Now to some calculation and interpretation of p-values from your own data. In question 9 we wanted to
know if there was a significant difference between the proportions of farms of different sizes in
Vermont and Montana. If we assume that these data are normally distributed, there are several
appropriate tests that we could use. The first (with which I hope you are familiar) is the chi-squared
test. You can carry out a chi-squared test using the following command:

> chisq.test()

Note that your data has to have your observations as a pair of columns to be compared for this to work
correctly. Firstly, you’ll have to extract the two relevant rows from US.farmland.number and save them
as a new data frame. Then you can use the R command t() to rotate a matrix or data frame so that the
rows are columns and vice versa. Once you’ve done that, you should be ready to analyse…
Question 11) Compare the farm size proportions in Vermont and Montana using a chi-squared test.
Show your output and comment on the p-value. Does this agree with your eyeballed
estimate?

Another option for doing the same is the G-test. This is actually preferred to the chi-squared as chi-
squared is a calculationally simple approximation of G that gives less precise results. With the advent
of modern computers, such simplification is not needed. In order to carry out a G-test in R, you’ll have
to download and then load another set of code (known as a package) called RVAideMemoire. We can do
this all within R, using the following two commands. You may have to select a CRAN mirror to
download from. Any will do, but the nearer, the better.

> install.packages(“RVAideMemoire”)

> library(RVAideMemoire)

You can then use the command G.test in place of chisq.test to carry out a G-test.

Question 12) Compare the farm size proportions in Vermont and Montana using a G-test. Show your
output and comment on the p-value. How does this compare to your chi-squared results?

If you want, you can save all the data in the R console using “File”, “Save” from the menu. If you don’t
do this, as soon as you quit, everything that is not a saved variable will disappear forever. Now quit R,
being sure to save your workspace as you go so your variables don’t disappear.

This has been a quick refresher on some of the basics of statistics. Next week we’ll take these and head
into the world of multivariate stats…

Você também pode gostar