Escolar Documentos
Profissional Documentos
Cultura Documentos
GRAPHICAL METHODS
R Graphics
a. R has two independent graphics subsystems
Traditional graphics
Grid graphics
recent (2000)
Low-level tool, flexible
1
Creating Graphs in R
Plotting
The functions plot(), points(), lines(), text(), mtext(), axis(), identify() etc. form a suite
that plots points, lines and text.
On startup, R initiates a graphics device driver which opens a special graphics window for the
display of interactive graphics. If a new graphics window needs to be opened either win.graph()
or windows() command can be issued. Once the device driver is running, R plotting commands
can be used to produce a variety of graphical displays and to create entirely new kinds of display.
. Plot of Vector(s)
1. One vector x (plots the vector against the index vector)
> x <- 1:10
> plot(x)
2. Scatterplot of two vectors x and y
> set.seed(13)
> x <- -30:30
> y <- 3 * x + 2 + rnorm(length(x), sd = 20)
> plot(x, y)
. Plot of data.frame elements If the first argument to plot() is a data.frame, this can be as
simply as plot(x,y) providing 2 columns (variables in the data.frame).
Let’s look at the data in the data.frame airquality which measured the 6 air quality in New
York, on a daily basis between May to September 1973. In total there are 154 observation
(days).
> airquality[1:2, ]
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
> plot(airquality)
3
> attach(airquality)
> plot(Ozone, Temp, main = "plot(Ozone, Temp)")
> detach(airquality)
1. BOX PLOT
The boxplot is a method to graphically picture the numerical information, gathered by particular
information. It gives a graphical perspective of the middle, quartiles , most and least extremes of
an information set. Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the minimum, maximum, median,
first quartile and third quartile in the data set. It is also useful in comparing the distribution of
data across data sets by drawing boxplots for each of them.
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the
distribution of data based on the five number summary: minimum, first quartile, median, third
quartile, and maximum.
In the simplest box plot the central rectangle spans the first quartile to the third quartile (the
interquartile range or IQR). A segment inside the rectangle shows the median and "whiskers"
above and below the box show the locations of the minimum and maximum.
This simplest possible box plot displays the full range of variation (from min to max), the
likely range of variation (the IQR), and a typical value (the median).
Not uncommonly real datasets will display surprisingly high maximums or surprisingly
low minimums called outliers
Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the
first quartile.
4
Suspected outliers are are slightly more central versions of outliers: either 1.5×IQR or
more above the third quartile or 1.5×IQR or more below the first quartile.
If either type of outlier is present the whisker on the appropriate side is taken to 1.5×IQR
from the quartile (the "inner fence") rather than the max or min, and individual outlying
data points are displayed as unfilled circles (for suspected outliers) or filled circles (for
outliers). (The "outer fence" is 3×IQR from the quartile.)
How to construct Box Plot ?
Syntax
Boxplot(arguments)
notch : used to design a line on each side of he boxes ; takes a Boolean value.
5
sub : used to display the subtitle (if any) for the bar chart
Solution :
This problem can be demonstrated by using some of the parameters of the box plots.
Case (a) Considering the parameters data and x
(i) The boxplot function is used to create box chart.
(ii) Here we are displaying the speed and dist parameters of Cars dataset using x and data
parameters as shown below.
Output :
6
Case (c) Considering the parameters notch,col,x,data,,xlab,ylab
(i) Here the value of notch argument is given as TRUE , to create a line on each side of the
box.
(ii) The col argument is used to assign color to the box ; here we are passing green and red to
col .
boxplot(speed~dist,data=cars,xlab = "Speed of cars:mph",ylab = "Distance taken to
stop:ft",main="Cars DataSet",col=c("green","red"),notch = TRUE)
Output :
7
Exercise - 2
> boxplot(airquality)
Note if you give plot a vector and factor plot(factor, vector) or plot(vector factor) it will
produce a boxplot.
8
> title("Equivalent plots")
> plot(factor(airquality$Month), airquality$Ozone, col = 2:6, xlab = "month",
+ylab = "ozone", sub = "plot(factor(airquality$Month), airquality$Ozone")
9
2. BAR CHART / BARPLOT
A bar chart represents data in rectangular bars with length of the bar proportional to the value of
the variable. They are represented by two axis , x – used to represent the groups and y – used to
represent the corresponding values . R can draw both the vertical and level bars in the bar
diagram. R uses the function barplot() to create bar charts. R can draw both vertical and
horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.
Application
Bar Charts are used to show data related to finance , marketing , and others .
Syntax
barplot(arguments)
Example :
barplot(part,names.arg,xlab,ylab,main,col)
names.arg is a vector of names appearing under each bar/used to display label below each
bar.
horiz : used to display the bar chart in vertical or horizontal position , takes Boolean value
sub : used to display the subtitle (if any) for the bar chart
xlim : used to specify the limits for the x-axis , example c(0,10)
ylim : used to specify the limits for the y-axis , example c(1,0)
10
To be given to students as LAB EXERCISE
Exercise -1
Using R bar chart, demonstrate the percentage conveyance of various strategies utilized for
travelling to office such as bike, car, bus, auto, and train.
Solution :
This problem can be demonstrated by using different parameters of the bar chart.
Case (a) Considering the parameters part and main
(i) Numerical values 20,10,16,4,10 are assigned to a part argument .
(ii) The barplot function is used to create the bar chart ; part is used to assigned values to
each part
of the chart and main is used to assign the title “Strategies utilized for travelling to
office”.
11
#function to create the bar chart with x and y axis names
barplot(part,main = "Strategies utilized for travelling to office",xlab=
"Vehicles",ylab="Numbers")
Output :
Output :
12
(iv) To create a horizontal barchart , the horiz parameter is used with its value as TRUE.
Output :
13
vec <- c(1,2,3,4,5,6,7,8,9)
values <- matrix(vec,nrow = 3, ncol = 3)
barplot(values,col = c("red","blue","yellow"))
> values
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Exercise- 2:
14
3. PIE CHART
R Programming language has numerous libraries to create charts and graphs.
Pie charts give more visibility as judging length is more precise than judging volume(as in the
case of other charts).It is represented as a circle. The circle is made of various parts; each part
is marked and numbered. In other words, a pie-chart is a representation of values as slices of
a circle with different colors. The slices are labeled and the numbers corresponding to each
slice is also represented in the chart. The aggregate value of pie chart is always 100 percent.
This chart is easy to read where each part tells us what the size of the data element is.
Application
A pie chart is used to display information related to marketing, weather reports, and finance
among others.
In R the pie chart is created using the pie() function which takes positive numbers as a vector
input. pie function cab take any number of arguments .The additional parameters are used to
control labels, color, title etc.
Syntax
pie(arguments)
pie(part,labels,edges,radius,clockwise,init.angle,density,angle,col,main)
Following is the description of the arguments/parameters used –
part : contains a vector of non-negative numeric values and tells the size of parts
labels: used to give description to the slices / used to assign names to each part.
edges: used to change the outer circle of the pie; the default value is 200.
radius: indicates the radius of the circle of the pie chart.(value between −1 and
+1).
Clockwise: is a logical value indicating if the slices are drawn clockwise or anti
clockwise.(True / False)
init.angle : used to specify the initial angle(in degrees) for chart parts ; default
value is 0
angle: used to change the angle of the shading lines inside the chart
col : indicates the color palette / used to show colors in chart
15
main : indicates the title of the chart.
16
Bus ,cycle , and train are assigned to lbls argument
(ii) The title ‘Strategies utilized for travelling to office’ is assigned using the main argument
; the edge argument is used to change the value of the outer circle as 5 , as shown
below.
Output :
17
Case (d) Considering parameters part, labels , clockwise and main
(i) Numerical values 8,12,16,4,10 are assigned to a part argument and string values walking
,car ,
Bus ,cycle , and train are assigned to lbls argument
(ii) Boolean TRUE and FALSE value is assigned to the clockwise parameter to change the
direction of the parts in the graph ,that is , clockwise or counterclockwise ; title
‘Strategies utilized for travelling to office’ is assigned using the main argument .
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
pie (part,labels=lbls,clockwise = TRUE,main='Strategies utilized for travelling to office')
Output (Clockwise):
18
pie (part,labels=lbls,clockwise = FALSE,main='Strategies utilized for travelling to office')
Output (Counterclockewise):
Output :
Case (f) Considering parameters part, labels ,density, angle and main
(i) The angle parameter is used in this case to assign an angle to each line inside the circle.
The value of the angle parameter is 90 degrees and the lines are vertical as shown below.
19
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
pie (part,labels=lbls,density = 20,angle=90,main='Strategies utilized for travelling to
office')
Output :
Case (h) Considering parameters part, labels ,border, col and main
20
(i) The border parameter is used to change the circle border color , that is , each part of the
circle in this case .
(ii) Red color is assigned to the border parameter as shown below.
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
# assigning red border color to each block
clor <- c ("red","blue","green","yellow","white")
pie (part,labels=lbls,border="red",main='Strategies utilized for travelling to office')
Output :
Chart Legend
Chart legends are used to provide a small description of each part ; we can specify where on the
chart it should be displayed , that is , top-left,top-right,bottom-left, or bottom-right , etc. In a pie
chart , the legend is included using legend function .
Syntax
legend(graphics)
Example: legend (position, labels, fill)
Here
Position : states the position of the legend
labels : defines the label of blocks
21
(i) Walking is assigned red color , car – blue color , bus – yellow color , cycle –
green color , and train – white color , all these values are assigned through cols
and lbls variables and the legend function.
(ii) The fill parameter is used to assign colors to the legend.
(iii) Legend is added to the top-right side of the chart , by assigning the value as
topright ; it can take different values such as topleft , bottomright , and bottomleft.
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
# assigning color to cols variable
cols <- c ("red","blue","yellow","green","white")
pie (part,labels=lbls,main='Strategies utilized for travelling to office')
#creating legend
legend("topright",fill = cols,c("Walking","Car","Bus","Cycle","Train"))
Output :
Example :
pie(OzMonthMean, col = rainbow(5))
22
3D Pie Chart
A pie chart of three-dimensional (3D) shape is called a 3D pie chart. It is used to display the
information in 3D form . A 3D pie chart can be created using the pie3D function.
Syntax
Pie3D(parameters)
Example : For the previous example construct a 3D pie chart.
Note : Install plotrix package
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
# creating 3D pie
pie3D(part,labels=lbls,explode=0.1,main='Strategies utilized for travelling to office')
Output :
4. HISTOGRAMS
A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is
similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
In other words , it takes a vector (i.e. column) of data, breaks it up into intervals, then
plots as a vertical bar the number of instances within each interval.
Histograms can be created/plotted using
(i) Using an input vector
(ii) built-in datasets
(i) Plot histogram using an input vector
23
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
png(file = "histogram.png")
dev.off()
24
Range of X and Y values
To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim
parameters.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
png(file = "histogram_lim_breaks.png")
breaks = 5)
dev.off()
25
(ii)Plot histogram from built-in datasets
# Simple Histogram
hist(mtcars$mpg)
26
# Add a Normal Curve
x <- mtcars$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram with Normal
Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
Histograms can be a poor method for determining the shape of a distribution because it is so
strongly affected by the number of bins used.
5. LINE GRAPH / LINE CHART
A line chart is a graph that connects a series of points by drawing line segments between them.
These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts
are usually used in identifying the trends in data. Line graphs are mainly used to display
information related to marketing , weather reports , finance , and other areas.
Syntax
27
Plot(arguments)
Example
A simple line chart is created using the input vector and the type parameter as "O". The below
script will create and save a line chart in the current R working directory.
v <- c(7,12,28,3,41)
png(file = "line_chart.jpg")
plot(v,type = "o")
dev.off()
28
Line Chart Title, Color and Labels
The features of the line chart can be expanded by using additional parameters. We add color to
the points and lines, give a title to the chart and add labels to the axes.
Example
v <- c(7,12,28,3,41)
png(file = "line_chart_label_colored.jpg")
dev.off()
29
Multiple Lines in a Line Chart
More than one line can be drawn on the same chart by using the lines()function.
After the first line is plotted, the lines() function can use an additional vector as input to draw the
second line in the chart,
v <- c(7,12,28,3,41)
t <- c(14,7,6,19,3)
png(file = "line_chart_2_lines.jpg")
dev.off()
30
6. SCATTER PLOT
Scatter plots are used to show the relation between two variables of the given sets of data. Data
are displayed as a group of points in these plots. They can be used when one of the two variables
are both dependent or independent of each other .Scatterplots show many points plotted in the
Cartesian plane. Each point represents the values of two variables. One variable is chosen in the
horizontal axis and another in the vertical axis.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
31
Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's
use the columns "wt" and "mpg" in mtcars.
print(head(input))
png(file = "scatterplot.png")
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
32
# Save the file.
dev.off()
Scatterplot Matrices
When we have more than two variables and we want to find the correlation between one variable
versus the remaining ones we use scatterplot matrix. We use pairs() function to create matrices
of scatterplots.
Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)
data represents the data set from which the variables will be taken.
Example
Each variable is paired up with each of the remaining variable. A scatterplot is plotted for each
pair.
png(file = "scatterplot_matrices.png")
33
# Plot the matrices between 4 variables giving 12 plots.
pairs(~wt+mpg+disp+cyl,data = mtcars,
dev.off()
34
points(x, y) Adds points or connected lines to the current plot.
lines(x, y
text(x, y, labels, ...) Add text to a plot at points given by x, y. Normally labels is an
integer or character vector in which case labels[i] is plotted at point
(x[i], y[i]). The default is 1:length(x).
Note: This function is often used in the sequence
The graphics parameter type="n" suppresses the points but sets up
the axes, and the text() function supplies special characters, as
specified by the character vector names for the points.
polygon(x, y, ...) Draws a polygon defined by the ordered vertices in (x, y) and
(optionally) shade it in with hatch lines, or fill it if the graphics device
allows the filling of figures.
legend(x, y, legend, ...) Adds a legend to the current plot at the specified position. Plotting
characters, line styles, colors etc., are identified with the labels in
the character vector legend.
At least one other argument v (a vector the same length as legend)
with the corresponding
values of the plotting unit must also be given, as follows:
legend( , fill=v) Colors for filled boxes
legend( , col=v) Colors in which points or lines will be drawn
legend( , lty=v) Line styles
legend( , lwd=v) Line widths
legend( , pch=v) Plotting characters
title(main, sub) Adds a title main to the top of the current plot in a large font and
(optionally) a sub-title sub at the bottom in a smaller font.
axis(side, ...) Adds an axis to the current plot on the side given by the first
argument (1 to 4, counting clockwise from the bottom.) Other
arguments control the positioning of the axis within or beside the
plot, and tick positions and labels. Useful for adding custom axes
after calling plot() with the axes=FALSE argument.
35
Example 1– Using points , lines , legend .
attach(cars)
plot(cars, type = "n", xlab = "Speed [mph]", ylab = "Distance [ft]")
points(speed[speed < 15], dist[speed < 15], pch = "s", col = "blue")
points(speed[speed >= 15], dist[speed >= 15], pch = "f", col = "green")
lines(lowess(cars), col = "red")
legend(5, 120, pch = c("s", "f"), col = c("blue", "green"), legend =
c("Slow","Fast"))
title("Breaking distance of old cars")
detach(2)
36
Example 2 – Generate the following (25 symbols that you can use to produce points in your
graphs)
Solution :
# Make an empty chart
plot(1, 1, xlim=c(1,5.5), ylim=c(0,7), type="n", ann=FALSE)
37
III. Interactive graphics functions
R provides functions which allow users to extract or add information to a plot using a mouse via
locator() and identify() functions respectively.
Example 1
> plot(1:20, rt(20, 1))
> text(locator(1), "outlier", adj = 0)
Waits for the user to select locations on the current plot using the left mouse button.
Example 2
Identify members in a hierarchical cluster analysis of distances between European cities
Dataset used: eurodist
> hca <- hclust(eurodist)
> plot(hca, main = "Distance between European Cities")
> (x <- identify(hca))
> x
ADVANCED GRAPHICS
a) Lattice Graphs
What is Lattice?
It is a powerful and elegant high-level data visualization system. That is being inspired by Trellis
graphics. Although, it is being designed with an emphasis on multivariate data. That allows easy
conditioning to produce “small multiple” plots.
38
Lattice Graphs
The lattice package was written by Deepayan Sarkar. He provides better defaults. It also provides
the ability to display multivariate relationships. And trying to improve on-base R graphics.
This package supports the creation of trellis graphs –
graphs that display a variable or
the relationship between variables, conditioned on one or
more other variables.
The typical format is:
graph_type(formula, data=)
We can select graph_type from the listed below. Formula specifies the variable(s) to display and
any conditioning variables.
For example:
~x|A means display numeric variable x for each level of factor A;
y~x | A*B relationship between numeric variables y and x for every combination of factor A and
B levels;
~x means display numeric variable x alone.
39
Main functions in the lattice package
Function Description
histogram() Histogram
# Install
install.packages("lattice")
# Load
library("lattice")
40
Example – 1 (mtcars dataset used here)
Output
b)ggplot2 in R
Install package ggplot2.
ggplot2 is a data visualization package for the statistical programming language R. In other
words, ggplot2 is an R library for creating graphics. It can greatly improve the quality
and aesthetics of your graphics, and will make you much more efficient in creating them. ggplot2
allows you to build almost any type of graphic. The ggplot2 package, created by Hadley
Wickham, offers a powerful graphics language for creating elegant and complex plots. Its
popularity in the R community has exploded in recent years. Originally based on Leland
41
Wilkinson's The Grammar of Graphics, ggplot2 allows you to create graphs that represent both
univariate and multivariate numerical and categorical data in a straightforward manner. Grouping
can be represented by color, symbol, size, and transparency. It serves as a general scheme for
data visualization which breaks up graphs into semantic components such as scales and layers.
ggplot2 can serve as a replacement for the base graphics in R and contains a number of defaults
for web and print display of common scales
In contrast to base R graphics, ggplot2 allows the user to add, remove or alter components in a plot
at a high level of abstraction. This abstraction comes at a cost, with ggplot2 being slower than
lattice graphics.
library(ggplot2)
ggplot(diamonds) # if only the dataset is known.
ggplot(diamonds, aes(x=carat)) # if only X-axis is known. The Y-axis can be specified in respecti
ve geoms.
ggplot(diamonds, aes(x=carat, y=price)) # if both X and Y axes are fixed for all layers.
ggplot(diamonds, aes(x=carat, color=cut)) # Each category of the 'cut' variable will now have a
distinct color, once a geom is added.
42
The aes argument stands for aesthetics. ggplot2 considers the X and Y axis of the plot to be
aesthetics as well, along with color, size, shape, fill etc. If you want to have the color, size etc
fixed (i.e. not vary based on a variable from the dataframe), you need to specify it outside
the aes(), like this.
2. The Layers
The layers in ggplot2 are also called ‘geoms’. Once the base setup is done, you can append the
geoms one on top of the other.
We have added two layers (geoms) to this plot - the geom_point() and geom_smooth(). Since the
X axis Y axis and the color were defined in ggplot() setup itself, these two layers inherited those
aesthetics.
Alternatively, you can specify those aesthetics inside the geom layer also as shown below.
43
Notice the X and Y axis and how the color of the points vary based on the value of cut variable.
The legend was automatically added.
Instead of having multiple smoothing lines for each level of cut, let us integrate them all under
one line.
How to do that?
Removing the color aesthetic from geom_smooth() layer would accomplish that.
library(ggplot2)
ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat
, y=price)) # Remove color from geom_smooth
ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=cut)) + geom_smooth() # sa
me but simpler
44
Now make the shape of the points vary with color feature?
45
3. The Labels
Now that you have drawn the main parts of the graph. You might want to add the plot’s main
title and perhaps change the X and Y axis titles. This can be accomplished using the labs layer,
meant for specifying the labels. However, manipulating the size, color of the labels is the job of
the ‘Theme’.
library(ggplot2)
gg <- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + labs(title="Scatterpl
ot", x="Carat", y="Price") # add axis lables and plot title.
print(gg)
The plot’s main title is added and the X and Y axis labels capitalized.
Note: If you are showing a ggplot inside a function, you need to explicitly save it and
then print using the print(gg), like we just did above.
4. The Theme
Almost everything is set, except that we want to increase the size of the labels and change the
legend title. Adjusting the size of labels can be done using the theme() function by setting
the plot.title, axis.text.x and axis.text.y. They need to be specified inside the element_text(). If
you want to remove any of them, set it to element_blank() and it will vanish entirely.
Adjusting the legend title is a bit tricky. If your legend is that of a color attribute and it varies
based in a factor, you need to set the name using scale_color_discrete(), where the color part
belongs to the color attribute and the discrete because the legend is based on a factor variable.
46
axis.text.x=element_text(size=15),
axis.text.y=element_text(size=15),
axis.title.x=element_text(size=25),
axis.title.y=element_text(size=25)) +
scale_color_discrete(name="Cut of diamonds") # add title and axis text, change legend title.
print(gg1) # print the plot
If the legend shows a shape attribute based on a factor variable, you need to change it
using scale_shape_discrete(name="legend title").
Had it been a continuous variable, use scale_shape_continuous(name="legend title") instead.
What is the function to use if your legend is based on a fill attribute on a continuous
variable?
The answer is scale_fill_continuous(name="legend title").
5. The Facets
In the previous chart, we had the scatterplot for all different values of cut plotted in the same
chart.
What if you want one chart for one cut?
47
facet_wrap(formula) takes in a formula as the argument. The item on the RHS corresponds to the
column. The item on the LHS defines the rows.
In facet_wrap, the scales of the X and Y axis are fixed to accomodate all points by default. This
would make comparison of attributes meaningful because they would be in the same scale.
However, it is possible to make the scales roam free making the charts look more evenly
distributed by setting the argument scales=free.
For comparison purposes, you can put all the plots in a grid as well using facet_grid(formula).
48
gg1 + facet_grid(color ~ cut) # In a grid
VISUALISING DISTRIBUTIONS
How you visualise the distribution of a variable will depend on whether the variable is
categorical or continuous.
A variable is categorical if it can only take one of a small set of values. In R, categorical
variables are usually saved as factors or character vectors. To examine the distribution of a
categorical variable, use a bar chart:
A variable is continuous if it can take any of an infinite set of ordered values. Numbers and
date-times are two examples of continuous variables. To examine the distribution of a
continuous variable, use a histogram:
In both bar charts and histograms, tall bars show the common values of a variable, and shorter
bars show less-common values. Places that do not have bars reveal values that were not seen
in your data.
49
SAVING GRAPHS
If you are working with RStudio, the plot can be exported from menu in plot panel (lower right-
panel).
Plots panel –> Export –> Save as Image or Save as PDF
1. Specify files to save your image using a function such as jpeg(), png(), svg() or pdf().
Additional argument indicating the width and the height of the image can be also used.
2. Create the plot
3. Close the file with dev.off()
Function Output to
pdf("mygraph.pdf") pdf file
win.metafile("mygraph.wmf") windows metafile
png("mygraph.png") png file
jpeg("mygraph.jpg") jpeg file
bmp("mygraph.bmp") bmp file
postscript("mygraph.ps") postscript file
50
Use a full path in the file name to save the graph outside of the current working directory.
Example:
Or use this:
The R code above, saves the file in the current working directory.
51
REGRESSION ANALYSIS
Regression analysis is a very widely used statistical tool to establish a relationship model between
two variables. One of these variable is called predictor variable whose value is gathered through
experiments. The other variable is called response variable whose value is derived from the
predictor variable.
i) R- LINEAR REGRESSION
In Linear Regression these two variables are related through an equation, where exponent (power)
of both these variables is 1. Mathematically a linear relationship represents a straight line when
plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1
creates a curve.
y = ax + b
Following is the description of the parameters used −
Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.
Find the coefficients from the model created and create the mathematical equation using
these
Get a summary of the relationship model to know the average error in prediction. Also
called residuals.
52
To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
Following is the description of the parameters used −
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(relation)
Coefficients:
(Intercept) x
53
-38.4551 0.6746
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(summary(relation))
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
54
Predict the weight of new persons
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
print(result)
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
png(file = "linearregression.png")
55
plot(y,x,col = "blue",main = "Height & Weight Regression",
dev.off()
56
We create the regression model using the lm() function in R. The model determines the value of
the coefficients using the input data. Next we can predict the value of the response variable for a
given set of predictor variables using these coefficients.
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between the response variable and predictor
variables.
Example
Input Data
Consider the data set "mtcars" available in the R environment. It gives a comparison between
different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"), horse
power("hp"), weight of the car("wt") and some more parameters.
The goal of the model is to establish the relationship between "mpg" as a response variable with
"disp","hp" and "wt" as predictor variables. We create a subset of these variables from the mtcars
data set for this purpose.
print(head(input))
57
# Create the relationship model.
print(model)
a <- coef(model)[1]
print(a)
print(Xdisp)
print(Xhp)
print(Xwt)
Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891
58
-0.03115655
wt
-3.800891
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is
−
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104
The Logistic Regression is a regression model in which the response variable (dependent variable)
has categorical values such as True/False or 0/1. It actually measures the probability of a binary
response as the value of response variable based on the mathematical equation relating it with the
predictor variables.
The function used to create the regression model is the glm() function.
Syntax
59
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)
family is R object to specify the details of the model. It's value is binomial for
logistic regression.
Example
The in-built data set "mtcars" describes different models of a car with their
various engine specifications. In "mtcars" data set, the transmission mode
(automatic or manual) is described by the column am which is a binary value
(0 or 1). We can create a logistic regression model between the columns "am"
and 3 other columns - hp, wt and cyl.
print(head(input))
60
print(summary(am.data))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and
"hp", we consider them to be insignificant in contributing to the value of the variable "am". Only
weight (wt) impacts the "am" value in this regression model.
61
The function used to create the Poisson regression model is
the glm()function.
Syntax
The basic syntax for glm() function in Poisson regression is −
glm(formula,data,family)
family is R object to specify the details of the model. It's value is 'Poisson' for
Logistic Regression.
Example
We have the in-built data set "warpbreaks" which describes the effect of wool
type (A or B) and tension (low, medium or high) on the number of warp
breaks per loom. Let's consider "breaks" as the response variable which is a
count of number of breaks. The wool "type" and "tension" are taken as
predictor variables.
Input Data
input <- warpbreaks
print(head(input))
data = warpbreaks,
family = poisson)
62
print(summary(output))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the summary we look for the p-value in the last column to be less than 0.05 to consider an
impact of the predictor variable on the response variable. As seen the wooltype B having tension
type M and H have impact on the count of breaks.
---------------------------------
63