Você está na página 1de 63

SAVEETHA INSTITUTE OF MEDICAL AND TECHNICAL SCIENCES

SAVEETHA SCHOOL OF ENGINEERING

CS01201 – STATISTICS WITH R PROGRAMMING

UNIT V PLOTTING AND REGRESSION ANALYSIS IN R


Creating Graphs in R, Categorization of Plotting - High-Level Plotting Functions – Box
plot , Bar Plot , Pie Chart , Histogram , Line Chart , Scatter Plot , Low-Level Plotting
Functions , Interactive graphic function, Advanced Graphics – Lattice Graphs, , ggplot2 ,
Visualizing Distributions , Saving Graphs. Regression Analysis - Simple Linear Regression, -
Multiple Regression, Logistic Regression, - Poisson Regression.

GRAPHICAL METHODS

R Graphics
a. R has two independent graphics subsystems
Traditional graphics

 Available in R from the beginning


 Rich collection of tools
 Not very flexible

Grid graphics

 recent (2000)
 Low-level tool, flexible

b. Grid forms the basis of two high-level graphics systems:

 Lattice: based on Trellis graphics (Cleveland)

 ggplot2: inspired by “Grammar of Graphics”(Wilkinson)

1
Creating Graphs in R

Plotting
The functions plot(), points(), lines(), text(), mtext(), axis(), identify() etc. form a suite
that plots points, lines and text.

To see some of the possibilities/basic plots that R offers, enter


demo(graphics)
Press the Enter key to move to each new graph.

On startup, R initiates a graphics device driver which opens a special graphics window for the
display of interactive graphics. If a new graphics window needs to be opened either win.graph()
or windows() command can be issued. Once the device driver is running, R plotting commands
can be used to produce a variety of graphical displays and to create entirely new kinds of display.

Plotting commands divided into three basic groups


I. High-level plotting functions create a new plot on the graphics device, possibly with
axes, labels, titles and so on.
II. Low-level plotting functions add more information to an existing plot, such as extra
points, lines and labels.
III. Interactive graphics functions allow you to interactively add information to, or extract
information from the plots

In addition, R maintains a list of graphical parameters which can be manipulated to customize


your plots.
I. High-level plotting functions
High-level plotting functions are designed to generate a complete plot of the data passed as
arguments to the function. Where appropriate, axes, labels and titles are automatically generated
(unless you request otherwise). High-level plotting commands always start a new plot, erasing
the current plot if necessary.
I.1 The R function plot()
The plot() function is one of the most frequently used plotting functions in R.
IMPORTANT: This is a generic function, that is the type of plot produced is dependent on the
class of the first argument.
The following both plot y against x:
plot(y ~ x) # Use a formula to specify the graph
2
plot(x, y) #
Obviously x and y must be the same length.
Example :
plot((0:20)*pi/10, sin((0:20)*pi/10))
plot((1:30)*0.92, sin((1:30)*0.92))

. Plot of Vector(s)
1. One vector x (plots the vector against the index vector)
> x <- 1:10
> plot(x)
2. Scatterplot of two vectors x and y
> set.seed(13)
> x <- -30:30
> y <- 3 * x + 2 + rnorm(length(x), sd = 20)
> plot(x, y)

. Plot of data.frame elements If the first argument to plot() is a data.frame, this can be as
simply as plot(x,y) providing 2 columns (variables in the data.frame).
Let’s look at the data in the data.frame airquality which measured the 6 air quality in New
York, on a daily basis between May to September 1973. In total there are 154 observation
(days).
> airquality[1:2, ]
Ozone Solar.R Wind Temp Month Day

1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
> plot(airquality)

Multiple plots in the same window, attach/detach

> par(mfrow = c(2, 1))


> plot(airquality$Ozone, airquality$Temp, main = "airquality$Ozone,airquality$Temp")

3
> attach(airquality)
> plot(Ozone, Temp, main = "plot(Ozone, Temp)")
> detach(airquality)

I.2 Other high-level graphics functions

1. BOX PLOT

The boxplot is a method to graphically picture the numerical information, gathered by particular
information. It gives a graphical perspective of the middle, quartiles , most and least extremes of
an information set. Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the minimum, maximum, median,
first quartile and third quartile in the data set. It is also useful in comparing the distribution of
data across data sets by drawing boxplots for each of them.

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the
distribution of data based on the five number summary: minimum, first quartile, median, third
quartile, and maximum.
In the simplest box plot the central rectangle spans the first quartile to the third quartile (the
interquartile range or IQR). A segment inside the rectangle shows the median and "whiskers"
above and below the box show the locations of the minimum and maximum.

 This simplest possible box plot displays the full range of variation (from min to max), the
likely range of variation (the IQR), and a typical value (the median).
 Not uncommonly real datasets will display surprisingly high maximums or surprisingly
low minimums called outliers
 Outliers are either 3×IQR or more above the third quartile or 3×IQR or more below the
first quartile.

4
 Suspected outliers are are slightly more central versions of outliers: either 1.5×IQR or
more above the third quartile or 1.5×IQR or more below the first quartile.

 If either type of outlier is present the whisker on the appropriate side is taken to 1.5×IQR
from the quartile (the "inner fence") rather than the max or min, and individual outlying
data points are displayed as unfilled circles (for suspected outliers) or filled circles (for
outliers). (The "outer fence" is 3×IQR from the quartile.)
How to construct Box Plot ?

Syntax

Boxplot(arguments)

There are various arguments in boxplot , some of them are as follows:

 Data :contains a list of data to design a boxplot

 x : data or values to design a box plot.

 xlab : label for x-axis.

 ylab is the label for y axis.

 main is the title of the bar chart.

 col : used to give colors to the bars in the graph.

 border : used to specify a color for the border.

 notch : used to design a line on each side of he boxes ; takes a Boolean value.

 horizontal : a Boolean argument ; if it is TRUE , boxplot will be drawn horizontally , if


its FALSE , box plot will be drawn vertically.

5
 sub : used to display the subtitle (if any) for the bar chart

 varwidth : a Boolean argument ; if it is TRUE , boxes will be drawn with widths


proportional to the square roots of the number of observations in the group.

 subset : Used to limit the bar plots by providing a vector of values.


To be given to students as LAB EXERCISE
Exercise -1
Using box plots demonstrate the relation between the cars speed and the distance taken to stop.
Use the inbuilt data set of RStudio called the Cars which gives the details related to car speeds
and distance covered.

Solution :
This problem can be demonstrated by using some of the parameters of the box plots.
Case (a) Considering the parameters data and x
(i) The boxplot function is used to create box chart.
(ii) Here we are displaying the speed and dist parameters of Cars dataset using x and data
parameters as shown below.

#create simple box plot


boxplot(speed~dist,data=cars)

Output :

Case (b) Considering the parameters main ,xlab,ylab


(i) The main argument is used to display the title.
(ii) The xlab argument is used to display the label of the x-axis ; here it is Speed of cars:mph
(iii) The ylab argument is used to display the label of the y-axis ; here it is ‘Distance taken to
stop: ft’
boxplot (speed~dist,data=cars,xlab = "Speed of cars:mph",ylab = "Distance taken to
stop:ft",main="Cars DataSet")
Output :

6
Case (c) Considering the parameters notch,col,x,data,,xlab,ylab
(i) Here the value of notch argument is given as TRUE , to create a line on each side of the
box.
(ii) The col argument is used to assign color to the box ; here we are passing green and red to
col .
boxplot(speed~dist,data=cars,xlab = "Speed of cars:mph",ylab = "Distance taken to
stop:ft",main="Cars DataSet",col=c("green","red"),notch = TRUE)
Output :

Case (d) Considering the parameters varwidth,border,data,main,xlab,ylab


(i) The border argument is used to make the box border as a blue color.
(ii) Now we are assigning the width of the box as a square root of the given number using
the varwidth argument , which takes only Boolean values ;here varwidth = TRUE.
boxplot(speed~dist,data=cars,xlab = "Speed of cars:mph",ylab = "Distance taken to stop:ft",
main="Cars DataSet",col=c("green","red"),border = "blue",varwidth = TRUE,notch=FALSE)
Output :

7
Exercise - 2

> boxplot(airquality)

Note if you give plot a vector and factor plot(factor, vector) or plot(vector factor) it will
produce a boxplot.

> par(mfrow = c(2, 2))


> boxplot(airquality$Ozone ~ airquality$Month, col = 2:6, xlab = "month",
+ylab = "ozone", sub = "boxplot(airquality$Ozone~airquality$Month")

8
> title("Equivalent plots")
> plot(factor(airquality$Month), airquality$Ozone, col = 2:6, xlab = "month",
+ylab = "ozone", sub = "plot(factor(airquality$Month), airquality$Ozone")

> plot(airquality$Ozone ~ factor(airquality$Month), col = 2:6,


+ sub = "plot(airquality$Ozone~factor(airquality$Month)")

9
2. BAR CHART / BARPLOT

A bar chart represents data in rectangular bars with length of the bar proportional to the value of
the variable. They are represented by two axis , x – used to represent the groups and y – used to
represent the corresponding values . R can draw both the vertical and level bars in the bar
diagram. R uses the function barplot() to create bar charts. R can draw both vertical and
horizontal bars in the bar chart. In bar chart each of the bars can be given different colors.
Application
Bar Charts are used to show data related to finance , marketing , and others .

Syntax
barplot(arguments)

Example :
barplot(part,names.arg,xlab,ylab,main,col)

The following are the most used arguments in real-time :


 part is a vector or matrix containing numeric values used in bar chart(tells the size of
parts).

 xlab is the label for x axis.

 ylab is the label for y axis.

 main is the title of the bar chart.

 names.arg is a vector of names appearing under each bar/used to display label below each
bar.

 col : used to give colors to the bars in the graph.

 border : used to specify a color for the border.

 space : used to give space between each bar

 horiz : used to display the bar chart in vertical or horizontal position , takes Boolean value

 density : used to show shading lines inside bars.

 sub : used to display the subtitle (if any) for the bar chart

 xlim : used to specify the limits for the x-axis , example c(0,10)

 ylim : used to specify the limits for the y-axis , example c(1,0)

 legend : List of arguments we want to add to the legend() function

10
To be given to students as LAB EXERCISE
Exercise -1
Using R bar chart, demonstrate the percentage conveyance of various strategies utilized for
travelling to office such as bike, car, bus, auto, and train.
Solution :
This problem can be demonstrated by using different parameters of the bar chart.
Case (a) Considering the parameters part and main
(i) Numerical values 20,10,16,4,10 are assigned to a part argument .
(ii) The barplot function is used to create the bar chart ; part is used to assigned values to
each part
of the chart and main is used to assign the title “Strategies utilized for travelling to
office”.

part <- c(20,10,16,4,10)


#function to create the bar chart
barplot(part, main = "Strategies utilized for travelling to office")
Output :

Case (b) Considering the parameters part ,main,xlab,ylab


(i) Numerical values 20, 10, 16,4,10 are assigned to a part argument.
(ii) The barplot function is used to create the bar chart ; part is used to assigned values to
each part
of the chart and main is used to assign the title “Strategies utilized for travelling to
office”.
(iii) The xlab argument is used to assign the name to the x-axis as Vehicles.
(iv) The ylab argument is used to assign the name to the y-axis as Numbers.

part <- c(20,10,16,4,10)

11
#function to create the bar chart with x and y axis names
barplot(part,main = "Strategies utilized for travelling to office",xlab=
"Vehicles",ylab="Numbers")
Output :

Case (c) Considering the parameters part ,main,xlab,ylab , names.arg


(i) Numerical values 20, 10, 16,4,10 are assigned to a part argument.
(ii) The label values namely , bike , car , bus , auto and train are assigned to the lbls variable
and passed in the barplot function , using names.arg argument , to display the names
below each bar .
part <- c(20,10,16,4,10)
lbls <- c("Bike","Car","Bus","Auto","Train")
#Assign names to each bar
barplot(part,main = "Strategies utilized for travelling to office",xlab=
"Vehicles",ylab="Numbers",names.arg = lbls)

Output :

Case (d) Considering the parameters part ,main,xlab,ylab , names.arg,density,horiz


(i) Numerical values 20, 10, 16,4,10 are assigned to a part argument.
(ii) The label values namely , bike , car , bus , auto and train are assigned to the lbls variable.
(iii) Lines are provided inside the barchart using the density argument , give 10 as a input
value.

12
(iv) To create a horizontal barchart , the horiz parameter is used with its value as TRUE.

part <- c(20,10,16,4,10)


lbls <- c("Bike","Car","Bus","Auto","Train")
#Assign names to each bar
barplot(part,main = "Strategies utilized for travelling to office",xlab=
"Vehicles",ylab="Numbers",names.arg = lbls,density = 10,horiz = TRUE)

Output :

Case (e) Considering the color parameter


(i) The col argument is used to display the color for each bar chart
(ii) Bike is assigned red , car is assigned yellow , bus is assigned blue , auto is assigned
black , and train is assigned white using the col argument.

part <- c(20,10,16,4,10)


lbls <- c("Bike","Car","Bus","Auto","Train")
barplot(part,main = "Strategies utilized for travelling to office",xlab=
"Vehicles",ylab="Numbers",names.arg = lbls,col = c("red","yellow","blue","black","white"))
Output :

Case (f) Designing a stacked barchart using matrix values


(i) The bar plot values are assigned using the matrix function , of size 3 X 3
(ii) Red, blue, and yellow colors are assigned to the bars through col argument.

13
vec <- c(1,2,3,4,5,6,7,8,9)
values <- matrix(vec,nrow = 3, ncol = 3)
barplot(values,col = c("red","blue","yellow"))

> values
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

Exercise- 2:

Plot a bar plot of the mean ozone quality by month.

First use tapply to calculate the mean of ozone by month

> OzMonthMean <- tapply(airquality$Ozone, factor(airquality$Month),mean, na.rm = TRUE)


> par(mfrow = c(1, 2))
> barplot(OzMonthMean, col = 2:6, main = "Mean Ozone by month")

14
3. PIE CHART
R Programming language has numerous libraries to create charts and graphs.

Pie charts give more visibility as judging length is more precise than judging volume(as in the
case of other charts).It is represented as a circle. The circle is made of various parts; each part
is marked and numbered. In other words, a pie-chart is a representation of values as slices of
a circle with different colors. The slices are labeled and the numbers corresponding to each
slice is also represented in the chart. The aggregate value of pie chart is always 100 percent.

This chart is easy to read where each part tells us what the size of the data element is.

Application

A pie chart is used to display information related to marketing, weather reports, and finance
among others.

How to Create Pie Chart in R ?

In R the pie chart is created using the pie() function which takes positive numbers as a vector
input. pie function cab take any number of arguments .The additional parameters are used to
control labels, color, title etc.

Syntax

The general syntax for creating a pie chart using R is as follows:

pie(arguments)

Let us discuss some of the arguments of the pie chart.

pie(part,labels,edges,radius,clockwise,init.angle,density,angle,col,main)
Following is the description of the arguments/parameters used –
 part : contains a vector of non-negative numeric values and tells the size of parts
 labels: used to give description to the slices / used to assign names to each part.
 edges: used to change the outer circle of the pie; the default value is 200.
 radius: indicates the radius of the circle of the pie chart.(value between −1 and
+1).
 Clockwise: is a logical value indicating if the slices are drawn clockwise or anti
clockwise.(True / False)
 init.angle : used to specify the initial angle(in degrees) for chart parts ; default
value is 0
 angle: used to change the angle of the shading lines inside the chart
 col : indicates the color palette / used to show colors in chart
15
 main : indicates the title of the chart.

To be given to students as LAB EXERCISE


Using R pie chart, demonstrate the percentage conveyance of various ways for travelling to
office such as walking, car, bus, cycle, and train.
Solution :
This problem can be demonstrated by using different parameters of the pie chart.
Case (a) Considering only the first two parameters , that is , part and labels
(i) Numerical values 8,12,16,4,10 are assigned to a part argument and string values walking
,car ,
Bus ,cycle , and train are assigned to lbls argument.
(ii) The pie function is used to create the pie chart ; part is used to assigned values to each
part of
Chart and lbls is used to assign labels to each part of the chart.

#assigning name and value to each blocks


part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
#function to create pie chart
pie(part,labels=lbls)
Output :

Case (b) Considering parameters part, labels ,edges, main


(i) Numerical values 8,12,16,4,10 are assigned to a part argument and string values walking
,car ,

16
Bus ,cycle , and train are assigned to lbls argument
(ii) The title ‘Strategies utilized for travelling to office’ is assigned using the main argument
; the edge argument is used to change the value of the outer circle as 5 , as shown
below.

part <- c(8,12,16,4,10)


lbls <- c("Walking","Car","Bus","Cycle","Train")
# providing edge value and assigning name to chart
pie (part,labels=lbls,edges=5,main='Strategies utilized for travelling to office')

Output :

Case (c) Considering parameters part, labels , radius and main


(i) Numerical values 8,12,16,4,10 are assigned to a part argument and string values walking
,car ,
Bus ,cycle , and train are assigned to lbls argument
(ii) -1 is used to change the radius of the circle using the radius parameter ;the title
‘Strategies utilized for travelling to office’ is assigned using the main argument .

part <- c(8,12,16,4,10)


lbls <- c("Walking","Car","Bus","Cycle","Train")
# providing radius value
pie (part,labels=lbls,radius=-1,main='Strategies utilized for travelling to office')
Output :

17
Case (d) Considering parameters part, labels , clockwise and main
(i) Numerical values 8,12,16,4,10 are assigned to a part argument and string values walking
,car ,
Bus ,cycle , and train are assigned to lbls argument
(ii) Boolean TRUE and FALSE value is assigned to the clockwise parameter to change the
direction of the parts in the graph ,that is , clockwise or counterclockwise ; title
‘Strategies utilized for travelling to office’ is assigned using the main argument .
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
pie (part,labels=lbls,clockwise = TRUE,main='Strategies utilized for travelling to office')
Output (Clockwise):

part <- c(8,12,16,4,10)


lbls <- c("Walking","Car","Bus","Cycle","Train")

18
pie (part,labels=lbls,clockwise = FALSE,main='Strategies utilized for travelling to office')

Output (Counterclockewise):

Case (e) Considering parameters part, labels , density and main


(i) The density parameter is used in this case to show lines inside the circle ; if the value of
density is high , say 20 , there will be more lines compared to say , 5 as the density value

part <- c(8,12,16,4,10)


lbls <- c("Walking","Car","Bus","Cycle","Train")
pie (part,labels=lbls,density = 20,main='Strategies utilized for travelling to office')

Output :

Case (f) Considering parameters part, labels ,density, angle and main
(i) The angle parameter is used in this case to assign an angle to each line inside the circle.
The value of the angle parameter is 90 degrees and the lines are vertical as shown below.

19
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
pie (part,labels=lbls,density = 20,angle=90,main='Strategies utilized for travelling to
office')
Output :

Case (g) Considering parameters part, labels ,col and main


(i) The colors red,blue,green,yellow, and white are assigned to the col variable in this case.
(ii) The col parameter is used to fill color inside the parts of the circle as shown below.

part <- c(8,12,16,4,10)


lbls <- c("Walking","Car","Bus","Cycle","Train")
# assigning color to each block
clor <- c ("red","blue","green","yellow","white")
pie (part,labels=lbls,col=clor,main='Strategies utilized for travelling to office')
Output :

Case (h) Considering parameters part, labels ,border, col and main

20
(i) The border parameter is used to change the circle border color , that is , each part of the
circle in this case .
(ii) Red color is assigned to the border parameter as shown below.
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
# assigning red border color to each block
clor <- c ("red","blue","green","yellow","white")
pie (part,labels=lbls,border="red",main='Strategies utilized for travelling to office')
Output :

Chart Legend
Chart legends are used to provide a small description of each part ; we can specify where on the
chart it should be displayed , that is , top-left,top-right,bottom-left, or bottom-right , etc. In a pie
chart , the legend is included using legend function .
Syntax
legend(graphics)
Example: legend (position, labels, fill)
Here
 Position : states the position of the legend
 labels : defines the label of blocks

To be given to students as LAB EXERCISE


Using a chart legend , show the percentage conveyance of various ways for travelling to
office such as walking, car, bus, cycle, and train.

21
(i) Walking is assigned red color , car – blue color , bus – yellow color , cycle –
green color , and train – white color , all these values are assigned through cols
and lbls variables and the legend function.
(ii) The fill parameter is used to assign colors to the legend.
(iii) Legend is added to the top-right side of the chart , by assigning the value as
topright ; it can take different values such as topleft , bottomright , and bottomleft.
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
# assigning color to cols variable
cols <- c ("red","blue","yellow","green","white")
pie (part,labels=lbls,main='Strategies utilized for travelling to office')
#creating legend
legend("topright",fill = cols,c("Walking","Car","Bus","Cycle","Train"))
Output :

Example :
pie(OzMonthMean, col = rainbow(5))

22
3D Pie Chart
A pie chart of three-dimensional (3D) shape is called a 3D pie chart. It is used to display the
information in 3D form . A 3D pie chart can be created using the pie3D function.
Syntax
Pie3D(parameters)
Example : For the previous example construct a 3D pie chart.
Note : Install plotrix package
part <- c(8,12,16,4,10)
lbls <- c("Walking","Car","Bus","Cycle","Train")
# creating 3D pie
pie3D(part,labels=lbls,explode=0.1,main='Strategies utilized for travelling to office')
Output :

4. HISTOGRAMS
A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is
similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in
histogram represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
In other words , it takes a vector (i.e. column) of data, breaks it up into intervals, then
plots as a vertical bar the number of instances within each interval.
Histograms can be created/plotted using
(i) Using an input vector
(ii) built-in datasets
(i) Plot histogram using an input vector

23
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used −
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.

Example
A simple histogram is created using input vector, label, col and border parameters.

The script given below will create and save the histogram in the current R working directory.

# Create data for the graph.

v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.

png(file = "histogram.png")

# Create the histogram.

hist(v,xlab = "Weight",col = "yellow",border = "blue")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −

24
Range of X and Y values
To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim
parameters.

The width of each of the bar can be decided by using breaks.

# Create data for the graph.

v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.

png(file = "histogram_lim_breaks.png")

# Create the histogram.

hist(v,xlab = "Weight",col = "green",border = "red", xlim = c(0,40), ylim = c(0,5),

breaks = 5)

# Save the file.

dev.off()

When we execute the above code, it produces the following result –

25
(ii)Plot histogram from built-in datasets
# Simple Histogram
hist(mtcars$mpg)

# Colored Histogram with Different Number of Bins


hist(mtcars$mpg, breaks=12, col="red")

26
# Add a Normal Curve
x <- mtcars$mpg
h<-hist(x, breaks=10, col="red", xlab="Miles Per Gallon", main="Histogram with Normal
Curve")
xfit<-seq(min(x),max(x),length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)

Histograms can be a poor method for determining the shape of a distribution because it is so
strongly affected by the number of bins used.
5. LINE GRAPH / LINE CHART

A line chart is a graph that connects a series of points by drawing line segments between them.
These points are ordered in one of their coordinate (usually the x-coordinate) value. Line charts
are usually used in identifying the trends in data. Line graphs are mainly used to display
information related to marketing , weather reports , finance , and other areas.

The plot() function in R is used to create the line graph.

Syntax

27
Plot(arguments)

The following are the list of arguments used :

 data : numeric values , used to create graph


 type : used to show lines in different forms , as follows :
o l : show only line
o p : show only points
o o : show both points and line
o h : show line in verticsl format
o s or S : show line in square wave format (top-down or bottom-up)
o main : title of line graph
o col : used to assign color to graph
o xlab : used to label x-axis
o ylab : used to label y-axis

Example
A simple line chart is created using the input vector and the type parameter as "O". The below
script will create and save a line chart in the current R working directory.

# Create the data for the chart.

v <- c(7,12,28,3,41)

# Give the chart file a name.

png(file = "line_chart.jpg")

# Plot the bar chart.

plot(v,type = "o")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −

28
Line Chart Title, Color and Labels
The features of the line chart can be expanded by using additional parameters. We add color to
the points and lines, give a title to the chart and add labels to the axes.

Example

# Create the data for the chart.

v <- c(7,12,28,3,41)

# Give the chart file a name.

png(file = "line_chart_label_colored.jpg")

# Plot the bar chart.

plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",

main = "Rain fall chart")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −

29
Multiple Lines in a Line Chart
More than one line can be drawn on the same chart by using the lines()function.

After the first line is plotted, the lines() function can use an additional vector as input to draw the
second line in the chart,

# Create the data for the chart.

v <- c(7,12,28,3,41)

t <- c(14,7,6,19,3)

# Give the chart file a name.

png(file = "line_chart_2_lines.jpg")

# Plot the bar chart.

plot(v,type = "o",col = "red", xlab = "Month", ylab = "Rain fall",

main = "Rain fall chart")

lines(t, type = "o", col = "blue")

# Save the file.

dev.off()

When we execute the above code, it produces the following result –

30
6. SCATTER PLOT

Scatter plots are used to show the relation between two variables of the given sets of data. Data
are displayed as a group of points in these plots. They can be used when one of the two variables
are both dependent or independent of each other .Scatterplots show many points plotted in the
Cartesian plane. Each point represents the values of two variables. One variable is chosen in the
horizontal axis and another in the vertical axis.

The simple scatterplot is created using the plot() function.

Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used −

 x is the data set whose values are the horizontal coordinates.

 y is the data set whose values are the vertical coordinates.

 main is the tile of the graph.

 xlab is the label in the horizontal axis.

 ylab is the label in the vertical axis.

 xlim is the limits of the values of x used for plotting.

 ylim is the limits of the values of y used for plotting.

 axes indicates whether both axes should be drawn on the plot.

31
Example
We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's
use the columns "wt" and "mpg" in mtcars.

input <- mtcars[,c('wt','mpg')]

print(head(input))

When we execute the above code, it produces the following result −


wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1

Creating the Scatterplot


The below script will create a scatterplot graph for the relation between wt(weight) and mpg(miles
per gallon).

# Get the input values.

input <- mtcars[,c('wt','mpg')]

# Give the chart file a name.

png(file = "scatterplot.png")

# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.

plot(x = input$wt,y = input$mpg,

xlab = "Weight",

ylab = "Milage",

xlim = c(2.5,5),

ylim = c(15,30),

main = "Weight vs Milage"

32
# Save the file.

dev.off()

When we execute the above code, it produces the following result –

Scatterplot Matrices
When we have more than two variables and we want to find the correlation between one variable
versus the remaining ones we use scatterplot matrix. We use pairs() function to create matrices
of scatterplots.

Syntax
The basic syntax for creating scatterplot matrices in R is −
pairs(formula, data)

Following is the description of the parameters used −

 formula represents the series of variables used in pairs.

 data represents the data set from which the variables will be taken.

Example
Each variable is paired up with each of the remaining variable. A scatterplot is plotted for each
pair.

# Give the chart file a name.

png(file = "scatterplot_matrices.png")

33
# Plot the matrices between 4 variables giving 12 plots.

# One variable with 3 others and total 4 variables.

pairs(~wt+mpg+disp+cyl,data = mtcars,

main = "Scatterplot Matrix")

# Save the file.

dev.off()

When the above code is executed we get the following output.

II. Low-level plotting functions


Sometimes the high-level plotting functions don't produce exactly the kind of plot you desire. In
this case, low-level plotting commands can be used to add extra information (such as points,
lines
or text) to the current plot.
Some of the more useful low-level plotting functions are:

34
points(x, y) Adds points or connected lines to the current plot.
lines(x, y
text(x, y, labels, ...) Add text to a plot at points given by x, y. Normally labels is an
integer or character vector in which case labels[i] is plotted at point
(x[i], y[i]). The default is 1:length(x).
Note: This function is often used in the sequence
The graphics parameter type="n" suppresses the points but sets up
the axes, and the text() function supplies special characters, as
specified by the character vector names for the points.

abline(a, b) Adds a line of slope b and intercept a to the current plot.

abline(h=y) Adds a horizontal line

abline(v=x) Adds a vertical line

polygon(x, y, ...) Draws a polygon defined by the ordered vertices in (x, y) and
(optionally) shade it in with hatch lines, or fill it if the graphics device
allows the filling of figures.
legend(x, y, legend, ...) Adds a legend to the current plot at the specified position. Plotting
characters, line styles, colors etc., are identified with the labels in
the character vector legend.
At least one other argument v (a vector the same length as legend)
with the corresponding
values of the plotting unit must also be given, as follows:
legend( , fill=v) Colors for filled boxes
legend( , col=v) Colors in which points or lines will be drawn
legend( , lty=v) Line styles
legend( , lwd=v) Line widths
legend( , pch=v) Plotting characters
title(main, sub) Adds a title main to the top of the current plot in a large font and
(optionally) a sub-title sub at the bottom in a smaller font.

axis(side, ...) Adds an axis to the current plot on the side given by the first
argument (1 to 4, counting clockwise from the bottom.) Other
arguments control the positioning of the axis within or beside the
plot, and tick positions and labels. Useful for adding custom axes
after calling plot() with the axes=FALSE argument.

35
Example 1– Using points , lines , legend .
attach(cars)
plot(cars, type = "n", xlab = "Speed [mph]", ylab = "Distance [ft]")
points(speed[speed < 15], dist[speed < 15], pch = "s", col = "blue")
points(speed[speed >= 15], dist[speed >= 15], pch = "f", col = "green")
lines(lowess(cars), col = "red")
legend(5, 120, pch = c("s", "f"), col = c("blue", "green"), legend =
c("Slow","Fast"))
title("Breaking distance of old cars")
detach(2)

36
Example 2 – Generate the following (25 symbols that you can use to produce points in your
graphs)

Solution :
# Make an empty chart
plot(1, 1, xlim=c(1,5.5), ylim=c(0,7), type="n", ann=FALSE)

# Plot digits 0-4 with increasing size and color


text(1:5, rep(6,5), labels=c(0:4), cex=1:5, col=1:5)

# Plot symbols 0-4 with increasing size and color


points(1:5, rep(5,5), cex=1:5, col=1:5, pch=0:4)
text((1:5)+0.4, rep(5,5), cex=0.6, (0:4))

# Plot symbols 5-9 with labels


points(1:5, rep(4,5), cex=2, pch=(5:9))
text((1:5)+0.4, rep(4,5), cex=0.6, (5:9))

# Plot symbols 10-14 with labels


points(1:5, rep(3,5), cex=2, pch=(10:14))
text((1:5)+0.4, rep(3,5), cex=0.6, (10:14))

# Plot symbols 15-19 with labels


points(1:5, rep(2,5), cex=2, pch=(15:19))
text((1:5)+0.4, rep(2,5), cex=0.6, (15:19))

# Plot symbols 20-25 with labels


points((1:6)*0.8+0.2, rep(1,6), cex=2, pch=(20:25))
text((1:6)*0.8+0.5, rep(1,6), cex=0.6, (20:25))

37
III. Interactive graphics functions
R provides functions which allow users to extract or add information to a plot using a mouse via
locator() and identify() functions respectively.
Example 1
> plot(1:20, rt(20, 1))
> text(locator(1), "outlier", adj = 0)
Waits for the user to select locations on the current plot using the left mouse button.

Example 2
Identify members in a hierarchical cluster analysis of distances between European cities
Dataset used: eurodist
> hca <- hclust(eurodist)
> plot(hca, main = "Distance between European Cities")
> (x <- identify(hca))
> x

ADVANCED GRAPHICS
a) Lattice Graphs
What is Lattice?
It is a powerful and elegant high-level data visualization system. That is being inspired by Trellis
graphics. Although, it is being designed with an emphasis on multivariate data. That allows easy
conditioning to produce “small multiple” plots.

38
Lattice Graphs
The lattice package was written by Deepayan Sarkar. He provides better defaults. It also provides
the ability to display multivariate relationships. And trying to improve on-base R graphics.
This package supports the creation of trellis graphs –
 graphs that display a variable or
 the relationship between variables, conditioned on one or
 more other variables.
The typical format is:
graph_type(formula, data=)
We can select graph_type from the listed below. Formula specifies the variable(s) to display and
any conditioning variables.
For example:
~x|A means display numeric variable x for each level of factor A;
y~x | A*B relationship between numeric variables y and x for every combination of factor A and
B levels;
~x means display numeric variable x alone.

39
Main functions in the lattice package

Function Description

xyplot() Scatter plot

splom() Scatter plot matrix

cloud() 3D scatter plot

stripplot() strip plots (1-D scatter plots)

bwplot() Box plot

dotplot() Dot plot

barchart() bar chart

histogram() Histogram

densityplot Kernel density plot

qqmath() Theoretical quantile plot

qq() Two-sample quantile plot

contourplot() 3D contour plot of surfaces

levelplot() False color level plot of surfaces

parallel() Parallel coordinates plot

wireframe() 3D wireframe graph

Installing and loading the lattice package

# Install
install.packages("lattice")
# Load
library("lattice")

40
Example – 1 (mtcars dataset used here)

1. # Customized Lattice Example


2. library(lattice)
3. panel.smoother <- function(x, y) {
4. panel.xyplot(x, y) # show points
5. panel.loess(x, y) # show smoothed line
6. }
7. attach(mtcars)
8. hp <- cut(hp,3) # divide horsepower into three bands
9. xyplot(mpg~wt|hp, scales=list(cex=.8, col="red"),
10. panel=panel.smoother,
11. xlab="Weight", ylab="Miles per Gallon",
12. main="MGP vs Weight by Horse Power")

Output

b)ggplot2 in R
Install package ggplot2.
ggplot2 is a data visualization package for the statistical programming language R. In other
words, ggplot2 is an R library for creating graphics. It can greatly improve the quality
and aesthetics of your graphics, and will make you much more efficient in creating them. ggplot2
allows you to build almost any type of graphic. The ggplot2 package, created by Hadley
Wickham, offers a powerful graphics language for creating elegant and complex plots. Its
popularity in the R community has exploded in recent years. Originally based on Leland

41
Wilkinson's The Grammar of Graphics, ggplot2 allows you to create graphs that represent both
univariate and multivariate numerical and categorical data in a straightforward manner. Grouping
can be represented by color, symbol, size, and transparency. It serves as a general scheme for
data visualization which breaks up graphs into semantic components such as scales and layers.
ggplot2 can serve as a replacement for the base graphics in R and contains a number of defaults
for web and print display of common scales

In contrast to base R graphics, ggplot2 allows the user to add, remove or alter components in a plot
at a high level of abstraction. This abstraction comes at a cost, with ggplot2 being slower than
lattice graphics.

Note : At present, ggplot2 cannot be used to create 3D graphs or mosaic plots.

The process of making any ggplot is as follows.


1. The Setup
First, you need to tell ggplot what dataset to use. This is done using the ggplot(df) function,
where df is a dataframe that contains all features needed to make the plot. This is the most basic
step. Unlike base graphics, ggplot doesn’t take vectors as arguments.
Optionally you can add whatever aesthetics you want to apply to your ggplot
(inside aes() argument) - such as X and Y axis by specifying the respective variables from the
dataset. The variable based on which the color, size, shape and stroke should change can also be
specified here itself. The aesthetics specified here will be inherited by all the geom layers you
will add subsequently.
If you intend to add more layers later on, may be a bar chart on top of a line graph, you can
specify the respective aesthetics when you add those layers.
Below, shown is few examples of how to setup ggplot using in the diamonds dataset that
comes with ggplot2 itself.
However, no plot will be printed until you add the geom layers.
Examples:

library(ggplot2)
ggplot(diamonds) # if only the dataset is known.
ggplot(diamonds, aes(x=carat)) # if only X-axis is known. The Y-axis can be specified in respecti
ve geoms.
ggplot(diamonds, aes(x=carat, y=price)) # if both X and Y axes are fixed for all layers.
ggplot(diamonds, aes(x=carat, color=cut)) # Each category of the 'cut' variable will now have a
distinct color, once a geom is added.

42
The aes argument stands for aesthetics. ggplot2 considers the X and Y axis of the plot to be
aesthetics as well, along with color, size, shape, fill etc. If you want to have the color, size etc
fixed (i.e. not vary based on a variable from the dataframe), you need to specify it outside
the aes(), like this.

ggplot(diamonds, aes(x=carat), color="steelblue")

2. The Layers
The layers in ggplot2 are also called ‘geoms’. Once the base setup is done, you can append the
geoms one on top of the other.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + geom_smooth() # Adding


scatterplot geom (layer1) and smoothing geom (layer2).

We have added two layers (geoms) to this plot - the geom_point() and geom_smooth(). Since the
X axis Y axis and the color were defined in ggplot() setup itself, these two layers inherited those
aesthetics.
Alternatively, you can specify those aesthetics inside the geom layer also as shown below.

ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat


, y=price, color=cut)) # Same as above but specifying the aesthetics inside the geoms.

43
Notice the X and Y axis and how the color of the points vary based on the value of cut variable.
The legend was automatically added.
Instead of having multiple smoothing lines for each level of cut, let us integrate them all under
one line.
How to do that?
Removing the color aesthetic from geom_smooth() layer would accomplish that.

library(ggplot2)
ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat
, y=price)) # Remove color from geom_smooth
ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=cut)) + geom_smooth() # sa
me but simpler

44
Now make the shape of the points vary with color feature?

ggplot(diamonds, aes(x=carat, y=price, color=cut, shape=color)) + geom_point()

45
3. The Labels
Now that you have drawn the main parts of the graph. You might want to add the plot’s main
title and perhaps change the X and Y axis titles. This can be accomplished using the labs layer,
meant for specifying the labels. However, manipulating the size, color of the labels is the job of
the ‘Theme’.

library(ggplot2)
gg <- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + labs(title="Scatterpl
ot", x="Carat", y="Price") # add axis lables and plot title.
print(gg)

The plot’s main title is added and the X and Y axis labels capitalized.
Note: If you are showing a ggplot inside a function, you need to explicitly save it and
then print using the print(gg), like we just did above.

4. The Theme
Almost everything is set, except that we want to increase the size of the labels and change the
legend title. Adjusting the size of labels can be done using the theme() function by setting
the plot.title, axis.text.x and axis.text.y. They need to be specified inside the element_text(). If
you want to remove any of them, set it to element_blank() and it will vanish entirely.
Adjusting the legend title is a bit tricky. If your legend is that of a color attribute and it varies
based in a factor, you need to set the name using scale_color_discrete(), where the color part
belongs to the color attribute and the discrete because the legend is based on a factor variable.

gg1 <- gg + theme(plot.title=element_text(size=30, face="bold"),

46
axis.text.x=element_text(size=15),
axis.text.y=element_text(size=15),
axis.title.x=element_text(size=25),
axis.title.y=element_text(size=25)) +
scale_color_discrete(name="Cut of diamonds") # add title and axis text, change legend title.
print(gg1) # print the plot

If the legend shows a shape attribute based on a factor variable, you need to change it
using scale_shape_discrete(name="legend title").
Had it been a continuous variable, use scale_shape_continuous(name="legend title") instead.
What is the function to use if your legend is based on a fill attribute on a continuous
variable?
The answer is scale_fill_continuous(name="legend title").

5. The Facets
In the previous chart, we had the scatterplot for all different values of cut plotted in the same
chart.
What if you want one chart for one cut?

gg1 + facet_wrap( ~ cut, ncol=3) # columns defined by 'cut'

47
facet_wrap(formula) takes in a formula as the argument. The item on the RHS corresponds to the
column. The item on the LHS defines the rows.

gg1 + facet_wrap(color ~ cut) # row: color, column: cut

In facet_wrap, the scales of the X and Y axis are fixed to accomodate all points by default. This
would make comparison of attributes meaningful because they would be in the same scale.
However, it is possible to make the scales roam free making the charts look more evenly
distributed by setting the argument scales=free.

gg1 + facet_wrap(color ~ cut, scales="free") # row: color, column: cut

For comparison purposes, you can put all the plots in a grid as well using facet_grid(formula).

48
gg1 + facet_grid(color ~ cut) # In a grid

VISUALISING DISTRIBUTIONS
How you visualise the distribution of a variable will depend on whether the variable is
categorical or continuous.
A variable is categorical if it can only take one of a small set of values. In R, categorical
variables are usually saved as factors or character vectors. To examine the distribution of a
categorical variable, use a bar chart:
A variable is continuous if it can take any of an infinite set of ordered values. Numbers and
date-times are two examples of continuous variables. To examine the distribution of a
continuous variable, use a histogram:
In both bar charts and histograms, tall bars show the common values of a variable, and shorter
bars show less-common values. Places that do not have bars reveal values that were not seen
in your data.

49
SAVING GRAPHS
If you are working with RStudio, the plot can be exported from menu in plot panel (lower right-
panel).
Plots panel –> Export –> Save as Image or Save as PDF

It’s also possible to save the graph using R codes as follow:

1. Specify files to save your image using a function such as jpeg(), png(), svg() or pdf().
Additional argument indicating the width and the height of the image can be also used.
2. Create the plot
3. Close the file with dev.off()

Function Output to
pdf("mygraph.pdf") pdf file
win.metafile("mygraph.wmf") windows metafile
png("mygraph.png") png file
jpeg("mygraph.jpg") jpeg file
bmp("mygraph.bmp") bmp file
postscript("mygraph.ps") postscript file

50
Use a full path in the file name to save the graph outside of the current working directory.

# example - output graph to jpeg file


jpeg("c:/mygraphs/myplot.jpg")
plot(x)
dev.off()

Example:

Or use this:

The R code above, saves the file in the current working directory.

File formats for exporting plots:


 pdf(“rplot.pdf”): pdf file
 png(“rplot.png”): png file
 jpeg(“rplot.jpg”): jpeg file
 postscript(“rplot.ps”): postscript file
 bmp(“rplot.bmp”): bmp file
 win.metafile(“rplot.wmf”): windows metafile

51
REGRESSION ANALYSIS
Regression analysis is a very widely used statistical tool to establish a relationship model between
two variables. One of these variable is called predictor variable whose value is gathered through
experiments. The other variable is called response variable whose value is derived from the
predictor variable.

i) R- LINEAR REGRESSION

In Linear Regression these two variables are related through an equation, where exponent (power)
of both these variables is 1. Mathematically a linear relationship represents a straight line when
plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1
creates a curve.

The general mathematical equation for a linear regression is −

y = ax + b
Following is the description of the parameters used −

 y is the response variable.

 x is the predictor variable.

 a and b are constants which are called the coefficients.

Steps to Establish a Regression


A simple example of regression is predicting weight of a person when his height is known. To do
this we need to have the relationship between height and weight of a person.

The steps to create the relationship is −

 Carry out the experiment of gathering a sample of observed values of height and
corresponding weight.

 Create a relationship model using the lm() functions in R.

 Find the coefficients from the model created and create the mathematical equation using
these

 Get a summary of the relationship model to know the average error in prediction. Also
called residuals.

52
 To predict the weight of new persons, use the predict() function in R.

Input Data
Below is the sample data representing the observations −

# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the response variable.

Syntax
The basic syntax for lm() function in linear regression is −

lm(formula,data)
Following is the description of the parameters used −

 formula is a symbol presenting the relation between x and y.

 data is the vector on which the formula will be applied.

Create Relationship Model & get the Coefficients

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)

print(relation)

When we execute the above code, it produces the following result −


Call:
lm(formula = y ~ x)

Coefficients:
(Intercept) x

53
-38.4551 0.6746

Get the Summary of the Relationship

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)

print(summary(relation))

When we execute the above code, it produces the following result −


Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.253 on 8 degrees of freedom


Multiple R-squared: 0.9548, Adjusted R-squared: 0.9491
F-statistic: 168.9 on 1 and 8 DF, p-value: 1.164e-06

predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)

Following is the description of the parameters used −

 object is the formula which is already created using the lm() function.

 newdata is the vector containing the new value for predictor variable.

54
Predict the weight of new persons

# The predictor vector.

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

# The resposne vector.

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

# Apply the lm() function.

relation <- lm(y~x)

# Find weight of a person with height 170.

a <- data.frame(x = 170)

result <- predict(relation,a)

print(result)

When we execute the above code, it produces the following result −


1
76.22869

Visualize the Regression Graphically

# Create the predictor and response variable.

x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)

y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)

relation <- lm(y~x)

# Give the chart file a name.

png(file = "linearregression.png")

# Plot the chart.

55
plot(y,x,col = "blue",main = "Height & Weight Regression",

abline(lm(x~y)),cex = 1.3,pch = 16,xlab = "Weight in Kg",ylab = "Height in cm")

# Save the file.

dev.off()

When we execute the above code, it produces the following result −

ii) R - MULTIPLE REGRESSION


Multiple regression is an extension of linear regression into relationship between more than two
variables. In simple linear relation we have one predictor and one response variable, but in
multiple regression we have more than one predictor variable and one response variable.

The general mathematical equation for multiple regression is −


y = a + b1x1 + b2x2 +...bnxn

Following is the description of the parameters used −

 y is the response variable.

 a, b1, b2...bn are the coefficients.

 x1, x2, ...xn are the predictor variables.

56
We create the regression model using the lm() function in R. The model determines the value of
the coefficients using the input data. Next we can predict the value of the response variable for a
given set of predictor variables using these coefficients.

lm() Function
This function creates the relationship model between the predictor and the response variable.

Syntax
The basic syntax for lm() function in multiple regression is −

lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −

 formula is a symbol presenting the relation between the response variable and predictor
variables.

 data is the vector on which the formula will be applied.

Example
Input Data
Consider the data set "mtcars" available in the R environment. It gives a comparison between
different car models in terms of mileage per gallon (mpg), cylinder displacement("disp"), horse
power("hp"), weight of the car("wt") and some more parameters.

The goal of the model is to establish the relationship between "mpg" as a response variable with
"disp","hp" and "wt" as predictor variables. We create a subset of these variables from the mtcars
data set for this purpose.

input <- mtcars[,c("mpg","disp","hp","wt")]

print(head(input))

When we execute the above code, it produces the following result −


mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Hornet Sportabout 18.7 360 175 3.440
Valiant 18.1 225 105 3.460

Create Relationship Model & get the Coefficients

input <- mtcars[,c("mpg","disp","hp","wt")]

57
# Create the relationship model.

model <- lm(mpg~disp+hp+wt, data = input)

# Show the model.

print(model)

# Get the Intercept and coefficients as vector elements.

cat("# # # # The Coefficient Values # # # ","\n")

a <- coef(model)[1]

print(a)

Xdisp <- coef(model)[2]

Xhp <- coef(model)[3]

Xwt <- coef(model)[4]

print(Xdisp)

print(Xhp)

print(Xwt)

When we execute the above code, it produces the following result −


Call:
lm(formula = mpg ~ disp + hp + wt, data = input)

Coefficients:
(Intercept) disp hp wt
37.105505 -0.000937 -0.031157 -3.800891

# # # # The Coefficient Values # # #


(Intercept)
37.10551
disp
-0.0009370091
hp

58
-0.03115655
wt
-3.800891

Create Equation for Regression Model


Based on the above intercept and coefficient values, we create the
mathematical equation.
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

Apply Equation for predicting New Values


We can use the regression equation created above to predict the mileage
when a new set of values for displacement, horse power and weight is
provided.

For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is

Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104

iii) R - LOGISTIC REGRESSION

The Logistic Regression is a regression model in which the response variable (dependent variable)
has categorical values such as True/False or 0/1. It actually measures the probability of a binary
response as the value of response variable based on the mathematical equation relating it with the
predictor variables.

The general mathematical equation for logistic regression is −


y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))

Following is the description of the parameters used −

 y is the response variable.

 x is the predictor variable.

 a and b are the coefficients which are numeric constants.

The function used to create the regression model is the glm() function.

Syntax

59
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)

Following is the description of the parameters used −

 formula is the symbol presenting the relationship between the variables.

 data is the data set giving the values of these variables.

 family is R object to specify the details of the model. It's value is binomial for
logistic regression.

Example
The in-built data set "mtcars" describes different models of a car with their
various engine specifications. In "mtcars" data set, the transmission mode
(automatic or manual) is described by the column am which is a binary value
(0 or 1). We can create a logistic regression model between the columns "am"
and 3 other columns - hp, wt and cyl.

# Select some columns form mtcars.

input <- mtcars[,c("am","cyl","hp","wt")]

print(head(input))

When we execute the above code, it produces the following result −


am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460

Create Regression Model


We use the glm() function to create the regression model and get its
summary for analysis.

input <- mtcars[,c("am","cyl","hp","wt")]

am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)

60
print(summary(am.data))

When we execute the above code, it produces the following result −


Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 43.2297 on 31 degrees of freedom


Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and
"hp", we consider them to be insignificant in contributing to the value of the variable "am". Only
weight (wt) impacts the "am" value in this regression model.

iv) R - POISSON REGRESSION


Poisson Regression involves regression models in which the response variable is in the form of
counts and not fractional numbers. For example, the count of number of births or number of wins
in a football match series. Also the values of the response variables follow a Poisson distribution.

The general mathematical equation for Poisson regression is −


log(y) = a + b1x1 + b2x2 + bnxn.....

Following is the description of the parameters used −

 y is the response variable.

 a and b are the numeric coefficients.

 x is the predictor variable.

61
The function used to create the Poisson regression model is
the glm()function.

Syntax
The basic syntax for glm() function in Poisson regression is −
glm(formula,data,family)

Following is the description of the parameters used in above functions −

 formula is the symbol presenting the relationship between the variables.

 data is the data set giving the values of these variables.

 family is R object to specify the details of the model. It's value is 'Poisson' for
Logistic Regression.

Example
We have the in-built data set "warpbreaks" which describes the effect of wool
type (A or B) and tension (low, medium or high) on the number of warp
breaks per loom. Let's consider "breaks" as the response variable which is a
count of number of breaks. The wool "type" and "tension" are taken as
predictor variables.

Input Data
input <- warpbreaks

print(head(input))

When we execute the above code, it produces the following result −


breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L

Create Regression Model

output <-glm(formula = breaks ~ wool+tension,

data = warpbreaks,

family = poisson)

62
print(summary(output))

When we execute the above code, it produces the following result −


Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 297.37 on 53 degrees of freedom


Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4

In the summary we look for the p-value in the last column to be less than 0.05 to consider an
impact of the predictor variable on the response variable. As seen the wooltype B having tension
type M and H have impact on the count of breaks.

---------------------------------

63

Você também pode gostar