Rstudio Study Notes For PA 20181126

Predictive Analytics: RStudio Study Notes
I. Data Investigation & Preparation
Basic Terms
class(𝑥) displays which class the object 𝑥 is in
summary(𝑥) provides a summary of the object 𝑥
table(𝑥) provides a frequency of each value in 𝑥
table(𝑥, 𝑦) provides a two-dimensional frequency table for 𝑥 and 𝑦
levels(𝑥) displays all of the levels of 𝑥 in the data
head(𝑥) returns the first several rows of a data.frame
remove(𝑥) or rm(𝑥) removes the declared variable 𝑥 from memory
# is used at the beginning of a line for comments
Reading in a CSV file

(1) MPG <- read.csv(“MilesPerGallon.csv”)
Removing variables
(1) Remove the columns “income” and “face”
data.new <- data.old[ , setdiff(colnames(data.old), c(“income”, “face”))]
(2) Remove the 7th and 9th columns

data.new <- data.old[ , -c(7, 9)]
(3) Select only the columns “gender” and “age” and the rows where the gender is male and the age is at least 60
data.new <- data.old[data.old$gender == “male” & data.old$age >= 60, c(“gender”, “age”)]
Handling NA and missing observations

(1) Remove observations that have NA for Age.
data.new <- data.old[! is.na(data$Age) , ]
(2) Replacing missing values with the median

age.median <- median(data.new$Age, na.rm = TRUE)
data.new$Age[is.na(data.new$Age)] <- age.median
Create a log version of a column

(1) Use an if-else to handle zeroes to avoid errors
logIncome <- ifelse(data.new$Income == 0, 0, log(data.new$Income))
Create a compound variable from two other variables

(1) Combine the variables “smoker” and “gender” into a single variable
data.new$smoker_gender <- paste(data.new$smoker, data.new$gender, sep = “_”)
Binarize a factor column into multiple columns based on each possible factor
(1) Binarize the column “continent” using predict and dummyVars
library(caret)
data.binarized <- predict(dummyVars( ~ continent, data.new, sep = ‘_’), data.new)
data.new <- cbind(data.new, data.binarized)
Create a flag/indicator field for modeling

(1) data.new$flag <- as.factor(ifelse(data.new$amount >= 10, “P”, “F”))
Page 1 of 6
A. Reed 11/26/2018
Display combinations of scatterplots for multiple columns
(1) pairs(data.new[ , c(2, 4, 7)])
Creating cuts for a column – bucketing amounts together

(1) Creating specific break points
data.new$AgeBand <- cut(x = data.new$Age, breaks = c(0, 25, 50))
(2) Creating cuts with intervals of 10

data.new$AgeBand <- cut(x = data.new$Age, 10)
Split the data into training and testing datasets

(1) Use a random uniform number between 0 and 1 to determine the datasets
set.seed(1000)
data.full$random <- runif(nrow(data.full))
data.train <- data.full[data.full$random < 0.8 , ]
data.test <- data.full[data.full$random >= 0.8 , ]
(2) Alternative method of partitioning the data, using createDataPartition

library(caret)
set.seed(1000)
partition <- createDataPartition(data.full$flag, list = FALSE, p = 0.8)
data.train <- data.full[partition, ]
data.test <- data.full[-partition, ]
II. Plotting With ggplot2
(1) Scatterplot of age and income

library(ggplot2)
p1 <- ggplot(data.new, aes(x = age, y = income)) + geom_point()
(2) Scatterplot of age and income, where the color varies by continent, the x-axis is restricted on [40, 80], and the y-
axis is logarithmic
library(ggplot2)
p2 <- ggplot(data.new, aes(x = age, y = income, color = continent) + geom_point()
+ coord_cartesian(xlim = c(40, 80)) + scale_y_log10()
(3) Add labels to graph p2

p3 <- p2 + labs(x = “x Label”, y = “y Label”, title = “Graph Title”)
(4) Divide graph p1 into multiple graphs, based on continent, using facet_wrap
p4 <- p1 + facet_wrap(~continent)
(5) Show graphs p2 and p3 side-by-side using grid.arrange

library(gridExtra)
grid.arrange(p2, p3, ncol = 2)
Page 2 of 6
A. Reed 11/26/2018
(6) View a bar graph for every variable in the dataset against the target variable “flag”
for (i in c(1:ncol(data.new)))
{
plot <- ggplot(data = data.new, aes(x = data.new[ , i], fill = flag)) + geom_bar(position = “fill”)
+ labs(x = colnames(data.new)[i])
print(plot)
}
#Other common layers are geom_smooth, geom_boxplot, and geom_histogram.

#There are many more features available, listed on the ggplot2 cheat sheet, available during the exam
II. Principal Component Analysis & Clustering
Principal Component Analysis

(1) First, need to subset the data to only contain numerical data. Then use prcomp.
pca <- prcomp(data.new, center = TRUE, scale. = TRUE)
(2) Display the standard deviation and proportion of variance for each principal component
summary(pca)
(3) Display the coefficients for each principal component

pca$rotation
Clustering on two variables

(1) First, scale the variables. Then calculate kmeans.
df$a <- scale(df$a)
df$b <- scale(df$b)
km3 <- kmeans(df, 3)
(2) Add a column that shows which cluster each observation belongs in
df$group <- as.factor(km3$cluster)
(3) Analyze the best “k” amount of clusters through the elbow method: First, calculate the
kmeans for various k values. Then perform the following and decide when the amount of
variance added is no longer substantial.
km3$betweenss / km3$totss
III. Generalized Linear Model
Create the general linear model

(1) Standard general linear model using glm
model <- glm(target ~ x + y + z, data = data.new, family = Gamma(link = “log”))
(2) Use Ridge Regression. Alpha = 0 implies Ridge Regression; alpha = 1 implies Lasso where coefficients can
become 0. Try various values for alpha and lambda.
library(glmnet)
mm <- model.matrix(target ~ x +y + z, data = data.new)
model.ridge <- glmnet(mm, y = target, family = “Gaussian”, alpha = 0, lambda = 0.1)
Page 3 of 6
A. Reed 11/26/2018
(3) Use cv.glmnet to identify the best lambda, then incorporate this value
library(glmnet)
mm <- model.matrix(target ~ x + y + z, data = data.new)
model.cv <- cv.glmnet(x = mm, y = data.train$target, family = “Gaussian”, alpha = 1)
model.best <- glmnet(x = mm, y = data.train$target, family = “Gaussian”,
lambda = model.cv$lamda.min, alpha = 1)
model.best$beta
View results and determine the best model

(1) View the coefficients using summary; choose the model with the lowest AIC.
summary(model)
(2) View confidence intervals for the model coefficients

confint.default(model)
(3) Remove unnecessary variables using stepAIC from the MASS library
library(MASS)
stepAIC(model, direction = “backward”)
(4) Predict the new values; should perform this on both training and testing sets
predictions <- predict(model, newdata = data.new, type = “response”)
data.predict <- cbind(data.new, predictions)
(5) Calculate the sum of squared errors; choose the model with the lowest SSE
sse <- sum((data.predict$target – data.predict$predictions)^2)
(6) View the confusionMatrix; choose the model with the highest accuracy
library(caret)
confusionMatrix(as.factor(predictions), as.factor(data.test$target))
IV. Decision Trees
Create the decision tree

(1) Standard decision tree
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~ x + y + z, data = data.train)
(2) Decision tree with a minimum of 5 observations per leaf, complexity parameter is 0.001, maximum depth of 7,
and reduces the Gini impurity measure
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~ ., data = data.train, control = rpart.control(minbucket = 5, cp = .001,
maxdepth = 7), parms = list(split = “gini”))
(3) Regression decision tree that uses anova for assessment and a cp value of 0 for maximum complexity
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~., data = data.train, method = “anova”, control = rpart.control(minbucket = 10,
cp = 0, maxdepth = 10))
Page 4 of 6
A. Reed 11/26/2018
View results and prune the tree
(1) Use rpart.plot to view the decision tree
library(rpart.plot)
rpart.plot(d.tree)
(2) Use printcp and plotcp to view results regarding various cp values used in the decision tree
printcp(d.tree)
plotcp(d.tree)
(3) Can also use the following method to identify the best cp value
cp.best <- d.tree$cptable[which.min(d.tree$cptable[ , “xerror”]), “CP”]
(4) Use the best cp value to prune the tree

d.tree2 <- prune(d.tree, cp = cp.best)
(5) Predict the values and compute the confusionMatrix

library(caret)
predictions <- predict(d.tree2, type = “raw”)
confusionMatrix(as.factor(predictions), as.factor(data.new$flag))
V. Random Forests (Bagging)
Create the random forest

(1) Random forest that creates 50 trees, 3 variables per tree, and a minimum of 100 observations per terminal node
library(randomForest)
set.seed(1000)
model.rf <- randomForest(target ~ ., data = data.train, ntree = 50, mtry = 3, nodesize = 100,
sampsize = floor(0.6 * nrow(data.training)), importance = TRUE)
(2) Random forest that uses under-sampling with caret; use sampling = “up” for over-sampling
library(caret)
set.seed(1000)
ctrl <- trainControl(method = “repeatedcv”, number = 5, repeats = 3, sampling = “down”)
tune_grid <- expand.grid(mtry = c(15:25)) #number of features
model.rf <- train(target ~ ., data = data.train, method = “rf”, ntree = 50, importance = TRUE,
trControl = ctrl, tuneGrid = tune_grid)
Evaluate the results of the random forest

(1) Calculate the area under the curve (AUC)
library(pROC)
predictions <- predict(model.rf, newdata = data.test, type = “raw”)
auc(as.numeric(data.test$target), as.numeric(predictions))
(2) View the top 10 important features with varImp from caret
library(caret)
rf.importance <- varImp(model.rf)
plot(rf.importance, top = 10)
(3) View partial dependence plots

library(pdp)
partial(model.rf, train = data.train, pred.var = “target”, plot = TRUE, rug = TRUE, smooth = TRUE)
Page 5 of 6
A. Reed 11/26/2018
(4) Compute accuracy with confusionMatrix
confusionMatrix(as.factor(predictions), as.factor(data.new$flag))
#Random forests generally reduce variance without affecting the bias. However, they are less interpretable.
VI. Report Structure
(1) Executive Summary

 Give background information
 List the most important features relating to the targets
(2) Data Exploration, Preparation, and Cleaning

 Explain the target variables and whether they require classification or regression
 Mention that we need to consult the legal department to decide if we can implement all the data into
our models
 Convey NA values, negatives, or other values that are anomalies
 General explanation of the original dataset including number of rows, columns, and general
characteristics
 Display the distribution of the target variables and important features
 Give the dimensions of the dataset after preparing the data
(3) Feature Selection

 Explain any variables that were created and why
 Mention how Principal Component Analysis and/or Clustering affected which features were selected
(4) Model Selection and Validation

 Explain how the training and testing sets were determined
 Demonstrate which parameters were modified and why
 Display a visual of the model as well as the calculations resulting from the model
 Visuals can include the coefficient table for the GLM, residual plots, Q-Q plot, decision tree, a
confusionMatrix, etc.
(5) Findings
 Compare and contrast the results of the models
 Mention how decision trees and random forests may predict better but are less interpretable
 List the most important features and which direction they move the target
 Give ideas for future research and analysis
(6) Appendices
 Display a summary of the dataset, using summary(dataset)
 Show any relevant graphs
Page 6 of 6
A. Reed 11/26/2018

Rstudio Study Notes For PA 20181126

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Rstudio Study Notes For PA 20181126

Enviado por

Direitos autorais:

Formatos disponíveis

Predictive Analytics: RStudio Study Notes

I. Data Investigation & Preparation

Reading in a CSV file

(2) Remove the 7th and 9th columns

Handling NA and missing observations

(2) Replacing missing values with the median

Create a log version of a column

Create a compound variable from two other variables

Create a flag/indicator field for modeling

Creating cuts for a column – bucketing amounts together

(2) Creating cuts with intervals of 10

Split the data into training and testing datasets

(2) Alternative method of partitioning the data, using createDataPartition

II. Plotting With ggplot2

(1) Scatterplot of age and income

(3) Add labels to graph p2

(5) Show graphs p2 and p3 side-by-side using grid.arrange

#Other common layers are geom_smooth, geom_boxplot, and geom_histogram.

II. Principal Component Analysis & Clustering

Principal Component Analysis

(3) Display the coefficients for each principal component

Clustering on two variables

III. Generalized Linear Model

Create the general linear model

View results and determine the best model

(2) View confidence intervals for the model coefficients

IV. Decision Trees

Create the decision tree

(4) Use the best cp value to prune the tree

(5) Predict the values and compute the confusionMatrix

V. Random Forests (Bagging)

Create the random forest

Evaluate the results of the random forest

(3) View partial dependence plots

VI. Report Structure

(1) Executive Summary

(2) Data Exploration, Preparation, and Cleaning

(3) Feature Selection

(4) Model Selection and Validation

Você também pode gostar