Você está na página 1de 6

Predictive Analytics: RStudio Study Notes

I. Data Investigation & Preparation

Basic Terms
class(𝑥) displays which class the object 𝑥 is in
summary(𝑥) provides a summary of the object 𝑥
table(𝑥) provides a frequency of each value in 𝑥
table(𝑥, 𝑦) provides a two-dimensional frequency table for 𝑥 and 𝑦
levels(𝑥) displays all of the levels of 𝑥 in the data
head(𝑥) returns the first several rows of a data.frame
remove(𝑥) or rm(𝑥) removes the declared variable 𝑥 from memory
# is used at the beginning of a line for comments

Reading in a CSV file


(1) MPG <- read.csv(“MilesPerGallon.csv”)

Removing variables
(1) Remove the columns “income” and “face”
data.new <- data.old[ , setdiff(colnames(data.old), c(“income”, “face”))]

(2) Remove the 7th and 9th columns


data.new <- data.old[ , -c(7, 9)]

(3) Select only the columns “gender” and “age” and the rows where the gender is male and the age is at least 60
data.new <- data.old[data.old$gender == “male” & data.old$age >= 60, c(“gender”, “age”)]

Handling NA and missing observations


(1) Remove observations that have NA for Age.
data.new <- data.old[! is.na(data$Age) , ]

(2) Replacing missing values with the median


age.median <- median(data.new$Age, na.rm = TRUE)
data.new$Age[is.na(data.new$Age)] <- age.median

Create a log version of a column


(1) Use an if-else to handle zeroes to avoid errors
logIncome <- ifelse(data.new$Income == 0, 0, log(data.new$Income))

Create a compound variable from two other variables


(1) Combine the variables “smoker” and “gender” into a single variable
data.new$smoker_gender <- paste(data.new$smoker, data.new$gender, sep = “_”)

Binarize a factor column into multiple columns based on each possible factor
(1) Binarize the column “continent” using predict and dummyVars
library(caret)
data.binarized <- predict(dummyVars( ~ continent, data.new, sep = ‘_’), data.new)
data.new <- cbind(data.new, data.binarized)

Create a flag/indicator field for modeling


(1) data.new$flag <- as.factor(ifelse(data.new$amount >= 10, “P”, “F”))

Page 1 of 6
A. Reed 11/26/2018
Display combinations of scatterplots for multiple columns
(1) pairs(data.new[ , c(2, 4, 7)])

Creating cuts for a column – bucketing amounts together


(1) Creating specific break points
data.new$AgeBand <- cut(x = data.new$Age, breaks = c(0, 25, 50))

(2) Creating cuts with intervals of 10


data.new$AgeBand <- cut(x = data.new$Age, 10)

Split the data into training and testing datasets


(1) Use a random uniform number between 0 and 1 to determine the datasets
set.seed(1000)
data.full$random <- runif(nrow(data.full))
data.train <- data.full[data.full$random < 0.8 , ]
data.test <- data.full[data.full$random >= 0.8 , ]

(2) Alternative method of partitioning the data, using createDataPartition


library(caret)
set.seed(1000)
partition <- createDataPartition(data.full$flag, list = FALSE, p = 0.8)
data.train <- data.full[partition, ]
data.test <- data.full[-partition, ]

II. Plotting With ggplot2

(1) Scatterplot of age and income


library(ggplot2)
p1 <- ggplot(data.new, aes(x = age, y = income)) + geom_point()

(2) Scatterplot of age and income, where the color varies by continent, the x-axis is restricted on [40, 80], and the y-
axis is logarithmic
library(ggplot2)
p2 <- ggplot(data.new, aes(x = age, y = income, color = continent) + geom_point()
+ coord_cartesian(xlim = c(40, 80)) + scale_y_log10()

(3) Add labels to graph p2


p3 <- p2 + labs(x = “x Label”, y = “y Label”, title = “Graph Title”)

(4) Divide graph p1 into multiple graphs, based on continent, using facet_wrap
p4 <- p1 + facet_wrap(~continent)

(5) Show graphs p2 and p3 side-by-side using grid.arrange


library(gridExtra)
grid.arrange(p2, p3, ncol = 2)

Page 2 of 6
A. Reed 11/26/2018
(6) View a bar graph for every variable in the dataset against the target variable “flag”
for (i in c(1:ncol(data.new)))
{
plot <- ggplot(data = data.new, aes(x = data.new[ , i], fill = flag)) + geom_bar(position = “fill”)
+ labs(x = colnames(data.new)[i])
print(plot)
}

#Other common layers are geom_smooth, geom_boxplot, and geom_histogram.


#There are many more features available, listed on the ggplot2 cheat sheet, available during the exam

II. Principal Component Analysis & Clustering

Principal Component Analysis


(1) First, need to subset the data to only contain numerical data. Then use prcomp.
pca <- prcomp(data.new, center = TRUE, scale. = TRUE)

(2) Display the standard deviation and proportion of variance for each principal component
summary(pca)

(3) Display the coefficients for each principal component


pca$rotation

Clustering on two variables


(1) First, scale the variables. Then calculate kmeans.
df$a <- scale(df$a)
df$b <- scale(df$b)
km3 <- kmeans(df, 3)

(2) Add a column that shows which cluster each observation belongs in
df$group <- as.factor(km3$cluster)

(3) Analyze the best “k” amount of clusters through the elbow method: First, calculate the
kmeans for various k values. Then perform the following and decide when the amount of
variance added is no longer substantial.
km3$betweenss / km3$totss

III. Generalized Linear Model

Create the general linear model


(1) Standard general linear model using glm
model <- glm(target ~ x + y + z, data = data.new, family = Gamma(link = “log”))

(2) Use Ridge Regression. Alpha = 0 implies Ridge Regression; alpha = 1 implies Lasso where coefficients can
become 0. Try various values for alpha and lambda.
library(glmnet)
mm <- model.matrix(target ~ x +y + z, data = data.new)
model.ridge <- glmnet(mm, y = target, family = “Gaussian”, alpha = 0, lambda = 0.1)

Page 3 of 6
A. Reed 11/26/2018
(3) Use cv.glmnet to identify the best lambda, then incorporate this value
library(glmnet)
mm <- model.matrix(target ~ x + y + z, data = data.new)
model.cv <- cv.glmnet(x = mm, y = data.train$target, family = “Gaussian”, alpha = 1)
model.best <- glmnet(x = mm, y = data.train$target, family = “Gaussian”,
lambda = model.cv$lamda.min, alpha = 1)
model.best$beta

View results and determine the best model


(1) View the coefficients using summary; choose the model with the lowest AIC.
summary(model)

(2) View confidence intervals for the model coefficients


confint.default(model)

(3) Remove unnecessary variables using stepAIC from the MASS library
library(MASS)
stepAIC(model, direction = “backward”)

(4) Predict the new values; should perform this on both training and testing sets
predictions <- predict(model, newdata = data.new, type = “response”)
data.predict <- cbind(data.new, predictions)

(5) Calculate the sum of squared errors; choose the model with the lowest SSE
sse <- sum((data.predict$target – data.predict$predictions)^2)

(6) View the confusionMatrix; choose the model with the highest accuracy
library(caret)
confusionMatrix(as.factor(predictions), as.factor(data.test$target))

IV. Decision Trees

Create the decision tree


(1) Standard decision tree
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~ x + y + z, data = data.train)

(2) Decision tree with a minimum of 5 observations per leaf, complexity parameter is 0.001, maximum depth of 7,
and reduces the Gini impurity measure
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~ ., data = data.train, control = rpart.control(minbucket = 5, cp = .001,
maxdepth = 7), parms = list(split = “gini”))

(3) Regression decision tree that uses anova for assessment and a cp value of 0 for maximum complexity
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~., data = data.train, method = “anova”, control = rpart.control(minbucket = 10,
cp = 0, maxdepth = 10))

Page 4 of 6
A. Reed 11/26/2018
View results and prune the tree
(1) Use rpart.plot to view the decision tree
library(rpart.plot)
rpart.plot(d.tree)

(2) Use printcp and plotcp to view results regarding various cp values used in the decision tree
printcp(d.tree)
plotcp(d.tree)

(3) Can also use the following method to identify the best cp value
cp.best <- d.tree$cptable[which.min(d.tree$cptable[ , “xerror”]), “CP”]

(4) Use the best cp value to prune the tree


d.tree2 <- prune(d.tree, cp = cp.best)

(5) Predict the values and compute the confusionMatrix


library(caret)
predictions <- predict(d.tree2, type = “raw”)
confusionMatrix(as.factor(predictions), as.factor(data.new$flag))

V. Random Forests (Bagging)

Create the random forest


(1) Random forest that creates 50 trees, 3 variables per tree, and a minimum of 100 observations per terminal node
library(randomForest)
set.seed(1000)
model.rf <- randomForest(target ~ ., data = data.train, ntree = 50, mtry = 3, nodesize = 100,
sampsize = floor(0.6 * nrow(data.training)), importance = TRUE)

(2) Random forest that uses under-sampling with caret; use sampling = “up” for over-sampling
library(caret)
set.seed(1000)
ctrl <- trainControl(method = “repeatedcv”, number = 5, repeats = 3, sampling = “down”)
tune_grid <- expand.grid(mtry = c(15:25)) #number of features
model.rf <- train(target ~ ., data = data.train, method = “rf”, ntree = 50, importance = TRUE,
trControl = ctrl, tuneGrid = tune_grid)

Evaluate the results of the random forest


(1) Calculate the area under the curve (AUC)
library(pROC)
predictions <- predict(model.rf, newdata = data.test, type = “raw”)
auc(as.numeric(data.test$target), as.numeric(predictions))

(2) View the top 10 important features with varImp from caret
library(caret)
rf.importance <- varImp(model.rf)
plot(rf.importance, top = 10)

(3) View partial dependence plots


library(pdp)
partial(model.rf, train = data.train, pred.var = “target”, plot = TRUE, rug = TRUE, smooth = TRUE)
Page 5 of 6
A. Reed 11/26/2018
(4) Compute accuracy with confusionMatrix
confusionMatrix(as.factor(predictions), as.factor(data.new$flag))

#Random forests generally reduce variance without affecting the bias. However, they are less interpretable.

VI. Report Structure

(1) Executive Summary


 Give background information
 List the most important features relating to the targets

(2) Data Exploration, Preparation, and Cleaning


 Explain the target variables and whether they require classification or regression
 Mention that we need to consult the legal department to decide if we can implement all the data into
our models
 Convey NA values, negatives, or other values that are anomalies
 General explanation of the original dataset including number of rows, columns, and general
characteristics
 Display the distribution of the target variables and important features
 Give the dimensions of the dataset after preparing the data

(3) Feature Selection


 Explain any variables that were created and why
 Mention how Principal Component Analysis and/or Clustering affected which features were selected

(4) Model Selection and Validation


 Explain how the training and testing sets were determined
 Demonstrate which parameters were modified and why
 Display a visual of the model as well as the calculations resulting from the model
 Visuals can include the coefficient table for the GLM, residual plots, Q-Q plot, decision tree, a
confusionMatrix, etc.

(5) Findings
 Compare and contrast the results of the models
 Mention how decision trees and random forests may predict better but are less interpretable
 List the most important features and which direction they move the target
 Give ideas for future research and analysis

(6) Appendices
 Display a summary of the dataset, using summary(dataset)
 Show any relevant graphs

Page 6 of 6
A. Reed 11/26/2018

Você também pode gostar