Escolar Documentos
Profissional Documentos
Cultura Documentos
Basic Terms
class(𝑥) displays which class the object 𝑥 is in
summary(𝑥) provides a summary of the object 𝑥
table(𝑥) provides a frequency of each value in 𝑥
table(𝑥, 𝑦) provides a two-dimensional frequency table for 𝑥 and 𝑦
levels(𝑥) displays all of the levels of 𝑥 in the data
head(𝑥) returns the first several rows of a data.frame
remove(𝑥) or rm(𝑥) removes the declared variable 𝑥 from memory
# is used at the beginning of a line for comments
Removing variables
(1) Remove the columns “income” and “face”
data.new <- data.old[ , setdiff(colnames(data.old), c(“income”, “face”))]
(3) Select only the columns “gender” and “age” and the rows where the gender is male and the age is at least 60
data.new <- data.old[data.old$gender == “male” & data.old$age >= 60, c(“gender”, “age”)]
Binarize a factor column into multiple columns based on each possible factor
(1) Binarize the column “continent” using predict and dummyVars
library(caret)
data.binarized <- predict(dummyVars( ~ continent, data.new, sep = ‘_’), data.new)
data.new <- cbind(data.new, data.binarized)
Page 1 of 6
A. Reed 11/26/2018
Display combinations of scatterplots for multiple columns
(1) pairs(data.new[ , c(2, 4, 7)])
(2) Scatterplot of age and income, where the color varies by continent, the x-axis is restricted on [40, 80], and the y-
axis is logarithmic
library(ggplot2)
p2 <- ggplot(data.new, aes(x = age, y = income, color = continent) + geom_point()
+ coord_cartesian(xlim = c(40, 80)) + scale_y_log10()
(4) Divide graph p1 into multiple graphs, based on continent, using facet_wrap
p4 <- p1 + facet_wrap(~continent)
Page 2 of 6
A. Reed 11/26/2018
(6) View a bar graph for every variable in the dataset against the target variable “flag”
for (i in c(1:ncol(data.new)))
{
plot <- ggplot(data = data.new, aes(x = data.new[ , i], fill = flag)) + geom_bar(position = “fill”)
+ labs(x = colnames(data.new)[i])
print(plot)
}
(2) Display the standard deviation and proportion of variance for each principal component
summary(pca)
(2) Add a column that shows which cluster each observation belongs in
df$group <- as.factor(km3$cluster)
(3) Analyze the best “k” amount of clusters through the elbow method: First, calculate the
kmeans for various k values. Then perform the following and decide when the amount of
variance added is no longer substantial.
km3$betweenss / km3$totss
(2) Use Ridge Regression. Alpha = 0 implies Ridge Regression; alpha = 1 implies Lasso where coefficients can
become 0. Try various values for alpha and lambda.
library(glmnet)
mm <- model.matrix(target ~ x +y + z, data = data.new)
model.ridge <- glmnet(mm, y = target, family = “Gaussian”, alpha = 0, lambda = 0.1)
Page 3 of 6
A. Reed 11/26/2018
(3) Use cv.glmnet to identify the best lambda, then incorporate this value
library(glmnet)
mm <- model.matrix(target ~ x + y + z, data = data.new)
model.cv <- cv.glmnet(x = mm, y = data.train$target, family = “Gaussian”, alpha = 1)
model.best <- glmnet(x = mm, y = data.train$target, family = “Gaussian”,
lambda = model.cv$lamda.min, alpha = 1)
model.best$beta
(3) Remove unnecessary variables using stepAIC from the MASS library
library(MASS)
stepAIC(model, direction = “backward”)
(4) Predict the new values; should perform this on both training and testing sets
predictions <- predict(model, newdata = data.new, type = “response”)
data.predict <- cbind(data.new, predictions)
(5) Calculate the sum of squared errors; choose the model with the lowest SSE
sse <- sum((data.predict$target – data.predict$predictions)^2)
(6) View the confusionMatrix; choose the model with the highest accuracy
library(caret)
confusionMatrix(as.factor(predictions), as.factor(data.test$target))
(2) Decision tree with a minimum of 5 observations per leaf, complexity parameter is 0.001, maximum depth of 7,
and reduces the Gini impurity measure
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~ ., data = data.train, control = rpart.control(minbucket = 5, cp = .001,
maxdepth = 7), parms = list(split = “gini”))
(3) Regression decision tree that uses anova for assessment and a cp value of 0 for maximum complexity
library(rpart)
set.seed(1000)
d.tree <- rpart(target ~., data = data.train, method = “anova”, control = rpart.control(minbucket = 10,
cp = 0, maxdepth = 10))
Page 4 of 6
A. Reed 11/26/2018
View results and prune the tree
(1) Use rpart.plot to view the decision tree
library(rpart.plot)
rpart.plot(d.tree)
(2) Use printcp and plotcp to view results regarding various cp values used in the decision tree
printcp(d.tree)
plotcp(d.tree)
(3) Can also use the following method to identify the best cp value
cp.best <- d.tree$cptable[which.min(d.tree$cptable[ , “xerror”]), “CP”]
(2) Random forest that uses under-sampling with caret; use sampling = “up” for over-sampling
library(caret)
set.seed(1000)
ctrl <- trainControl(method = “repeatedcv”, number = 5, repeats = 3, sampling = “down”)
tune_grid <- expand.grid(mtry = c(15:25)) #number of features
model.rf <- train(target ~ ., data = data.train, method = “rf”, ntree = 50, importance = TRUE,
trControl = ctrl, tuneGrid = tune_grid)
(2) View the top 10 important features with varImp from caret
library(caret)
rf.importance <- varImp(model.rf)
plot(rf.importance, top = 10)
#Random forests generally reduce variance without affecting the bias. However, they are less interpretable.
(5) Findings
Compare and contrast the results of the models
Mention how decision trees and random forests may predict better but are less interpretable
List the most important features and which direction they move the target
Give ideas for future research and analysis
(6) Appendices
Display a summary of the dataset, using summary(dataset)
Show any relevant graphs
Page 6 of 6
A. Reed 11/26/2018