Você está na página 1de 10

STHDA

S t at i s t i c al t o o ls fo r h i gh t h ro u gh p u t d at a an aly s i s

HOME BOOKS R/STATISTICS STAT SOFTWARES CONTACT

Search

Connect

Home / Easy Guides / R software / Cluster Analysis in R Unsupervised machine learning / Visual Enhancement of Actions menu for module Wiki
Clustering Analysis Unsupervised Machine Learning

HotelBioxuryGHL
ReservehoyalmejorPrecio!Gotobioxury.com

Visual Enhancement of Clustering Analysis Unsupervised Machine Learning


Adsby Google ClusteringKMean Plot Methods

Tools

1 Required package
2 Data preparation
3 Enhanced distance matrix computation and visualization
4 Enhanced clustering analysis
4.1 eclust() function
4.2 Examples
5 Infos

Clustering analysis is used to find groups of similar objects in a dataset. There are two main categories of clustering:
Hierarchical clustering: like agglomerative (hclust and agnes) and divisive (diana) methods, which construct a hierarchy of clustering.
Partitioning clustering: like kmeans, pam, clara and fanny, which require the user to specify the number of clusters to be generated.
These clustering methods can be computed using the R packages stats (for kmeans) and cluster (for pam, clara and fanny), but the workflow
require multiple steps and multiple lines of R codes.
In this chapter, we provide some easytouse functions for enhancing the workflow of clustering analyses and we implemented ggplot2 method for
visualizing the results.

1 Required package
The following R packages are required in this chapter:
factoextra for enhanced clustering analyses and data visualization
cluster for computing the standard PAM, CLARA, FANNY, AGNES and DIANA clustering
1. factoextra can be installed as follow:

if(!require(devtools))install.packages("devtools")
devtools::install_github("kassambara/factoextra")

2. Install cluster:

install.packages("cluster")

3. Load required packages:


library(factoextra)
library(cluster)

2 Data preparation

The builtin R dataset USArrests is used:

#Loadandscalethedataset
data("USArrests")
df<scale(USArrests)
head(df)

##MurderAssaultUrbanPopRape
##Alabama1.242564080.78283930.52090660.003416473
##Alaska0.507862481.10682251.21176422.484202941
##Arizona0.071633411.47880320.99898011.042878388
##Arkansas0.232349380.23086801.07359270.184916602
##California0.278268231.26281441.75892342.067820292
##Colorado0.025714560.39885930.86080851.864967207

3 Enhanced distance matrix computation and visualization


This section describes two functions:
1. get_dist() [in factoextra]: for computing distance matrix between rows of a data matrix. Compared to the standard dist() function, it supports
correlationbased distance measures including pearson, kendall and spearman methods.
2. fviz_dist(): for visualizing a distance matrix

#Correlationbaseddistancemethod
res.dist<get_dist(df,method="pearson")
head(round(as.matrix(res.dist),2))[,1:6]

##AlabamaAlaskaArizonaArkansasCaliforniaColorado
##Alabama0.000.711.450.091.871.69
##Alaska0.710.000.830.370.810.52
##Arizona1.450.830.001.180.290.60
##Arkansas0.090.371.180.001.591.37
##California1.870.810.291.590.000.11
##Colorado1.690.520.601.370.110.00

#Visualizethedissimilaritymatrix
fviz_dist(res.dist,lab_size=8)
The ordered dissimilarity matrix image (ODI) displays the clustering tendency of the dataset. Similar objects are close to one another. Red
color corresponds to small distance and blue color indicates big distance between observation.

4 Enhanced clustering analysis


For instance, the standard R code for computing hierarchical clustering is as follow:

#Loadandscalethedataset
data("USArrests")
df<scale(USArrests)
#Computedissimilaritymatrix
res.dist<dist(df,method="euclidean")
#Computehierarchicalclustering
res.hc<hclust(res.dist,method="ward.D2")
#Visualize
plot(res.hc,cex=0.5)
In this chapter, we provide the function eclust() [in factoextra] which provides several advantages:
It simplifies the workflow of clustering analysis
It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of
clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
For hierarchical clustering, correlationbased metric is allowed
It provides silhouette information for all partitioning methods and hierarchical clustering
It draws beautiful graphs using ggplot2

4.1 eclust() function

eclust(x,FUNcluster="kmeans",hc_metric="euclidean",...)

x: numeric vector, data matrix or data frame


FUNcluster: a clustering function including kmeans, pam, clara, fanny, hclust, agnes and diana. Abbreviation is allowed.
hc_metric: character string specifying the metric to be used for calculating dissimilarities between observations. Allowed values are
those accepted by the function dist() [including euclidean, manhattan, maximum, canberra, binary, minkowski] and
correlation based distance measures [pearson, spearman or kendall]. Used only when FUNcluster is a hierarchical clustering
function such as one of hclust, agnes or diana.
: other arguments to be passed to FUNcluster.

The function eclust() returns an object of class eclust containing the result of the standard function used (e.g., kmeans, pam, hclust, agnes,
diana, etc.).
It includes also:
cluster: the cluster assignment of observations after cutting the tree
nbclust: the number of clusters
silinfo: the silhouette information of observations
size: the size of clusters
data: a matrix containing the original or the standardized data (if stand = TRUE)
gap_stat: containing gap statistics

4.2 Examples

In this section well show some examples for enhanced kmeans clustering and hierarchical clustering. Note that the same analysis can be done for
PAM, CLARA, FANNY, AGNES and DIANA.
library("factoextra")
#Enhancedkmeansclustering
res.km<eclust(df,"kmeans",nstart=25)

#Gapstatisticplot
fviz_gap_stat(res.km$gap_stat)

#Silhouetteplot
fviz_silhouette(res.km)
##clustersizeave.sil.width
##1180.39
##22160.34
##33130.37
##44130.27

#Optimalnumberofclustersusinggapstatistics
res.km$nbclust

##[1]4

#Printresult
res.km

##Kmeansclusteringwith4clustersofsizes8,16,13,13
##
##Clustermeans:
##MurderAssaultUrbanPopRape
##11.41188980.87433460.81452110.01927104
##20.48943750.38260010.57582980.26165379
##30.96154071.10660100.93010690.96676331
##40.69507011.03944140.72263701.27693964
##
##Clusteringvector:
##AlabamaAlaskaArizonaArkansasCalifornia
##14414
##ColoradoConnecticutDelawareFloridaGeorgia
##42241
##HawaiiIdahoIllinoisIndianaIowa
##23423
##KansasKentuckyLouisianaMaineMaryland
##23134
##MassachusettsMichiganMinnesotaMississippiMissouri
##24314
##MontanaNebraskaNevadaNewHampshireNewJersey
##33432
##NewMexicoNewYorkNorthCarolinaNorthDakotaOhio
##44132
##OklahomaOregonPennsylvaniaRhodeIslandSouthCarolina
##22221
##SouthDakotaTennesseeTexasUtahVermont
##31423
##VirginiaWashingtonWestVirginiaWisconsinWyoming
##22332
##
##Withinclustersumofsquaresbycluster:
##[1]8.31606116.21221311.95246319.922437
##(between_SS/total_SS=71.2%)
##
##Availablecomponents:
##
##[1]"cluster""centers""totss""withinss"
##[5]"tot.withinss""betweenss""size""iter"
##[9]"ifault""clust_plot""silinfo""nbclust"
##[13]"data""gap_stat"

#Enhancedhierarchicalclustering
res.hc<eclust(df,"hclust")#computehclust
fviz_dend(res.hc,rect=TRUE)#dendrogam

fviz_silhouette(res.hc)#silhouetteplot

##clustersizeave.sil.width
##1170.40
##22120.26
##33180.38
##44130.35
fviz_cluster(res.hc)#scatterplot

Its also possible to specify the number of clusters as follow:

eclust(df,"kmeans",k=4)

5 Infos

This analysis has been performed using R software (ver. 3.2.1)


Adsby Google GGPLOT2 MachineLearning AnalysisExample

Share 0 Like Share 0 Tweet Share 0 28 Share 4


WanttoLearnMoreonRProgrammingandDataScience?

FollowusbyEmail

Subscribe
byFeedBurner

OnSocialNetworks:
onSocialNetworks

Get involved :
Click to follow us on Facebook and Google+ :
Comment this article by clicking on "Discussion" button (topright position of this page)
Sign up as a member and post news and articles on STHDA web site.

Suggestions

Determining the optimal number of clusters: 3 must known methods Unsupervised Machine Learning
Cluster Analysis in R Unsupervised machine learning
Beautiful dendrogram visualizations in R: 5+ must known methods Unsupervised Machine Learning
Partitioning cluster analysis: Quick start guide Unsupervised Machine Learning
DBSCAN: densitybased clustering for discovering clusters in large datasets with noise Unsupervised Machine Learning
Clustering Validation Statistics: 4 Vital Things Everyone Should Know Unsupervised Machine Learning
Hierarchical Clustering Essentials Unsupervised Machine Learning
Static and Interactive Heatmap in R Unsupervised Machine Learning
ModelBased Clustering Unsupervised Machine Learning
Clarifying distance measures Unsupervised Machine Learning
HCPC: Hierarchical clustering on principal components Hybrid approach (2/2) Unsupervised Machine Learning
Assessing clustering tendency: A vital issue Unsupervised Machine Learning
How to choose the appropriate clustering algorithms for your data? Unsupervised Machine Learning
Hybrid hierarchical kmeans clustering for optimizing clustering outputs Unsupervised Machine Learning
The Guide for Clustering Analysis on a Real Data: 4 steps you should know Unsupervised Machine Learning
How to compute pvalue for hierarchical clustering in R Unsupervised Machine Learning
Fuzzy clustering analysis Unsupervised Machine Learning
Practical Guide to Cluster Analysis in R Book

This page has been seen 5659 times

License
(Click on the image below)

Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email

Subscribe
by FeedBurner

on Social Networks

R Basi cs

Impo rt i ng D at a

Ex po rt i ng D at a

Reshapi ng D at a

D at a Mani pul at i o n

D at a Vi sual i zat i o n

Basi c St at i st i cs

Cl ust er A nal ysi s

Survi val A nal ysi s

Adsby Google

ClusterAnalysis

Functions

R Packages
R packages developed by STHDA for easier data analyses and visualization: factoextra, survminer and ggpubr.

Learn more >>

Fo rum

Co nt act

Você também pode gostar