Visual Enhancement of Clustering Analysis - Unsupervised Machine Learning

STHDA
S t at i s t i c al t o o ls fo r h i gh t h ro u gh p u t d at a an aly s i s
HOME BOOKS R/STATISTICS STAT SOFTWARES CONTACT
Search
Connect
Home / Easy Guides / R software / Cluster Analysis in R Unsupervised machine learning / Visual Enhancement of Actions menu for module Wiki
Clustering Analysis Unsupervised Machine Learning
HotelBioxuryGHL
ReservehoyalmejorPrecio!Gotobioxury.com
Visual Enhancement of Clustering Analysis Unsupervised Machine Learning

Adsby Google ClusteringKMean Plot Methods
Tools
1 Required package
2 Data preparation
3 Enhanced distance matrix computation and visualization
4 Enhanced clustering analysis
4.1 eclust() function
4.2 Examples
5 Infos
Clustering analysis is used to find groups of similar objects in a dataset. There are two main categories of clustering:
Hierarchical clustering: like agglomerative (hclust and agnes) and divisive (diana) methods, which construct a hierarchy of clustering.
Partitioning clustering: like kmeans, pam, clara and fanny, which require the user to specify the number of clusters to be generated.
These clustering methods can be computed using the R packages stats (for kmeans) and cluster (for pam, clara and fanny), but the workflow
require multiple steps and multiple lines of R codes.
In this chapter, we provide some easytouse functions for enhancing the workflow of clustering analyses and we implemented ggplot2 method for
visualizing the results.
1 Required package
The following R packages are required in this chapter:
factoextra for enhanced clustering analyses and data visualization
cluster for computing the standard PAM, CLARA, FANNY, AGNES and DIANA clustering
1. factoextra can be installed as follow:
if(!require(devtools))install.packages("devtools")
devtools::install_github("kassambara/factoextra")
2. Install cluster:
install.packages("cluster")
3. Load required packages:

library(factoextra)
library(cluster)
2 Data preparation
The builtin R dataset USArrests is used:
#Loadandscalethedataset
data("USArrests")
df<scale(USArrests)
head(df)
##MurderAssaultUrbanPopRape
##Alabama1.242564080.78283930.52090660.003416473
##Alaska0.507862481.10682251.21176422.484202941
##Arizona0.071633411.47880320.99898011.042878388
##Arkansas0.232349380.23086801.07359270.184916602
##California0.278268231.26281441.75892342.067820292
##Colorado0.025714560.39885930.86080851.864967207
3 Enhanced distance matrix computation and visualization

This section describes two functions:
1. get_dist() [in factoextra]: for computing distance matrix between rows of a data matrix. Compared to the standard dist() function, it supports
correlationbased distance measures including pearson, kendall and spearman methods.
2. fviz_dist(): for visualizing a distance matrix
#Correlationbaseddistancemethod
res.dist<get_dist(df,method="pearson")
head(round(as.matrix(res.dist),2))[,1:6]
##AlabamaAlaskaArizonaArkansasCaliforniaColorado
##Alabama0.000.711.450.091.871.69
##Alaska0.710.000.830.370.810.52
##Arizona1.450.830.001.180.290.60
##Arkansas0.090.371.180.001.591.37
##California1.870.810.291.590.000.11
##Colorado1.690.520.601.370.110.00
#Visualizethedissimilaritymatrix
fviz_dist(res.dist,lab_size=8)
The ordered dissimilarity matrix image (ODI) displays the clustering tendency of the dataset. Similar objects are close to one another. Red
color corresponds to small distance and blue color indicates big distance between observation.
4 Enhanced clustering analysis

For instance, the standard R code for computing hierarchical clustering is as follow:
#Loadandscalethedataset
data("USArrests")
df<scale(USArrests)
#Computedissimilaritymatrix
res.dist<dist(df,method="euclidean")
#Computehierarchicalclustering
res.hc<hclust(res.dist,method="ward.D2")
#Visualize
plot(res.hc,cex=0.5)
In this chapter, we provide the function eclust() [in factoextra] which provides several advantages:
It simplifies the workflow of clustering analysis
It can be used to compute hierarchical clustering and partititioning clustering in a single line function call
Compared to the standard partitioning functions (kmeans, pam, clara and fanny) which requires the user to specify the optimal number of
clusters, the function eclust() computes automatically the gap statistic for estimating the right number of clusters.
For hierarchical clustering, correlationbased metric is allowed
It provides silhouette information for all partitioning methods and hierarchical clustering
It draws beautiful graphs using ggplot2
4.1 eclust() function
eclust(x,FUNcluster="kmeans",hc_metric="euclidean",...)
x: numeric vector, data matrix or data frame

FUNcluster: a clustering function including kmeans, pam, clara, fanny, hclust, agnes and diana. Abbreviation is allowed.
hc_metric: character string specifying the metric to be used for calculating dissimilarities between observations. Allowed values are
those accepted by the function dist() [including euclidean, manhattan, maximum, canberra, binary, minkowski] and
correlation based distance measures [pearson, spearman or kendall]. Used only when FUNcluster is a hierarchical clustering
function such as one of hclust, agnes or diana.
: other arguments to be passed to FUNcluster.
The function eclust() returns an object of class eclust containing the result of the standard function used (e.g., kmeans, pam, hclust, agnes,
diana, etc.).
It includes also:
cluster: the cluster assignment of observations after cutting the tree
nbclust: the number of clusters
silinfo: the silhouette information of observations
size: the size of clusters
data: a matrix containing the original or the standardized data (if stand = TRUE)
gap_stat: containing gap statistics
4.2 Examples
In this section well show some examples for enhanced kmeans clustering and hierarchical clustering. Note that the same analysis can be done for
PAM, CLARA, FANNY, AGNES and DIANA.
library("factoextra")
#Enhancedkmeansclustering
res.km<eclust(df,"kmeans",nstart=25)
#Gapstatisticplot
fviz_gap_stat(res.km$gap_stat)
#Silhouetteplot
fviz_silhouette(res.km)
##clustersizeave.sil.width
##1180.39
##22160.34
##33130.37
##44130.27
#Optimalnumberofclustersusinggapstatistics
res.km$nbclust
##[1]4
#Printresult
res.km
##Kmeansclusteringwith4clustersofsizes8,16,13,13
##
##Clustermeans:
##MurderAssaultUrbanPopRape
##11.41188980.87433460.81452110.01927104
##20.48943750.38260010.57582980.26165379
##30.96154071.10660100.93010690.96676331
##40.69507011.03944140.72263701.27693964
##
##Clusteringvector:
##AlabamaAlaskaArizonaArkansasCalifornia
##14414
##ColoradoConnecticutDelawareFloridaGeorgia
##42241
##HawaiiIdahoIllinoisIndianaIowa
##23423
##KansasKentuckyLouisianaMaineMaryland
##23134
##MassachusettsMichiganMinnesotaMississippiMissouri
##24314
##MontanaNebraskaNevadaNewHampshireNewJersey
##33432
##NewMexicoNewYorkNorthCarolinaNorthDakotaOhio
##44132
##OklahomaOregonPennsylvaniaRhodeIslandSouthCarolina
##22221
##SouthDakotaTennesseeTexasUtahVermont
##31423
##VirginiaWashingtonWestVirginiaWisconsinWyoming
##22332
##
##Withinclustersumofsquaresbycluster:
##[1]8.31606116.21221311.95246319.922437
##(between_SS/total_SS=71.2%)
##
##Availablecomponents:
##
##[1]"cluster""centers""totss""withinss"
##[5]"tot.withinss""betweenss""size""iter"
##[9]"ifault""clust_plot""silinfo""nbclust"
##[13]"data""gap_stat"
#Enhancedhierarchicalclustering
res.hc<eclust(df,"hclust")#computehclust
fviz_dend(res.hc,rect=TRUE)#dendrogam
fviz_silhouette(res.hc)#silhouetteplot
##clustersizeave.sil.width
##1170.40
##22120.26
##33180.38
##44130.35
fviz_cluster(res.hc)#scatterplot
Its also possible to specify the number of clusters as follow:
eclust(df,"kmeans",k=4)
5 Infos
This analysis has been performed using R software (ver. 3.2.1)

Adsby Google GGPLOT2 MachineLearning AnalysisExample
Share 0 Like Share 0 Tweet Share 0 28 Share 4

WanttoLearnMoreonRProgrammingandDataScience?

FollowusbyEmail
Subscribe
byFeedBurner
OnSocialNetworks:
onSocialNetworks
Get involved :
Click to follow us on Facebook and Google+ :
Comment this article by clicking on "Discussion" button (topright position of this page)
Sign up as a member and post news and articles on STHDA web site.
Suggestions
Determining the optimal number of clusters: 3 must known methods Unsupervised Machine Learning
Cluster Analysis in R Unsupervised machine learning
Beautiful dendrogram visualizations in R: 5+ must known methods Unsupervised Machine Learning
Partitioning cluster analysis: Quick start guide Unsupervised Machine Learning
DBSCAN: densitybased clustering for discovering clusters in large datasets with noise Unsupervised Machine Learning
Clustering Validation Statistics: 4 Vital Things Everyone Should Know Unsupervised Machine Learning
Hierarchical Clustering Essentials Unsupervised Machine Learning
Static and Interactive Heatmap in R Unsupervised Machine Learning
ModelBased Clustering Unsupervised Machine Learning
Clarifying distance measures Unsupervised Machine Learning
HCPC: Hierarchical clustering on principal components Hybrid approach (2/2) Unsupervised Machine Learning
Assessing clustering tendency: A vital issue Unsupervised Machine Learning
How to choose the appropriate clustering algorithms for your data? Unsupervised Machine Learning
Hybrid hierarchical kmeans clustering for optimizing clustering outputs Unsupervised Machine Learning
The Guide for Clustering Analysis on a Real Data: 4 steps you should know Unsupervised Machine Learning
How to compute pvalue for hierarchical clustering in R Unsupervised Machine Learning
Fuzzy clustering analysis Unsupervised Machine Learning
Practical Guide to Cluster Analysis in R Book
This page has been seen 5659 times
License
(Click on the image below)
Welcome!
Want to Learn More on R Programming and Data Science?
Follow us by Email
Subscribe
by FeedBurner
on Social Networks
R Basi cs
Impo rt i ng D at a
Ex po rt i ng D at a
Reshapi ng D at a
D at a Mani pul at i o n
D at a Vi sual i zat i o n
Basi c St at i st i cs
Cl ust er A nal ysi s
Survi val A nal ysi s
Adsby Google
ClusterAnalysis
Functions
R Packages
R packages developed by STHDA for easier data analyses and visualization: factoextra, survminer and ggpubr.
Learn more >>
Fo rum
Co nt act

Visual Enhancement of Clustering Analysis - Unsupervised Machine Learning

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Visual Enhancement of Clustering Analysis - Unsupervised Machine Learning

Enviado por

Direitos autorais:

Formatos disponíveis

STHDA

HOME BOOKS R/STATISTICS STAT SOFTWARES CONTACT

Visual Enhancement of Clustering Analysis Unsupervised Machine Learning

3. Load required packages:

The builtin R dataset USArrests is used:

3 Enhanced distance matrix computation and visualization

4 Enhanced clustering analysis

4.1 eclust() function

x: numeric vector, data matrix or data frame

Its also possible to specify the number of clusters as follow:

This analysis has been performed using R software (ver. 3.2.1)

Share 0 Like Share 0 Tweet Share 0 28 Share 4

This page has been seen 5659 times

Cl ust er A nal ysi s

Survi val A nal ysi s

Learn more >>

Você também pode gostar