Você está na página 1de 14

2/3/13 8:06 PM K-means clustering

Page 1 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering K-means clustering
Jeffrey Leek, Assistant Professor of Biostatistics
Johns Hopkins Bloomberg School of Public Health
2/3/13 8:06 PM K-means clustering
Page 2 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
Can we find things that are close together? Can we find things that are close together?
How do we define close?
How do we group things?
How do we visualize the grouping?
How do we interpret the grouping?

2/14
2/3/13 8:06 PM K-means clustering
Page 3 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
How do we define close? How do we define close?
Most important step
Distance or similarity
Pick a distance/similarity that makes sense for your problem

Garbage in -> garbage out -

Continuous - euclidean distance


Continous - correlation similarity
Binary - manhattan distance
-
-
-

3/14
2/3/13 8:06 PM K-means clustering
Page 4 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering K-means clustering
A partioning approach
Requires
Produces

Fix a number of clusters


Get "centroids" of each cluster
Assign things to closest centroid
Reclaculate centroids
-
-
-
-

A defined distance metric


A number of clusters
An initial guess as to cluster centroids
-
-
-

Final estimate of cluster centroids


An assignment of each point to clusters
-
-
4/14
2/3/13 8:06 PM K-means clustering
Page 5 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering - K-means clustering - example example
set.seed(1234); par(mar=c(0,0,0,0))
x <- rnorm(12,mean=rep(1:3,each=4),sd=0.2)
y <- rnorm(12,mean=rep(c(1,2,1),each=4),sd=0.2)
plot(x,y,col="blue",pch=19,cex=2)
text(x+0.05,y+0.05,labels=as.character(1:12))
5/14
2/3/13 8:06 PM K-means clustering
Page 6 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering - K-means clustering - starting centroids starting centroids
6/14
2/3/13 8:06 PM K-means clustering
Page 7 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering - K-means clustering - assign to closest centroid assign to closest centroid
7/14
2/3/13 8:06 PM K-means clustering
Page 8 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering - K-means clustering - recalculate centroids recalculate centroids
8/14
2/3/13 8:06 PM K-means clustering
Page 9 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering - K-means clustering - reassign values reassign values
9/14
2/3/13 8:06 PM K-means clustering
Page 10 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
K-means clustering - K-means clustering - update centroids update centroids
10/14
2/3/13 8:06 PM K-means clustering
Page 11 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
kmeans() kmeans()
Important parameters: x,centers,iter.max,nstart
dataFrame <- data.frame(x,y)
kmeansObj <- kmeans(dataFrame,centers=3)
names(kmeansObj)
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size"
kmeansObj$cluster
[1] 3 3 3 3 1 1 1 1 2 2 2 2
11/14
2/3/13 8:06 PM K-means clustering
Page 12 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
kmeans() kmeans()
par(mar=rep(0.2,4))
plot(x,y,col=kmeansObj$cluster,pch=19,cex=2)
points(kmeansObj$centers,col=1:3,pch=3,cex=3,lwd=3)
12/14
2/3/13 8:06 PM K-means clustering
Page 13 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
Heatmaps Heatmaps
set.seed(1234)
dataMatrix <- as.matrix(dataFrame)[sample(1:12),]
kmeansObj2 <- kmeans(dataMatrix,centers=3)
par(mfrow=c(1,2),mar=rep(0.2,4))
image(t(dataMatrix)[,nrow(dataMatrix):1],yaxt="n")
image(t(dataMatrix)[,order(kmeansObj$cluster)],yaxt="n")
13/14
2/3/13 8:06 PM K-means clustering
Page 14 of 14 file:///Users/jtleek/Dropbox/Jeff/teaching/2013/coursera/week3/005kmeansClustering/index.html#1
Notes and further resources Notes and further resources
K-means requires a number of clusters
K-means is not deterministic
Rafa's Distances and Clustering Video
Elements of statistical learning

Pick by eye/intuition
Pick by cross validation/information theory, etc.
Determining the number of clusters
-
-
-

Different # of clusters
Different number of iterations
-
-

14/14

Você também pode gostar