Você está na página 1de 8

Unit 8: Unsupervised Learning Clustering

Topics

Unsupervised Learning
Clustering
K-means clustering algorithm
Hierarchical clustering
Dendrograms

Learning Objectives

Be able to describe how unsupervised learning differs from supervised


learning
Understand and be able to implement a basic K-means cluster algorithm
Be able to describe a Dendrogram and its role in Hierarchal clustering
Understand and be able to describe the process of Hierarchal clustering

Reading Assignments
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to
Statistical Learning with Applications in R. New York, NY: Springer. Rea d Chapter 10
available at http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
Pham, D. T., Dimov, S. S., & Nguyen, C. D. (2005). Selection of K in K-means clustering. Proceedings of
the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 219(1), 103119. Available from http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf

Overview
If you glance at the following picture what do you notice? You will probably recognize a field full of
flowers. What you might have also recognized is the fact that there is a pattern to the flowers. Different
species of flowers are growing in the field and these species are grouped together.

Although we dont know what the groups mean or why the flowers are growing in this particular pattern,
we can identify the patterns and we can isolate what group each type of flower belongs to. We notice
that the flowers are all grouped together and form patterns.

Once we have identified that in fact there is a pattern and that in the pattern there are groupings of
flowers, we can then begin to explore the characteristics of the flowers in the groups and begin to
speculate or discovery what the patterns mean.
Clustering is a machine learning or data mining technique designed to do the same thing which is to find
patterns or groupings in data.
Clustering is a form of unsupervised learning. Unsupervised learning basically means that we use some
technique to find patterns but we dont know in advance what the patterns are or what the patterns mean.
The opposite of unsupervised learning is supervised learning such as classification. In supervised
learning we know what groups (or classes) exist and the objective is to use the data to learn how to
assign a particular instance to one of the known groups. i
When we know in advance the groups within the data that we are interested in it is referred to as Bias
because even if other groupings or patterns in the data exist, we are ignoring them because we are only
interested in the groups that we have identified.
Unsupervised learning lacks bias. When we begin to cluster data we dont know what groups exist or
even if any groups exist at all. The objective is not to be able to place a particular instance into a
particular group, but rather to gain an understanding of the data to see if any relevant groups exist in the
data at all.
In clustering we can clearly see the concept of data mining at work as we are starting with some
collection of data that we are digging through to find something of valuable, which in the case is a pattern
in the data.ii
Why Clustering?
You might be wondering why we want to cluster data, what is the value of identifying groups if we dont
know what they mean. This is a fair question. Think of clustering as one step in a process of knowledge
discovery. If we use the mining analogy, miners dont typically dig up the final product. If you have
ever watched one of the reality TV shows that follow gold miners such as Gold Rush or Bering Sea Gold,
what you will see is that the miners are typically attempting to find gold bearing ore. They dont just dig
up the gold, they must dig up the ore that contains the gold and then attempt to filter the gold out from all
of the other material that they have extracted. To get to pure gold there are additional processes such as
separating and smelting.
What we can equate from these programs is the fact that to find gold in our mountains of data, we must
first identify the ore that contains the gold and once we have found and extracted the ore, we can refine it
to find something that is really valuable.
Clustering in this process is much like the process of finding the gold bearing rock and extracting it. We
do this by looking for some kind of pattern. We dont know exactly what the pattern means and in some
cases the pattern might mean nothing. Just like the miners in these reality TV shows, no all ore yields a
jackpot in gold. However to find the gold (or in the case of data mining the knowledge that will generate
business value) we need to identify the gold bearing ore and extract it from the mountain of other data and
that is a primary outcome of clustering.
A few examples of how clustering works to mine the ore and find gold might be of value.
Consider a marketing. The role of marketing within the organization is to be able to understand customer
requirements or needs, engage with product development to produce products or services to meet those

needs, and then communicate with consumers targeting those for whom the product or service best meets
their needs. Of course targeting consumers can be difficult. Billions of dollars are spent by companies in
an effort to get their product message to the correct consumers. The revenues of a number of industries
such as television, newspapers, magazines, billboards, radio, and a plethora of internet providers such as
Google, Yahoo, and Facebook as examples are entirely based upon their ability to accomplish this goal. iii
Any information that helps a company get their message to the correct consumer is extremely valuable.
Clustering provides the information necessary to effectively target the right group. We can see an
example of this in use by Netflix or Amazon. If you log into your Amazon.com account right now you
will see Recommendations for you prominently displayed on the front page of the Amazon.com web
site. Amazon is using clustering technology to group you together with other customers who are like you.
They may use any number of attributes to accomplish this including your demographic information and
purchase history. Once they have identified the cluster or group of individuals, they can take this ore that
they have extracted and refine it into gold. In the case of Amazon, they surmise that you are likely to be
interested in the same kinds of products, books, videos, etc. that other people in your cluster are interested
in. Amazon has demonstrated that the ability to target customers using these techniques results in
significant sales increases.iv
The Amazon example demonstrates how a combination of techniques including using clustering to
identify groups and other methods to refine the data can discovery significant value in the data. The
approach used by Amazon to recommend products is part of what are called recommender systems.
Recommender systems mine data to recommend movies for users of Netflix, recommend articles for users
of news services such as Bloomberg, Fortune, CNN, and providers such as Zite. v vi vii
Although the applications of clustering in marketing are exciting and a substantial source of business
value, they do not represent the only areas where this technology can be applied.
In the insurance industry policy holders for home, car, boat, personal good and other insured items can be
clustered together based upon a wide variety of attributes to determine the levels of risk for a particular
group based upon the assumption that a group of similar individuals will have similar lifestyles and
similar levels of risk.
City planners, civil engineers and even real estate agents can identify neighborhoods and effectively plan
zoning based upon the characteristics of homes and occupants within their area of jurisdiction.
Psychiatrists and Psychologists can identify and then categorize patients with disorders of the mind based
upon their behaviors and characteristics. The result leading to better definitions or refinement of
diagnostic categories and better diagnosis of a patients condition and of course the potential to predict
which patients are at risk for behaviors that may threaten themselves or others.
Meteorologists use clustering techniques to better understand the dynamics of weather and to aid in
providing better, more accurate, and longer term forecasts.
Archeologists leverage clustering techniques to determine what groups of people may have been
associated with certain discovered artifacts. This will aid in their research because they can better
understand the groups that existed and the events which may have prompted change within the groups
over time.
Sociology is the study of people in groups. It should be obvious to pretty much anyone that clustering
techniques are a significant advantage to sociologists. viii ix x xi

Now that we have established clustering as an important and valuable data mining and machine learning
capability, it is important that we provide a high level understanding of how clustering works. There are a
number of different algorithms and techniques that are employed to generate clusters, however we will
focus on walking through and understanding two clustering techniques. The first technique makes use of
the K-means clustering algorithm and the second is known as hierarchical clustering.
K Means Clustering algorithm
K-means clustering is a data mining/machine learning algorithm used to cluster observations into groups
without any prior knowledge of those relationships. K-mean clustering is a form of unsupervised learning
because it identifies clusters in the data without any prior knowledge of what those clusters are or even if
clusters exist at all. xii xiii
The k-means algorithm is one of the simplest clustering techniques because it operates very much like the
algorithm that we used in linear regression. Recall that in linear regression, we attempted to draw a line
through data points plotted in a graph where the average distance of points from the line was minimized.
Every point that did not fall exactly on the line was considered an error and the magnitude of the error
was determined by the distance of the point from the line. The regression algorithm simply attempted to
find the slope and position of a line that would minimize the sum of all the errors.
The k-means algorithm works in much the same way but instead of minimizing the distance of points
from a line, the k-means algorithms attempts to minimize the distance of points from some number of
points that are placed on the graph and represent the centers of a cluster.
To help this make sense, consider the following figure (figure xxx). We see in this figure a large number
of points that are plotted on the chart. In this example we have colored the points in a different color to
highlight them, but the algorithm would simply see these as a bunch of points plotted in a graph.

Now imagine that we were to randomly pick three points on this graph. The choice of the number of
points to pick is based upon the number of clusters that you are trying to discover. The K in K-means
refers to the number of random points that are selected to place on the graph. In the next section we will
learn a little bit about how to select what the value of K is, but for now we will just assume that k is equal
to 3.

The 3 points are simply randomly placed on the graph somewhere. It is not really all that important
where they are and as we will see the algorithm will move them into the appropriate location as part of its
processing.
In the previous figure we can see that we have randomly picked three points on the graph which are
represented by + signs and labeled K1, K2, and K3.
Once this has occurred, assign each point on the graph to the + that it is nearest to. Nearest is a measure
of Euclidean distance as measured on a two dimensional graph. When all of the points have been
assigned to one of the K points, re-estimate the position of the points by taking an average of all of the
points that were assigned.
Assume, for example, that in the example represented in the next figure, that 40 points were assigned to
the first randomly selected + point which we will refer to as K 1. If we were to sum all of the X values of
those points and divide by 40 we would get an average value for X and if we then summed all of the Y
values of those points and divided by 40 we would get an average value of Y.
This same procedure would be repeated for the second and third randomly selected points (K 2 and K3).
After this process is completed each of the points (K 1, K2 and K3) will have new positions and the process
will be repeated by allocating each of the points in the graph to the closest of the three points K 1, K2 and
K3. The new average X and Y value or each point will be computed and the points will be moved to their
new locations.
This process will be repeated until the mean locations for K 1, K2 and K3 no longer change. It should be
clear at this point why this algorithm is referred to as the K-means algorithm, because it defines as cluster
by iteratively computing the mean position of K points to which are allocated the closest members. Each
of these K points and all of the data points that are assigned to them during the last iteration of the
algorithm represent the clusters that are identified by the process.

Choosing the value of K


Choosing the value of K can be both an art and a science. There are many factors that can and often
should be included into the decision to set the value of K at any particular value. If the value of K is set
too low there will be fewer groups and some key information that might have existed would not be
captured. On the other hand, if the value of K is set too high it could result in groupings of instances that
lack relevance as there is really not sufficient differences between instances to classify the instances
separately.
Hierarchical Clustering
Hierarchical clustering is a slightly different method of clustering that builds from an opposite approach
form the K-means algorithm that we have just discussed. If you consider the K-means algorithm it
essentially builds a cluster down. You start with the number of groups which of course is identified by
K and then cluster the instances of the training data around the K groups using some form of normalized
distance measure. There are a variety of distance measures that could be used including the most
common which is Euclidean as well as other methods such as cosine angle and correlation coefficients.

What is important to consider is the fact that the clusters build down downward as the algorithm begins
with K cluster groups and then iteratively assigns training instances to the clusters. Hierarchical
clustering takes the exact opposite approach. In hierarchical clustering an instance of test data is selected
and then its closest neighbor is found. This process of finding the closet pairs of data instances and then
the closest point to those pairs and so on eventually forms a tree structure known as a dendrogram as we
can see in the preceding figure.

In hierarchical clustering one must determine where the cut point as outlined by the dashed line in the
preceding figure must lie. The cut line basically identifies the clusters by selecting the point in the tree
structure (dendrogram) where a cut should be made. At this point each new tree that remains after the cut
has been taken represents on the clusters.

Discussion Question
Conduct research within the University of the People Library and the internet on
Hierarchal clustering and in your posting describe how you could implement an
algorithm to cluster a data set using Hierarchal clustering. Your description should
address the role of dendrograms in Hierarchal clustering.

i Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster Analysis (5th ed.). West
Sussex, United Kingdom: John Wiley and Sons.
ii Mirkin, B. (2005). Clustering for Data Mining: A Data Recovery Approach. Boca Raton,
FL: Chapman & Hall.
iii Punj, G., & Stewart, D. W. (1983). Cluster analysis in marketing research: review and
suggestions for application. Journal of marketing research, 134-148.
iv Linden, G., Smith, B., & York, J. (2003). Amazon.com recommendations: Item-to-item
collaborative filtering. Internet Computing, IEEE, 7(1), 76-80.
v Liu, J., Wang, Q., Fang, K., & Mi, Q. (2007, April). An optimized collaborative filtering
approach combining with item-based prediction. In Computer Supported Cooperative
Work in Design, 2007. CSCWD 2007. 11th International Conference on (pp. 157-161).
IEEE.
vi Breese, J. S., Heckerman, D., & Kadie, C. (1998, July). Empirical analysis of predictive
algorithms for collaborative filtering. In Proceedings of the Fourteenth conference on
Uncertainty in artificial intelligence (pp. 43-52). Morgan Kaufmann Publishers Inc.
vii Sarwar, B., Karypis, G., Konstan, J., & Riedl, J. (2000, October). Analysis of
recommendation algorithms for e-commerce. In Proceedings of the 2nd ACM conference
on Electronic commerce (pp. 158-167). ACM.
viii Morey, L. C., Blashfield, R. K., & Skinner, H. A. (1983). A Comparison of Cluster
Analysis Techniques Withing a Sequential Validation Framework. Multivariate Behavioral
Research, 18(3), 309-329.
ix McKennell, A. (1970). Attitude measurement: use of coefficient alpha with cluster or
factor analysis. Sociology, 4(2), 227-245.
x Ball, G. H., & Hall, D. J. (1967). A clustering technique for summarizing multivariate
data. Behavioral science, 12(2), 153-155.
xi Bailey, K. D. (1983). Sociological classification and cluster analysis. Quality & Quantity,
17(4), 251-268.
xii Richert, W., & Coelho, L. (2013). Building Machine Learning Systems with Python.
Birmingham, UK: Packt Publishing Ltd.
xiii Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining: Practical Machine Learning
Tools and Techniques (3rd ed.). Burlington, MA: Morgan Kaufmann Publishers.

Você também pode gostar