A Tutorial On Clustering Algorithms

A Tutorial on Clustering Algorithms
Introduction | K-means | Fuzzy C-means | Hierarchical | Mixture of Gaussians | Links
Clustering: An Introduction
What is Clustering?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that
objects in the same group (called a cluster) are more similar (in some sense or another) to
each other than to those in other groups (clusters). It is a main task of exploratory data
mining, and a common technique for statistical data analysis, used in many fields, including
machine learning, pattern recognition, image analysis, information retrieval, bioinformatics,
data compression, and computer graphics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a
cluster and how to efficiently find them. Popular notions of clusters include groups with
small distances among the cluster members, dense areas of the data space, intervals or
particular statistical distributions. Clustering can therefore be formulated as a multi-objective
optimization problem. The appropriate clustering algorithm and parameter settings (including
values such as the distance function to use, a density threshold or the number of expected
clusters) depend on the individual data set and intended use of the results. Cluster analysis as
such is not an automatic task, but an iterative process of knowledge discovery or interactive
multi-objective optimization that involves trial and failure. It is often necessary to modify
data preprocessing and model parameters until the result achieves the desired properties.
Besides the term clustering, there are a number of terms with similar meanings, including
automatic classification, numerical taxonomy, botryology (from Greek "grape") and
typological analysis. The subtle differences are often in the usage of the results: while in data
mining, the resulting groups are the matter of interest, in automatic classification the resulting
discriminative power is of interest. This often leads to misunderstandings between
researchers coming from the fields of data mining and machine learning[citation needed], since they
use the same terms and often the same algorithms, but have different goals.
Clustering can be considered the most important unsupervised learning problem; so, as every
other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be the process of organizing objects into groups whose
members are similar in some way.
A cluster is therefore a collection of objects which are similar between them and are
dissimilar to the objects belonging to other clusters.
We can show this with a simple graphical example:
In this case we easily identify the 4 clusters into which the data can be divided; the similarity
criterion is distance: two or more objects belong to the same cluster if they are close
according to a given distance (in this case geometrical distance). This is called distancebased clustering.
Another kind of clustering is conceptual clustering: two or more objects belong to the same
cluster if this one defines a concept common to all that objects. In other words, objects are
grouped according to their fit to descriptive concepts, not according to simple similarity
measures.
The Goals of Clustering

So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute
best criterion which would be independent of the final aim of the clustering. Consequently,
it is the user which must supply this criterion, in such a way that the result of the clustering
will suit their needs.
For instance, we could be interested in finding representatives for homogeneous groups (data
reduction), in finding natural clusters and describe their unknown properties (natural
data types), in finding useful and suitable groupings (useful data classes) or in finding
unusual data objects (outlier detection).
Possible Applications
Clustering algorithms can be applied in many fields, for instance:
Marketing: finding groups of customers with similar behavior given a large database
of customer data containing their properties and past buying records;
Biology: classification of plants and animals given their features;
Libraries: book ordering;
Insurance: identifying groups of motor insurance policy holders with a high average
claim cost; identifying frauds;
City-planning: identifying groups of houses according to their house type, value and
geographical location;
Earthquake studies: clustering observed earthquake epicenters to identify dangerous

zones;
WWW: document classification; clustering weblog data to discover groups of similar

access patterns.
Requirements
The main requirements that a clustering algorithm should satisfy are:
scalability;
dealing with different types of attributes;
discovering clusters with arbitrary shape;
minimal requirements for domain knowledge to determine input parameters;
ability to deal with noise and outliers;
insensitivity to order of input records;
high dimensionality;
interpretability and usability.
Problems
There are a number of problems with clustering. Among them:
current clustering techniques do not address all the requirements adequately (and
concurrently);
dealing with large number of dimensions and large number of data items can be
problematic because of time complexity;
the effectiveness of the method depends on the definition of distance (for distancebased clustering);
if an obvious distance measure doesnt exist we must define it, which is not always
easy, especially in multi-dimensional spaces;
the result of the clustering algorithm (that in many cases can be arbitrary itself) can be
interpreted in different ways.
Clustering Algorithms
Classification
Clustering algorithms may be classified as listed below:
Exclusive Clustering
Overlapping Clustering
Hierarchical Clustering
Probabilistic Clustering
In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a
definite cluster then it could not be included in another cluster. A simple example of that is
shown in the figure below, where the separation of points is achieved by a straight line on a
bi-dimensional plane.
On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so
that each point may belong to two or more clusters with different degrees of membership. In
this case, data will be associated to an appropriate membership value.
Instead, a hierarchical clustering algorithm is based on the union between the two nearest
clusters. The beginning condition is realized by setting every datum as a cluster. After a few
iterations it reaches the final clusters wanted.
Finally, the last kind of clustering use a completely probabilistic approach.
In this tutorial we propose four of the most used clustering algorithms:
K-means
Fuzzy C-means
Hierarchical clustering
Mixture of Gaussians
Each of these algorithms belongs to one of the clustering types listed above. So that, K-means
is an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm,
Hierarchical clustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering
algorithm. We will discuss about each clustering method in the following paragraphs.
Distance Measure
An important component of a clustering algorithm is the distance measure between data
points. If the components of the data instance vectors are all in the same physical units then it
is possible that the simple Euclidean distance metric is sufficient to successfully group similar
data instances. However, even in this case the Euclidean distance can sometimes be
misleading. Figure shown below illustrates this with an example of the width and height
measurements of an object. Despite both measurements being taken in the same physical
units, an informed decision has to be made as to the relative scaling. As the figure shows,
different scalings can lead to different clusterings.
Notice however that this is not only a graphic issue: the problem arises from the
mathematical formula used to combine the distances between the single components of the
data feature vectors into a unique distance measure that can be used for clustering purposes:
different formulas leads to different clusterings.
Again, domain knowledge must be used to guide the formulation of a suitable distance
measure for each particular application.
Minkowski Metric
For higher dimensional data, a popular measure is the Minkowski metric,
where d is the dimensionality of the data. The Euclidean distance is a special case where p=2,
while Manhattan metric has p=1. However, there are no general theoretical guidelines for
selecting a measure for any given application.
It is often the case that the components of the data feature vectors are not immediately
comparable. It can be that the components are not continuous variables, like length, but
nominal categories, such as the days of the week. In these cases again, domain knowledge
must be used to formulate an appropriate measure.

K-Means Clustering
The Algorithm
K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem. The procedure follows a simple and easy way to
classify a given data set through a certain number of clusters (assume k clusters) fixed a
priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be
placed in a cunning way because of different location causes different result. So, the better
choice is to place them as much as possible far away from each other. The next step is to take
each point belonging to a given data set and associate it to the nearest centroid. When no
point is pending, the first step is completed and an early groupage is done. At this point we
need to re-calculate k new centroids as barycenters of the clusters resulting from the previous
step. After we have these k new centroids, a new binding has to be done between the same
data set points and the nearest new centroid. A loop has been generated. As a result of this
loop we may notice that the k centroids change their location step by step until no more
changes are done. In other words centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error
function. The objective function
,
where
is a chosen distance measure between a data point
and the cluster
centre , is an indicator of the distance of the n data points from their respective cluster
centres.
The algorithm is composed of the following steps:
1. Place K points into the space represented by the objects that

are being clustered. These points represent initial group
centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the
positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This
produces a separation of the objects into groups from which
the metric to be minimized can be calculated.
Although it can be proved that the procedure will always terminate, the k-means algorithm
does not necessarily find the most optimal configuration, corresponding to the global
objective function minimum. The algorithm is also significantly sensitive to the initial
randomly selected cluster centres. The k-means algorithm can be run multiple times to reduce
this effect.
K-means is a simple algorithm that has been adapted to many problem domains. As we are
going to see, it is a good candidate for extension to work with fuzzy feature vectors.
An example
Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we
know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster
i. If the clusters are well separated, we can use a minimum-distance classifier to separate
them. That is, we can say that x is in cluster i if || x - mi || is the minimum of all the k
distances. This suggests the following procedure for finding the k means:
Make initial guesses for the means m1, m2, ..., mk
Until there are no changes in any mean

o Use the estimated means to classify the samples into clusters
o For i from 1 to k
Replace mi with the mean of all of the samples for cluster i
o end_for
end_until
Here is an example showing how the means m1 and m2 move into the centers of two clusters.
Remarks
This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm for
partitioning the n samples into k clusters so as to minimize the sum of the squared distances
to the cluster centers. It does have some weaknesses:
The way to initialize the means was not specified. One popular way to start is to
randomly choose k of the samples.
The results produced depend on the initial values for the means, and it frequently
happens that suboptimal partitions are found. The standard solution is to try a number
of different starting points.
It can happen that the set of samples closest to mi is empty, so that mi cannot be
updated. This is an annoyance that must be handled in an implementation, but that we
shall ignore.
The results depend on the metric used to measure || x - mi ||. A popular solution is to
normalize each variable by its standard deviation, though this is not always desirable.
The results depend on the value of k.
This last problem is particularly troublesome, since we often have no way of knowing how
many clusters exist. In the example shown above, the same algorithm applied to the same
data produces the following 3-means clustering. Is it better or worse than the 2-means
clustering?
Unfortunately there is no general theoretical solution to find the optimal number of clusters
for any given data set. A simple approach is to compare the results of multiple runs with
different k classes and choose the best one according to a given criterion (for instance the
Schwarz Criterion - see Moore's slides), but we need to be careful because increasing k
results in smaller error function values by definition, but also an increasing risk of overfitting.

Fuzzy C-Means Clustering

The Algorithm
Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to
two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in
1981) is frequently used in pattern recognition. It is based on minimization of the following
objective function:
,
where m is any real number greater than 1, uij is the degree of membership of xi in the cluster
j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||
*|| is any norm expressing the similarity between any measured data and the center.
Fuzzy partitioning is carried out through an iterative optimization of the objective function
shown above, with the update of membership uij and the cluster centers cj by:
,
This iteration will stop when
, where is a termination criterion
between 0 and 1, whereas k are the iteration steps. This procedure converges to a local
minimum or a saddle point of Jm.
1. Initialize U=[uij] matrix, U(0)

2. At k-step: calculate the centers vectors C(k)=[cj] with U(k)
3. Update U(k) , U(k+1)
4. If || U(k+1) - U(k)||<
then STOP; otherwise return to step 2.
Remarks
As already told, data are bound to each cluster by means of a Membership Function, which
represents the fuzzy behaviour of this algorithm. To do that, we simply have to build an
appropriate matrix named U whose factors are numbers between 0 and 1, and represent the
degree of membership between data and centers of clusters.
For a better understanding, we may consider this simple mono-dimensional example. Given a
certain data set, suppose to represent it as distributed on an axis. The figure below shows this:
Looking at the picture, we may identify two clusters in proximity of the two data
concentrations. We will refer to them using A and B. In the first approach shown in this
tutorial - the k-means algorithm - we associated each datum to a specific centroid; therefore,
this membership function looked like this:
In the FCM approach, instead, the same given datum does not belong exclusively to a well
defined cluster, but it can be placed in a middle way. In this case, the membership function
follows a smoother line to indicate that every datum may belong to several clusters with
different values of the membership coefficient.
In the figure above, the datum shown as a red marked spot belongs more to the B cluster
rather than the A cluster. The value 0.2 of m indicates the degree of membership to A for
such datum. Now, instead of using a graphical representation, we introduce a matrix U whose
factors are the ones taken from the membership functions:
(a)
(b)
The number of rows and columns depends on how many data and clusters we are
considering. More exactly we have C = 2 columns (C = 2 clusters) and N rows, where C is
the total number of clusters and N is the total number of data. The generic element is so
indicated: uij.
In the examples above we have considered the k-means (a) and FCM (b) cases. We can notice
that in the first case (a) the coefficients are always unitary. It is so to indicate the fact that
each datum can belong only to one cluster. Other properties are shown below:
An Example
Here, we consider the simple case of a mono-dimensional application of the FCM. Twenty
data and three clusters are used to initialize the algorithm and to compute the U matrix.
Figures below (taken from our interactive demo) show the membership value for each datum
and for each cluster. The color of the data is that of the nearest cluster according to the
membership function.
In the simulation shown in the figure above we have used a fuzzyness coefficient m = 2 and
we have also imposed to terminate the algorithm when
. The
picture shows the initial condition where the fuzzy distribution depends on the particular
position of the clusters. No step is performed yet so that clusters are not identified very well.
Now we can run the algorithm until the stop condition is verified. The figure below shows the
final condition reached at the 8th step with m=2 and =0.3:
Is it possible to do better? Certainly, we could use an higher accuracy but we would have also
to pay for a bigger computational effort. In the next figure we can see a better result having
used the same initial conditions and =0.01, but we needed 37 steps!
It is also important to notice that different initializations cause different evolutions of the
algorithm. In fact it could converge to the same result but probably with a different number of
iteration steps.

Hierarchical Clustering Algorithms

How They Work
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic
process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:
1. Start by assigning each item to a cluster, so that if you have N items, you now have N
clusters, each containing just one item. Let the distances (similarities) between the
clusters the same as the distances (similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so
that now you have one cluster less.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)
Step 3 can be done in different ways, which is what distinguishes single-linkage from
complete-linkage and average-linkage clustering.
In single-linkage clustering (also called the connectedness or minimum method), we consider
the distance between one cluster and another cluster to be equal to the shortest distance from
any member of one cluster to any member of the other cluster. If the data consist of
similarities, we consider the similarity between one cluster and another cluster to be equal to
the greatest similarity from any member of one cluster to any member of the other cluster.
In complete-linkage clustering (also called the diameter or maximum method), we consider
the distance between one cluster and another cluster to be equal to the greatest distance from
any member of one cluster to any member of the other cluster.
In average-linkage clustering, we consider the distance between one cluster and another
cluster to be equal to the average distance from any member of one cluster to any member of
the other cluster.
A variation on average-link clustering is the UCLUS method of R. D'Andrade (1978) which
uses the median distance, which is much more outlier-proof than the average distance.
This kind of hierarchical clustering is called agglomerative because it merges clusters
iteratively. There is also a divisive hierarchical clustering which does the reverse by starting
with all objects in one cluster and subdividing them into smaller pieces. Divisive methods are
not generally available, and rarely have been applied.
(*) Of course there is no point in having all the N items grouped in a single cluster but, once
you have got the complete hierarchical tree, if you want k clusters you just have to cut the k-1
longest links.
Single-Linkage Clustering: The Algorithm

Lets now take a deeper look at how Johnsons algorithm works in the case of single-linkage
clustering.
The algorithm is an agglomerative scheme that erases rows and columns in the proximity
matrix as old clusters are merged into new ones.
The N*N proximity matrix is D = [d(i,j)]. The clusterings are assigned sequence numbers
0,1,......, (n-1) and L(k) is the level of the kth clustering. A cluster with sequence number m is
denoted (m) and the proximity between clusters (r) and (s) is denoted d [(r),(s)].
1. Begin with the disjoint clustering having level L(0) = 0 and

sequence number m = 0.
2. Find the least dissimilar pair of clusters in the current
clustering, say pair (r), (s), according to
d[(r),(s)] = min d[(i),(j)]
where the minimum is over all pairs of clusters in the current
clustering.
3. Increment the sequence number : m = m +1. Merge clusters
(r) and (s) into a single cluster to form the next clustering m.
Set the level of this clustering to
L(m) = d[(r),(s)]
4. Update the proximity matrix, D, by deleting the rows and
columns corresponding to clusters (r) and (s) and adding a
row and column corresponding to the newly formed cluster.
The proximity between the new cluster, denoted (r,s) and old
cluster (k) is defined in this way:
d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]
5. If all objects are in one cluster, stop. Else, go to step 2.
An Example
Lets now see a simple example: a hierarchical clustering of distances in kilometers between
some Italian cities. The method used is single-linkage.
Input distance matrix (L = 0 for all the clusters):
BA FI MI NA RM TO
BA 0 662 877 255 412 996
FI 662 0 295 468 268 400
MI 877 295 0 754 564 138
NA 255 468 754 0 219 869
RM 412 268 564 219 0 669
TO 996 400 138 869 669 0
The nearest pair of cities is MI and TO, at distance 138. These are merged into a single
cluster called "MI/TO". The level of the new cluster is L(MI/TO) = 138 and the new
sequence number is m = 1.
Then we compute the distance from this new compound object to all other objects. In single
link clustering the rule is that the distance from the compound object to another object is
equal to the shortest distance from any member of the cluster to the outside object. So the
distance from "MI/TO" to RM is chosen to be 564, which is the distance from MI to RM, and
so on.
After merging MI with TO we obtain the following matrix:
BA FI
MI/T N R
O
A M
BA
66
2
877
25
412
5
FI
66
0
2
295
46
268
8
75
564
4
MI/T 87 29
O
7 5
NA
25 46
5 8
754
0 219
RM
41 26
2 8
564
21
9
min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a new cluster called NA/RM
L(NA/RM) = 219
m=2
BA FI
MI/T NA/R
O
M
BA
66
2
877
255
FI
66
0
2
295
268
MI/TO
87 29
7 5
564
NA/R 25 26
M
5 8
564
min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM into a new cluster called
BA/NA/RM
L(BA/NA/RM) = 255
m=3
BA/NA/R
MI/T
FI
M
O
BA/NA/R
M
26
564
8
FI
268
295
MI/TO
564
29
5
min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM and FI into a new cluster called
BA/FI/NA/RM
L(BA/FI/NA/RM) = 268
m=4
BA/FI/NA/ MI/T
RM
O
BA/FI/NA/
RM
295
MI/TO
295
Finally, we merge the last two clusters at level 295.

The process is summarized by the following hierarchical tree:

Clustering as a Mixture of Gaussians

Introduction to Model-Based Clustering
Theres another way to deal with clustering problems: a model-based approach, which
consists in using certain models for clusters and attempting to optimize the fit between the
data and the model.
In practice, each cluster can be mathematically represented by a parametric distribution, like
a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modelled by a
mixture of these distributions. An individual distribution used to model a specific cluster is
often referred to as a component distribution.
A mixture model with high likelihood tends to have the following traits:
component distributions have high peaks (data in one cluster are tight);
the mixture model covers the data well (dominant patterns in the data are captured
by component distributions).
Main advantages of model-based clustering:
well-studied statistical inference techniques available;
flexibility in choosing the component distribution;
obtain a density estimation for each cluster;
a soft classification is available.
Mixture of Gaussians
The most widely used clustering method of this kind is the one based on learning a mixture of
Gaussians: we can actually consider clusters as Gaussian distributions centred on their
barycentres, as we can see in this picture, where the grey circle represents the first variance of
the distribution:
The algorithm works in this way:
it chooses the component (the Gaussian) at random with probability
it samples a point
Lets suppose to have:
x1, x2,..., xN
We can obtain the likelihood of the sample:

What we really want to maximise is
centres of the Gaussians).
.
(probability of a datum given the
is the base to write the likelihood function:
Now we should maximise the likelihood function by calculating

, but it would be too
difficult. Thats why we use a simplified algorithm called EM (Expectation-Maximization).
The EM Algorithm
The algorithm which is used in practice to find the mixture of Gaussians that can model the
data set is called EM (Expectation-Maximization) (Dempster, Laird and Rubin, 1977). Lets
see how it works with an example.
Suppose xk are the marks got by the students of a class, with these probabilities:
x1 = 30
x2 = 18
x3 = 0
x4 = 23
First case: we observe that the marks are so distributed among students:
x1 : a students
x2 : b students
x3 : c students
x4 : d students
We should maximise this function by calculating

logarithm of the function and maximise it:
. Lets instead calculate the
Supposing a = 14, b = 6, c = 9 and d = 10 we can calculate that

Second case: we observe that marks are so distributed among students:
x1 + x2 : h students
x3 : c students
x4 : d students
We have so obtained a circularity which is divided into two steps:
expectation:
maximization:
This circularity can be solved in an iterative way.

Lets now see how the EM algorithm works for a mixture of Gaussians (parameters estimated
at the pth iteration are marked by a superscript (p):
1. Initialize parameters:
2. E-step:
3. M-step:
where R is the number of records.

Links
Andrew Moore: K-means and Hierarchical Clustering - Tutorial Slides
http://www-2.cs.cmu.edu/~awm/tutorials/kmeans.html
Brian T. Luke: K-Means Clustering
http://fconyx.ncifcrf.gov/~lukeb/kmeans.html
Tariq Rashid: Clustering
http://www.cs.bris.ac.uk/home/tr1690/documentation/fuzzy_clustering_initial_report/node11.
html
Hans-Joachim Mucha and Hizir Sofyan: Cluster Analysis
http://www.quantlet.com/mdstat/scripts/xag/html/xaghtmlframe142.html
Frigui Hichem: Similarity Measures and Criterion Functions for clustering
http://prlab.ee.memphis.edu/frigui/ELEC7901/UNSUP2/SimObj.html
Osmar R. Zaane: Principles of Knowledge Discovery in Databases - Chapter 8: Data
Clustering
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html
Pier Luca Lanzi: Ingegneria della Conoscenza e Sistemi Esperti Lezione 2:
Apprendimento non supervisionato
http://www.elet.polimi.it/upload/lanzi/corsi/icse/2002/Lezione%202%20-%20Apprendimento
%20non%20supervisionato.pdf
Stephen P. Borgatti: How to explain hierarchical clustering
http://www.analytictech.com/networks/hiclus.htm
Maria Irene Miranda: Clustering methods and algorithms
http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/1999/clustering/dbms.html
Jia Li: Data Mining - Clustering by Mixture Models
http://www.stat.psu.edu/~jiali/course/stat597e/notes/mix.pdf
Nearest neighbor search

From Wikipedia, the free encyclopedia
Nearest neighbor search (NNS), also known as proximity search, similarity search or
closest point search, is an optimization problem for finding closest (or most similar) points.
Closeness is typically expressed in terms of a dissimilarity function: the less similar the
objects, the larger the function values. Formally, the nearest-neighbor (NN) search problem is
defined as follows: given a set S of points in a space M and a query point q M, find the
closest point in S to q. Donald Knuth in vol. 3 of The Art of Computer Programming (1973)
called it the post-office problem, referring to an application of assigning to a residence the
nearest post office. A direct generalization of this problem is a k-NN search, where we need
to find the k closest points.
Artificial neural network

"Neural network" redirects here. For networks of living neurons, see Biological neural
network. For the journal, see Neural Networks (journal). For the evolutionary concept, see
Neutral network (evolution).
"Neural computation" redirects here. For the journal, see Neural Computation (journal).
Machine learning and

data mining
Problems[show]
Supervised learning
(classification regression)
[show]
Clustering[show]
Dimensionality reduction[show]
Structured prediction[show]
Anomaly detection[show]
Neural nets[show]
Theory[show]
Machine learning venues[show]
Machine learning portal
An artificial neural network is an interconnected group of nodes, akin to the vast network of
neurons in a brain. Here, each circular node represents an artificial neuron and an arrow
represents a connection from the output of one neuron to the input of another.
In machine learning and cognitive science, artificial neural networks (ANNs) are a family
of models inspired by biological neural networks (the central nervous systems of animals, in
particular the brain) which are used to estimate or approximate functions that can depend on a
large number of inputs and are generally unknown. Artificial neural networks are generally
presented as systems of interconnected "neurons" which exchange messages between each
other. The connections have numeric weights that can be tuned based on experience, making
neural nets adaptive to inputs and capable of learning.
For example, a neural network for handwriting recognition is defined by a set of input
neurons which may be activated by the pixels of an input image. After being weighted and
transformed by a function (determined by the network's designer), the activations of these
neurons are then passed on to other neurons. This process is repeated until finally, the output
neuron that determines which character was read is activated.
Like other machine learning methods systems that learn from data neural networks have
been used to solve a wide variety of tasks, like computer vision and speech recognition, that
are hard to solve using ordinary rule-based programming.
Principal component analysis

PCA of a multivariate Gaussian distribution centered at (1,3) with a standard deviation of 3 in

roughly the (0.866, 0.5) direction and of 1 in the orthogonal direction. The vectors shown are
the eigenvectors of the covariance matrix scaled by the square root of the corresponding
eigenvalue, and shifted so their tails are at the mean.
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables into a set of
values of linearly uncorrelated variables called principal components. The number of
principal components is less than or equal to the number of original variables. This
transformation is defined in such a way that the first principal component has the largest
possible variance (that is, accounts for as much of the variability in the data as possible), and
each succeeding component in turn has the highest variance possible under the constraint that
it is orthogonal to the preceding components. The resulting vectors are an uncorrelated
orthogonal basis set. The principal components are orthogonal because they are the
eigenvectors of the covariance matrix, which is symmetric. PCA is sensitive to the relative
scaling of the original variables.
PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation
can be thought of as revealing the internal structure of the data in a way that best explains the
variance in the data. If a multivariate dataset is visualised as a set of coordinates in a highdimensional data space (1 axis per variable), PCA can supply the user with a lowerdimensional picture, a projection or "shadow" of this object when viewed from its (in some
sense; see below) most informative viewpoint. This is done by using only the first few
principal components so that the dimensionality of the transformed data is reduced.
PCA is closely related to factor analysis. Factor analysis typically incorporates more domain
specific assumptions about the underlying structure and solves eigenvectors of a slightly
different matrix.
PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems
that optimally describe the cross-covariance between two datasets while PCA defines a new
orthogonal coordinate system that optimally describes variance in a single dataset.
Relation between PCA and factor analysis[36]

Principal component analysis creates variables that are linear combinations of the original
variables. The new variables have the property that the variables are all orthogonal. The
principal components can be used to find clusters in a set of data. PCA is a variance-focused
approach seeking to reproduce the total variable variance, in which components reflect both
common and unique variance of the variable. PCA is generally preferred for purposes of data
reduction (i.e., translating variable space into optimal factor space) but not when the goal is to
detect the latent construct or factors.
Factor analysis is similar to principal component analysis, in that factor analysis also involves
linear combinations of variables. Different from PCA, factor analysis is a correlation-focused
approach seeking to reproduce the inter-correlations among variables, in which the factors
"represent the common variance of variables, excluding unique variance[37]" . In terms of the
correlation matrix, this corresponds with focusing on explaining the off-diagonal terms (i.e.
shared co-variance), while PCA focuses on explaining the terms that sit on the diagonal.
However, as a side result, when trying to reproduce the on-diagonal terms, PCA also tends to
fit relatively well the off-diagonal correlations.[38] Results given by PCA and factor analysis
are very similar in most situations, but this is not always the case, and there are some
problems where the results are significantly different. Factor analysis is generally used when
the research purpose is detecting data structure (i.e., latent constructs or factors) or causal
modeling.

A Tutorial On Clustering Algorithms

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Tutorial On Clustering Algorithms

Enviado por

Direitos autorais:

Formatos disponíveis

A Tutorial on Clustering Algorithms

Introduction | K-means | Fuzzy C-means | Hierarchical | Mixture of Gaussians | Links

The Goals of Clustering

Biology: classification of plants and animals given their features;

Libraries: book ordering;

Earthquake studies: clustering observed earthquake epicenters to identify dangerous

WWW: document classification; clustering weblog data to discover groups of similar

dealing with different types of attributes;

discovering clusters with arbitrary shape;

minimal requirements for domain knowledge to determine input parameters;

ability to deal with noise and outliers;

insensitivity to order of input records;

interpretability and usability.