Escolar Documentos
Profissional Documentos
Cultura Documentos
DATA MINING
DATA MINING
History of WEKA
In 1993, the University of Waikato in New Zealand
started development of the original version of Weka
(which became a mixture of TCL/TK, C, and Makefiles).
In 1997, the decision was made to redevelop Weka from
scratch in Java, including implementations of modelling
algorithms.
Overview Of WEKA
Weka is a workbench that contains a collection of visualization tools and
algorithms for data analysis and predictive modelling, together with
graphical user interfaces for easy access to this functionality.
The original non-Java version of Weka was a TCL/TK front-end to modelling
algorithms implemented in other programming languages.
Data pre-processing utilities are built using C, and a Makefile.
This original version was primarily designed as a tool for analysing data
from agricultural domains, but the more recent fully Java-based version
(Weka 3), for which development started in 1997, is now used in many
different application areas, in particular for educational purposes and
research
Weka supports several standard data mining tasks, more specifically, data
pre-processing, clustering, classification, regression, visualization, and
feature selection.
Features and
Interfaces in WEKA
By:- Tanvi Redkar
Main Features
46 data processing tools.
76 classifications/regression algorithms
8 clustering algorithms
15 attributes/subset evaluators plus 10 search algorithms
for feature selection.
3 algorithms for finding association rules.
Cont..
Options to customise using java source code is made
available.
Custom extensions and plug ins can be developed.
Excellent mailing and discussion list available.
Cont
3 graphical user interfaces.
WEKA INTERFACES
Command-line
Explorer
pre-processing, attribute selection, learning,
visualisation
Knowledge Flow
Visual design
Capabilities ~ Explorer
Experimenter
Testing and evaluating machine learning algorithms.
Limitations of WEKA
By:- Priyanka Bhagat
The main disadvantage is that most of the functionality is only applicable if all
data is held in main memory.
A few algorithms are included that are able to process data incrementally or
in batches.
However, for most of the methods the amount of available memory imposes a
limit on the data size, which restricts application to small or medium-sized
datasets.
A second disadvantage : a Java implementation is generally somewhat slower
than an equivalent in C/C++.
Types of clusters
Well-separated clusters.
Distribution-based clustering
Centroid-based clustering
hierarchical clustering
Density-based clustering
Conceptual clustering
Clustering algorithm
K-means clustering
K-means is one of the simplest unsupervised learning
algorithms that solve the well known clustering problem. The
procedure follows a simple and easy way to classify a given
data set through a certain number of clusters fixed APRIORI.
Algorithmic steps
select K points as the initial centroids.
Repeat
Form K clusters by assigning all points to the closest
centroid.
Recompute the centroid of each cluster.
Until The centroids dont change.
Advantages
1) Fast, robust and easier to understand.
2) Relatively efficient:O(t*k*n*d), wherenis objects,kis clusters, d is
dimension of each object, andtis iterations. Normally,k,t, d<<n.
3)Gives best result when data set are distinct or well separated from each
other.
Disadvantages
1)The learning algorithmrequiresAPRIORI specification of
the number of cluster centres.
2)The use of Exclusive Assignment - Ifthere are two highly
overlapping datathen k-means will not be able to
resolvethat there are two clusters.
3)The learning algorithm is not invariant to non-linear
transformationsi.e.with different representation ofdata we
get different results.
4) Randomly choosing of the cluster centre cannot lead us to
the fruitful result.
5)Applicable only whenmeanis defined i.e. fails for
categorical data.
6)Unable to handle noisy data andoutliers.
7) Algorithm fails for non-linear data set.