Data Mining

DATA MINING Basic Data Mining Tasks
Classification maps data into predefined groups or classes Supervised learning Pattern recognition Prediction
Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. Unsupervised learning Segmentation Partitioning
Summarization maps data into subsets with associated simple descriptions. Characterization Generalization
Link Analysis uncovers relationships among data. Affinity Analysis Association Rules Sequential Analysis determines sequential patterns.
Ex: Time Series Analysis Example: Stock Market o Predict future values o Determine similar patterns over time o Classify behavior
Data Mining vs. KDD

Knowledge Discovery in Databases (KDD): process of finding useful information and patterns in data. Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process.
KDD Process
Selection: Obtain data from various sources. Preprocessing: Cleanse data. Transformation: Convert to common format. Transform to new format. Data Mining: Obtain desired results. Interpretation/Evaluation: Present results to user in meaningful manner. KDD Process Ex: Web Log
Selection: Select log data (dates and locations) to use
Preprocessing: Remove identifying URLs Remove error logs
Transformation: Sessionize (sort and group)
Data Mining: Identify and count patterns Construct data structure
Interpretation/Evaluation: Identify and display frequently accessed sequences.
Potential User Applications: Cache prediction Personalization
Data Mining Development
Data bases Relational Data Model SQL Association Rule Algorithms Data Warehousing Scalability Techniques Similarity Measures Hierarchical Clustering IR Systems Imprecise Queries Textual Data Web Search Engines Algorithm Design Techniques Algorithm Analysis Data Structures Neural Networks Decision Tree Algorithms Algorithm Design Techniques Algorithm Analysis
Information retrieval
Statistics
Machine learning
Algorithm
Data Structures Human Interaction Overfitting Outliers Interpretation Visualization Large Datasets High Dimensionality Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Integration Application
Data mining Issues
Social Implications of DM
Privacy Profiling Unauthorized use Data Mining Metrics Usefulness Return on Investment (ROI) Accuracy Space/Time Database Perspective on Data Mining Scalability Real World Data
Data Mining Techniques Outline Goal: Provide an overview of basic data mining techniques Statistical Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation
Similarity Measures Decision Trees Neural Networks Activation Functions
Genetic Algorithms
Point Estimation Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample.
Ex:
May be used to predict value for missing data. R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employees salary.
o Is this a good idea? o Estimation Error
Bias: Difference between expected value and actual value.
Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:
Why square? Root Mean Square Error (RMSE) Visualization: Frequency distribution, mean, variance, median, mode, etc. Box Plot:
Models Based on Summarization
Scatter Diagram
Bayes Theorem

Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem:
Assign probabilities of hypotheses given a data value.
Bayes Theorem Example
Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, not authorize but contact police From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%. Training Data:
ID 1 2 3
In o e cm 4 3 2
C d re it E c lle t xe n Go od E c lle t xe n
h4= do
C h h h
Hypothesis Testing

Find model to explain behavior by creating and then testing a hypothesis about the data. Exact opposite of usual DM approach. H0 Null hypothesis; Hypothesis to be tested. H1 Alternative hypothesis O observed value E Expected value based on hypothesis.
Chi Squared Statistic
Ex:
O={50,93,67,78,87} E=75 c2=15.55 and therefore significant
Regression
Predict future values based on past values Linear Regression assumes linear relationship exists. Find values to best fit the data
y = c0 + c1 x1 + + cn xn Linear Regression
Correlation Examine the degree to which the values for two variables behave similarly. Correlation coefficient r: 1 = perfect correlation -1 = perfect but opposite correlation 0 = no correlation
Decision Trees
Decision Tree (DT): Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem.
Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.
A Decision Tree Model is a computational model consisting of three parts: Decision Tree Algorithm to create the tree Algorithm that applies the tree to data
Creation of the tree is the most difficult part. Processing is basically a search similar to that in a binary search tree (although DT may not be binary).
Advantages/Disadvantages Advantages: Easy to understand. Easy to generate rules May suffer from overfitting. Classifies by rectangular partitioning. Does not easily handle nonnumeric data. Can be quite large pruning is necessary.
Disadvantages:
Neural Networks Based on observed functioning of human brain. (Artificial Neural Networks (ANN) Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical viewpoint. Alternatively, a NN may be viewed from the perspective of matrices. Used in pattern recognition, speech recognition, computer vision, and classification.
Neural Networks
Neural Network (NN) is a directed graph F=<V,A> with vertices V={1,2,,n} and arcs A={<i,j>|1<=i,j<=n}, with the following restrictions: V is partitioned into a set of input nodes, VI, hidden nodes, VH, and output nodes, VO. The vertices are also partitioned into layers Any arc <i,j> must have node i in layer h-1 and node j in layer h. Arc <i,j> is labeled with a numeric value wij. Node i is labeled with a function fi.
Neural Network Example
NN Node
NN Activation Functions Functions associated with nodes in graph. Output may be in range [-1,1] or [0,1]
Neural Networks
A Neural Network Model is a computational model consisting of three parts: Neural Network graph Learning algorithm that indicates how learning takes place. Recall techniques that determine hew information is obtained from the network.
We will look at propagation as the recall technique. NN Advantages Learning Can continue learning even after training set has been applied. Easy parallelization
Solves many problems NN Disadvantages Difficult to understand
May suffer from overfitting Structure of graph must be determined a priori. Input values must be numeric. Verification difficult. Optimization search type algorithms. Creates an initial feasible solution and iteratively creates new better solutions. Based on human evolution and survival of the fittest. Must represent a solution as an individual. Individual: string I=I1,I2,,In where Ij is in given alphabet A. Each character Ij is called a gene. Population: set of individuals. A Genetic Algorithm (GA) is a computational model consisting of five parts:

Genetic Algorithms
A starting set of individuals, P. Crossover: technique to combine two parents to create offspring. Mutation: randomly change an individual. Fitness: determine the best individuals. Algorithm which applies the crossover and mutation techniques to P iteratively using the fitness function to determine the best individuals in P to keep.
Crossover Examples
000 000 111 111 Parents
000 111 111 000 Children
000 000 00 111 111 11 Parents
000 111 00 111 000 11 Children
a) Single Crossover
Genetic Algorithm
a) Multiple Crossover
GA Advantages/Disadvantages Advantages Easily parallelized Difficult to understand and explain to end users. Abstraction of the problem and method to represent individuals is quite difficult. Determining fitness function is difficult. Determining how to perform crossover and mutation is difficult. Disadvantages
Data Mining Concepts Associations and Item-sets: An association is a rule of the form: if X then Y. It is denoted as X Y
Example: If India wins in cricket, sales of sweets go up. For any rule if X Y Y X, then X and Y are called an interesting item-set. Example: People buying school uniforms in June also buy school bags (People buying school bags in June also buy school uniforms) Support and Confidence: The support for a rule R is the ratio of the number of occurrences of R, given all occurrences of all rules. The confidence of a rule X Y, is the ratio of the number of occurrences of Y given X, among all other occurrences given X. Support for {Bag, Uniform} = 5/10 = 0.5 Confidence for Bag Uniform = 5/8 = 0.625 Mining for Frequent Item-sets The Apriori Algorithm: Given minimum required support s as interestingness criterion:
1. Search for all individual elements (1-element item-set) that have a minimum support of s
2. Repeat From the results of the previous search for i-element item-sets, search for all i+1 element item-sets that have a minimum support of s 1. This becomes the set of all frequent (i+1)-element item-sets that are interesting 3. Until item-set size reaches maximum.. Let minimum support = 0.3 Interesting 1-element item-sets: {Bag}, {Uniform}, {Crayons}, {Pencil}, {Books} Interesting 2-element item-sets: {Bag,Uniform} {Bag,Crayons} {Bag,Pencil} {Bag,Books} {Uniform,Crayons} {Uniform,Pencil} {Pencil,Books} Let minimum support = 0.3 Interesting 3-element item-sets: {Bag,Uniform,Crayons}
Mining for Association Rules

Association rules are of the form AB Which are directional Association rule mining requires two thresholds: minsup and minconf General Procedure: 1. Use apriori to generate frequent itemsets of different sizes
1. At each iteration divide each frequent itemset X into two parts LHS and RHS. This represents a rule of the form LHS
RHS 2. The confidence of such a rule is support(X)/support(LHS) Discard all rules whose confidence is less than minconf
Example: The frequent itemset {Bag, Uniform, Crayons} has a support of 0.3. This can be divided into the following rules: {Bag} {Uniform, Crayons} {Bag, Uniform} {Crayons} {Bag, Crayons} {Uniform} {Uniform} {Bag, Crayons} {Uniform, Crayons} {Bag} {Crayons} {Bag, Uniform} Confidence for these rules are as follows: {Bag} {Uniform, Crayons} 0.375 {Bag, Uniform} {Crayons} 0.6 {Bag, Crayons} {Uniform} 0.75 {Uniform} {Bag, Crayons} 0.428 {Uniform, Crayons} {Bag} 0.75 {Crayons} {Bag, Uniform} 0.75 If minconf is 0.7, then we have discovered the following rules People who buy a school bag and a set of crayons are likely to buy school uniform. People who buy school uniform and a set of crayons are likely to buy a school bag. People who buy just a set of crayons are likely to buy a school bag and school uniform as well. Generalized Association Rules Since customers can buy any number of items in one transaction, the transaction relation would be in the form of a list of individual purchases. Bill No. Date 15563 15563 15564 15564 23.10.2003 23.10.2003 23.10.2003 23.10.2003 Item Books Crayons Uniform Crayons
A transaction for the purposes of data mining is obtained by performing a GROUP BY of the table over various fields. A GROUP BY over Bill No. would show frequent buying patterns across different customers. A GROUP BY over Date would show frequent buying patterns across different days. Bill No. 15563 15563 15564 15564 Date 23.10.2003 23.10.2003 23.10.2003 23.10.2003 Item Books Crayons Uniform Crayons
Classification and Clustering

Given a set of data elements: Classification maps each data element to one of a set of pre-determined classes based on the difference among data elements belonging to different classes Clustering groups data elements into different groups based on the similarity between elements within a single group
Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Cloudy Cloudy Classification problem Weather Play(Yes,No) Classification Techniques
Temp 30 15 16 27 25 17 17 35
Play? Yes No Yes Yes Yes No No Yes
Hunts method for decision tree identification: Given N element types and m decision classes:
1. For i 1 to N do
1. Add element i to the i-1 element item-sets from the previous iteration 2. Identify the set of decision classes for each item-set 3. If an item-set has only one decision class, then that item-set is done, remove that item-set from subsequent iterations
2. done
Decision Tree Identification Example Outlook Sunny Overcast Sunny Cloudy Overcast Overcast Cloudy Cloudy Temp Warm Chilly Chilly Pleasant Pleasant Chilly Chilly Warm Play? Yes No Yes Yes Yes No No Yes
Top down technique for decision tree identification Decision tree created is sensitive to the order in which items are considered If an N-item-set does not result in a clear decision, classification classes have to be modeled by rough sets.
Other Classification Algorithms Quinlans depth-first strategy builds the decision tree in a depth-first fashion, by considering all possible tests that give a decision and selecting the test that gives the best information gain. It hence eliminates tests that are inconclusive. SLIQ (Supervised Learning in Quest) developed in the QUEST project of IBM uses a top-down breadth-first strategy to build a decision tree. At each level in the tree, an entropy value of each node is calculated and nodes having the lowest entropy values selected and expanded. Clustering Techniques Clustering partitions the data set into clusters or equivalence classes. Similarity among members of a class more than similarity among members across classes. Similarity measures: Euclidian distance or other application specific measures.
Euclidian Distance for Tables
Clustering Techniques General Strategy: Draw a graph connecting items which are close to one another with edges. Partition the graph into maximally connected subcomponents. Construct an MST for the graph Merge items that are connected by the minimum weight of the MST into a cluster Clustering types: Hierarchical clustering: Clusters are formed at different levels by merging clusters at a lower level
Partitional clustering: Clusters are formed at only one level Nearest Neighbour Clustering Algorithm: Given n elements x1, x2, xn, and threshold t, . j 1, k 1, Clusters = {} Repeat Find the nearest neighbour of xj Let the nearest neighbour be in cluster m If distance to nearest neighbour > t, then create a new cluster and k k+1; else assign xj to cluster m j j+1 until j > n Iterative partitional clustering: Given n elements x1, x2, xn, and k clusters, each with a center. Assign each element to its closest cluster center After all assignments have been made, compute the cluster centroids for each of the cluster Repeat the above two steps with the new centroids until the algorithm converges
Mining Sequence Data

Characteristics of Sequence Data: Collection of data elements which are ordered sequences In a sequence, each item has an index associated with it A k-sequence is a sequence of length k. Support for sequence j is the number of m-sequences (m>=j) which contain j as a sequence Sequence data: transaction logs, DNA sequences, patient ailment history, A sequence is a list of itemsets of finite length. Example:
Some Definitions:
{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil} the purchases of a single customer over time
The order of items within an itemset does not matter; but the order of itemsets matter A subsequence is a sequence with some itemsets deleted A sequence S = {a1, a2, , am} is said to be contained within another sequence S, if S contains a subsequence {b1, b2, bm} such that a1 b1, a2 b2, , am bm.
Some Definitions:
Hence, {pen}{pencil}{ruler,pencil} is contained in {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}
Apriori Algorithm for Sequences: L1 Set of all interesting 1-sequences k1 while Lk is not empty do Generate all candidate k+1 sequences Lk+1 Set of all interesting k+1-sequences done Generating Candidate Sequences: Given L1, L2, Lk, candidate sequences of Lk+1 are generated as follows: For each sequence s in Lk, concatenate s with all new 1-sequences found while generating Lk-1 Example: minsup = 0.5
abcde bdae aebd be eabda aaaa baaa cbdb abbab abde Example:
Interesting 1-sequences: a b d e Candidate 2-sequences aa, ab, ad, ae ba, bb, bd, be da, db, dd, de ea, eb, ed, ee minsup = 0.5
abcde bdae aebd be eabda aaaa baaa cbdb abbab abde
Interesting 2-sequences: ab, bd Candidate 2-sequences aba, abb, abd, abe, aab, bab, dab, eab, bda, bdb, bdd, bde, bbd, dbd, ebd. Interesting 3-sequences = {}
Language Inference: Given a set of sequences, consider each sequence as the behavioural trace of a machine, and infer the machine that can display the given sequence as behavior.
Input set of sequences Inferring the syntax of a language given its sentences
Output state machine
Applications: discerning behavioural patterns, emergent properties discovery, collaboration modeling, State machine discovery is the reverse of state machine construction Discovery is maximalist in nature (Srinivasa and Spiliopoulou 2000)
Shortest-run Generalization Given a set of n sequences:
1. Create a state machine for the first sequence

2. for j 2 to n do 1. Create a state machine for the jth sequence
2. Merge this sequence into the earlier sequence as follows: 1. Merge all halt states in the new state machine to the halt state in the existing state machine 2. If two or more paths to the halt state share the same suffix, merge the suffixes together into a single path 3. Done
aabcb aac aabc Characteristics of streaming data: Large data sequence No storage Often an infinite sequence Examples: Stock market quotes, streaming audio/video, network traffic Running mean: Let n = number of items read so far, avg = running average calculated so far, On reading the next number num: avg (n*avg+num) / (n+1) n n+1 Running variance: var = (num-avg)2 = num2 - 2*num*avg + avg2 Let A = num2 of all numbers read so far B = 2*num*avg of all numbers read so far C = avg2 of all numbers read so far avg = average of numbers read so far n = number of numbers read so far Running variance: On reading next number num: avg (avg*n + num) / (n+1) n n+1 A A + num2 B B + 2*avg*num C C + avg2 var = A + B + C -Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999) Let streaming data be in the form of frames where each frame comprises of one or more data elements.
Support for data element k within a frame is defined as (#occurrences of k)/(#elements in frame) -Consistency for data element k is the sustained support for k over all frames read so far, with a leakage of (1- ) -Consistency: (Srinivasa and Spiliopoulou, CoopIS 1999)
levelt(k) = (1- )*levelt-1(k) + *sup(k)
Nearest Neighbor
Also called instance based learning Algorithm

Given a new instance x, find its nearest neighbor <x,y> Return y as the class of x Normalization?! Whats its time complexity? Does it learn?
Distance measures Some interesting questions
Nearest Neighbor (2)
Dealing with noise k-nearest neighbor Use more than 1 neighbor How many neighbors? Weighted nearest neighbors Huge storage Use representatives (a problem of instance selection)
How to speed up?
Sampling Grid Clustering
Nave Bayes Classification

This is a direct application of Bayes rule P(C|x) = P(x|C)P(C)/P(x) x - a vector of x1,x2,,xn Thats the best classifier you can ever build
You dont even need to select features, it takes care of it automatically There are a limited number of instances How to estimate P(x|C)
But, there are problems
Assume conditional independence between xis We have P(C|x) P(x1|C) P(xi|C) (xn|C)P(C)
How good is it in reality? Lets build one NBC for a very simple data set
Estimate the priors and conditional probabilities with the training data P(C=1) = ? P(C=2) =? P(x1=1|C=1)? P(x1=2|C=1)? What is the class for x=(1,2,1)? What is the class for (1,2,2)? 1 2 3 0 1 2
C 7= A1=0 A1=1 A1=2 A2=0 A2=1 A2=2 A3=1 A3=2
P(1|x) P(x1=1|1) P(x2=2|1) P(x3=1|1) P(1), P(2|x)
4 2 2 0
A1 1 0 2 1 0 2 1
A2 2 0 1 2 1 2 0
A3 1 1 2 1 2 2 1
C 1 1 2 2 1 2 1
Classification Methods Decision Trees A decision tree is a flow-chart-like tree structure Internal node denotes a test on an attribute (feature) Branch represents an outcome of the test All records in a branch have the same value for the tested attribute Leaf node represents class label or class label distribution
Example: is it a good day to play golf? a set of attributes and their possible values: outlook temperature humidity windy sunny, overcast, rain cool, mild, hot high, normal true, false
Using Decision Trees for Classification Examples can be classified as follows look at the example's value for the feature specified move along the edge labeled with this value if you reach a leaf, return the label of the leaf otherwise, repeat from step 1 Example (a decision tree to decide whether to go on a picnic):
Decision Trees and Decision Rules
Each path in the tree represents a decision rule: Rule1: If (outlook=sunny) AND (humidity<=0.75) Then (play=yes) Rule2: If (outlook=rainy) AND (wind>20) Then (play=no) Rule3: If (outlook=overcast) Then (play=yes) Top-Down Decision Tree Generation The basic approach usually consists of two phases: Tree construction At the start, all the training examples are at the root Partition examples are recursively based on selected attributes Tree pruning remove tree branches that may reflect noise in the training data and lead to errors when classifying test data improve classification accuracy Basic Steps in Decision Tree Construction Tree starts a single node representing all data
If sample are all same class then node becomes a leaf labeled with class label
Otherwise, select feature that best separates sample into individual classes.
Recursion stops when: Samples in node belong to the same class (majority) There are no remaining attributes on which to split Trees Construction Algorithm (ID3) Decision Tree Learning Method (ID3)
Input: a set of examples S, a set of features F, and a target set T (target class T represents the type of instance we want to
classify, e.g., whether to play golf)

1. If every element of S is already in T, return yes; if no element of S is in T return no 2. Otherwise, choose the best feature f from F (if there are no features remaining, then return failure); 3. Extend tree from f by adding a new branch for each attribute value 4. Distribute training examples to leaf nodes (so each leaf node S is now the set of examples at that node, and F is the
remaining set of features not yet selected) 5. Repeat steps 1-5 for each leaf node Main Question: how do we choose the best feature at each step? Dealing With Continuous Variables Partition continuous attribute into a discrete set of intervals
sort the examples according to the continuous attribute A
identify adjacent examples that differ in their target classification generate a set of candidate thresholds midway problem: may generate too many intervals Another Solution:
take a minimum threshold M of the examples of the majority class in each adjacent partition; then merge adjacent partitions
with the same majority class
Over-fitting in Classification A tree generated may over-fit the training examples due to noise or too small a set of training data Two approaches to avoid over-fitting: (Stop earlier): Stop growing the tree earlier (Post-prune): Allow over-fit and then post-prune the tree Approaches to determine the correct final tree size: Separate training and testing sets or use cross-validation Use all the data for training, but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve over entire distribution Use Minimum Description Length (MDL) principle: halting growth of the tree when the encoding is minimized. Rule post-pruning (C4.5): converting to rules before pruning Pruning the Decision Tree A decision tree constructed using the training data may need to be pruned over-fitting may result in branches or leaves based on too few examples
pruning is the process of removing branches and subtrees that are generated due to noise; this improves classification accuracy Subtree Replacement: merge a subtree into a leaf node Using a set of data different from the training data At a tree node, if the accuracy without splitting is higher than the accuracy with splitting, replace the subtree with a leaf node; label it using the majority class
Bayesian Classification It is a statistical classifier based on Bayes theorem It uses probabilistic learning by calculating explicit probabilities for hypothesis A nave Bayesian classifier, that assumes total independence between attributes, is commonly used and performs well with large data sets The model is incremental in the sense that each training example can incrementally increase or decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data
Given a data sample X with an unknown class label, H is the hypothesis that X belongs to a specific class C The conditional probability of hypothesis H given X, Pr(H|X), follows the Bayes theorem:
Practical difficulty: requires initial knowledge of many probabilities, significant computational cost Nave Bayesian Classifier
Suppose we have n classes C1 , C2 ,,Cn. Given an unknown sample X, the classifier will predict that X=(x1 ,x2 ,,xn) belongs to the
class with the highest conditional probability:
X Ci if Pr(Ci | X ) > Pr(C j | X ), for i j n, i j

Maximize Pr(X | Ci).Pr(Ci) / Pr(X) => maximize Pr(X | Ci).Pr(Ci)
Note: Pr(Ci) = si / s, and

P( |XC P( =Chr P(= ) k / sik r ) |x r |x C s ) wee ir k i i
k= 1 n
Greatly reduces the computation cost, only count the class distribution Nave: class conditional independence Nave Bayesian Classifier Example Given a training set, we can compute the probabilities
X = <sunny, mild, high, true> Pr(X | no).Pr(no) = (3/5 . 2/5 . 4/5 . 3/5) . 5/14 = 0.04 Pr(X | yes).Pr(yes) = (2/9 . 4/9 . 3/9 . 3/9) . 9/14 = 0.007 K-Means Algorithm The basic algorithm (based on reallocation method):
1. select K data points as the initial representatives 2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters) 3. for j = 1 to K, recalculate the cluster centroid Cj 4. repeat steps 2 and 3 until these is (little or) no change in clusters Example: Clustering Terms Initial (arbitrary) assignment: C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6}
Now using simple similarity measure, compute the new cluster-term similarity matrix
T1 T2 T3 T4 T5 T6 T7 T8 Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2 Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2 Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2 Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1
Now compute new cluster centroids using the original document-term matrix
T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3 Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1 Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1 Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1 Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1 The process is repeated until no1further changes are4/3 11/4the1/1 made to clusters Doc5 2 2 2 3 4 0 2
K-Means Algorithm
Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n Often terminates at a local optimum Weakness of the k-means: Applicable only when mean is defined; what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers
Variations of K-Means usually differ in:

Selection of the initial k means
Dissimilarity calculations Strategies to calculate cluster means Hierarchical Algorithms Use distance matrix as clustering criteria
does not require the number of clusters k as an input, but needs a termination condition
Hierarchical Agglomerative Clustering HAC starts with unclustered data and performs successive pairwise joins among items (or previous clusters) to form larger ones this results in a hierarchy of clusters which can be viewed as a dendrogram useful in pruning search in a clustered item set, or in browsing clustering results Some commonly used HACM methods
Single Link: at each step join most similar pairs of objects that are not yet in the same clusterComplete Link: use least similar pair between each cluster pair to determine inter-cluster similarity - all items within one cluster are linked to each other within a similarity thresholdWards method: at each step join cluster pair whose merger minimizes the increase in total within-group error sum of squares (based on distance between centroids) - also called the minimum variance method
Group Average (Mean): use average value of pairwise links within a cluster to determine inter-cluster similarity (i.e., all objects contribute to inter-cluster similarity) Dendrogram for a hierarchy of clusters
UNIT 4 1.1 CHARACTERISTICS OF A DATA WAREHOUSE

A data warehouse is a repository of subjectively selected and adapted opetdata, which can successfully answer any ad hoc, complex, statistical or an queries. It is situated at the centre of a decision support system (DSS) oganization r (or
corporation) and contains integrated historical data
summarized and detailed information common to the entire effective business intelligence, business (doubling every
organ (Hanlmergren, 1997). Data warehousing technology is becoming essential strategy formulation and impleme
in a globally competitive environment, wherein larger and larger amounts
18 months) are required to be processed faster and faster (ii seconds) for comprehension of its real meaning and impact. Data warel enables easy organization and maintenance of large data in addition to fast rl and analysis in the manner and depth required from time to time. Data can be classified into three categories: reference and transactional derived data and, denormalized data. Reference and system. At a predetermined
transaction data originate operational (source) systems and is normally kept in a conventional d; riodicity, it will be loaded for refreshing tl
warehouse. If it is purged from the source then it is archived into th on the other hand, is only derived from the modified basis of
warehouse. But once put in the data warehouse, it cannot be modified. Deriv reference and transaction data
on certain rules or computations. It can always be derived again on fl
derivation. Denormalized data, which is the basis for directly bi detailed reference (transaction) data.
analytical processing (OLAP) tools, is prepared periodically but is
1.2 DATA MARTS

From a data warehouse, data flows to various departments for their custo components are called data m, DSS usage. These individual departmental other words, a data mart is a body of DSS data for a department that architectural foundation warehouse and is much more popular than data warehouse. What is the instead of becoming part of the data
of a data warehouse. Data mart is a subset of a data
reason: Why does every department or part of an organization prefer to have a data mars
warehouse? As a central data warehouse keep; growing, it becomes more and more complex or even possibly unwieldy Gradually, the data becomes harder to customize. When the data warehouse i; small, the analyst (of DSS) can easily customize, becomes very large
summarize and analyse to dram analytical results relatively easily. But he can do that no more once the data
as it calls for large amount of time, which he may not be ablE to afford. The cost of processing the data also increases as the volume of data increases. Further, the software that is available to access or analyse the larg( quantity of data may not be as easy or elegant
as the software that will be able tc processs small amounts of data, as in the data mart. As a result, the analyst find extremely useful for fast and easy analysis, as also the users of the
the data mart
results of such analysis. Since the data flows into the data mart easily summarize, sort, select or
(from the date warehouse), the department which owns it can easily customize the data. It car
structure without any global considerations or any necessity to meet the requirements of another department. Similarly, in the case of historical data, a department can choose to limit the historical data to its owr overhead is also very limited in this case department. The processing load or
Whereas, the resource requirements are very restricted. Therefore, data marts an technological and economic reasons as well, why
very much preferred by both users and analysts. There are other organizational the data mart is so attractive. 11
Is only because, a data mart is a natural outgrowth of a data warehouse. detailed data is customized, sorted or It is important to
The source of data that flows into the data mart is of current level detail. Th summarized as it is placed in the data mart
In addition, the data mart can be fed by data from external sources.
note that all the issues related to data mart (as discussec data warehouse is only a collection of data marts.
below) are equally relevant for a data warehouse, since basically a
1.2.1 Types of Data M arts

The data marts can be classified into two groups: multidimensional (MDDB OLAP ofMOLAP) and relational OLAP (ROLAP). In a multidimensional (MDDB) data mart the numeric data, which is basically multidimensional in its original nature, can lx diced in a free fashion, i.e. free from data modelling constraints of consider the data of prices of sliced and
DBMS (as RDBMS). As an example, let us the data
commodities. This data may be stored just as basic tables in RDBMS. even though
inherently may be multidimensional, i.e. commodity-wise, year-wise is lost or not considered at at
region-wise, etc. The multidimensionality of the data other hand, in an
in an RDBMS which handles it in terms of tabular form of data. Whereas, on th
MDDB situation, the data can be so viewed as to fully harness on this
the natural multidimensionality. Queries can be asked based multimore powerful than what could be model.
dimensionality of data. Thus, the query processing and analysis will be much achieved in a conventional RDBMS
environment. Only specialized DBMSs or engines can therefore support MDDB
On the other hand, ROLAP or relational OLAP data marts may contain both text and numeric data and are supported by RDBMS. These data marts are used for general purpose DSS analysis with many indices which support for star schema. ROLAP data marts can preconceived queries. The processing in ROLAP is predictable in terms of
be analysed either on ad hoc basis (ad hoc queries) or on
summary and detailed data which will be available in the ROLAP data marts (Inmon, 1996). Hybrid approach is also possible.
1.2.2 Loading a Data M art

The data mart is loaded with data from a data warehouse by means of a load program are: frequency and schedule; program. The chief considerations for a load
total or partial refreshment (or addition); customization of data from the warehouse (selection); (loading time optimization); integrity of process.
re-sequencing and merging of data; aggregation of data; summarization; efficiency data; data relationships and
integrity of data domains; and producing meta data for describing the loading
1.2.3 M etadata for a D M ata art

Metadata (or data about data) describes the details about the data in a data description may be in terms of the contents and components of metadata for a given data warehouse or data mart: 1. Description of sources of the data. 2. Description of customization that may have taken place as the data passes warehouse or in a data mart. Such
source of data that flows into the data warehouse or data mart. Following are the
from data warehouse into data mart. 3. Descriptive information about data mart, its tables, attributes and relationships, etc. 4. Definitions of all types. The metadata of a data mart is created and updated from the load programs The linkages and relationships that move data from data warehouse to data mart. established or well
between metadata of a data warehouse and metadata of data mart have to be well
understood by the manager or analyst using the metadata. This description is essential to establish drill down capability between the two environments. With this linkage, the
manager using data mart metadata can easily find the heritage of the data in data warehouse. Further, the DSS analyst also requires to understand how calculations are made and how data is selected for the to the individual data n data mart environment. The metadata pertaining
should also be available to the end-user for effective usage.
1.2.4 D M ata odel for a D M ata art

A formal data model is required to be built for a large data mart which may, can be repetitive or predicts have some processing involved. This processing processing. Such a data
No data model is necessary for ordinary or simple small data marts with
model should be compatible with the DBMS which hang the data mart. For example, in case of multidimensional DBMS (or DDB), separate formal data model can be built, as the multidimensional data model il have to be compliant levels of data in the particular context of application for a given department. is a necessity for that DBMS. Data marts which do ot particular DBMS can be modelled in terms of a formal data model which will I care of both summary and detailed
1.2.5 M aintenance of a Data M art

Periodic maintenance of a data mart means loading, refreshing and purging regular cycles as per the natur daily basis whereas daily prices of commodities ma} particular nature and frequency of the c availability. data in it. Refreshing the data is performed in refreshed (added) on a
frequency of data update. For example, rainfall data in a data mart may
refreshed on a weekly basis, similarly, census data may be refreshed on a decennially depending on the
decen basis. Thus, refreshing may be daily, weekly, monthly, quarterly, yearly
As regards to purging of data in a data mart, the data mart is read periodic and some (old) data is selected for purging or removing. The data to be renmo may be purged, archived or condensed. The criteria for purging depends on d time and periodicity, based on any criterion decided by the application requirements.
1.2.6 N ature of D in a D M ata ata art

The data in a data mart can be of detailed level, summary level, ad hoc d of a largely populated data preprocessed or prepared data. As a rule, the bulk detailed data.
contains a lot of ad hoc summary (aggregations) data and also a lot of prepz
1.2.7 Software Com ponents for a Data M art

The software that can be found with a data mart includes: DBMS, access creation of data mart, purging software, which is the most unique for a data warehousing application, allows the derived to the user. analysis software, software for automatic end-user to search through the data mart to find presentation of the data so
archival software, and metadata management software; etc. The access and ana'
and compute or derive the dc required by the user. This software also performs elegant or impression The DBMS, for a data mart can be RDBMS or multidimensional DBMS (MDD] .
1.2.8 Tables in the D M ata art

A data mart may include: summary tables, detailed tables, reference table etc. As the data grows, the backlog back-up may be kept as a 'library', which c survey of available data, as a] when required. The data is kept in the tables in the form of star joins on normalized tables. S1 joins are required to be created when a predictable pattern of usage is required a significant amount of data. Otherwise, when the data is not large. relational tab] are adequate. historical tables, analytical (spreadsheet) tables,
also serve the purpose of a quick reference for a
1.3 OTHER ASPECTS OF DATA MART 1.3.1 External Data

Let us examine some of the issues relating to usage of external data in a data han one data mart, shall First, the external data, if required to be used by more data marts required. This will
best placed in the data warehouse itself and then subsequently be moved to t
avoid redundancy and duplication in procureme
of external data (otherwise, several data marts may procure the same date details in addition to the data, are also required to be
Secondly, when a particular external data is acquired. its 'pedigree' or descripti stored. These details inclu
the source of external data: size and data of acquisition of the external data: edits; and filtering criteria
applied to the external data; etc.
1.3.2 R eference D ata

The reference data tables, stored in addition to basic data in the data mart help ai data to its expanded version. enable the end-users of the data mart to relate the data is typically
doing so, the end user can operate the data in 'short hand', if desired. The referee
copied over from the data warehouse, although. it is rarely possit reference tables. Wh
that the data mart itself will store and manage its own manage their contents over
reference tables are stored and managed in DSS environment, there is a need
time. The time management is a complex task whi may be best done by the data warehouse itself (instead of the data mart).
1.3.3 Perform ance Issues

The performance considerations in DSS environment are very much different from response time requirements are relaxable ranging from 1 minute to 24 It is not the same situation as in a data mart environment, requirements. Therefore, performance requirements may be achieved. The expectations of those in OLAP environment. The response systems. The exploration of data. in which the
time issues are totally different-there is no need for real time or online response as is required in the case of OLTP performance issues are relaxed as there is an abundance of data to be dealt with along with a lot of
hours. Especially, in the context of data warehousing, the where there is very limited data with clear definition of performance in multidimensional (MDDB) environment are To conclude, good performance can be achieved
reasonable performance objectives may be set and attained. The star join is one way
again different. There can be good performance from MDDB if it is not overloaded. creating profile records (i.e. aggregate records) and creating prejoined tables; etc.
in a data mart environment by: extensive use of indexes, using star joins, limiting the volume of the data, creating arrays of data,
1.3.4 M onitoring R equirem ents for a D M ata art

Periodic monitoring is required on data mart behaviour. Monitoring mainly relates data usage to data content tracking. Data
usage tracking has queries on the data mart such as: What data is being accessed? Which users are active? What is the quantity of data accessed? What are the usage timings? The data content tracking includes the following queries: What are the actual contents of the data mart? Is there any bad data in the data mart? How and how much of the data mart is growing? Monitoring will become essential and important only when the size of the data in the data mart grows significantly. What is the best way of accessing?
1.3.5 Security in a D M ata art

When there is secretive information in the data mart it needs to be secured well. Typically, secretive information includes
financial information, medical records and human resources information, etc. The data mart administrator should make necessary security arrangements such as: firewalls; log on/off security: application based security; DBMS security ('view' based security); encryption and decryption. The information cost of security depends upon its exclusiveness.
CONCLUSION
A data mart is a powerful and natural extension of a data warehouse to a specific functional or departmental usage. The data Data
warehouse provides granular data and various data marts interpret and structure the granular data to suit their needs. marts can be of two types that of a DBMS or can MDDB and ROLAP-each has different characteristics. The data model for a data mart is same as
be different, as required. Metadata is an internal part of a data mart environment.
Metadata enables
different data marts to achieve a degree of cohesiveness. It also allows the end-user to effectively access the data in a data mart. The software for a data mart includes: a DBMS, access and analysis software, metadata management software, system management, etc. The data relevant data model. software, purge and archival
structures formed in a data mart include star joins and normalized data as per the
Online Analytical Processin g 2.1 INTRODUCTION

Online Analytical Processing (OLAP) systems, contrary to the regular, conventional online transaction processing (OLTP) systems, are capable of analyzing online a large number of past transactions or large number of data records (ranging from mega bytes to giga bytes and tera bytes) and summarize them on the fly. This type of data is usually
multidimensional in nature. This multidimensionality is the key driver for OLAP technology, which happens to be central to data warehousing. Any multidimensional data as spreadsheet, cannot be processed by usually multidimensional in
conventional SQL type DBMS. For a complex real-world problem, the data is nature. Even though one can manage to put such data tables, the semantics of
in a conventional relational database in normalized
m ultidim ensionality w be lost and any processing of such data in the ill multidimensional query on such
conventional SQL will not be capable of handling it effectively. As such, a database will explode into a large number of
complex SQL statements each of which may involve full table scan,
multiple joins. aggregation, sorting and also large temporary table space for storing temporary Finally, the end-result may consume large computing resources in terms of disk may not be available and even if they are
results.
space, memory, CPU time, which
available the query may take very long time. For example, a calculations-
conventional DBMS may not be able to handle three months' moving average or net present value these situations call for extensions to ANSI SQL, a near non-feasible other resources, OLAP is a continuously
requirement. In addition to response time and
iterative process and preferably interactive one. Drilling down from basis by the user. Such
summary aggregate levels to lower level details may be required to be done on an ad hoc drilling down may lead the user to detect certain patterns OLAP query based on these patterns. sy stem .
in the data. The user may put forward yet another
This process makes it impossible to handle or tackle for a conventional database
2.2 OLTP AND OLAP SYSTEMS

Conventional OLTP database applications are developed to meet the day-to-day operational data retrieval needs of the executive user communities. The are developed to meet the information exploration and historical trend provide concurrent and online update, insert. delete in addition to long (infrequent or occasional updates or refreshing the data usually conventional OLTP database systems) and is between OLTP and OLAP systems. database transactional requirements and tools volume, transactions are source systems (which are Table 2.1 shows a comparison analysis requirements of the management or retrieval queries and other procedures, processing or warehouse), but are more efficient in processing a number
entire user community. On the other hand, the data warehousing based on OLAP conventional regular database transactions or OLTP transactions are short, high
reporting. These transactions in batch mode may be ad hoc, online or pre-planned. On the other hand, OLAP of ad hoc queries. Information in a data warehouse frequently comes from different operational manner, making it more suitable for trend analysis and decision support data retrieval.
interpreted, filtered, wrapped, summarized and organized in an integrated
Table 2.1Comparison between OLTP and OLAP

OLTP 1. Only current data available (old data is replaced by current data by updating) 2. Short transactions (single granularity or more) 3. Online update/insert/delete transactions 4. High volume of transactions in a given period 5.Concurrency control recovery 6. Largely online ad hoc queries, requiring low level of indexing No concurrent transactions and therefore required Largely pre-determined queries no recovery upon failures and transaction Both current and historic data available (current is appended to historic data) Long database transactions OLAP
Batch update/insert/delete transactions Low volume transactions, periodic refreshing
requiring high level of indexing
A very popular and early approach for achieving analytical processing is 'star 'collection model'. This approach is based on the common the user requirements. In this model, a small number of user-
schema' or
denomination of referenced or joint data
sets are generated from the detailed data sets. This involves denornialization of data followed by integration as shown in the three-tier architecture.
As shown in this architecture, the data extracted from individual data sets is accumulated into a data warehouse after transformation. For example, let us consider the conventional relational database of Part. Customers, Orders and All these tables are required to be denorrnalized and stored in a star or is more in line with user's understanding of the data as shown 2. Regions.
collection table that
2.3 DATA MODELLING-STAR SCHEMA FOR

Conventional database modelling (as in semantic and object-oriented database systems) deals with modelling database entities as objects, object hierarchies, etc., behavioural modelling in terms of methods (or procedures) of in addition to objects. In the data
warehousing environment, the requirements of data modelling are quite different, as the users' expectations from the integrated data warehouse have to be considered. Since
the data warehouse comprises of a central repository of data collected from divergent sources, the integrated user's view of the data warehouse will form the basis for any data modelling of the data warehouse. In other words, the data collected from different source systems
and integrated together in the data warehouse cannot be used, unless it is modelled and organized based on the end- users' perspective of the data. The better the understanding of the user's warehousing application. perspective, the more will be effectiveness of the data Therefore, the outcome of the understanding of the
users' requirements and perspectives should be thoroughly captured in the data model for the data the relationships of warehouse. In this context the intuitive perceptions of the users of individual data entities in the data warehouse have to be
understood by the data exercise may lead to a
model designer for the data warehouse. Such a modelling
'star schema' wherein the individual tables or data sets (beaming in from different sources and coverging into a data warehouse) will be denormalized and integrated be presented as per the end-users' requirements. The star schema so normalized data models. In this context the users' perception of 'multidimensional' view has to be well understood. 'Dimension' effectively handle data navigation difficulties and performance issues of highly data in terms of refers to those so as to developed will
ategories by which the analyst would like to organize, aggregate or view the data. For example, if we take data on prices of commodities (see Fig. 2.3), the users may like to view it from the angle of grouping by time dimension (day, region (town, district, state), each of which is of different analysis application (see Fig. 2.4), the multiple week, month, year) or by dimension. Similarly, in the sales
dimensions of sales information will be in
terms of market requirements, periods, products and regions. In RDBMS environment, the 'fact' data and 'dimension' data can be described as separate tables. The star schema provides a multidimensional view. Thus, the chief model for a multidimensional data warehouse will be star schema. that matter, data analysis of any application will dimensions-this is not possible with conventional Business data analysis or, for require analysis in multiple RDBMS or SQL queries (it may be relational These facts are
possible with great difficulty, inefficiency and redundancy in certain limited cases). From the data organization angle or DBA's perspective, a star schema is a schema. A star schema is organized around a central table called fact table, which contains raw numeric items that represent relevant business facts. aggregated along business dimensions, they tend to be very large. Smaller tables are dimensional tables for the dimensions of data as are already to the users of the facts data. These dimensional tables contain a noncompound primary key and are heavily indexed. Dimensional tables represent majority of data elements. These tables are joined to the fact tables using foreign familiar additive and are accessible via dimensions. Since the fact tables are presummarized and
key references. After the star schema is created and loaded, the dimensional queries (described earlier) can easily be answered. Such multidimensional processing can variety of dimensions-time periods, regions (geographic or be made for a demographic), product
ranges, etc. Such multidimensional data as represented by
star schema will be very
common in all possible industrial sectors such as banking, sales, finance, insurance, manufacturing and also in government. We shall be some sectors . Let us examine the performance issues of star schema either in a conventional RDBMS environment or in other environments. It may be noted that conventional methodology is oriented for OLTP applications and not for OLAP using multidimensional OLAP, star schema will not be RDBMS design surveying later the case studies for
requirements. Therefore,
efficient in RDBMS. In other
words, if the star schema is hosted on a conventional RDBMS, the performance will be poor. Star schema is defined only for denormalized data, i.e. independent of the relational model. For object oriented data models, the star schema definition may be required to be made on a novel basis and this is an open area for research. Such research can attempt effectively capturing the semantics of object-oriented and semantic database models into the data warehousing domain.
2.4
DATA MODELLING-MULTIFACT STAR SCHEMA OR
SNOW FLAKE SCHEMA

As a data warehouse grows in complexity, the diversity of subjects covered grows adds more perspectives to access the data. In such situations, the star schema which will be
inadequate. This can be enhanced by adding additional dimensions which increases the scope of attributes in the star schema fact table. But even further breakdown of the star schema due to its over complexity technique called multifact star schema or snow flake larger growth leads to and size. Thus, a better schema can be utilized. The
goal of this schema is to provide aggregation at different levels of hierarchies in a given dimension. This goal is achieved by normalizing those hierarchical
dimensions into more detailed data sets to facilitate the aggregation of fact data. It is possible to model the data warehouse into separate groups. where each group addresses specific performance and objective of a specific user. Each group of fact data can be modelled schema. trend analysis using a separate star
2.5OLAP TOOLS 2.5.1Categories of OLAP Tools

As discussed in Chapter 1, OLAP tools can be broadly classified into two categories: MOLAP tools and ROLAP tools. MOLAP tools presuppose the data to be present in a multidimensional database (MDDB). In other words, data which has basically multidimensional nature, if loaded into a multidimensional database, can utilized by MOLAP tools for analysis. On the other hand, a typical relational application (without MDDB) can be processed by ROLAP (or relational However, there also exist hybrid approaches which integrate both ROLAP techniques and they are usually called multirelational All these tools basically implement 'star schema' or 'snowflake discussed (Mattison, 1996). Applications, as already discussed, are being wide in range-both in business and government. In business, the typical applications range from sales analysis, campaigning, sales forecasting and capacity planning. Similarly, in applications, we can cite examples as commodity price forecasting, plan formulation, analysis and forecasting, and forecasting, based on rainfall analysis, etc. The both in government and business is almost presented in the end illustrate this range of marketing the government monitoring, analysis and agricultural production analysis scope of application of these tools unlimited, and the case studies applications. The spread of market be database OLAP) tools. MOLAP and database systems. schema', already
between MOLAP and ROLAP sectors is as shown in Fig. 2.5.
MOLAP
MOLAP-based products organize, navigate and analyse data typically in an aggregated form. They require tight coupling with the applications and they depend upon a multidimensional database (MDDB) system. Efficient by using techniques are
implementations store the data in a way similar to the form in which it is utilized improved storage techniques so as to minimize storage. Many efficient used as sparse data storage management on disk so as to improve
the response time. Some
OLAP tools, as Pilot products (Software Analysis Server) introduce 'time' also as an additional dimension for analysis, thereby enabling time as Oracle Express Server introduce strong 'series' analysis. Some products
analytical capabilities into the database itself. trends budgeting).
Applications requiring iterative and comprehensive time series analysis of are well suited for MOLAP technology (e.g. financial analysis and
Examples include Arbor Software's Essbase, Oracle's Express Server,
Pilot Software's Kenan
Lightship Server, Sinper's TM/ 1, Planning Science's Gentium and Technology's Multiway. maintaining support to Some of the problems faced by users are related to
multiple subject areas in an RDBMS. As shown in Fig. 2.6,
these problems can be solved by some vendors by maintaining access from MOLAP tools to detailed data in an RDBMs. This can be very useful for organizations with performance-sensitive or are in the process of multiple subject areas. An by several dimensions (e.g. multidimensional analysis requirements and that have built building a data warehouse architecture that contains example would be the creation of sales data measured product and sales region) to be stored and maintained in a
persistent structure. This structure would be provided to reduce the application overhead of performing calculations and building aggregations during application initialization. These structures can be automatically refreshed at predetermined intervals established by an administrator.
ROLAP
Relational OLAP (or ROLAP) is the latest and fastest growing OLAP technology segment in the market. Many vendors have entered the fray in this direction (e.g. and Microstrategy). By supporting a dictionary layer Sagent Technology of meta-
data, the RDBMS products have bypassed any requirement for creating a static, multidimensional data structure (as was required in the case of MOLAP) This approach enables multiple multidimensional views of two-dimensional relational tables to be created, avoiding structuring data around the desired view. Some products in this segment have supported strong SQL engines to support the complexity of multidimensional analysis. This includes creating multiple SQL statements to handle user requests, being 'RDBMS aware' and also being capable generating the SQL statements based on the optimizer of the DBMS engine. While flexibility is the attractive feature of ROLAP, there exist products which use of denormalized database designs (as star schema). However, of late noticeable change or realignment in ROLAP technology. Firstly, there is towards pure middle-ware technology so as to simplify the development of multidimensional applications. Secondly, the sharp delineation between ROLAP other approaches as hybrid-OLAP is fast disappearing. Thus vendors of tools and RDBMS products are now eager to provide multidimensional persistent structures with facilities to assist in the administrations of these structures. Notable among vendors of such products are Microstrategy Platinum/Prodea Software (Beacon), Information Advantage /Stanford Technology Group (Metacube) and SyBASE (HighGate Informix has been acquired by IBM.) (DSSAgent/DSS Server), (AxSys), Informix Project). (Of late and ROLAP require the there is a a shift of
2.5.2 Managed Query Environment (MQE)

Recent trend of OLAP is to enable capability for users to perform limited analysis directly against RDBMS products or by bringing in an intermediate limited MOLAP server Some vendors' products (e.g. Andyne's Pablo) have been able to provide
ad hoc query as 'data cube' and 'slice and dice' analysis capabilities. This is achieved by first developing a query to select data from the DBMS, which then delivers the data to the desktop system where it is placed into a data cube. This data to create the structure each time the query is executed. Once cube the user can locally perform multidimensional pivot operations on it. In another approach, these requested cube can be
locally stored in the desktop and also manipulated there so as to reduce the overhead required the data is placed in the data analysis and also slice, dice and The ease of them particularly
tools can work with MOLAP servers, the
data from RDBMS can first go to MOLAP server and then to the desktop. operation, administration and installation of such products makes approach provides sophisticated analysis capabilities to costs involved in other more complex products in the installation and administration that accompanies the server along with metadata definitions most network infrastructures
attractive to users who are familiar with simple RDBMS usage and environment. This such users without significant market. W all the ease of ith desktop OLAP products, most
of these tools require the data cube to be built and maintained on the desktop or a separate that assist users in retrieving the correct set of data that makes up the data cube. This method causes data redundancy and strain to that support many users. Although. this mechanism allows for the flexibility of each user to build a customize data cube, the lack of data consistency among users, relatively small amount of data that can be efficiently maintained are challenges facing all administrators of these tools. and the significant
2.6 STATE OF THE M ARKET 2.6.1Overview of the State of the International M arket
Basically OLAP tools provide a more intuitive and analytical way to view corporate organizational data. These tools aggregate data along common subjects of business or dimensions and then let users navigate through the hierarchies and dimensions with the click of a mouse. Users can drill down, across, or up across or
levels in each dimension or pivot and swap out dimensions to change their view the data. All this can be achieved by various OLAP tools in the international market. free manipulation of data provides insight into data, usually not possible by any other means (Kimball, 1996). Some tools, such as Arbor Software Corp.'s Essbase and Oracle's Express, preaggregate data into special multidimensional databases. Other tools work directly against relational data and aggregate data on the fly, such as Microstrategy Inc.'s DSS Agent or Information Advantage Inc.'s DecisionSuite. Some tools process OLAP data on the desktop instead of a server. Desktop OLAP tools include PowerPlay, Brio Technology Inc.'s Brioquery. Planning Sciences Inc.'s Andyne's Pablo. Many of the differences between OLAP tools are re-architecting their products to give users greater control over the trade-off between flexibility and performance that is inherent in OLAP tools. Many vendors have rewritten their products in lava. incorporated OLAP functionality in their Microsoft have taken steps toward this Stanford Technology Group and capabilities of their respective Eventually conventional database vendors may become the largest OLAP providers. The leading database vendors database kernels. Oracle, Informix and also end by acquiring OLAP vendors (IRI Software, DBMS products are varying in their scope.
of Such
Cognos' Gentium, and fading. Vendors are
Panorama, respectively). As a result, the OLAP Red Brick corporate data, corporate provide
Subsequently, IBM acquired Informix along with all its products. System's Red Brick Ware tools for multidimensional analysis of PowerPlay, can be characterized as an MQE tool that can leverage investment in the relational database technology to
multidimensional access to enterprise data. PowerPlay also provides robustness, scalability and administrative control.
2.6.2 Cognos PowerPlay

Cognos PowerPlay is an open OLAP solution that can interface and interoperate with a wide variety of third-party software tools, databases and applications. The data used by PowerPlay is stored in multidimensional data sets called analytical PowerCubes.
Cognos' client/server architecture allows for the PowerCubes to be
stored on the
Cognos universal client or on a server. PowerPlay offers a single universal client for OLAP servers that support PowerCubes locally situated on the inside popular relational databases. In addition to the fast LAN or (optionally) installation and
deployment capabilities, PowerPlay provides a high level of usability with a familiar Windows interface, high performance, scalability, and of ownership. Specifically, starting with version offers: Support for enterprise-size data sets (PowerCubes) of 20+ million records, 100,000 categories and 100 measures A drill-through capability for queries from Cognos Impromptu Powerful 3-D charting capabilities with background and rotation control for advanced users Scatter charts that let users show data across two measures, allowing easy comparison of budget to actual values Linked displays that give users multiple views of the same data in report Full support for OLE2 Automation, as both a client and a server Formatting features for financial reports: brackets for negative numbers, single and double underlining, and reverse sign for expenses Faster and easier ranking of data A 'home' button that automatically resets the dimension line to the top level Unlimited undo levels and customizable toolbars An enhanced PowerPlay portfolio that lets users build graphical, interactive. EIS-type briefing books from PowerPlay reports; Impromptu reports; word or presentation documents; or any other documents A 32-bit architecture for Windows NT, Windows 95/98 or reports processing, spreadsheet, 5, Cognos relatively low cost PowerPlay client
Access to third-party OLAP tools including direct native access to Arbor's and Oracle's Express multidimensional databases PowerCube creation and access within existing relational databases such as Oracle, SyBASE, or Microsoft SQL server right inside the data warehouse PowerCube creation scheduled for off-peak processing, or sequential to other
Essbase
processes the
Advanced security control by dimension, category and measure on the client. server or both Remote analysis where users pull subsets of information from the server the client Complete integration with relational database security and data management
down to
features
An open API through OLE automation, allowing both server and client-based PowerCubes to be accessed by Visual Basic applications, spreadsheets, and other thirdparty tools and applications. drill-to-detail using cubes, new capabilities allow them to populate these PowerCubes inside popular relational databases, and to do the processing off the desktop and on UNIX servers. To robust administration capabilities, Cognos offers a companion tooladministrator, which is available in database and server editions. In PowerPlay administrator database edition, the administrator would continue the cube and run the population of the cube process (called transform) on platform. The advantage is that data from multiple sources can now be PointerCube for the client, and the actual PowerCube can be database. This means that existing database management tools administrator can be used to manage the business data, and a mechanism can be employed for both application and OLAP sophisticated security model is provided which in effect creates a to model the client used to generate a inside a relational and the database single delivery processing. A 'master' cube to provide a PowerPlay As mentioned earlier, PowerPlay's capabilities include Impromptu. Also, cubes can be built using data from multiple
sources. For the administrators who are responsible for creating multidimensional
service a variety of users. This is defined and controlled through an authenticator, also included with PowerPlay. The administrator server of PowerPlay lets users process the
population of the cube on a UNIX platform. An administrator uses client transformer to create a model, and moves it to the UNIX server using the supplied software component called Powergrid. The server transformer, once triggered, will create the
PowerCube, and the only prerequisite is that all data sources be accessible. Once completed, the database is placed resulting PowerCube (or PointerCube if the multidimensional inside an RDBMS) is copied or transferred to the client platform for
subsequent PowerPlay analysis by the user. The authenticator can be used to establish user classes and access security, and can also be used to redirect cube access since all database passwords and locations can be known to the authenticator.
PowerPlay can be used effectively for generation of reports on any multidimensional cube generated by other tools as Oracle Express, Plato, Visual DB2. PowerPlay supports clients on Windows 3.1, 95, 98 and NT. DW of
Administrator
database and server editions execute on HP/UX, IBM AIX, and Sun Solaris, and support PowerCubes in Oracle, Sybase SQL server, and Microsoft SQL server. Cognos PowerPlay release version 6 has Web-enabled features so as to present the reports on the Web in 3-tier undirected use. Latest
2.6.3 IBI Focus Fusion

Focus Fusion from Information Builders Inc. (IBI) is a multidimensional database technology for OLAP and data warehousing. It is designed to address business applications that require multidimensional analysis of detail product data. Focus complements Cactus and EDA/SQL middleware software to provide a data warehouse solution. performance, multi Fusion
multifaceted
Focus Fusion combines a parallel-enabled, highdimensional database engine with the administrative, copy
management and access tools, necessary for a data warehouse solution. Designed specifically for deployment of business intelligence applications in data warehouse
environments, Fusion provides:
Fast query and reporting. Fusion's advanced indexing, parallel query, and facilities provide high performance for reports, queries and analyses, users need to complete data warehouse solutions Comprehensive, graphics-based administration facilities that make Fusion database applications easy to build and deploy Integrated copy management facilities, which schedule automatic data from any source into Fusion A complete portfolio of tightly integrated business intelligence applications that span reporting, query, decision support and EIS needs Open access via industry-standard protocols like ANSI SQL, ODBC and via EDA/SQL, so that Fusion works with a wide variety of desktop World Wide Web browsers Online 21 Three-tiered reporting architecture for high performance Scalability of OLAP applications from the department to the enterprise Access to precalculated summaries (roll-up) combined with dynamic detail data manipulation capabilities Capability to perform intelligent application partitioning without disrupting users Interoperability with the leading EIS, DSS and OLAP tools Support for parallel computing environment Seamless integration with more than 60 different database engines on more than 35 platforms, including Oracle, Sybase, SAP, Hogon, Microsoft SQL Analytical
roll-up
with the scalability
refresh
HTTP tools, including
Processing
Focus Fusion's proprietary OverLAP technology allows Fusion to serve as an virtual warehousing environment for the analysis of corporate data. This can simplify warehouse management and lower overall costs by potentially
OLAP
front-end or shared cache for relational and legacy databases, effectively providing a
reducing the need to copy infrequently accessed detail data to the warehouse for a possible drill-down. Focus Fusion is a modular tool that supports flexible configurations for diverse needs, and includes the following components: Fusion/DBserver. High-performance, client/server, parallel-enabled, scalable
multidimensional DBMS. Fusion/DBserver runs on both UNIX and NT and connects ransparently to all enterprise data that EDA/SQL can access (more
than 60 different database engines on over 35 platforms). Fusion/DBserver also provides stored procedure and remote procedure call (RPC) facilities. usion/Administrator. Comprehensive GUI-based (Windows) administration provides visual schema definition and bulk load of Fusion multidimensional index definition and build, and rollup definition Additionally. Fusion /Administrator automates migration of Fusion. Fusion/PDQ. Parallel data query for Fusion/DBserver exploits symmetric multiprocessor (SMP) hardware for fast query execution and parallel loads. EDA/Link. Fusion's client component supports standard APIs, including and ANSI SQL. EDA/Link provides access to Fusion from any 95/NT, UNIX, Macintosh, OS/2) or host system (UNIX, TCP/IP and many other network topologies Fusion's products into enterpris . Internet ODBC utility that databases, and creation. FOCUS databases to
desktop (Windows
MVS, AS400, VMS, etc.) over
(via EDA Hub servers) to integrate
EDA/WebLink. Fusion's open browser client that supports Netscape, Mosaic, Explorer, and all other standard HTML browsers. It works with HTML generator to facilitate Web-based warehouse Enterprise Copy Manager for Fusion.
Information Builders'
publishing applications. automated assembly,
Fully
transformation, summarization, and load of enterprise data from any Fusion on scheduled basis. It consists of Enterprise Copy (the graphical Windows-based interface), and access to source data).
source(s) into
Server. Enterprise Copy Client
Enterprise Source Server (the remote gateway
EDA Gateways. Remote data access gateways that provide transparent, live drill through capabilities from Fusion/DBserver to production databases. 2.6.4 Pilot Software Pilot Software offers the Pilot Decision Support of tools from a high speed multidimensional database (MOLAP), data warehouse integration (ROLAP), data mining, and a diverse set of customizable business applications targetted after sales and marketing professionals. The following products are at the core of Pilot Software's offering: Pilot Analysis Server. A full-function multidimensional database with highconsolidation, graphical user interface speed
(Pilot Model Builder), and expert-level interface. multidimensional model with
The latest version includes relational integration of the
relational data stores, thus allowing the user the choice between high-speed access of a multidimensional database or on- the-fly (ROLAP) access of detail data stored directly in the data warehouse or data mart.
Pilot Link. A database connectivity tool that includes ODBC connectivity and highspeed connectivity via specialized drivers to the most popular relational database platforms. A graphical user interface allows the user seamless and to a wide variety of distributed databases. Pilot Designer. An application design environment specifically created to enable rapid development of OLAP applications. Pilot Desktop. A collection of applications that allow the end-user easy navigation and visualization of the multidimensional database. Pilot Sales and Marketing Analysis Library. A collection of sophisticated applications designed for the sales and marketing business end-user easy access
(including 80/20
Pareto analysis, time-based ranking, BCG quadrant
analysis, trendline and statistical forecasting). The applications can be modified and tailored to meet individual needs for particular customers. Pilot Discovery Server. A predictive data mining tool that embeds directly into relational database and does not require the user to copy or transform the mining results are stored with metadata into the data segmentation and are embedded into the graphical Server Launch, which helps building data mining models. Pilot Internet Publisher. A tool that easily allows users to access their Pilot multidimensional database via browsers on the Internet or Intranets. Some of the distinguishing features of Pilot's product offering include the complete solution from powerful OLAP engine and data mining engine to products are as under: Time intelligence. The Pilot Analysis Server has a number of features to time as a special dimension. Among these are the ability to process convert the native periodicity (e.g. the data collected preferred by the customer viewing the data (e.g. support overall their customizable the
data. The data
warehouse as a predictive user interface called Pilot Discovery
business applications. Within their OLAP offering, some of the key features of Pilot's
data on the fly to
weekly) to the periodicity
view the data monthly). This feature is
accomplished via special optimized structures within the multidimensional database. Embedded data mining. The Pilot Analysis Server is the first product to ntegrate redictive data mining (as it is described in Chapter 3) with the database model. This allows the user to benefit not only multidimensional
from the predictive power of
data mining but also from the descriptive and analytical power of multidimensional navigation. Multidimensional database compression. In addition to the compression of sparsity,* Pilot Analysis Server also has special code for compressing data values over time. A new feature called a 'dynamic dimension' allows some dimensions to be calculated on the
fly when they are attributes of an existing dimension. This allows the database to be much smaller and still provide fast the fly are also access. Dynamic variables which are also calculated on
available to further decrease the total size of the database and thus
decrease the time for consolidation of the database. Relational integration. Pilot allows for a seamless integration of both MOLAP and ROLAP to provide the user with either the speed of MOLAP or the more The users interface with the system by defining the they prefer, and the system self-optimizes the queries. space-efficient ROLAP.
multidimensional model or view that
2.6.5 Arbor Essbase W eb

Essbase is one of the most ambitious of the early Web products. It includes not only OLAP manipulations, such as drill up down and across, pivot, slice and dice; and dynamic reporting but also data entry, including full multi-user concurrent write capabilities-a feature that differentiates it from the others. After identifying the sources, the data has to be extracted and prepared for mining. Clearly the data selection will depend upon the business objectives. Along with the data selected the meta data also has to be acquired for understanding the m eaning of data. The m data, as in the case of data eta warehouse or data mart, should contain all the details as descriptions of data types, initial values, range of values, sources, etc. be selected. The active variables are process. In addition, some data variables to be identified: they are results. The shelf life of data also becomes an important consideration. The shelf life is z period after which the data will lose its attraction or usefulness. algorithms to be adopted may also be decided by the analyst at this stage, as they determine the required formats and other details of the data being prepared. he For data mining, the 'active' variables have to fixed
those which will be participating in the mining mining algorithms require supplementary helpful in visualization and explanation of
The data mining
Data pre-processing.
Data pre-processing is the most crucial step as the
operational data is normally never captured and prepared for data mining purpose. Mostly, the data is captured from several inconsistent, poorly documented operational systems. (Capturing data from point of sale (PoS) is an exception.) data pre-processing requires substantial efforts in purifying and organizing the data. This step ensures that the selected data available for mining is in good The data pre-processing step begins with a general review of the structure of characteristics of data to be viewed depend on the nature of data, i.e. is categorical or quantitative. terms of histograms, pie quality. the data and The whether the data Thus,
quality assurance. This is performed using sampling and visualization techniques. For categorical variables, the visualization can be in
charts, etc. For quantitative variables, the visualization will be in
terms of maxima, minima, mean, median, mode, standard deviation, etc. By utilising all these methods it is possible to determine the presence of invalid skewed data which may be incorrect. Spurious data or noise can be identified quantitative techniques as minima and maxima analysis or by various scatter distribution parameters. Scatter plots, base plots can help in identifying and clearing the data of spurious and noisy elements by finding outlandish or exceptional data be incorrect). Missing values of data pose a real problem. This problem can be encountered methods-dropping the observations with missing data values or missing value with a likely value, predictable by reliable techniques, etc. by various replacing a (which may and by
A segmentation algorithm will segregate the data as indicated above. In this p rocess, here are two different methods possible: demographic clustering and neural clustering. These two methods are distinguished by: data types accepted, the methodology of calculation of inter-record distances and the way the segments are organized.
Demographic clustering operates on records with categorical orders and the distances are measured using a voting principle called condorset. No hierarchy Is followed for organizing the resulting segments (or clusters). On the other hand, neural
clustering methods are based on neural networks. They accept only numeric quantitative inputs (categorical inputs can be converted into numerical input form). The distance measurement technique is based on Euclidean distances and the resulting segments are arranged in a hierarchy, where the most similar segments placed closest together and the least similar segments are placed with largest distance.Segmentation (or clustering) is used in business applications as customer profiling or target marketing etc. It has interdisciplinary applications, cutting across all sectors of business. are
odelling 3.4.2 Predictive M

Just as human beings learn from observation and experience and then be able to predict, it is possible for an algorithm to draw certain general rules on the behaviour of the data and predict the future behaviour. predictive modelling analyzes the existing database to valid observations from which the model can reach a accurate predictions. The model needs to be 'trained' observations for prediction purpose. This approach is The model pattern itself of can employees be of physically an Indian In data mining operations, determine some essential conclusion on how to make with already known data called 'supervised learning'. very government simple. organization. to be positions general training by
characteristics about the data. It is, however, essential that the data include complete,
It can be first a few 'if-then-else' clauses, for example. Figure 3.3 shows the behaviour From this figure we can see that the resignation data has been analyzed and found following a particular pattern, i.e. employees who are not in very senior and have worked less than 10 years in the organization are displaying a tendency to leave the organization. This is a predictive model developed on mining the data given. Once the model is defined clearly it can be used for prediction purposes. Both training and testing of the model need to be performed. Training requires large data whereas testing is done on small data. applications across the industry sectors. Predictive modelling has extensive Customer or employee selection management,
credit rating, cross-selling and target marketing are some of the applications of this technique.
Developing a Data Warehouse 4.1 WHY AND HOW TO BUILD A DATA WAREHOUSE?
Changes are taking place in the world continuously in all forms of activity. From business perspective, the quest for survival in a globally competitive environment extremely demanding for-all. As discussed in Chapter strategy requires means that available data. As the data size increases continuously, doubling every requirements for processing this data so as to comprehend the meaning data are also required to be increased significantly. Competition also adds pressure on this situation and herein business intelligence becomes the foundation for successful months, the peed of this 1, in this scenario, business a
has become
answers to questions in business policy and future strategy. This
the decisions are required to be taken quickly and correctly using all the
business strategy (Bracket, 1996). Here the need for data warehousing technology arises, in terms of the ability to organize, maintain large data and also be able to analyze in a few seconds in the manner and depth as required. (Inmon and Hackathorn,
1994). Why did the conventional information systems not succeed in meeting these requirements? Actually the conventional information systems and data warehousing tackle two different activity domains-OLTP and OLAP. discussed in Chapter 1). These domains are not at all competitive with (as each as the
other; they deal with two different problem domain requirements. Further, technology upgrades, the CPU time and disk space are growing larger and are becoming cheaper with time. Similarly, network bandwidths are becoming cheaper day by day. Need for more and more interoperability is increasingly being felt. In this emerges as a promising technology
increasing and heterogeneity and
scenario, data warehousing
(Anahory and Murray, 1997).
In the following sections, we shall survey all the major issues involved in building a data warehouse, such as approach, architectural strategy, design considerations,
data content related issues, meta-data, distribution of data, tools and performance considerations. This will happen if the tool is weak. Therefore, it is better to go for a very comprehensive tool. It is also very important to ensure that all the related tools are compatible with one another and also with the overall design. All the tools should
also be compatible with the given data warehousing
environment and also with one
another. This means that all the selected tools are compatible with each other and there can be a common metadata repository. Alternatively, the tools should be able to source the metadata from the warehouse data dictionary (if it is available) or from a CASE tool used to design the database data warehouse. Another option is to use metadata gateways that translate metadata to another tool's format (Kimball, 1996). meticulously followed then the resulting data rapidly become unmanageable since every in the one tool's
If the above guidelines are not warehouse environment will
modification to the warehouse data
model may involve some significant and labour-intensive changes to the metadata definitions for every tool in the required 1996). to be verified for environment. These changes will also be and integrity (Mattison,
consistency
4.8 PERFORMANCE CONSIDERATIONS

Even though OLAP applications on a data warehouse are not calling for very stringent, real-time responses as in the case of OLTP systems on a database, the interactions of the user with data warehouse should be online and interactive with good enough speed. Rapid query processing is desirable. The actual performance levels may vary from application to application with different
requirements, as the case may be. The requirements of query response time, as defined by a particular business application, are required to be met fully and its tools. However, it is not possible to predict in levels of a data warehouse. As the usage patterns of the application to application and are unpredictable, the tuning techniques do not always work in data by the data warehouse advance the performance data warehousing vary from traditional database design and warehouse environment. It is essential in terms of the
therefore to understand and know the specific user requirements
information and querying before the data warehouse is designed and implemented. Query optimization also can be done if frequently asked queries are well understood. For example, to answer aggregate level queries the warehouse and aggregated data. the overall can be (additionally) populated with specified denormalized views containing specific, summarized, derived If made correctly available, many end-user requirements and many frequently asked queries can be answered directly and efficiently so that performance levels can be maintained high (Hackney, 1997).
The balancing between processors and I/O in a multiprocessor environment is m ore im portant in data warehouse applications. As the disk space requirements are three times the raw data size, the required number of online disks be quite large. The throughput or performance efficiency comes from number parallelism. To balance this situation, it is important to allocate the processors to efficiently handle all the disk I/O operations. will end up becoming CPU bound in its execution thus processors have different performance ratings and
all the
will of disks and
correct number of Otherwise, the hardware
leading to inefficiency. Various
thus can support a different number of
disks per CPU. The hardware selection should be based on efficient calculation and careful analysis of disk I/O rates system configuration so that driving all the disks required and processor capabilities to derive efficient adequate number of processors are available for (Creecy et al., 1992). ertain
Another consideration is related to disk controller. A disk controller supports a amount of data throughput (e.g. 20 MB/sec). Thus, given the number of
disks required
for storing data warehouse (three times the raw data size) the required can be calculated and provided for
number of controllers
Balanced design considerations should be
ensured to all system components for the sake of better efficiency. The resulting configuration will be able to easily balanced and scalable computing It is very crucial to fine tune a handle the known workloads and provide a platform for future growth of the data warehouse. parallel DBMS with a parallel processor hardware
architecture for optimal results in query processing (Winter and Brobst, 1994). DBMS selection Next to hardware selection, a factor most critical, is the DBMS selection. This determines the speed performance of the data warehousing environment. The requirements of a DBMS for data warehousing environment are scalability, performance in high volume storage and processing and throughput in traffic. majority of established RDBMS vendors have implemented various degrees The
of parallelism in
their products. Even though all the well-known vendors-IBM, Oracle, Sybase-support parallel database processing, some of them have better suit the specialized requirements of a provide additional modules for OLAP cubes Poolet and Reilly, improved their architectures so as to data warehouse. The RDBMS products (Yazdani and Shirley, 1997; O'Neil, 1997;
1997: Hammergren,1997).One can use the OLAP features of the However, this may not be adequate meeting such cases, OLAP servers from other reputed of DBMS.
same DBMS on which the data resides. certain application requirements. In vendors may be used, independently
Applications of Data Warehousing 5.1INTRODUCTION
and Data Mining in Government
Data warehousing and data mining are the important means of preparing the
government to face the challenges of the new millennium. and data mining technologies have extensive potential government: in various Central Government sectors such as
The data warehousing applications in the Agriculture, Rural
Development, Health and Energy. These technologies can and should therefore be implemented. Similarly, in State Government activities also, large
opportunities exist for applying these techniques. Alm all these ost opportunities have not yet been exploited. Central and State Government actual case studies of already In this chapter, we shall examine both the applications in terms of the potential applications and implemented applications are given in Appendices.
5.2 NATIONAL DATA WAREHOUSES A large number of national data warehouses can be identified from the existing resources within the Central Government Ministries. Let us examine these subject areas on which data warehouses may be developed at present and data potential also in future.
5.2.1C ensus D ata

The Registrar General and Census Commissioner of India decennially compiles information of all individuals, villages, population groups, etc. This information is wide ranging such as the individual-slip, a compilation of information of individual built from this database upon which OLAP techniques can be can be performed for analysis and knowledge discovery. It can be integrated into one data warehouse for the government. This is true for State Government and Central Government. Thus data warehouses can be built at level, State level and also at District level. Central households, of which a database of 5% sample is maintained for analysis. A data warehouse can be applied. Data mining also
C O N C L U S IO N
In the government, the individual data marts are required to be maintained by the individual departments (or public sector organizations) and a central data
warehouse is required to be m aintained by the m inistry concerned for the concerned sector. A generic inter-sectoral data warehouse is required to be maintained by a central body (as Planning Commission). Similarly, at the State level, a generic inter-departmental data warehouse can be built and maintained by a nodal agency, and detailed data warehouses can also be built and maintained at district level by an appropriate agency. National Informatics Centre may possibly play the role of the nodal agency at Central, State and District levels for developing and maintaining data warehouses in various sectors. the

Data Mining

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Data Mining

Enviado por

Direitos autorais:

Formatos disponíveis

DATA MINING Basic Data Mining Tasks

Data Mining vs. KDD

Selection: Select log data (dates and locations) to use

Preprocessing: Remove identifying URLs Remove error logs

Transformation: Sessionize (sort and group)

Data Mining: Identify and count patterns Construct data structure

Interpretation/Evaluation: Identify and display frequently accessed sequences.

Potential User Applications: Cache prediction Personalization

Data Mining Development

Data mining Issues

Similarity Measures Decision Trees Neural Networks Activation Functions

o Is this a good idea? o Estimation Error

Bias: Difference between expected value and actual value.

Models Based on Summarization

Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem:

Assign probabilities of hypotheses given a data value.

Bayes Theorem Example

Chi Squared Statistic

O={50,93,67,78,87} E=75 c2=15.55 and therefore significant

Neural Network Example

Solves many problems NN Disadvantages Difficult to understand

000 000 111 111 Parents

000 111 111 000 Children

000 000 00 111 111 11 Parents

000 111 00 111 000 11 Children

Mining for Association Rules

Classification and Clustering

Play? Yes No Yes Yes Yes No No Yes

Euclidian Distance for Tables

Mining Sequence Data

{pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil} the purchases of a single customer over time

Hence, {pen}{pencil}{ruler,pencil} is contained in {pen,pencil,ink}{pencil,ink}{ink,eraser}{ruler,pencil}

abcde bdae aebd be eabda aaaa baaa cbdb abbab abde

Output state machine

Shortest-run Generalization Given a set of n sequences:

1. Create a state machine for the first sequence

levelt(k) = (1- )*levelt-1(k) + *sup(k)

Distance measures Some interesting questions

Nearest Neighbor (2)

How to speed up?

Sampling Grid Clustering

Nave Bayes Classification

But, there are problems

C 7= A1=0 A1=1 A1=2 A2=0 A2=1 A2=2 A3=1 A3=2

P(1|x) P(x1=1|1) P(x2=2|1) P(x3=1|1) P(1), P(2|x)

Decision Trees and Decision Rules

classify, e.g., whether to play golf)

with the same majority class

class with the highest conditional probability:

X Ci if Pr(Ci | X ) > Pr(C j | X ), for i j n, i j

Note: Pr(Ci) = si / s, and

Variations of K-Means usually differ in:

UNIT 4 1.1 CHARACTERISTICS OF A DATA WAREHOUSE

corporation) and contains integrated historical data

in a globally competitive environment, wherein larger and larger amounts

on certain rules or computations. It can always be derived again on fl

analytical processing (OLAP) tools, is prepared periodically but is

1.2 DATA MARTS

of a data warehouse. Data mart is a subset of a data

the data mart

below) are equally relevant for a data warehouse, since basically a

1.2.1 Types of Data M arts

DBMS (as RDBMS). As an example, let us the data

levelt(k) = (1- )levelt-1(k) + sup(k)