DataMining AdrianTuhtan

Data Mining
Overview

Introduction Explanation of Data Mining Techniques Advantages Applications Privacy
Data Mining
What is Data Mining? The process of semi automatically analyzing large databases to find useful patterns (Silberschatz) KDD Knowledge Discovery in Databases (3) Attempts to discover rules and patterns from data Discover Rules Make Predictions Areas of Use

Internet Discover needs of customers Economics Predict stock prices Science Predict environmental change Medicine Match patients with similar problems cure
Example of Data Mining
Credit Card Company wants to discover information about clients from databases. Want to find:

Clients who respond to promotions in Junk Mail Clients that are likely to change to another competitor Clients that are likely to not pay Services that clients use to try to promote services affiliated with the Credit Card Company Anything else that may help the Company provide/ promote services to help their clients and ultimately make more money.
Data Mining & Data Warehousing
Data Warehouse: is a repository (or archive) of information gathered from multiple sources, stored under a unified schema, at a single site. (Silberschatz)

Collect data Store in single repository Allows for easier query development as a single repository can be queried.
Data Mining:
Analyzing databases or Data Warehouses to discover patterns about the data to gain knowledge. Knowledge is power.
Discovery of Knowledge
Data Mining Techniques

Classification Clustering Regression Association Rules
Classification
Classification: Given a set of items that have several classes, and given the past instances (training instances) with their associated class, Classification is the process of predicting the class of a new item. Therefore to classify the new item and identify to which class it belongs Example: A bank wants to classify its Home Loan Customers into groups according to their response to bank advertisements. The bank might use the classifications Responds Rarely, Responds Sometimes, Responds Frequently. The bank will then attempt to find rules about the customers that respond Frequently and Sometimes. The rules could be used to predict needs of potential customers.
Technique for Classification
Decision-Tree Classifiers
Job
Carpenter Engineer
Doctor
Income
<30K >50K <40K
Income
>90K
Income
<50K >100K
Bad
Good
Bad
Good
Bad
Good
Predicting credit risk of a person with the jobs specified.
Clustering
Clustering algorithms find groups of items that are similar. It divides a data set so that records with similar content are in the same group, and groups are as different as possible from each other. (2)
Example: Insurance company could use clustering to group clients by their age, location and types of insurance purchased.
The categories are unspecified and this is referred to as unsupervised learning
Clustering
Group Data into Clusters

Similar data is grouped in the same cluster Dissimilar data is grouped in the same cluster
How is this achieved ? K-Nearest Neighbor A classification method that classifies a point by calculating the distances between the point and points in the training data set. Then it assigns the point to the class that is most common among its k-nearest neighbors (where k is an integer).(2)
Hierarchical
Group data into t-trees
Regression

Regression deals with the prediction of a value, rather than a class. (1, P747) Example: Find out if there is a relationship between smoking patients and cancer related illness.
Given values: X1, X2... Xn Objective predict variable Y One way is to predict coefficients a0, a1, a2

Y = a0 + a1X1 + a2X2 + anXn Linear Regression
Regression
Example graph:
Line of Best Fit Curve Fitting
Association Rules
An association algorithm creates rules that describe how often events have occurred together. (2) Example: When a customer buys a hammer, then 90%
of the time they will buy nails.
Association Rules
Support: is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule(1, p748) Example:
People who buy hotdog buns also buy hotdog sausages in 99% of cases. = High Support People who buy hotdog buns buy hangers in 0.005% of cases. = Low support
Situations where there is high support for the antecedent are worth careful attention
E.g. Hotdog sausages should be placed in near hotdog buns in supermarkets if there is also high confidence.
Association Rules

Confidence: is a measure of how often the consequent is true when the antecedent is true. (1, p748) Example:
90% of Hotdog bun purchases are accompanied by hotdog sausages. High confidence is meaningful as we can derive rules.
Hotdog bun Hotdog sausage 2 rules may have different confidence levels and have the same support. E.g. Hotdog sausage Hotdog bun may have a much lower confidence than Hotdog bun Hotdog sausage yet they both can have the same support.
Advantages of Data Mining
Provides new knowledge from existing data

Public databases Government sources Company Databases
Old data can be used to develop new knowledge New knowledge can be used to improve services or products Improvements lead to:

Bigger profits More efficient service
Uses of Data Mining
Sales/ Marketing
Diversify target market Identify clients needs to increase response rates

Identify Customers that pose high credit risk Identify people misusing the system. E.g. People who have two Social Security Numbers Identify customers likely to change providers Identify customer needs
Risk Assessment
Fraud Detection
Customer Care

Data mining involves six common classes of tasks: Anomaly detection (Outlier/change/deviation detection) The identification of unusual data records, that might be interesting or data errors and require further investigation. Association rule learning (Dependency modeling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. Classification is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Regression Attempts to find a function which models the data with the least error. Summarization providing a more compact representation of the data set, including visualization and report generation.
Applications of Data Mining

(4)
Source IDC 1998

DataMining AdrianTuhtan

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

DataMining AdrianTuhtan

Enviado por

Direitos autorais:

Formatos disponíveis

Data Mining

Introduction Explanation of Data Mining Techniques Advantages Applications Privacy

Example of Data Mining

Data Mining & Data Warehousing

Data Mining Techniques

Classification Clustering Regression Association Rules

Technique for Classification

Predicting credit risk of a person with the jobs specified.

Group Data into Clusters

Group data into t-trees

Y = a0 + a1X1 + a2X2 + anXn Linear Regression

Line of Best Fit Curve Fitting

of the time they will buy nails.

Advantages of Data Mining

Provides new knowledge from existing data

Public databases Government sources Company Databases

Bigger profits More efficient service

Uses of Data Mining

Diversify target market Identify clients needs to increase response rates

Applications of Data Mining

Source IDC 1998

Você também pode gostar