Você está na página 1de 17

Data Mining

Copyright © 2015 Tech Mahindra. All rights reserved. 1


What is Data Mining
Data mining is a Process of semi-automatically analyzing large databases to find patterns that are:
 valid: hold on new data with some certainty
 novel: non-obvious to the system
 useful: should be possible to act on the item
 understandable: humans should be able to interpret the pattern
known as Knowledge Discovery in Data (KDD)

The key properties of data mining are:


 Automatic discovery of patterns
 Prediction of likely outcomes
 Creation of actionable information
 Focus on large data sets and databases
 Data mining can answer questions that cannot be addressed through simple query and reporting techniques.

Copyright © 2015 Tech Mahindra. All rights reserved. 2


Why Data Mining
Credit ratings/targeted marketing:
 Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
 Identify likely responders to sales promotions

Fraud detection
 Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a
particular customer?

Customer relationship management:


 Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor?
Data Mining helps extract such information

Copyright © 2015 Tech Mahindra. All rights reserved. 3


Applications
 Banking: loan/credit card approval
– predict good customers based on old customers
 Customer relationship management:
– identify those who are likely to leave for a competitor.
 Targeted marketing:
– identify likely responders to promotions
 Fraud detection: telecommunications, financial transactions
– from an online stream of event identify fraudulent events
 Manufacturing and production:
– automatically adjust knobs when process parameter changes
 Medicine: disease outcome, effectiveness of treatments
– analyze patient disease history: find relationship between diseases
 Molecular/Pharmaceutical: identify new drugs
 Scientific data analysis:
– identify new galaxies by searching for sub clusters
 Web site/store design and promotion:
– find affinity of visitor to pages and modify layout
Any area where large amounts of historic data that if understood better can help shape future decisions.

Copyright © 2015 Tech Mahindra. All rights reserved. 4


Properties
Automatic Discovery
Data mining is accomplished by building models which uses an algorithm to act on a set of data. The notion of automatic discovery
refers to the execution of data mining models.
The process of applying a model to new data is known as scoring.

Prediction
Predictions have an associated probability (How likely is this prediction to be true?).
Prediction probabilities are also known as confidence (How confident can I be of this prediction?).
Some forms of predictive data mining generate rules, which are conditions that imply a given outcome. For example, a rule might
specify that a person who has a bachelor's degree and lives in a certain neighbourhood is likely to have an income greater than the
regional average.
Rules have an associated support (What percentage of the population satisfies the rule?).

Grouping
Other forms of data mining identify natural groupings in the data. For example, a model might identify the segment of the population
that has an income within a specified range, that has a good driving record, and that leases a new car on a yearly basis.

Actionable Information
Data mining can derive actionable information from large volumes of data. For example, a town planner might use a model that
predicts income based on demographics to develop a plan for low-income housing.

Copyright © 2015 Tech Mahindra. All rights reserved. 5


Data Mining Process

Copyright © 2015 Tech Mahindra. All rights reserved. 6


Process
Problem Definition:
 This initial phase of a data mining project focuses on understanding the project objectives and requirements. Once you have
specified the project from a business perspective, you can formulate it as a data mining problem and develop a preliminary
implementation plan.

Data Gathering and Preparation


 The data understanding phase involves data collection and exploration. As you take a closer look at the data, you can determine
how well it addresses the business problem. You might decide to remove some of the data or add additional data. This is also
the time to identify data quality problems and to scan for patterns in the data.
 The data preparation phase covers all the tasks involved in creating the case table you will use to build the model. Data
preparation tasks are likely to be performed multiple times, and not in any prescribed order.
 Thoughtful data preparation can significantly improve the information that can be discovered through data mining.

Copyright © 2015 Tech Mahindra. All rights reserved. 7


Process(continued):
Model Building and Evaluation
In this phase, you select and apply various modeling techniques and calibrate the parameters to optimal values.
At this stage of the project, it is time to evaluate how well the model satisfies the originally-stated business goal (phase 1). If the
model is supposed to predict customers who are likely to purchase a product, does it sufficiently differentiate between the two
classes? Is there sufficient lift? Are the trade-offs shown in the confusion matrix acceptable?

Knowledge Deployment
Knowledge deployment is the use of data mining within a target environment. In the deployment phase, insight and actionable
information can be derived from data.
Deployment can involve scoring (the application of models to new data), the extraction of model details (for example the rules of a
decision tree), or the integration of data mining models within applications, data warehouse infrastructure, or query and reporting
tools. :

Copyright © 2015 Tech Mahindra. All rights reserved. 8


Relationship with other fields
Overlaps with machine learning, statistics, artificial intelligence, databases, visualization but more stress
on
 scalability of number of features and instances
 stress on algorithms and architectures whereas foundations of methods and formulations provided
by statistics and machine learning.
 automation for handling large, heterogeneous data

Copyright © 2015 Tech Mahindra. All rights reserved. 9


Some Basic Operations
 Predictive:
o Regression
o Classification
o Collaborative Filtering
 Descriptive:
o Clustering / similarity matching
o Association rules and variants
o Deviation detection

Copyright © 2015 Tech Mahindra. All rights reserved. 10


Classification(Supervised Learning)
 Given old data about customers and payments, predict new applicant’s loan eligibility

Salary > 5 L
Age Good/
Salary Prof. = Exec
bad
Profession
Location
Customer type
New applicant’s data

Copyright © 2015 Tech Mahindra. All rights reserved. 11


Classification methods
 Goal: Predict class Ci = f(x1, x2, .. Xn)
 Regression: (linear or any other polynomial)
 a*x1 + b*x2 + c = Ci.
 Nearest neighour
 Decision tree classifier: divide decision space into piecewise constant regions.
 Probabilistic/generative models
 Neural networks: partition by non-linear boundaries

Copyright © 2015 Tech Mahindra. All rights reserved. 12


Clustering or Unsupervised Learning
 Unsupervised learning when old data with class labels not available e.g. when introducing a new
product.
 Group/cluster existing customers based on time series of payment history such that similar
customers in same cluster.
 Key requirement: Need a good measure of similarity between instances.
 Identify micro-markets and develop policies for each

Copyright © 2015 Tech Mahindra. All rights reserved. 13


Clustering Methods
 Hierarchical clustering
– agglomerative Vs divisive
– single link Vs complete link
 Partitional clustering
– distance-based: K-means
– model-based: EM
– density-based:

Applications:
 Customer segmentation e.g. for targeted marketing
– Group/cluster existing customers based on time series of payment history such that similar
customers in same cluster.
– Identify micro-markets and develop policies for each
 Collaborative filtering:
– group based on common items purchased
 Text clustering
 Compression
Copyright © 2015 Tech Mahindra. All rights reserved. 14
Why Now?
 Data is being produced
 Data is being warehoused
 The computing power is available
 The computing power is affordable
 The competitive pressures are strong
 Commercial products are available

Copyright © 2015 Tech Mahindra. All rights reserved. 15


Trends in Data Mining
 Application Exploration.
 Scalable and interactive data mining methods.
 Integration of data mining with database systems, data warehouse systems and web database
systems.
 Standardization of data mining query language.
 Visual data mining.
 New methods for mining complex types of data.
 Biological data mining.
 Data mining and software engineering.
 Web mining.
 Distributed data mining.
 Real time data mining.
 Multi database data mining.
 Privacy protection and information security in data mining.

Copyright © 2015 Tech Mahindra. All rights reserved. 16


Thank you

Copyright © 2015 Tech Mahindra. All rights reserved. 17

Você também pode gostar