Escolar Documentos
Profissional Documentos
Cultura Documentos
Applications:
Database Indexing and Data
Mining
George Kollios
Boston University
Prof. George Kollios
Office: MCS 288
Office Hours: Monday 2:30pm-4:00pm
Thursday 11:00am-12:30pm
Mailing List: cs562
Web: http://www.cs.bu.edu/faculty/gkollios/ada05
Book1:
http://www.cs.bu.edu/faculty/gkollios/ada05/Book/
History of Database
Technology
1960s: Data collection, database creation, IMS and
network DBMS
1970s: Relational data model, relational DBMS
implementation
1980s: RDBMS, advanced data models (extended-
relational, OO, deductive, etc.) and application-
oriented DBMS (spatial, scientific, engineering, etc.)
1990s2000s: Data mining and data warehousing,
multimedia databases, and Web databases
Modern Database Systems
Extend these layers
Structure of a RDBMS
A DBMS is an OS for Query Optimization
and Execution
data!
Relational Operators
DB
Index Methods for RDBMS
Hashing Methods:
Linear Hashing, Extensible Hashing
B-tree family:
B+-trees and variations
Machine
Learning
Data Mining Visualization
Information Other
Science Disciplines
Overview of terms
Data: a set of facts (items) D, stored in
a database
Pattern: an expression E in a language
L, that describes a subset of facts
Attribute: a field in an item i in D.
Interestingness: a function I D,L that
maps an expression to a measure
space M
The Data Mining Task
For a given dataset D, language of
facts L, interestingness function ID,L
and threshold c, find the
expressions E that:
ID,L(E) > c efficiently.
How Data Mining is used
Identify the problem
Use data mining techniques to
transform the data into information
Act on the information
Measure the results
DM Functionalities
Concept description:
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Association (correlation and causality):
Multi-dimensional vs. single-dimensional association
age(X, 20..29) ^ income(X, 20..29K) buys(X,
PC) [support = 2%, confidence = 60%]
contains(T, computer) contains(x, software)
[1%, 75%]
DM Functionalities
Cluster analysis
Class label is unknown: Group data to
form new classes, e.g., cluster houses
to find distribution patterns
Clustering based on the principle:
maximizing the intra-class similarity
and minimizing the interclass
similarity
DM Functionalities
Classification and Prediction
Finding models (functions) that describe and
distinguish classes or concepts for future
prediction
E.g., classify countries based on climate, or
classify cars based on gas mileage
Presentation: decision-tree, classification rule,
neural network
Prediction: Predict some unknown or missing
numerical values