Escolar Documentos
Profissional Documentos
Cultura Documentos
E059
Class : B.Tech CS
Date of Experiment:
Grade:
Assignment
PART B
Name:Shubham Gupta
Batch : E3
Date of Submission:
Performance Issues
Data mining query languages and ad hoc data mining Data Mining Query language
that allows the user to describe ad hoc mining tasks, should be integrated with a data
warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable.
Handling noisy or incomplete data The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. If the data cleaning
methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation The patterns discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows
Parallel, distributed, and incremental mining algorithms The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel fashion.
Then the results from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.
Handling of relational and complex types of data The database may contain
complex data objects, multimedia data objects, spatial data, temporal data etc. It is not
possible for one system to mine all these kind of data.
- OLTP (On-line Transaction Processing) is characterized by a large number of short on-line transactions (INSERT,
UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity
in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database
there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).
- OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often
very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP
applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in
multi-dimensional schemas (usually star schema).
The following table summarizes the major differences between OLTP and OLAP system design.
OLTP System
Online Transaction Processing
(Operational System)
OLAP System
Online Analytical
Processing
(Data Warehouse)
Consolidation data; OLAP data comes from the various
OLTP Databases
Source of data
Purpose of data
Inserts and
Updates
Queries
Processing
Speed
Space
Requirements
Backup religiously; operational data is critical to run the business, data loss
is likely to entail significant monetary loss and legal liability
DatabaseDesig
n
Backup and
Recovery
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the "high-level" application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition
for expert systems, and data visualization.
The unifying goal of the KDD process is to extract knowledge from data in the
context of large databases.
It does this by using data mining methods (algorithms) to extract (identify) what is
deemed knowledge, according to the specifications of measures and thresholds, using
a database along with any required preprocessing, subsampling, and transformations
of that database.
An Outline of the Steps of the KDD Process
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of
o the application domain
o the relevant prior knowledge
o the goals of the end-user
2. Creating a target data set: selecting a data set, or focusing on a subset of
variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
o Removal of noise or outliers.
o Collecting necessary information to model or account for noise.
o Strategies for handling missing data fields.
o Accounting for time sequence information and known changes.
4. Data reduction and projection.
o Finding useful features to represent the data depending on the goal of the
task.
o Using dimensionality reduction or transformation methods to reduce the
effective number of variables under consideration or to find invariant
representations for the data.
5. Choosing the data mining task.
o Deciding whether the goal of the KDD process is classification,
regression, clustering, etc.
6. Choosing the data mining algorithm(s).
What is star schema? The star schema architecture is the simplest data warehouse
schema. It is called a star schema because the diagram resembles a star, with
points radiating from a center. The center of the star consists of fact table and the
points of the star are the dimension tables. Usually the fact tables in a star schema
are in third normal form(3NF) whereas dimensional tables are de-normalized.
Despite the fact that the star schema is the simplest architecture, it is most
commonly used nowadays and is recommended by Oracle.
Fact Tables
A fact table typically has two types of columns: foreign keys to dimension tables
and measures those that contain numeric facts. A fact table can contain fact's data
on detail or aggregated level.
Dimension Tables
A dimension is a structure usually composed of one or more hierarchies that
categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat
dimension or list. The primary keys of each of the dimension tables are part of the
composite primary key of the fact table. Dimensional attributes help to describe the
dimensional value. They are normally descriptive, textual values. Dimension tables
are generally small in size then fact table.
Typical fact tables store data about sales while dimension tables data about
geographic region(markets, cities) , clients, products, times, channels.
The main characteristics of star schema:
Simple structure -> easy to understand schema
Great query effectives -> small number of tables to join
Relatively long time of loading data into dimension tables -> de-normalization,
redundancy data caused that size of the table could be large.
The most commonly used in the data warehouse implementations -> widely
supported by a large number of business intelligence tools
Snowflake schema
The snowflake schema architecture is a more complex variation of the star schema
used in a data warehouse, because the tables which describe the dimensions are
normalized.
Web-Enabling the data warehouse means using the web for information
delivery and integrating the click stream data from the corporate web site for
analysis.