15-53 Data Warehousing and Data Mining

(DATA WAREHOUSING AND DATA MINING) 2.
Data smoothing: - Discrimination of numeric

1. Define the term data warehouse? Give the 3 attributes is one example; this is helpful or even
major activities of a data warehouse. necessary for logic based methods.
Ans: Collection of key pieces of information to arrive 3. Treatment of missing values: - There is not simple
at suitable managerial decisions is known as and safe solution for the cases where some
Data ware House. Three major activities of data of the attributes have significant number of missing
warehouse are following values.
a) populating the data 4. Data reduction: - Reasons for data reduction are in
b) day to day management most cases two fold: either the data maybe
c) accommodating changes. too big for the program, or expected time for obtaining
the solution might be too long.
2. What is the star flake schema? How it is different
with respect to star schema? 6. Explain the concept of data warehousing and
Ans: Star flake schema is schema that uses a data mining.
combination of demoralized star and normalized snow Ans. A data warehouse is a collection of a large
flake schemas. There are most appropriate in decision amount of data and these data is the pieces of
support data warehouses. information Which is use to suitable managerial
The star schema looks good section to the problem of decisions. (a storehouse of data) eg:- student data
warehousing but it simply states that one should to the details of the citizens of a city or the sales of
identify the fact and store it in read only area. previous years or the number of patients that
came to a hospital with different ailments. Such data
3. What are the functions of schedule manager? becomes a storehouse of information.
Ans: A scheduling is a key for successful warehouse Data mining is the process of exploration and
management and their function are:- analysis, by automatic or semiautomatic means, of
a) Handle multiple queues large quantities of data in order to discover meaningful
b) Maintain job schedules across outages. patterns and rules. The main concept of data
c) Support starting and stopping of queries etc. mining using a variety of techniques to identify
nuggets of information or decision making knowledge
4. How to categorize data mining systems? in bodies of data, and extracting these in such a way
Ans: a) Classification according to the type of data that they can be put to use in the areas such as
source mined:- this classification categorizes decision support, prediction, forecasting and
data mining systems according to the type of data estimation.
handled such as spatial data, multimedia data, 7. What is Meta data? Give example.
time-series data, text data, World Wide Web, etc. Ans. Meta data is simply data about data which
b) Classification according to the data model drawn normally describe the objects & their quantity, their
on:- this classification categorizes data mining size
systems based on the data model involved such as and how data are stored. It is helpful in query
relational database, object-oriented database, management. Eg:-
data warehouse, transactional, etc. 8. What are the requirements for clustering?
c) Classification according to the king of knowledge Explain.
discovered:- this classification categorizes Ans: Clustering is a division of a data into groups of
data mining systems based on the kind of knowledge similar objects. Each group called cluster, consists
discovered. of objects that are similar between themselves and
d) Classification according to mining techniques dissimilar to objects of other groups. Or we can say
used:- This classification categorizes data mining Clustering is a challenging and interesting field
systems according to the data analysis approach used potential application poses their own special
such as machine learning, neural networks, requirement.
genetic algorithms, statistics, visualization, database The following are typical requirements of clustering.
oriented or data warehouse-oriented, etc. Scalability: Many clustering algorithms work well on
small data sets containing fewer than 200 data
5. Explain the possible techniques for data cleaning. objects However, a large database may contain
Ans: 1. Data normalization: - For example decimal millions of objects.
scaling into the range (0,1), or standard Ability to deal with different types of attributed:
deviation normalization. Many algorithms are designed to cluster interval-based
(numerical) data. However, applications may require Education: - It plays two roles: a)To make people
clustering other types of data, such as binary, comfortable with DWH concept. b)
categorical (nominal), and ordinal data, or mixtures of To aid the prototyping activity.
these data types. Business Requirements: - It is essential that the
Discovery of clusters with arbitrary shape: Many business requirements are fully understood by the data
clustering algorithms determined clusters based on warehouse
Euclidean or Manhattan distance measures. planner. This is
Algorithms based on such distance measures tend to more easily said
find than done
spherical clusters with similar size and density. because future
However, a cluster could be of any shape. modifications
Minimal requirements for domain knowledge of are hardly clear
determine input parameters: Many clustering even
algorithms require users to input certain parameters in to top level
cluster analysis (such as the number of desired planners, let
clusters). alone the IT professionals.
The clustering results can be quite sensitive to input Technical blue prints: - This is the stage where the
parameters. overall architecture that satisfies the requirements is
Ability to deal with noisy data: Most real-world delivered.
databases contain outliers or missing, unknown, Building the vision: - Here the first physical
erroneous infrastructure becomes available. The major
data. Some clustering algorithms are sensitive to such infrastructure
data and may lead to clusters of poor quality. components are set up; first stages of loading and
Insensitivity to the order of input records: Some generation of data start up.
clustering algorithms are sensitive to the order of input History load: - Here the system made fully
data; for example, the same set of data, when operational by loading the required history into the
presented with different orderings to such an warehouse.
algorithm, may Now the warehouse becomes fully “loaded” and is
generated dramatically different clusters. ready to take on live “queries”.
High dimensionality: A database or a data warehouse Adhoc Query: -Now we configure a query tool to
can contain several dimensions or attributes. operate against the data warehouse.
Many clustering algorithms are good at handling low- The users can ask questions in a typical format.
dimensional data, involving only two to three Automation: - Extracting and loading of data from the
dimensions. sources, transforming the data, backing up,
Constraint-based clustering: Real-world applications restoration, achieving, aggregations, and monitoring
may need to perform clustering under various query profiles are the operational process of this phase.
kinds of constraints. Suppose that our job is to choose Extending Scope: - There is not single mechanism by
the locations for a given number of new automatic which this can be achieved. As and when needed,
cash a new set of data may be added; new formats may be
-dispensing machines (i.e., ATMs) in a city. included or may be even involve major changes.
Interpretability and usability: Users expect Requirement Evolution: - Business requirements will
clustering results to be interpretable, Comprehensible, constantly change during the life of the warehouse.
and usable. Hence, the process that supports the warehouse also
9. Explain in brief the data warehouse delivery needs to be constantly monitored and modified.
process.
Ans: Data warehouse delivery process follows the Q10. What is the need for partitioning of data?
following steps: Explain the usefulness of partitioning of data.
IT Strategy: - Data warehouse cannot work in Ans: In most warehouses the size of the fact tables
ISOLATION whole IT Strategy of company needed. tends to become very large. This leads to several
Business case analysis: - Important of various problem of management backup, processing etc. which
component of business/over all understand of business partitioning and each fact table part into
is must. separate partition. This technique allows data to be
Designer must have sound knowledge’s of business scanned to be minimized, without the overhead
activity.
of using an index. This improves the overall efficiency It may be very difficult to change the strategy later,
of the system partitioning data help in – because data mart format also have to be changed
a). Assist in better management of data.
b). Ease of backup/recovery since the volume is less. Q14. What are the issues in data mining.
c). The star schemas with partitions produce better Ans: There are may issues in data mining such as-
performance. i). Security and Social issue: security is an important
issue with any data collection that is shared or
Q11. Explain the steps needed for designing the in tended to be used for strategic decision making.
summary table. ii). User Interface Issues: The knowledge discovered
Ans: Summary table are designed by following steps – by data mining tools is useful. It is interesting
i. Decide the dimensions along which and understanding correlating personal data with other
aggregation is to be done. information.
ii. Determine the aggregation of multiple iii). Mining Methodology issues: these issues pertain
facts. to the data mining approaches applied and their
iii. Aggregate multiple facts into the summary limitations.
table. iv). Performance issues: many artificial intelligence
iv. Determine the level of aggregation and the & statistical methods exist for data analysis &
context at embedding. interpretation
v. Design time into the table. v). Data Source Issues: There are many issues related
vi. Index the Summary table. to the data sources, some are partial such as the
diversity
Q12. What is an event? How does an event of data types, while others are philosophical like the
manager manages the events? Name any 4 events. data glut problem.
Ans: An event is a measurable, observable occurrence
at defined action. The event manager is software
that continuously monitors the system for the
occurrence at the event and then takes any action that
is Q15. Define data mining query in term of
suitable. List of common events- i.Running out of primitives.
memory space ii. A process dying. iii. A process using Ans: a) Growing Data Volume: The main reason for
accessing resource iv. I/O errors. necessity of automated computer systems for
intelligent data analysis is the enormous volume of
Q13. What are the reasons for data marting? existing and newly appearing data that require
Mention their advantages and disadvantages. processing. The amount of data accumulated each day
Ans: There are so many reasons for data marting, such by various business, scientific, and governmental
as - Since the volume of data scanned is organizations around the world is daunting.
small, they speed up the query processing, Data can be b) Limitations of Human Analysis: Two other
structured in a form suitable for a user problems that surface when human analysts’ process
access too, Data can be segmented or partitioned so data are the inadequacy of the human brain when
that they can be used on different platforms searching for complex multifactor dependencies in
and also different control strategies become applicable. data, and the lack of objectiveness in such an analysis.
Advantages:- c) Low Cost of Machine Learning: While data
i).Since the volume of data scanned is small, they mining does not eliminate human participation in
speed up the query processing. solving the task completely, it significantly simplifies
ii).Data can be structured in a form suitable for a user the job and allows an analyst who is not a
access too. professional in statistics and programming to manage
iii).Data can be segmented or partitioned so that they the process of extracting knowledge from data.
can be used on different platform
Disadvantages:-
i).The lot of setting up and operating data mart is quite Q16. Explain in brief the data mining applications.
high. Ans: Data mining has many varied field of application
ii).Once a data marting strategy is put in place, the data which is listed below:
mart format became fined. Retail/Marketing:
•Identify buying patterns from customers.•Find ii). Initially every item is member of a set of 1-
associations among customer demographic candidate item sets. The support count of each
characteristics candidate
•Predict response to mailing campaigns item sets is calculated and items with a support count
•Market basket analysis less the minimum required support count are removed
Banking: as candidate, the remaining item is joined to create 2
•Detect patterns of fraudulent credit card use•Identity candidate item sets that each comprise of two items or
‘loyal’ customers members.
•Predict customers, determine credit card spending iii). The support count of each two member item set is
•Identify stock trading calculated from the database of transactions
Insurance and Health Care: and 2 member item set that occur with a support count
•Claims analysis greater than or equal to minimum support count are
•Identify behavior pattern of risky customers.•Identify used to create 3 candidate item sets. The process in
fraudulent behavior steps 1 and 2 is repeated generating 4 and 5.
Transportation: iv). All candidate item sets are generated with a
•Determine the distribution schedules among support count greater them the minimize support count
outlets.•Analyze loading patterns from a set of request item sets.
Medicine: v). Apriori recursively generates all the subsets of each
•Characterize patient behavior to predict office visits. frequent item set and creates association rules based
•Identify successful medical therapies for different on subsets with a
illnesses. confidence greater than
the minimum
confidence.
Q17. With example explain the decision tree There are many
working concept. variations of Apriori
Ans: Decision tree is a classifier in the form of a tree Algorithm have been
structure where each node is either – proposed that focus on
a. A least node, indicating a class at instances. improving the
b. A decision node that specifies some test to be efficiency of the original algorithm –
carried out on a single attribute value, with one A). Hash based Technique:- It is used to reduce the
branch and sub tree for each possible outcome of the size of the candidate K – item sets.
test. A decision tree can be used to B). Transaction Reduction:- A transaction that does
classify an instance by starting at the root and moving not contain any frequent K- item
through it until a leaf node, which sets cannot contain any frequent (K+1) item sets.
provides the classification of the instance. Example :- C). Sampling: - In this way, we trade off some degree
Decision making in the Bombay stock of accuracy against efficiency.
market - Assume that the major factors affecting the D). Dynamic Item set Counting: - It was proposed in
Bombay stock markets are – which the database is partitioned
What it did yesterday, What the New Delhi market is into blocks masked by start points.
doing today, Bank Interest Rate, Unemployment rate.
Q19. In brief explain the process of data

preparation.
Q18. What are the implementation steps of data
mining with apriori analysis and how the
efficiency of this algorithm can be improved?
Ans: - Implémentation Steps –
i). The Apriori algorithm would analyze all the
transactions in a dataset for each items support count.
Any items that support count less than the minimum
support count is removed from the pool of candidate
application.
Ans: Data preparation is divided into different The environment should enable the user to experiment
selection data cleaning, formation of new data with different coding schemes, store partial results
and data formatting. make
i). Select data- Data quality properties: completeness attributes discrete, create time series out of historic
and correctness. Technical constraints such as data, select random sub-samples, separate test sets and
limits on data so on.
volume or data type. 6. Integrate with decision support system: Data
ii). Data cleaning: - data normalization, data mining looks for hidden data that cannot easily be
smoothing, treatment of missing values, data found
reduction. using normal query techniques. A knowledge
iii). New data construction: - this step represents discovery process always starts with traditional
constructive operations on selected on selected decision support
data, system activities and from there we magnify in on
which includes. Derivation of new attributes from two interesting parts of the data set.
or more existing attributes. Generation of new 7. Choose extendible architecture: New techniques
records, for pattern recognition and machine learning are under
Data transformation. development and we also see many developments in
iv). Data formatting: - reordering of attributes or the database area. It is advisable to choose an
records, changes related to the constraints of architecture that enables us to integrate new tools at
modeling tools. later stages.
8. Support heterogeneous databases: Not all the
necessary data is necessarily to be found in the data
Q20.What are the guidelines for KDD environment. warehouse.
Ans: The following are the guidelines for KDD Sometimes we will need to enrich the data warehouse
environment are:- with information from unexpected sources, such as
1. Support extremely large data sets: Data mining information
deals with extremely large data sets consisting of brokers or with operational data that is not stored in
billions of records and without proper platforms to our regular data warehouse.
store and handle these volumes of data, no reliable 9. Introduce client/server architecture: A data
data mining is possible. Parallel servers with databases mining environment needs extensive reporting
optimized for decision support system oriented queries facilities.
are useful. Fast and flexible access to large data sets is Client/server is a much more flexible system which
of very important. moves the burden of visualization and graphical
2. Support hybrid learning: Learning tasks can be techniques
divided into three areas: a. classification tasks b. from the servers to the local machine. We can then
knowledge engineering tasks c. problem-solving optimize our database server completely for data
tasks. All algorithms can not perform well in all the mining.
above areas as discussed in previous chapters. 10. Introduce cache optimization: The learning
Depending on our requirement one has to choose the algorithm in a data mining environments should be
appropriate one. optimized
3. Establish a data warehouse: A data warehouse for store the data in separate tables on to cache large
contains historic data and is subject oriented and static, portions in internal memory type of database access.
that is, users do not update the data but it is created on
a regular time-frame on the basis of the operational Q21. Explain data mining for financial data
data of an organization. analysis.
4. Introduce data cleaning facilities: Even when a Ans: Financial data collected in the banking and
data warehouse is in operation, the data is certain to financial industries are often relatively complete,
contain all sorts of heterogeneous mixture. Special reliable
tools for cleaning data are necessary and some and of high quality, which facilitates systematic data
advanced tools are available, especially in the field of analysis and data mining. The various issues are –
de-duplication of client files. a) Design and construction of data warehouses for
5. Facilitate working with dynamic coding: Creative multidimensional data analysis and data mining:
coding is the heart of the knowledge discovery Data warehouses need to be constructed for banking
process. and financial data. Multidimensional data analysis
methods should be used to analyze the general
properties of such data. Data warehouses, data cubes,
multifeature and discovery-driven data cubes,
characteristic and comparative analyses and outlier a Q23. What is the importance of period of retention
nalyses all play important roles in financial data of data?
analysis and mining. Ans: A businessman says he wants to the data to be
b) Loan payment prediction and customer credit retained for as long as possible 5, 10, 15 years
policy analysis: Loan payment prediction and the longer the better. The more data we have, the better
customer credit analysis are critical to the business of a the information generated. But such a view
bank. Many factors can strongly or weakly thing is unnecessarily simplistic. If a company wants
influence loan payment performance and customer to have an idea of the recorder levels, details
credit rating. Data mining methods, such as of sales for last 6 months to one year may be enough.
feature selection and attribute relevance ranking may Sales pattern of 5 years is unlikely to be relevant
help identify important factors and eliminate today. So, It is important to determine the retention
irrelevant ones. period for each function but once it is drawn, it
c) Classification and clustering of customers for becomes easy to decide on the optimum value of data
targeted marketing: Classification and clustering to be stored.
methods can be used for customer group identification
and targeted marketing. Effective clustering and Q25. Give the advantages and disadvantages of
collaborative filtering methods can help identify equal segment partitioning.
customer groups, associate a new customer with an Ans: The advantage is that the slots are reusable.
appropriate customer group and facilitate targeted Suppose we are sure that we will no more need
marketing. the data of 10 years back, then we can simply delete
d) Detection of money laundering and other the data of that slot and use it again. Of
financial crimes: To detect money laundering and course there is a serious draw back in the scheme – if
other the partitions tend to differ too much in size.
financial crimes, it is important to integrate The number of visitors visiting a till station, say in
information from multiple databases, as long as they summer months, will be much larger than in
are winter months and hence the size of the segment
potentially related to the study. Multiple data analysis should be big enough to take case of the summer rush.
tools can then be used to detect unusual patterns, Q24. What are the facts to optimize the cost-benefit
such as large amounts of cash flow at certain periods, ratio.
by certain group of people and so on. Ans: The facts to optimize the cost-benefit ratio
are:
Q22. What is data i). Understand the significance of the data stored with
warehouse? Explain respect to time. Only those data that are
the architecture of still needed for processing need to be stored.
data warehouse. ii). Find out whether maintaining of statistical samples
Ans: It is a large of each of the subsets could be resorted
collection of Data to instead of storing the entire data.
and set of Process iii). Remove certain columns of the data, if you feel it
managers that use is no more essential.
this data to make the iv). Determine the use of intelligent and non intelligent
information available. The architecture for data keys.
warehouse indicated below: It only gives the major v). Incorporate time as one of the factors into the data
items that make up a data ware house. table. This can help in indicating the usefulness
The size & complexity of each items depends on the of the data over a period of time and removal of
actual size of warehouse. The extracting & loading absolute data.
process are taken care of by the load manager. The vi). Partition the fact table. A record may contain a
processes of cleanup &transformation of data as also large number of fields, only a few of which are
of back up & archiving are duties of the warehouse actually needed in each case. It is desirable to group
manager, while the query manager, as the name those fields which will be useful into smaller
implies is to take case of query management. tables and store separately.
Q26. Explain the Query generation. using advertisements, coupons and various kinds of
Ans: Meta data is also required to generate queries. discounts and bonuses to promote products
The query manger uses the metadata to build a history and attract customers. Careful analysis of the
of all queries run and generator a query profile for effectiveness of sales campaigns can help improve
each user, or group of uses. We simply list a few of company profits. Multi-dimensional analysis can be
the commonly used meta data for the query. The used for these purposes by comparing the
names are self explanatory. o Query- Table accessed- amount of sales and the number of transactions
Column accessed, Name, Reference identifier. o containing the sales items during the sales period
Restrictions applied- Column name, Table name, versus those containing the same items before or after
Reference identifier ,Restrictions. o Join criteria the sales campaign.
applied-Column name, Table name, Reference d) Customer retention – analysis of customer
Identifier, Column name, Table name, Reference loyalty: With customer loyalty card information,
identifier. o Aggregate function used-Column name, one can register sequences of purchases of particular
Reference identifier, Aggregate function. o Syntax o customers. Customer loyalty and purchase
Resources o Disk trends can be analyzed in a systematic way, Goods
purchased at different periods by the same
Q27. Explain data mining for retail industry customer can be grouped into sequences. Sequential
application. patterns mining can then be used to investigate
Ans: The retail industry is a major application area for changes in customer consumption or loyalty and
data mining since it collects huge amount suggest adjustments on the pricing an variety of
of data on sales, customer shopping history, goods goods in order to help retain customers and attract new
transportation, and consumption and service customers.
records and so on. The quantity of data collected e) Purchase recommendations and cross-reference
continues to expand rapidly, due to web or of items: Using association mining for sales
e-commerce. Today, many stores also have web sites records, one may discover that a customer who buys a
where customers can make purchases on-line. particular brand of bread is likely to buy
Retail data mining can help identify customer buying another set of items. Such information can used to
behaviors, discover customer shopping form purchase recommendations. Purchase
patterns and trends, improve the quality of customer recommendations can e advertised on the web, in
service, achieve better customer retention weekly flyers or on the sales receipts to help
and satisfaction, enhance goods consumption ratios, improve customer service, aid customers in selecting
design more effective goods transportation items and increase sales.
and distribution policies and reduce the cost of the
business. The following are few activities of 37. Define aggregation. Explain steps require
data mining are carried out in the retail industry. designing summary table.
a) Design and construction of data warehouses on Ans: Association: - A collection of items and a set of
the benefits of data mining: The first aspect records, which contain
is to design a warehouse. Here it involves deciding some number of items from the given collection, an
which dimensions and levels to include and association function is an
what preprocessing to perform in order to facilitate operation against this set of records which return
quality and efficient data mining. affinities or patterns that exist
b) Multidimensional analysis of sales, customers, among the collection of items. Summary table are
products, time and region: The retail industry designed by following the steps
requires timely information regarding customer needs, given as follows: a) decide the dimensions along
product sales, trends and fashions as well which aggregation is to be done.
as the quality, cost, profit and service of commodities. b) Determine the aggregation of multiple facts. c)
It is therefore important to provide Aggregate multiple facts into
powerful multidimensional analysis and visualization the summary table. d) Determine the level of
tools, including the construction of aggregation and the extent of
sophisticated data cubes according to the needs of data embedding. e) Design time into the table. f) Index the
analysis. summary table.
c) Analysis of the effectiveness of sales campaigns:
The retail industry conducts sales campaigns Q29. Explain the h/w partitioning?
Ans: The data ware design process should try to 4. Database Managers: The database manger normally
maximize the performance of the system. will also have a separate (and often independent)
One of the ways to ensure this is to try to optimize by system manager module. The purpose of these
designing the data base with respect to specific managers is to automate certain processes and simplify
hardware architecture. Obviously, the exact details of the execution of others. Some of operations are listed
optimization depend on the hardware platforms. as follows. • Ability to add/remove users o
Normally the following guidelines are useful: User management o Manipulate user quotas o
i) maximize the processing, disk and I/O operations. Assign and deassign the user profiles • Ability to
ii) Reduce bottlenecks at the CPU and I/O perform database space management 5. Backup
Recovery Managers: Since the data stored in a
31. With diagram explain the architecture of ware warehouse is invaluable, the need to backup and
house manager. recover lost data cannot be overemphasized.
Ans: The ware house manager is a component that There are three main features for the management of
performs all operations necessary to support the backups. • Scheduling • Backup data tracking
warehouse management process. Unlike the load
• Database awareness
manager, the warehouse management process is driven
6. Back propagation: Back propagation learns by
by the extent to which the operational management of
iteratively processing a set of training samples,
the data ware house has been automated.
comparing the network’s prediction for each sample
Q28. Explain the system management tools.
with the actual known class label. For each
Ans: 1. Configuration Managers: This tool is
sample with the actual known class label. For each
responsible for setting up and configuring the
training sample, the weights are modified so
hardware.
as to minimize the means squared error between the
Since several types of machines are being addressed,
network’s prediction and the actual class.
several concepts like machine configuration,
These modifications are made in the “backwards”
compatibility etc. is to be taken care of, as also the
direction, that is, from the output layer, through
platform on which the system operates. Most
each hidden layer down to the first hidden layer
configuration managers have a single interface to
allow the control of all types of issues.
Q30.Explain horizontal and vertical partitioning
2. Schedule Managers : The scheduling is the key for
and differentiate them.
successful warehouse management.
Ans: HORIZONTAL PARTITIONING-This is
Almost all operations in the ware house need some
essentially means that the table is partitioned after
type of scheduling. Every operating system will have
the first few thousand entries, and the next few
its own scheduler and batch control mechanism. But
thousand entries etc. This is because in most cases,
these schedulers may not be capable of fully meeting
not all the information in the fact table needed all the
the requirements of a data warehouse. Hence it is more
time. Thus horizontal partitioning helps to
desirable to have specially designed schedulers to
reduce the query access time, by directly cutting down
manage the operations. Some of the capabilities that
the amount of data to be scanned by the queries.
such a manager should have include the following:
a) Partition by time into equal segments: This is the
• Handling multiple queues most straight forward method of partitioning by
• Interqueue processing capabilities months or years etc. This will help if the queries often
• Maintain job schedules across system outages come regarding the fortnightly or monthly
• Deal with time zone differences. performance / sales etc.
3. Event Managers: An event is defined as a b) Partitioning by time into different sized
measurable, observable occurrence of a defined action. segments: This is very useful technique to keep the p
If this definition is quite vague, it is because it hysical table small and also the operating cost low.
encompasses a very large set of operations. c) Partitioning on other dimension: Data collection
A partial list of the common events that need to be and storing need not always be partitioned based on
monitored are as follows: time, though it is a very safe and relatively straight
• Running out of memory space forward method.
• A process dying d) Partition by the size of the table: We will not be
• A process using excessing resource sure of any dimension on which partitions can be
• I/O errors made.
In this case it is ideal to partition by size.
e) Using Round Robin Partitions: Once the ii. The kinds of knowledge to be mined - It specifies
warehouse is holding full amount of data, if a new the data mining functions to be performed such as
partition is required, it can be done only by reusing the characterization, discrimination, association,
oldest partition. Then Meta data is needed to note the classification, and clustering and evolution analysis.
beginning and ending of the historical data. iii. Background knowledge - User can specify
VERTICAL PARTITIONING- A vertical background knowledge or knowledge about the
partitioning schema divides the table vertically. Each domain to be mined which is useful for guiding KDD
row is divided into 2 or more partitions. i) We may not process.
need to access all the data pertaining to a student all iv. Interestingness measures – These functions are
the time. used to separate uninteresting patterns from
For example, we may need either only his personal knowledge. Different kinds of knowledge may have
details like age, address etc. or only the examination different interestingness measures.
details of marks scored etc. Then we may choose to v. Presentation and Visualization of discovered
split them into separate tables, each containing data process - Users can choose from different forms for
only about the relevant fields. This will speed up knowledge presentation such as rules, tables, charts,
accessing. graphs, decision trees and cubes.
ii) The no. of fields in a row become inconveniently 35.What are the implementation steps of data
large, each field itself being made up of several mining with a priori analysis and how the efficiency
subfields etc. In such a scenario, it is always desirable of this algorithm can be improved.
to split it into two or more smaller tables. The vertical Ans: - Implémentation Steps –
partitioning itself can be achieved in two different i).The Apriori algorithm would analyze all the
ways: transactions in a dataset for each items support count.
(i) Normalization and (ii) row splitting. Any items that support count less than the minimum
support count is removed from the pool of candidate
32. Explain the steps needed for designing the application.
summary table. ii).Initially every item is member of a set of 1-
Ans: Summary table are designed by following steps – candidate item sets. The support count of each
i. Decide the dimensions along which candidate item sets is calculated and items with a
aggregation is to be done. support count less the minimum required support
ii. Determine the aggregation of multiple count are removed as candidate, the remaining item is
facts. joined to create 2 candidate item sets that each
iii. Aggregate multiple facts into the summary comprise of two items or members.
table. iii).The support count of each two member item set is
iv. Determine the level of aggregation and the calculated from the database of transactions and 2
context at embedding. member item set that occur with a support count
v. Design time into the table. greater than or equal to minimum support count are
vi. Index the Summary table. used to create 3 candidate item sets. The process in
steps 1 and 2 is repeated generating 4 and 5.
33. What are the reasons for data marting? iv).All candidate item sets are generated with a support
Mention their advantages and disadvantages. count greater them the minimize support count from a
Ans: There are so many reasons for data marting, such set of request item sets.
as - Since the volume of data scanned v).Apriori recursively generates all the subsets of each
is small, they speed up the query processing, Data can frequent item set and creates association rules based on
be structured in a form suitable for a subsets with a confidence greater than the minimum
user access too, Data can be segmented or partitioned confidence.
so that they can be used on different There are many variations of Apriori Algorithm have
platforms and also different control strategies become been proposed that focus on improving the efficiency
applicable. of the original algorithm –
a).Hash based Technique:- It is used to reduce the
34. Define data mining query in terms of primitives. size of the candidate K – itemsets.
Ans: A data mining query is defined in terms of the b).Transaction Reduction:- A transaction that does
following primitives- not contain any frequent K- itemsets cannot contain
Task Relevant Data – This is the database position to any frequent (K+1) item sets.
be investigated.
c).Sampling: - In this way, we trade off some degree
of accuracy against efficiency.
d).Dynamic Itemset Counting: - It was proposed in
which the database is partitioned into blocks masked
by start points.
36.Explain multi dimensional schemas.

Ans: This is a very convenient method of analyzing
data, when it goes beyond the normal tabular
relations.
For example, a store maintains a table of each item
it sells over a month as a table, in each of it’s 10
outlets. This is a 2 dimensional table. One the other
hand, if the company wants a data of all items sold by
it’s outlets, it can be done by simply by superimposing
the 2 dimensional table for each of these items – one
behind the other. Then it becomes a 3 dimensional
view. Then the query, instead of looking for a 2
dimensional rectangle of data, will look for a 3
dimensional cuboid of data. There is no reason why
the dimensioning should stop at 3 dimensions. In fact
almost all queries can be thought of as approaching a
multi-dimensioned unit of data from a
multidimensional volume of the schema. A lot of
designing effort goes into optimizing such searches.

15-53 Data Warehousing and Data Mining

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

15-53 Data Warehousing and Data Mining

Enviado por

Direitos autorais:

Formatos disponíveis

(DATA WAREHOUSING AND DATA MINING) 2.

Data smoothing: - Discrimination of numeric

Q19. In brief explain the process of data

36.Explain multi dimensional schemas.

Você também pode gostar