Three major activities of a data warehouse are following a) populating the data b) day to day management c) accommodating changes. A scheduling is a key for successful warehouse management and their function are:a) Handle multiple queues b) Maintain job schedules across outages c) support starting and stopping of queries etc.
Three major activities of a data warehouse are following a) populating the data b) day to day management c) accommodating changes. A scheduling is a key for successful warehouse management and their function are:a) Handle multiple queues b) Maintain job schedules across outages c) support starting and stopping of queries etc.
Direitos autorais:
Attribution Non-Commercial (BY-NC)
Formatos disponíveis
Baixe no formato DOC, PDF, TXT ou leia online no Scribd
Three major activities of a data warehouse are following a) populating the data b) day to day management c) accommodating changes. A scheduling is a key for successful warehouse management and their function are:a) Handle multiple queues b) Maintain job schedules across outages c) support starting and stopping of queries etc.
Direitos autorais:
Attribution Non-Commercial (BY-NC)
Formatos disponíveis
Baixe no formato DOC, PDF, TXT ou leia online no Scribd
1. Define the term data warehouse? Give the 3 attributes is one example; this is helpful or even major activities of a data warehouse. necessary for logic based methods. Ans: Collection of key pieces of information to arrive 3. Treatment of missing values: - There is not simple at suitable managerial decisions is known as and safe solution for the cases where some Data ware House. Three major activities of data of the attributes have significant number of missing warehouse are following values. a) populating the data 4. Data reduction: - Reasons for data reduction are in b) day to day management most cases two fold: either the data maybe c) accommodating changes. too big for the program, or expected time for obtaining the solution might be too long. 2. What is the star flake schema? How it is different with respect to star schema? 6. Explain the concept of data warehousing and Ans: Star flake schema is schema that uses a data mining. combination of demoralized star and normalized snow Ans. A data warehouse is a collection of a large flake schemas. There are most appropriate in decision amount of data and these data is the pieces of support data warehouses. information Which is use to suitable managerial The star schema looks good section to the problem of decisions. (a storehouse of data) eg:- student data warehousing but it simply states that one should to the details of the citizens of a city or the sales of identify the fact and store it in read only area. previous years or the number of patients that came to a hospital with different ailments. Such data 3. What are the functions of schedule manager? becomes a storehouse of information. Ans: A scheduling is a key for successful warehouse Data mining is the process of exploration and management and their function are:- analysis, by automatic or semiautomatic means, of a) Handle multiple queues large quantities of data in order to discover meaningful b) Maintain job schedules across outages. patterns and rules. The main concept of data c) Support starting and stopping of queries etc. mining using a variety of techniques to identify nuggets of information or decision making knowledge 4. How to categorize data mining systems? in bodies of data, and extracting these in such a way Ans: a) Classification according to the type of data that they can be put to use in the areas such as source mined:- this classification categorizes decision support, prediction, forecasting and data mining systems according to the type of data estimation. handled such as spatial data, multimedia data, 7. What is Meta data? Give example. time-series data, text data, World Wide Web, etc. Ans. Meta data is simply data about data which b) Classification according to the data model drawn normally describe the objects & their quantity, their on:- this classification categorizes data mining size systems based on the data model involved such as and how data are stored. It is helpful in query relational database, object-oriented database, management. Eg:- data warehouse, transactional, etc. 8. What are the requirements for clustering? c) Classification according to the king of knowledge Explain. discovered:- this classification categorizes Ans: Clustering is a division of a data into groups of data mining systems based on the kind of knowledge similar objects. Each group called cluster, consists discovered. of objects that are similar between themselves and d) Classification according to mining techniques dissimilar to objects of other groups. Or we can say used:- This classification categorizes data mining Clustering is a challenging and interesting field systems according to the data analysis approach used potential application poses their own special such as machine learning, neural networks, requirement. genetic algorithms, statistics, visualization, database The following are typical requirements of clustering. oriented or data warehouse-oriented, etc. Scalability: Many clustering algorithms work well on small data sets containing fewer than 200 data 5. Explain the possible techniques for data cleaning. objects However, a large database may contain Ans: 1. Data normalization: - For example decimal millions of objects. scaling into the range (0,1), or standard Ability to deal with different types of attributed: deviation normalization. Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require Education: - It plays two roles: a)To make people clustering other types of data, such as binary, comfortable with DWH concept. b) categorical (nominal), and ordinal data, or mixtures of To aid the prototyping activity. these data types. Business Requirements: - It is essential that the Discovery of clusters with arbitrary shape: Many business requirements are fully understood by the data clustering algorithms determined clusters based on warehouse Euclidean or Manhattan distance measures. planner. This is Algorithms based on such distance measures tend to more easily said find than done spherical clusters with similar size and density. because future However, a cluster could be of any shape. modifications Minimal requirements for domain knowledge of are hardly clear determine input parameters: Many clustering even algorithms require users to input certain parameters in to top level cluster analysis (such as the number of desired planners, let clusters). alone the IT professionals. The clustering results can be quite sensitive to input Technical blue prints: - This is the stage where the parameters. overall architecture that satisfies the requirements is Ability to deal with noisy data: Most real-world delivered. databases contain outliers or missing, unknown, Building the vision: - Here the first physical erroneous infrastructure becomes available. The major data. Some clustering algorithms are sensitive to such infrastructure data and may lead to clusters of poor quality. components are set up; first stages of loading and Insensitivity to the order of input records: Some generation of data start up. clustering algorithms are sensitive to the order of input History load: - Here the system made fully data; for example, the same set of data, when operational by loading the required history into the presented with different orderings to such an warehouse. algorithm, may Now the warehouse becomes fully “loaded” and is generated dramatically different clusters. ready to take on live “queries”. High dimensionality: A database or a data warehouse Adhoc Query: -Now we configure a query tool to can contain several dimensions or attributes. operate against the data warehouse. Many clustering algorithms are good at handling low- The users can ask questions in a typical format. dimensional data, involving only two to three Automation: - Extracting and loading of data from the dimensions. sources, transforming the data, backing up, Constraint-based clustering: Real-world applications restoration, achieving, aggregations, and monitoring may need to perform clustering under various query profiles are the operational process of this phase. kinds of constraints. Suppose that our job is to choose Extending Scope: - There is not single mechanism by the locations for a given number of new automatic which this can be achieved. As and when needed, cash a new set of data may be added; new formats may be -dispensing machines (i.e., ATMs) in a city. included or may be even involve major changes. Interpretability and usability: Users expect Requirement Evolution: - Business requirements will clustering results to be interpretable, Comprehensible, constantly change during the life of the warehouse. and usable. Hence, the process that supports the warehouse also 9. Explain in brief the data warehouse delivery needs to be constantly monitored and modified. process. Ans: Data warehouse delivery process follows the Q10. What is the need for partitioning of data? following steps: Explain the usefulness of partitioning of data. IT Strategy: - Data warehouse cannot work in Ans: In most warehouses the size of the fact tables ISOLATION whole IT Strategy of company needed. tends to become very large. This leads to several Business case analysis: - Important of various problem of management backup, processing etc. which component of business/over all understand of business partitioning and each fact table part into is must. separate partition. This technique allows data to be Designer must have sound knowledge’s of business scanned to be minimized, without the overhead activity. of using an index. This improves the overall efficiency It may be very difficult to change the strategy later, of the system partitioning data help in – because data mart format also have to be changed a). Assist in better management of data. b). Ease of backup/recovery since the volume is less. Q14. What are the issues in data mining. c). The star schemas with partitions produce better Ans: There are may issues in data mining such as- performance. i). Security and Social issue: security is an important issue with any data collection that is shared or Q11. Explain the steps needed for designing the in tended to be used for strategic decision making. summary table. ii). User Interface Issues: The knowledge discovered Ans: Summary table are designed by following steps – by data mining tools is useful. It is interesting i. Decide the dimensions along which and understanding correlating personal data with other aggregation is to be done. information. ii. Determine the aggregation of multiple iii). Mining Methodology issues: these issues pertain facts. to the data mining approaches applied and their iii. Aggregate multiple facts into the summary limitations. table. iv). Performance issues: many artificial intelligence iv. Determine the level of aggregation and the & statistical methods exist for data analysis & context at embedding. interpretation v. Design time into the table. v). Data Source Issues: There are many issues related vi. Index the Summary table. to the data sources, some are partial such as the diversity Q12. What is an event? How does an event of data types, while others are philosophical like the manager manages the events? Name any 4 events. data glut problem. Ans: An event is a measurable, observable occurrence at defined action. The event manager is software that continuously monitors the system for the occurrence at the event and then takes any action that is Q15. Define data mining query in term of suitable. List of common events- i.Running out of primitives. memory space ii. A process dying. iii. A process using Ans: a) Growing Data Volume: The main reason for accessing resource iv. I/O errors. necessity of automated computer systems for intelligent data analysis is the enormous volume of Q13. What are the reasons for data marting? existing and newly appearing data that require Mention their advantages and disadvantages. processing. The amount of data accumulated each day Ans: There are so many reasons for data marting, such by various business, scientific, and governmental as - Since the volume of data scanned is organizations around the world is daunting. small, they speed up the query processing, Data can be b) Limitations of Human Analysis: Two other structured in a form suitable for a user problems that surface when human analysts’ process access too, Data can be segmented or partitioned so data are the inadequacy of the human brain when that they can be used on different platforms searching for complex multifactor dependencies in and also different control strategies become applicable. data, and the lack of objectiveness in such an analysis. Advantages:- c) Low Cost of Machine Learning: While data i).Since the volume of data scanned is small, they mining does not eliminate human participation in speed up the query processing. solving the task completely, it significantly simplifies ii).Data can be structured in a form suitable for a user the job and allows an analyst who is not a access too. professional in statistics and programming to manage iii).Data can be segmented or partitioned so that they the process of extracting knowledge from data. can be used on different platform Disadvantages:- i).The lot of setting up and operating data mart is quite Q16. Explain in brief the data mining applications. high. Ans: Data mining has many varied field of application ii).Once a data marting strategy is put in place, the data which is listed below: mart format became fined. Retail/Marketing: •Identify buying patterns from customers.•Find ii). Initially every item is member of a set of 1- associations among customer demographic candidate item sets. The support count of each characteristics candidate •Predict response to mailing campaigns item sets is calculated and items with a support count •Market basket analysis less the minimum required support count are removed Banking: as candidate, the remaining item is joined to create 2 •Detect patterns of fraudulent credit card use•Identity candidate item sets that each comprise of two items or ‘loyal’ customers members. •Predict customers, determine credit card spending iii). The support count of each two member item set is •Identify stock trading calculated from the database of transactions Insurance and Health Care: and 2 member item set that occur with a support count •Claims analysis greater than or equal to minimum support count are •Identify behavior pattern of risky customers.•Identify used to create 3 candidate item sets. The process in fraudulent behavior steps 1 and 2 is repeated generating 4 and 5. Transportation: iv). All candidate item sets are generated with a •Determine the distribution schedules among support count greater them the minimize support count outlets.•Analyze loading patterns from a set of request item sets. Medicine: v). Apriori recursively generates all the subsets of each •Characterize patient behavior to predict office visits. frequent item set and creates association rules based •Identify successful medical therapies for different on subsets with a illnesses. confidence greater than the minimum confidence. Q17. With example explain the decision tree There are many working concept. variations of Apriori Ans: Decision tree is a classifier in the form of a tree Algorithm have been structure where each node is either – proposed that focus on a. A least node, indicating a class at instances. improving the b. A decision node that specifies some test to be efficiency of the original algorithm – carried out on a single attribute value, with one A). Hash based Technique:- It is used to reduce the branch and sub tree for each possible outcome of the size of the candidate K – item sets. test. A decision tree can be used to B). Transaction Reduction:- A transaction that does classify an instance by starting at the root and moving not contain any frequent K- item through it until a leaf node, which sets cannot contain any frequent (K+1) item sets. provides the classification of the instance. Example :- C). Sampling: - In this way, we trade off some degree Decision making in the Bombay stock of accuracy against efficiency. market - Assume that the major factors affecting the D). Dynamic Item set Counting: - It was proposed in Bombay stock markets are – which the database is partitioned What it did yesterday, What the New Delhi market is into blocks masked by start points. doing today, Bank Interest Rate, Unemployment rate.
Q19. In brief explain the process of data
preparation. Q18. What are the implementation steps of data mining with apriori analysis and how the efficiency of this algorithm can be improved? Ans: - Implémentation Steps – i). The Apriori algorithm would analyze all the transactions in a dataset for each items support count. Any items that support count less than the minimum support count is removed from the pool of candidate application. Ans: Data preparation is divided into different The environment should enable the user to experiment selection data cleaning, formation of new data with different coding schemes, store partial results and data formatting. make i). Select data- Data quality properties: completeness attributes discrete, create time series out of historic and correctness. Technical constraints such as data, select random sub-samples, separate test sets and limits on data so on. volume or data type. 6. Integrate with decision support system: Data ii). Data cleaning: - data normalization, data mining looks for hidden data that cannot easily be smoothing, treatment of missing values, data found reduction. using normal query techniques. A knowledge iii). New data construction: - this step represents discovery process always starts with traditional constructive operations on selected on selected decision support data, system activities and from there we magnify in on which includes. Derivation of new attributes from two interesting parts of the data set. or more existing attributes. Generation of new 7. Choose extendible architecture: New techniques records, for pattern recognition and machine learning are under Data transformation. development and we also see many developments in iv). Data formatting: - reordering of attributes or the database area. It is advisable to choose an records, changes related to the constraints of architecture that enables us to integrate new tools at modeling tools. later stages. 8. Support heterogeneous databases: Not all the necessary data is necessarily to be found in the data Q20.What are the guidelines for KDD environment. warehouse. Ans: The following are the guidelines for KDD Sometimes we will need to enrich the data warehouse environment are:- with information from unexpected sources, such as 1. Support extremely large data sets: Data mining information deals with extremely large data sets consisting of brokers or with operational data that is not stored in billions of records and without proper platforms to our regular data warehouse. store and handle these volumes of data, no reliable 9. Introduce client/server architecture: A data data mining is possible. Parallel servers with databases mining environment needs extensive reporting optimized for decision support system oriented queries facilities. are useful. Fast and flexible access to large data sets is Client/server is a much more flexible system which of very important. moves the burden of visualization and graphical 2. Support hybrid learning: Learning tasks can be techniques divided into three areas: a. classification tasks b. from the servers to the local machine. We can then knowledge engineering tasks c. problem-solving optimize our database server completely for data tasks. All algorithms can not perform well in all the mining. above areas as discussed in previous chapters. 10. Introduce cache optimization: The learning Depending on our requirement one has to choose the algorithm in a data mining environments should be appropriate one. optimized 3. Establish a data warehouse: A data warehouse for store the data in separate tables on to cache large contains historic data and is subject oriented and static, portions in internal memory type of database access. that is, users do not update the data but it is created on a regular time-frame on the basis of the operational Q21. Explain data mining for financial data data of an organization. analysis. 4. Introduce data cleaning facilities: Even when a Ans: Financial data collected in the banking and data warehouse is in operation, the data is certain to financial industries are often relatively complete, contain all sorts of heterogeneous mixture. Special reliable tools for cleaning data are necessary and some and of high quality, which facilitates systematic data advanced tools are available, especially in the field of analysis and data mining. The various issues are – de-duplication of client files. a) Design and construction of data warehouses for 5. Facilitate working with dynamic coding: Creative multidimensional data analysis and data mining: coding is the heart of the knowledge discovery Data warehouses need to be constructed for banking process. and financial data. Multidimensional data analysis methods should be used to analyze the general properties of such data. Data warehouses, data cubes, multifeature and discovery-driven data cubes, characteristic and comparative analyses and outlier a Q23. What is the importance of period of retention nalyses all play important roles in financial data of data? analysis and mining. Ans: A businessman says he wants to the data to be b) Loan payment prediction and customer credit retained for as long as possible 5, 10, 15 years policy analysis: Loan payment prediction and the longer the better. The more data we have, the better customer credit analysis are critical to the business of a the information generated. But such a view bank. Many factors can strongly or weakly thing is unnecessarily simplistic. If a company wants influence loan payment performance and customer to have an idea of the recorder levels, details credit rating. Data mining methods, such as of sales for last 6 months to one year may be enough. feature selection and attribute relevance ranking may Sales pattern of 5 years is unlikely to be relevant help identify important factors and eliminate today. So, It is important to determine the retention irrelevant ones. period for each function but once it is drawn, it c) Classification and clustering of customers for becomes easy to decide on the optimum value of data targeted marketing: Classification and clustering to be stored. methods can be used for customer group identification and targeted marketing. Effective clustering and Q25. Give the advantages and disadvantages of collaborative filtering methods can help identify equal segment partitioning. customer groups, associate a new customer with an Ans: The advantage is that the slots are reusable. appropriate customer group and facilitate targeted Suppose we are sure that we will no more need marketing. the data of 10 years back, then we can simply delete d) Detection of money laundering and other the data of that slot and use it again. Of financial crimes: To detect money laundering and course there is a serious draw back in the scheme – if other the partitions tend to differ too much in size. financial crimes, it is important to integrate The number of visitors visiting a till station, say in information from multiple databases, as long as they summer months, will be much larger than in are winter months and hence the size of the segment potentially related to the study. Multiple data analysis should be big enough to take case of the summer rush. tools can then be used to detect unusual patterns, Q24. What are the facts to optimize the cost-benefit such as large amounts of cash flow at certain periods, ratio. by certain group of people and so on. Ans: The facts to optimize the cost-benefit ratio are: Q22. What is data i). Understand the significance of the data stored with warehouse? Explain respect to time. Only those data that are the architecture of still needed for processing need to be stored. data warehouse. ii). Find out whether maintaining of statistical samples Ans: It is a large of each of the subsets could be resorted collection of Data to instead of storing the entire data. and set of Process iii). Remove certain columns of the data, if you feel it managers that use is no more essential. this data to make the iv). Determine the use of intelligent and non intelligent information available. The architecture for data keys. warehouse indicated below: It only gives the major v). Incorporate time as one of the factors into the data items that make up a data ware house. table. This can help in indicating the usefulness The size & complexity of each items depends on the of the data over a period of time and removal of actual size of warehouse. The extracting & loading absolute data. process are taken care of by the load manager. The vi). Partition the fact table. A record may contain a processes of cleanup &transformation of data as also large number of fields, only a few of which are of back up & archiving are duties of the warehouse actually needed in each case. It is desirable to group manager, while the query manager, as the name those fields which will be useful into smaller implies is to take case of query management. tables and store separately. Q26. Explain the Query generation. using advertisements, coupons and various kinds of Ans: Meta data is also required to generate queries. discounts and bonuses to promote products The query manger uses the metadata to build a history and attract customers. Careful analysis of the of all queries run and generator a query profile for effectiveness of sales campaigns can help improve each user, or group of uses. We simply list a few of company profits. Multi-dimensional analysis can be the commonly used meta data for the query. The used for these purposes by comparing the names are self explanatory. o Query- Table accessed- amount of sales and the number of transactions Column accessed, Name, Reference identifier. o containing the sales items during the sales period Restrictions applied- Column name, Table name, versus those containing the same items before or after Reference identifier ,Restrictions. o Join criteria the sales campaign. applied-Column name, Table name, Reference d) Customer retention – analysis of customer Identifier, Column name, Table name, Reference loyalty: With customer loyalty card information, identifier. o Aggregate function used-Column name, one can register sequences of purchases of particular Reference identifier, Aggregate function. o Syntax o customers. Customer loyalty and purchase Resources o Disk trends can be analyzed in a systematic way, Goods purchased at different periods by the same Q27. Explain data mining for retail industry customer can be grouped into sequences. Sequential application. patterns mining can then be used to investigate Ans: The retail industry is a major application area for changes in customer consumption or loyalty and data mining since it collects huge amount suggest adjustments on the pricing an variety of of data on sales, customer shopping history, goods goods in order to help retain customers and attract new transportation, and consumption and service customers. records and so on. The quantity of data collected e) Purchase recommendations and cross-reference continues to expand rapidly, due to web or of items: Using association mining for sales e-commerce. Today, many stores also have web sites records, one may discover that a customer who buys a where customers can make purchases on-line. particular brand of bread is likely to buy Retail data mining can help identify customer buying another set of items. Such information can used to behaviors, discover customer shopping form purchase recommendations. Purchase patterns and trends, improve the quality of customer recommendations can e advertised on the web, in service, achieve better customer retention weekly flyers or on the sales receipts to help and satisfaction, enhance goods consumption ratios, improve customer service, aid customers in selecting design more effective goods transportation items and increase sales. and distribution policies and reduce the cost of the business. The following are few activities of 37. Define aggregation. Explain steps require data mining are carried out in the retail industry. designing summary table. a) Design and construction of data warehouses on Ans: Association: - A collection of items and a set of the benefits of data mining: The first aspect records, which contain is to design a warehouse. Here it involves deciding some number of items from the given collection, an which dimensions and levels to include and association function is an what preprocessing to perform in order to facilitate operation against this set of records which return quality and efficient data mining. affinities or patterns that exist b) Multidimensional analysis of sales, customers, among the collection of items. Summary table are products, time and region: The retail industry designed by following the steps requires timely information regarding customer needs, given as follows: a) decide the dimensions along product sales, trends and fashions as well which aggregation is to be done. as the quality, cost, profit and service of commodities. b) Determine the aggregation of multiple facts. c) It is therefore important to provide Aggregate multiple facts into powerful multidimensional analysis and visualization the summary table. d) Determine the level of tools, including the construction of aggregation and the extent of sophisticated data cubes according to the needs of data embedding. e) Design time into the table. f) Index the analysis. summary table. c) Analysis of the effectiveness of sales campaigns: The retail industry conducts sales campaigns Q29. Explain the h/w partitioning? Ans: The data ware design process should try to 4. Database Managers: The database manger normally maximize the performance of the system. will also have a separate (and often independent) One of the ways to ensure this is to try to optimize by system manager module. The purpose of these designing the data base with respect to specific managers is to automate certain processes and simplify hardware architecture. Obviously, the exact details of the execution of others. Some of operations are listed optimization depend on the hardware platforms. as follows. • Ability to add/remove users o Normally the following guidelines are useful: User management o Manipulate user quotas o i) maximize the processing, disk and I/O operations. Assign and deassign the user profiles • Ability to ii) Reduce bottlenecks at the CPU and I/O perform database space management 5. Backup Recovery Managers: Since the data stored in a 31. With diagram explain the architecture of ware warehouse is invaluable, the need to backup and house manager. recover lost data cannot be overemphasized. Ans: The ware house manager is a component that There are three main features for the management of performs all operations necessary to support the backups. • Scheduling • Backup data tracking warehouse management process. Unlike the load • Database awareness manager, the warehouse management process is driven 6. Back propagation: Back propagation learns by by the extent to which the operational management of iteratively processing a set of training samples, the data ware house has been automated. comparing the network’s prediction for each sample Q28. Explain the system management tools. with the actual known class label. For each Ans: 1. Configuration Managers: This tool is sample with the actual known class label. For each responsible for setting up and configuring the training sample, the weights are modified so hardware. as to minimize the means squared error between the Since several types of machines are being addressed, network’s prediction and the actual class. several concepts like machine configuration, These modifications are made in the “backwards” compatibility etc. is to be taken care of, as also the direction, that is, from the output layer, through platform on which the system operates. Most each hidden layer down to the first hidden layer configuration managers have a single interface to allow the control of all types of issues. Q30.Explain horizontal and vertical partitioning 2. Schedule Managers : The scheduling is the key for and differentiate them. successful warehouse management. Ans: HORIZONTAL PARTITIONING-This is Almost all operations in the ware house need some essentially means that the table is partitioned after type of scheduling. Every operating system will have the first few thousand entries, and the next few its own scheduler and batch control mechanism. But thousand entries etc. This is because in most cases, these schedulers may not be capable of fully meeting not all the information in the fact table needed all the the requirements of a data warehouse. Hence it is more time. Thus horizontal partitioning helps to desirable to have specially designed schedulers to reduce the query access time, by directly cutting down manage the operations. Some of the capabilities that the amount of data to be scanned by the queries. such a manager should have include the following: a) Partition by time into equal segments: This is the • Handling multiple queues most straight forward method of partitioning by • Interqueue processing capabilities months or years etc. This will help if the queries often • Maintain job schedules across system outages come regarding the fortnightly or monthly • Deal with time zone differences. performance / sales etc. 3. Event Managers: An event is defined as a b) Partitioning by time into different sized measurable, observable occurrence of a defined action. segments: This is very useful technique to keep the p If this definition is quite vague, it is because it hysical table small and also the operating cost low. encompasses a very large set of operations. c) Partitioning on other dimension: Data collection A partial list of the common events that need to be and storing need not always be partitioned based on monitored are as follows: time, though it is a very safe and relatively straight • Running out of memory space forward method. • A process dying d) Partition by the size of the table: We will not be • A process using excessing resource sure of any dimension on which partitions can be • I/O errors made. In this case it is ideal to partition by size. e) Using Round Robin Partitions: Once the ii. The kinds of knowledge to be mined - It specifies warehouse is holding full amount of data, if a new the data mining functions to be performed such as partition is required, it can be done only by reusing the characterization, discrimination, association, oldest partition. Then Meta data is needed to note the classification, and clustering and evolution analysis. beginning and ending of the historical data. iii. Background knowledge - User can specify VERTICAL PARTITIONING- A vertical background knowledge or knowledge about the partitioning schema divides the table vertically. Each domain to be mined which is useful for guiding KDD row is divided into 2 or more partitions. i) We may not process. need to access all the data pertaining to a student all iv. Interestingness measures – These functions are the time. used to separate uninteresting patterns from For example, we may need either only his personal knowledge. Different kinds of knowledge may have details like age, address etc. or only the examination different interestingness measures. details of marks scored etc. Then we may choose to v. Presentation and Visualization of discovered split them into separate tables, each containing data process - Users can choose from different forms for only about the relevant fields. This will speed up knowledge presentation such as rules, tables, charts, accessing. graphs, decision trees and cubes. ii) The no. of fields in a row become inconveniently 35.What are the implementation steps of data large, each field itself being made up of several mining with a priori analysis and how the efficiency subfields etc. In such a scenario, it is always desirable of this algorithm can be improved. to split it into two or more smaller tables. The vertical Ans: - Implémentation Steps – partitioning itself can be achieved in two different i).The Apriori algorithm would analyze all the ways: transactions in a dataset for each items support count. (i) Normalization and (ii) row splitting. Any items that support count less than the minimum support count is removed from the pool of candidate 32. Explain the steps needed for designing the application. summary table. ii).Initially every item is member of a set of 1- Ans: Summary table are designed by following steps – candidate item sets. The support count of each i. Decide the dimensions along which candidate item sets is calculated and items with a aggregation is to be done. support count less the minimum required support ii. Determine the aggregation of multiple count are removed as candidate, the remaining item is facts. joined to create 2 candidate item sets that each iii. Aggregate multiple facts into the summary comprise of two items or members. table. iii).The support count of each two member item set is iv. Determine the level of aggregation and the calculated from the database of transactions and 2 context at embedding. member item set that occur with a support count v. Design time into the table. greater than or equal to minimum support count are vi. Index the Summary table. used to create 3 candidate item sets. The process in steps 1 and 2 is repeated generating 4 and 5. 33. What are the reasons for data marting? iv).All candidate item sets are generated with a support Mention their advantages and disadvantages. count greater them the minimize support count from a Ans: There are so many reasons for data marting, such set of request item sets. as - Since the volume of data scanned v).Apriori recursively generates all the subsets of each is small, they speed up the query processing, Data can frequent item set and creates association rules based on be structured in a form suitable for a subsets with a confidence greater than the minimum user access too, Data can be segmented or partitioned confidence. so that they can be used on different There are many variations of Apriori Algorithm have platforms and also different control strategies become been proposed that focus on improving the efficiency applicable. of the original algorithm – a).Hash based Technique:- It is used to reduce the 34. Define data mining query in terms of primitives. size of the candidate K – itemsets. Ans: A data mining query is defined in terms of the b).Transaction Reduction:- A transaction that does following primitives- not contain any frequent K- itemsets cannot contain Task Relevant Data – This is the database position to any frequent (K+1) item sets. be investigated. c).Sampling: - In this way, we trade off some degree of accuracy against efficiency. d).Dynamic Itemset Counting: - It was proposed in which the database is partitioned into blocks masked by start points.
36.Explain multi dimensional schemas.
Ans: This is a very convenient method of analyzing data, when it goes beyond the normal tabular relations. For example, a store maintains a table of each item it sells over a month as a table, in each of it’s 10 outlets. This is a 2 dimensional table. One the other hand, if the company wants a data of all items sold by it’s outlets, it can be done by simply by superimposing the 2 dimensional table for each of these items – one behind the other. Then it becomes a 3 dimensional view. Then the query, instead of looking for a 2 dimensional rectangle of data, will look for a 3 dimensional cuboid of data. There is no reason why the dimensioning should stop at 3 dimensions. In fact almost all queries can be thought of as approaching a multi-dimensioned unit of data from a multidimensional volume of the schema. A lot of designing effort goes into optimizing such searches.