Você está na página 1de 7

DATA WAREHOUSING AND MINING

II UNIT: DATA PREPROCESSING, LANGUAGE ARCHITECTURE CONCEPT DESCRIPTION


2MARKS QUESTIONS WITH ANSWERS: Q1.How many data preprocessing techniques present in data mining? ANS: There are a number of data preprocessing techniques. Data cleaning - can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store, such as a data warehouse or a data cube. Data transformation, such as normalization, may be applied. For example, normalization may improve the accuracy and efficiency of mining algorithms involving distance measurements. Data reduction can reduce the data size by aggregating, eliminating redundant features, or clustering, for instance. Q2. Define missing values? ANS: Imagine that you need to analyze All Electronics sales and customers data. You note that many tuples have no recorded value for several attributes, such as customer income. We can filling in the missing values for this attribute, lets look at the following methods: 1. Ignore the tuple. 2. Fill in the missing value manually. 3. Use a global constant to fill in the missing value. 4. Use the attribute mean to fill in the missing value. 5. Use the attribute mean for all samples belonging to the same class as the given tuple. 6. Use the most probable value to fill in the missing value. Q3. What is noise? ANS: Noise is a random error or variance in a measured variable. It includes the following data smoothing techniques: 1. Binning. 2. Clustering.

3. Combined computer and human inspection. 4. Regression. Q4. The given data to be smoothing by binning methods 8, 2,4,1,6,10,14,12 & 18 (i). Find partition by bin means. (ii). Partition by bin boundaries. ANS: Partition into bins: Bin1: 1, 2, 4 Bin2: 6, 8, 10 Bin3: 12, 14, 18 Smoothing by bin means: Bin1: 2, 2, 2 Bin2: 8, 8, 8 Bin3:14, 14, 14 Smoothing by bin boundaries: Bin1: 1, 1, 4 Bin2: 6, 6, 10 Bin3:12, 12, 18 Q5. Define Regression, give its types? ANS: Data can be smoothed by fitting the data to a function, such as with regression. Linear regression involves finding the best line to fit two variables, so that one variable can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two variables are involved the data are fit to a multidimensional surface. Q6. Which method is used to defect the redundancies of data mining. Give its equation? ANS: Some redundancies can be detected by correlation analysis. For examples, given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. The correlation between attributes A and B can be measured by rA, B= (A-A)(B-B)/(n-1) Where n is the number of tuples, A and B are the respective mean values of A and B, and and are the respective standard deviation of A and B.

Q7. What are the different types of correlation analysis? ANS: rA, B is greater than 0, then A and B are positively correlated, meaning that the values of A increase as the values of B increase. The higher the value, the more each attribute implies the other. Hence, a high value may indicate that A (or B) may be removed as a redundancy. If the value is equal to 0, then A and B are independent and there is no correlation between them. If the value is less than 0, then A and B are negatively correlated, where the values of one attribute increase as the values of the other attribute decrease. This means that each attribute discourages the other. Q8. What is meant by Normalization? ANS: Normalization, where the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, or 0.0to 1.0. Normalization is particularly useful for classification algorithm involving neural networks, or distance measurements such nearest neighbor classification and clustering. Q9. Give the types of Normalization? ANS: There are many methods for data normalization. We study three: 1. Min-Max normalization. 2. Z-score normalization. 3. Normalization by decimal scaling. Q10. Suppose that the mean and standard deviation of the values for the attributes income are $54,000 and $16,000 respectively. We would like to map income to the range (0.0, 1.0). By min-max normalization. Q11. Define discretization? ANS: Discretization, where raw data values for attributes are replaced by ranges or higher conceptual levels. Concept hierarchies allow the mining of data at multiple levels of abstraction and are a powerful tool for data mining. Discretization techniques can be used to reduce the number of values for a given continuous attribute, by dividing the range of the attribute into intervals. Q12. What is the goal of the attribute subset selection? ANS: The goal of attribute subset selection is to find minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Mining on a reduced set of attributes has an additional benefit.

Q13. Give the different of attribute subset selection? ANS: Basic heuristic methods of attribute subset selection include the following techniques: 1.Stepwise forward selection. 2. Stepwise backward elimination. 3. Combination of forward selection and backward elimination. Q14.Define lossy and lossless? ANS: If the original data can be reconstructed from the compressed data without any loss of information, the data compression techniques used is called lossless. If, instead, we can reconstruct only an approximation of the original data, then the data compression techniques is called lossy. Q15. Define histograms? ANS: A histogram for an attribute A partitions the data distribution of A into disjoint subset, or bucket. The buckets are displayed on a horizontal axis, while the height of a bucket typically reflects the average frequency of the values represented by the bucket. Q16. Define singleton buckets? ANS: If each bucket represents only a single attribute-value/frequency pair, the buckets are called singleton buckets. Q17. Define Sampling? ANS: Sampling can be used as a data reduction technique since it allow a large data set to be represented by a much smaller random sample of the data. An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample, n, as opposed to N, the data set size. Hence, sampling complexity is potentially sublinear to the size of the data. Sampling is most commonly used to estimate the answer to an aggregate query. Q18. What defines a data-mining task? ANS: Each user will have a data-mining task in mind, that is, some form of data analysis that she would like to have performed. A data-mining task can be specified in the form of a data mining query, which is input to the data mining system.A data mining query is defined in terms of the following primitives: 1. Task-relevant data 2. The kinds of knowledge to be mined

3. Interestingness measures 4. Presentation and visualization of discovered patterns Q19. Explain Interestingness measures? ANS: These functions are used to separate uninteresting patterns from knowledge. They may be used to guide the mining process or, after discovery, to evaluate the discovered patterns. Different kinds of knowledge may have different interestingness measures. For examples: Association rule include support and confidence. Q20. Define Schema hierarchies and set-grouping hierarchies? Schema hierarchies: A schema hierarchies is a total or partial order among attributes in the database schema. Schema hierarchies may formally express existing semantic relationships between attribute. It specifies data warehouse dimension. Set-grouping hierarchies: It organizes values for a given attribute of dimensions into groups of constants or range values. A total or partial order can be defined among groups. Set-grouping hierarchies can be used to refine or enrich schema-defined hierarchies, when the two types of hierarchies are combined. Q21. Define Operation-derived hierarchies and rule-based hierarchies? ANS: An Operation-derived hierarchy is based on operations specified by users, experts, or the data mining system. Operations can include the decoding of information-encoded strings, information extraction from complex data objects, and data clustering. Rule-based hierarchies: A rule based hierarchies occurs when either a whole concept hierarchies or a portion of it defined by a set of rules and is evaluated dynamically based on the current database data and the rule definition. Q22. Define interestingness measures types? Interestingness measures that estimate the simplicity, certainty, utility and novelty of patterns. Q23. Why is it important to have a data mining query language? ANS: Well, recall that a desired feature of data mining systems is the ability to support ad hoc and interactive data mining in order to facilitate flexible and effective knowledge discovery. Q24.Define loose coupling and tight coupling? ANS: Loose coupling means that a DM system will use some facilities of DB orDM system, fetching data from a data repository managed by these systems, performing data

mining, and then storing the mining results either in a file or in a designated place in a database or data warehouse. Tight Coupling: means that a DM system in smoothly integrated into the DB/DW system. The data mining subsystem is treated as one functional component of an information system. Data mining queries and functions are optimized based on mining query analysis, data structures, indexing schema, and query processing methods of DB or DW system. Q25. Define concept description? ANS: concept description generates descriptions for characterization and comparison of the data. It is sometimes called class description, when the concept to be described refers to a class of objects. Characterization provides a concise and succinct summarization of the given collection of data, while concept or class comparison provides description comparing two or more collections of data. Q26.Define data generalization? ANS: Data generalization is a process that abstracts a large set of task-relevant data in a database from a relatively low conceptual level to higher conceptual levels. Methods for the efficient and flexible generalization of large data sets can be categorized according to two approaches: 1. data cube approach (or OLAP) and 2. attribute-oriented induction approach Q27.Define AOI? ANS: The attribute-oriented induction approach consists of the following techniques: data focusing, data generalization by attribute removal or attribute generalization, count and aggregate value accumulation, attribute generalization control, and generalization data visualization. Q28. Give the mean, median and mode equation? ANS: Mean- Let X1, x2, X3 Xn be a set of n values or observations.

Median is the middle value of the ordered set if the number of values n is an odd number; otherwise ,it is the average of the middle two values.

Where L1, is the lower class boundary of the class containing the median, n is the number of values in the data,( ) is the sum of the frequencies of all of the classes that are lower than the median class, Fmedian is the frequencies of the median class and c is the size of the median class interval. Mode: The mode for a set of data is the value that occurs most frequently in the set. Q29. Define Box plots? ANS: In a box plot: 1. Typically, the ends of the box are at the quartiles, so that the box length is the interquartile range, IOR. 2. The median is marked by a line within the box. 3. Two lines (Called whiskers) outside the box to the smallest (Minimum) and largest (Maximum) observations. Q30. Define quartiles? ANS: The mostly commonly used percentiles other than the median are quartiles. The first quartile, denoted by Q1, is the 25th percentile, the third quartile, denoted byQ3, is the 75th percentile. The quartile, Include the median, give some indication of the center, spread and shape of a distribution. Q31. Define five number summaries? ANS: The five-number summary of a distribution consists of the median M, the quartiles Q1 and Q3, and the smallest and largest individual observations written in the order Minimum, Q1, M, Q3, Maximum.

Você também pode gostar