0 Votos favoráveis0 Votos desfavoráveis

29 visualizações4 páginasThere are a lot of serious data quality problems in real world datasets: incomplete, redundant, inconsistent and noisy. Missing data is a common issue in datamining and knowledge discovery. It is well accepted that many real-life datasets are full of missing data. However, data mining algorithms always handle missing data in very simple way. Missing data handling has become an acute issue. Inappropriate treatment will reduce the performance of data mining algorithms. For classification algorithm, its classification accuracy depends vitally on the quality of the training data. The presence of a high proportion of missing data in the training dataset is likely to render the resulting model less than reliable. Dealing with missing data has become an important issue in data mining researches and applications. This paper discusses the various imputations and sets light on new method Naive Bayesian Imputation based on Naive Bayesian classifier to estimate and replace missing data.

Sep 15, 2015

© © All Rights Reserved

PDF, TXT ou leia online no Scribd

There are a lot of serious data quality problems in real world datasets: incomplete, redundant, inconsistent and noisy. Missing data is a common issue in datamining and knowledge discovery. It is well accepted that many real-life datasets are full of missing data. However, data mining algorithms always handle missing data in very simple way. Missing data handling has become an acute issue. Inappropriate treatment will reduce the performance of data mining algorithms. For classification algorithm, its classification accuracy depends vitally on the quality of the training data. The presence of a high proportion of missing data in the training dataset is likely to render the resulting model less than reliable. Dealing with missing data has become an important issue in data mining researches and applications. This paper discusses the various imputations and sets light on new method Naive Bayesian Imputation based on Naive Bayesian classifier to estimate and replace missing data.

© All Rights Reserved

29 visualizações

There are a lot of serious data quality problems in real world datasets: incomplete, redundant, inconsistent and noisy. Missing data is a common issue in datamining and knowledge discovery. It is well accepted that many real-life datasets are full of missing data. However, data mining algorithms always handle missing data in very simple way. Missing data handling has become an acute issue. Inappropriate treatment will reduce the performance of data mining algorithms. For classification algorithm, its classification accuracy depends vitally on the quality of the training data. The presence of a high proportion of missing data in the training dataset is likely to render the resulting model less than reliable. Dealing with missing data has become an important issue in data mining researches and applications. This paper discusses the various imputations and sets light on new method Naive Bayesian Imputation based on Naive Bayesian classifier to estimate and replace missing data.

© All Rights Reserved

- Improving Data Quality in completeness using hybrid Algorithm AR-Modified kNN
- Northern - Southeast Case Study
- description: tags: 2007321
- Stat 38502008
- jurnal penagian aktif denga surat paksa kepada wajib pajak (englishpajak
- Acceptance-Based Treatment for Smoking Cessation, 2004
- Accelerated%20life%20testing%20of%20frozen%20green
- Nirmal Rayons Limited
- Web PEER10 02 REZAEIAN DerKiureghian
- adehabitatLTn
- 2375-4366-1-PB
- DM_02
- Multi Party
- Paper
- Data Repository for Sensor Network
- IBM SPSS Missing Values.pdf
- Handout Missing Data Webinar
- allcott_and_rogers_2012_-_the_short-run_and_long-run_effects_of_behavioral_interventions.pdf
- 01 2010 November.pdf
- Unemployment Duration in Australia

Você está na página 1de 4

and Detecting Patterns

Umamaheswari. D

Assistant Professor

NGM College, Pollachi

N. Shyamala

Assistant Professor

NGM College, Pollachi

Abstract

There are a lot of serious data quality problems in real world datasets: incomplete, redundant, inconsistent and noisy. Missing

data is a common issue in datamining and knowledge discovery. It is well accepted that many real-life datasets are full of

missing data. However, data mining algorithms always handle missing data in very simple way. Missing data handling has

become an acute issue. Inappropriate treatment will reduce the performance of data mining algorithms. For classification

algorithm, its classification accuracy depends vitally on the quality of the training data. The presence of a high proportion of

missing data in the training dataset is likely to render the resulting model less than reliable. Dealing with missing data has

become an important issue in data mining researches and applications. This paper discusses the various imputations and sets light

on new method Naive Bayesian Imputation based on Naive Bayesian classifier to estimate and replace missing data.

Keywords: Missing Data, Knowledge Discovery, Imputation, Patterns

________________________________________________________________________________________________________

I. INTRODUCTION

Most of the real world databases are characterized by an unavoidable problem of incompleteness, in terms of missing or

erroneous values. A variety of different reasons result in introduction of incompleteness in the data. Examples include manual

data entry procedures, incorrect measurements, equipment errors, and many others. Existence of errors, and inparticular missing

values, makes it often difficult to generate useful knowledge from data, since many of data analysisalgorithms can work only

with complete data. Therefore different strategies to work with data that contains missingvalues, and to impute, or another words

fill in, missing values in the data are developed.

There are three types of missing data patterns. If the missingness does not depend on the observed data, then it is regarded as

Missing Completely at Random (MCAR). This includes the situation where a value is missing with an iid (independent and

identical distribution) probability value that is independent of the actual value of this feature, or of any other feature. Simple

imputation techniques, such as mean imputation and linear regression imputation, are often used to deal with such MCAR

missing data.

Missing data that depends on the observed data, but not on the unobserved data is considered Missing at Random (MAR). Full

maximum likelihood and Bayesian imputation techniques can be used to deal with missing data that are MAR as well as MCAR.

Both MCAR and MAR missing data patterns are considered ignorable patterns, as valid inferences and estimations can be made

even if the missing data pattern is not explicitly considered in the analysis.

If the missingness is not MAR, then we say the data are Not Missing at Random (NMAR, or non-ignorable). Here, patterns

of the unobserved variables can determine their own missingness. Incorrectly assuming ignorability can result in biased

estimates. When the missing data are non-ignorable, a model of missing value mechanism must be incorporated into the

estimation process to produce meaningful estimates. This is generally specific to the problem itself and is difficult in practice,

especially as the missing data mechanism is rarely known with certainty.

In general, treatment methods for missing datacan be divided into three common approaches: ignore or discard the instances

which contain missing data, parameter estimation, and imputation techniques.We can predict the missing data if we dont wantto

lose any information hiding in the data. This predicting process is called imputation. Imputation techniques use present

information of the dataset to estimate and replace missing data correspondingly. It aims to recognize the relationships among the

values in dataset and to estimate missing data based on these relationships. Imputation techniques are flexible methods for

missing data handling. It is independent on data mining algorithms. According to the different information used in imputation,

put imputation techniques into three categories: global imputation based on missing attribute, globalimputation based on nonmissing attributes, and local imputation.

39

Data Mining Techniques to Fill the Missing Data and Detecting Patterns

(IJSTE/ Volume 2 / Issue 01 / 007)

A. Mean imputation

In this method, mean of values of an attribute that contains missing data is used to fill in the missing values. In case of a

categorical attribute, the mode, which is the most frequent value, is used instead of mean. The algorithm imputes missing values

for each attribute separately.

B. Hot-Deck imputation

In this method, for each example that contains missing values, the most similar example is found, and the missing values are

imputed from that example. If the most similar example also contains missing information for the same attributes asthe missing

information in the original example, then it is discarded and another closest example is found. The procedureis repeated until all

missing values are successfully imputed or entire database is searched.

C. Maximization Likelihood Methods

This method use all data observed in a database to construct the best possible first and second order moment estimates. It does

not impute any data, but rather a vector of means and a covariance matrix among the variables in a database. This method is a

development of expectation maximization (EM) approach. One advantage is that it has well-known statistical foundation.

Disadvantages include the assumption of original data distribution and the assumption of incomplete missing at random.

D. All Possible Values Imputation (APV)

This method consists of replacing the missing data for a given attribute by all possible values of that attribute. In this method, an

instance with a missing data will be replaced by a set of new instances. If there are more than one attribute with missing data, the

substitution for one attribute will be done first, then the nest attribute be done, etc., until all attributes with missing data are

replaced. This method also has a variation. All possible values of the attribute in the class are used to replace the missing data of

the given attribute. That is restricting the method to the class.

E. Regression Methods

This method assumes that the value of one variable changes in some linear way with other variables. The missing data are

replaced by a linear regression function instead of replacing all missing data with a statistics. This method depends on the

assumption of linear relationship between attributes. But in the most case, the relationship is not linear. Predict the missing data

in a linear way will bias the model.

F. K-Nearest Neighbor Imputation

This method uses k-nearest neighbor algorithms to estimate and replace missing data. The main advantages of this method are

that: a) it can estimate both qualitative attributes and quantitative attributes; b) It is not necessary to build a predictive model for

each attribute with missing data, even does not build visible models. Efficiency is the biggest trouble for this method. While the

k-nearest neighbor algorithms look for the most similar instances, the whole dataset should be searched. However, the dataset is

usually very huge for searching. On the other hand, how to select the value k and the measure of similar will impact the result

greatly.

G. Multiple Imputation

The basic idea of MI is that: a) a model which incorporates random variation is used to impute missing data; b) do this M times,

producing M complete datasets; c) run the analysis on each complete dataset and average the results of M cases to produce a

single one. For MI, the data must be missing at random. In general, multiple methods outperform deterministic imputation

methods. They can introduce random error in the imputation process and get approximately unbiased estimates of all parameters.

But the cost of calculating is too high for this method to implement in practice.

H. Naive Bayesian Imputation (NBI)

Nave Bayesian Classifier (NBC) uses probability to represent each class and tends to find the most possible class for each

sample. Though the assumption of independency cannot be satisfied thoroughly, NBC always perform well in practice. It is a

popular classifier, not only for its good performance, simple form and high calculation speed, but also for its insensitivity to

missing data. It can build models on dataset with any amount of missing data. Actually, Nave Bayesian Classifier makes full use

of all the data in the present dataset. So, this paper proposes a new method based on Nave Bayesian Classifier to handle missing

data. It is named Nave Bayesian Imputation model (NBI). The methods are developed and accomplished as the NBI models.

Imputation technique is one of the widely used missing data treatment methods. The basic idea of NBI is first to define the

attribute to be imputed, called imputation attribute and then, to construct NBC using the imputation attribute as the class

attribute. Other attributes in the dataset are used as the training subset. Hence the imputation problem is becoming a

40

Data Mining Techniques to Fill the Missing Data and Detecting Patterns

(IJSTE/ Volume 2 / Issue 01 / 007)

classification problem. Finally, the NBC is used to estimate and replace the missing data in imputation attribute. Usually, there

are more than one attribute containing missing data in the dataset. In this case, (1)attributes to be imputed should be identified

first; (2)and for the imputation attributes, imputation sequence also needs to be considered. For problem (1), it is usually decided

by the task of the data mining. If the next analysis process of data mining task dont require complete dataset, or if the models

have internal missing data treatment methods such as C4.5, we can impute only part of the attributes which contains missing

data. Otherwise, all the missing data should be imputed. For problem (2), NBI models have three strategies: (1) Order Irrelevant

Strategy (NBI-OI). Once the imputation attributes are defined, the dataset accomplished by the imputation is irrelevant to the

imputation order. Attributes imputed later dont use the values of the attribute which has been imputed. The complete datasets

are all the same for different imputation order.

I.

1) Input:

D {d1, d2, d3, ,dN} // Dataset with missing data,

dn means the No. n sample

A {A1, A2, A3, , AK} // Attribute set of Dataset D

B {B1, B2, B3, , BM} A // Imputation attribute set of

Data D

2) Output:

D' { d'1, d'2, d'3, , d'N} // Dataset which has been imputed

3) Sign:

dn [Bm] // The value of Attribute Bm in Sample dn

NBC (dn [Bm]) // The imputed value of Attribute Bm in

Sample dn

4) Methods:

1) D' := D

2) for m=1 to M

3) buildNBCm, using D as training set, Bm as class attribute, B-Bmas attribute set for training

4) for n=1 to N

5) ifdn [Bm]=null then d'n [Bm]:= NBCm (dn [Bm]);

Order Relevant Strategy (NBI-OR). The dataset accomplished by the imputation is relevant to the imputation order. Attributes

imputed later do usethe values of the attribute which has been imputed. The complete datasets are different for different

imputation order.

J.

1) Input:

D {d1, d2, d3, ,dN} // Dataset with missing data, dn means the No.n sample

A {A1, A2, A3, , AK} // Attribute set of Dataset D

B {B1, B2, B3, , BM} A // Imputation attribute set of

Data D, the subscript is assigned according to certain imputation order

2) Output:

D' { d'1, d'2, d'3, , d'N} // Dataset which has been imputed

3) Sign:

dn [Bm] // The value of Attribute Bm in Sample dn

NBC (dn [Bm]) // The imputed value of Attribute Bm in

Sample dn

4) Methods:

1) D' := D

2) for m=1 to M

3) buildNBCm, using D' as training set, Bm as class attribute, B-Bmas attribute set for training

4) for n=1 to N

5) ifd'n [Bm]=null then d'n [Bm]:= NBCm (d'n [Bm]);

Hybrid Strategy (NBI-Hm). This strategy combines the order relevant strategy and the order irrelevant strategy. The first M1

imputation attributesuse order relevant strategy and the rest use order irrelevant strategy.

K. Algorithms for NBI model with hybrid strategy:

1) Input:

D {d1, d2, d3, ,dN} // Dataset with missing data, dn means the No.n sample

A {A1, A2, A3, , AK} // Attribute set of Dataset D

B {B1, B2, B3, , BM1, BM1+1, , BM } A

41

Data Mining Techniques to Fill the Missing Data and Detecting Patterns

(IJSTE/ Volume 2 / Issue 01 / 007)

2) Output:

D' { d'1, d'2, d'3, , d'N} // Dataset which has been imputed

3) Sign:

dn [Bm] // The value of Attribute Bm in Sample dn

NBC (dn [Bm]) // The imputed value of Attribute Bm in

Sample dn

4) Methods:

1) D'':= D

2) for m=1 to M1

3) build NBCm, using D'' as training set, Bm as class attribute, B-Bmas attribute set for training

4) for n=1 to N

5) ifd''n [Bm]=null then d''n [Bm]:= NBCm (d''n [Bm])

6) D' := D''

7) for m= M1+1 to M

8) buildNBCm, using D'' as training set, Bm as class attribute, B-Bmas attribute set for training

9) for n=1 to N

10) ifd''n [Bm]=null then d'n [Bm]:= NBCm (d''n [Bm]);

According to the above discussion, NBI consists of two steps: Step 1, define the imputation attributesand the imputation order.

Step 2, use NBI model toimpute missing data. For the order relevant strategy, according to the defined imputation attributes and

imputation order, NBI imputes the missing data in thefirst imputation attribute of the order and then imputes the next on the

modified new dataset.There are more than one attribute containing missing data in real-world datasets. For data mining task,

such as classification, only part of the attributes catch our attention. Furthermore, multi-dimensionality also makes the miner

focus on part of all the attributes in the dataset only. So, we can only handle part of the missing data if the task doesnt require

complete dataset. Attributes containing missing data, being important for the data mining task and need to be imputed is called

main imputation attribute in this paper. The importance of the main imputation attribute for data mining task can be reflected

by the attributesimportant factor.

Initially, we arrange all the attributes important factors in descending order. For order relevant strategy,this order is also the

imputation order. Then, miner canselect the main imputation attributes according to their tasks. For example, miner can select the

anterior attributes in the range according to a certain percentage. The miner can also choose a threshold nmin~nmax and select

the first nmin~nmax attributes as the main imputation attributes. For the defining of the main imputation attribute, two aspects

should be taken in account: the missing proportion of the attribute and the importance of the attribute for data mining task. The

weighted index of the above two is used to value the important degree of attributes in this paper. For classification, attributes

important factor can be weighed by information gain or decision tree structure.

VI. CONCLUSION

The problem of missing data has received considerable attention always. More and more missing data treatment methods are

found. Mainly imputation methods for dealing with missing data are discussed in this paper. Initially, missing data mechanism

and imputation methods are presented. Popular methods for dealing with missing data are discussed. There exist many methods

for dealing missing data, but no one is absolutely better than the others. Different situations require different solutions. But in

this paper we just discussed the Naive Bayesian Imputation.

REFERENCES

[1]

[2]

[3]

Marvin L., John F. Kros. Chapter VII-The Impact of Missing Data on Data Mining, Data Mining: Opportunities and Challenges, Idea Group Publishing,

2003.

Liu P., Lei L. and Zhang X.F., A Comparison Study of Missing Value Processing Methods, Computer Science, 31(10):155-156, 2004.

Mondelo, D.,Imputation Strategies for Missing Data in Environment Time Serial for an UnluckySituation.Springer Berlin Heidelberg, 2006.

42

- Improving Data Quality in completeness using hybrid Algorithm AR-Modified kNNEnviado porJournal of Computing
- Northern - Southeast Case StudyEnviado porYogeshLuhar
- description: tags: 2007321Enviado poranon-146831
- Stat 38502008Enviado porDan-J
- jurnal penagian aktif denga surat paksa kepada wajib pajak (englishpajakEnviado porHardyDinoAnsyah
- Acceptance-Based Treatment for Smoking Cessation, 2004Enviado porebv413
- Accelerated%20life%20testing%20of%20frozen%20greenEnviado porRaul Diaz Torres
- Nirmal Rayons LimitedEnviado porDevdatta Bhattacharyya
- Web PEER10 02 REZAEIAN DerKiureghianEnviado portzuby87
- adehabitatLTnEnviado porRobert Mutwiri
- 2375-4366-1-PBEnviado porMaitrayee Das
- DM_02Enviado porajax_telamonio
- Multi PartyEnviado porJesús Rafael Méndez Natera
- PaperEnviado porflorieturqz
- Data Repository for Sensor NetworkEnviado porMaurice Lee
- IBM SPSS Missing Values.pdfEnviado porBair Puig
- Handout Missing Data WebinarEnviado porShadi Janaan
- allcott_and_rogers_2012_-_the_short-run_and_long-run_effects_of_behavioral_interventions.pdfEnviado porRalphYoung
- 01 2010 November.pdfEnviado porLual Malueth Kucdit
- Unemployment Duration in AustraliaEnviado porCaddlyman
- Financial Performance of Indian Pharmaceutical IndustryEnviado porRupal Muduli
- 03 Regresi Data PanelEnviado porMeidio Talo Prista
- Article1380114368_Donkoh Et AlEnviado porDerf Botanes
- Principles of survey research.pdfEnviado porsalman672003
- IJDKPEnviado porLewis Torres
- Final PresentationEnviado porbruno
- 2014-05-20_213350_shopping-doneEnviado porSunil Kumar
- LectureSlidesDA00 TopicsEnviado porCharlie
- Data_Preparation.pdfEnviado portushar_ah
- Patrick Dattalo - Analysis of Multiple Dependent VariablesEnviado porCristina C

- A Cloud Based Healthcare Services For Remote PlacesEnviado porIJSTE
- Optimization of Overall Efficiency using Facilities Planning in Ropp Cap Making IndustryEnviado porIJSTE
- Design and Analysis of Magneto Repulsive Wind TurbineEnviado porIJSTE
- FPGA Implementation of High Speed Floating Point Mutliplier using Log Based DesignEnviado porIJSTE
- Research on Storage Privacy via Black Box and Sanitizable SignatureEnviado porIJSTE
- Automatic Generation Control in Three-Area Power System Operation by using “Particle Swarm Optimization Technique”Enviado porIJSTE
- Partial Replacement of Fine Aggregate with Iron Ore Tailings and Glass PowderEnviado porIJSTE
- Multipurpose Scheme of Workshop Exhaust System for Ventilation and Electrical Power GenerationEnviado porIJSTE
- RFID Based Toll Gate AccessEnviado porIJSTE
- Optimization of Treatability by FACCO for Treatment of Chemical Industry EffluentEnviado porIJSTE
- Effect of RIB Orientation in Isogrid Structures: Aerospace ApplicationsEnviado porIJSTE
- Development of Relationship Between Saturation Flow and Capacity of Mid Block Section of urban Road - A Case Study of Ahmedabad CityEnviado porIJSTE
- A Mixture of Experts Model for ExtubationEnviado porIJSTE
- A Comprehensive Survey of Techniques/Methods For Content Based Image Retrieval SystemEnviado porIJSTE
- The Bicycle as a Mode Choice – A Gendered ApproachEnviado porIJSTE
- Enriching Gum Disease Prediction using Machine LearningEnviado porIJSTE
- Optimum Placement of DG Units using CPF MethodEnviado porIJSTE
- A Radar Target Generator for Airborne TargetsEnviado porIJSTE
- An Implementation and Design a Customized Advanced Image Editor using Image Processing in MatlabEnviado porIJSTE
- Comparative Study and Analysis of PCC Beam and Reinforced Concrete Beam using GeogridEnviado porIJSTE
- Onerider the Bike TaxiEnviado porIJSTE
- App-Based Water Tanker Booking, Monitoring & Controlling SystemEnviado porIJSTE
- Analytical Study on Method of Solving Polynomial InequalitiesEnviado porIJSTE
- Wireless Information Process and Power Transfer in Single-User OFDM SystemEnviado porIJSTE
- Interstage Construction Techniques for Mass Gain: An OverviewEnviado porIJSTE
- Using the Touch-Screen Images for Password-based Authentication of IlliteratesEnviado porIJSTE
- Optimizing Turning Process for EN43 by Taguchi Method under Various Machining ParametersEnviado porIJSTE
- Technology Advancement for Abled PersonEnviado porIJSTE
- Duplicate Detection using AlgorithmsEnviado porIJSTE
- An Implementation of Maximum Power Point Tracking Algorithms for Photovoltaic Systems using Matlab and Arduino based RTOS systemEnviado porIJSTE

- After Effects ExpressionsEnviado porquery
- Integrated Design of Electrical Distribution SystemEnviado porابو محمد بن طالب
- Paradise and HellEnviado porhasan
- Unit Plan EvaluationEnviado porGirish Mishra
- Diploma PaperEnviado porNikolina
- 66Enviado por100nagaraju
- Excellent Tutorial Maslows Hierarchy of Needs TheoryEnviado porsambitpgdba
- Would You Barter with God? Why Holy Debts and Not Profane Markets Created Money by Alla SemenovaEnviado porlapooh888
- Gd&t Workshops Gd&t Manufacturing 2014Enviado pordramilaaa
- what are my learning strengthsEnviado porapi-347548737
- 1-s2.0-S1877042811003521-mainEnviado porKarlina Rahmi
- APGAR, SCREEM, Family Lifeline.docxEnviado porAnnisa Widi
- 01 Intro to Engineering (1)Enviado porShafiqa Zulkefli
- Timaeus - Plato.pdfEnviado porEduardo Albertino
- Work EnergyEnviado porDeepak singh
- The Effects of Skin Bleach in SocietyEnviado porSalma Bah
- MRm Issue 6Enviado porDan Moore
- DESIGNING A SURVEY STUDY TO MEASURE THE DIVERSITY OF DIGITAL LEARNERSEnviado porijite
- bbau green mktng SEMINAR BROCHURE.PDFEnviado porSHREYANSHU SINGH
- Trabajo de InglesEnviado porZuly OrTega
- Free Perspective Drawing PDFEnviado porChris
- Consumer Behavior PP Chapter 7Enviado portuongvyvy
- EFFECT OF RELATIVE PRONOUN TYPE ON RELATIVE CLAUSE ATTACHMENTEnviado porLalitha Raja
- b.v doshi pptEnviado porSahil Harjai
- Pride Prejudice Teacher GuideEnviado porGriselda Delvo
- Islamic Scholarship on the Issue of Incest and SodomyEnviado porGilbert Hanz
- Problem Solving and LeadershipEnviado porMansi Latawa Agrawal
- Muniba MazariEnviado porAli Baig
- AMA Position on Expert Medical TestimonyEnviado porShirley Pigott MD
- The Selfish Gene DebateEnviado porIgor Demić

## Muito mais do que documentos

Descubra tudo o que o Scribd tem a oferecer, incluindo livros e audiolivros de grandes editoras.

Cancele quando quiser.