Você está na página 1de 18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

AComprehensiveGuidetoDataExploration
BUSINESSANALYTICS

SUNILRAY,J ANUARY10,2016 /

27

SHARE

Introduction
Therearenoshortcutsfordataexploration.Ifyouareinastateofmind,thatmachinelearningcan
sailyouawayfromeverydatastorm,trustme,itwont.Aftersomepointoftime,youllrealizethat
youarestrugglingatimprovingmodelsaccuracy.Insuchsituation,dataexplorationtechniqueswill
cometoyourrescue.
Icanconfidentlysaythis,becauseIvebeenthroughsuchsituations,alot.
IhavebeenaBusinessAnalyticsprofessionalforclosetothreeyearsnow.Inmyinitialdays,oneof
mymentorsuggestedmetospendsignificanttimeonexplorationandanalyzingdata.Followinghis
advicehasservedmewell.
Ive created this tutorial to help you understand the underlying techniques of data exploration.As
always, Ive tried my best to explain these concepts in the simplest manner. For better
understanding,Ivetakenupfewexamplestodemonstratethecomplicatedconcepts.

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

1/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

TableofContents
1.StepsofDataExplorationandPreparation
2.MissingValueTreatment
Whymissingvaluetreatmentisrequired?
Whydatahasmissingvalues?
Whicharethemethodstotreatmissingvalue?
3.TechniquesofOutlierDetectionandTreatment
Whatisanoutlier?
Whatarethetypesofoutliers?
Whatarethecausesofoutliers?
Whatistheimpactofoutliersondataset?
Howtodetectoutlier?
Howtoremoveoutlier?
4.TheArtofFeatureEngineering
WhatisFeatureEngineering?
WhatistheprocessofFeatureEngineering?
WhatisVariableTransformation?
Whenshouldweusevariabletransformation?
Whatarethecommonmethodsofvariabletransformation?
Whatisfeaturevariablecreationanditsbenefits?

Letsgetstarted.
http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

2/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Letsgetstarted.

1.StepsofDataExplorationandPreparation
Rememberthequalityofyourinputsdecidethequalityofyouroutput.So,onceyouhavegotyour
businesshypothesisready,itmakessensetospendlotoftimeandeffortshere.Withmypersonal
estimate,dataexploration,cleaningandpreparationcantakeupto70%ofyourtotalprojecttime.
Belowarethestepsinvolvedtounderstand,cleanandprepareyourdataforbuildingyourpredictive
model:
1.VariableIdentification
2.UnivariateAnalysis
3.BivariateAnalysis
4.Missingvaluestreatment
5.Outliertreatment
6.Variabletransformation
7.Variablecreation

Finally,wewillneedtoiterateoversteps47multipletimesbeforewecomeupwithourrefined
model.
Letsnowstudyeachstageindetail:

VariableIdentification
First, identify Predictor (Input) and Target (output) variables. Next, identify the data type and
categoryofthevariables.
Letsunderstandthisstepmoreclearlybytakinganexample.
Example: Suppose, we want to predict, whether the students will play cricket or not (refer below
data set). Here you need to identify predictor variables, target variable, data type of variables and

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

3/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

categoryofvariables.

Below,thevariableshavebeendefinedindifferentcategory:

UnivariateAnalysis
Atthisstage,weexplorevariablesonebyone.Methodtoperformunivariateanalysiswilldependon
whether the variable type is categorical or continuous. Lets look at these methods and statistical
measuresforcategoricalandcontinuousvariablesindividually:
Continuous Variables: In case of continuous variables, we need to understand the central
tendency and spread of the variable. These are measured using various statistical metrics
visualizationmethodsasshownbelow:

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

4/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Note:Univariateanalysisisalsousedtohighlightmissingandoutliervalues.Intheupcomingpartof
thisseries,wewilllookatmethodstohandlemissingandoutliervalues.Toknowmoreaboutthese
methods,youcanrefercoursedescriptivestatisticsfromUdacity.
Categorical Variables: For categorical variables, well use frequency table to understand
distributionofeachcategory.Wecanalsoreadaspercentageofvaluesundereachcategory.Itcan
be be measured using two metrics, Count and Count% against each category. Bar chart can be
usedasvisualization.

BivariateAnalysis
BivariateAnalysis finds out the relationship between two variables. Here, we look for association
anddisassociationbetweenvariablesatapredefinedsignificancelevel.Wecanperformbivariate
analysis for any combination of categorical and continuous variables. The combination can be:
Categorical & Categorical, Categorical & Continuous and Continuous & Continuous. Different
methodsareusedtotacklethesecombinationsduringanalysisprocess.
Letsunderstandthepossiblecombinationsindetail:
Continuous&Continuous:While doing bivariate analysis between two continuous variables, we
should look at scatter plot. It is a nifty way to find out the relationship between two variables.The
patternofscatterplotindicatestherelationshipbetweenvariables.Therelationshipcanbelinearor
nonlinear.

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

5/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Scatter plot shows the relationship between two variable but does not indicates the strength of
relationship amongst them.To find the strength of the relationship, we use Correlation. Correlation
variesbetween1and+1.
1:perfectnegativelinearcorrelation
+1:perfectpositivelinearcorrelationand
0:Nocorrelation

Correlationcanbederivedusingfollowingformula:
Correlation=Covariance(X,Y)/SQRT(Var(X)*Var(Y))
Varioustoolshavefunctionorfunctionalitytoidentifycorrelationbetweenvariables.InExcel,function
CORREL()isusedtoreturnthecorrelationbetweentwovariablesandSASusesprocedurePROC
CORR to identify the correlation. These function returns Pearson Correlation value to identify the
relationshipbetweentwovariables:

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

6/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Inaboveexample,wehavegoodpositiverelationship(0.65)betweentwovariablesXandY.

Categorical&Categorical:Tofindtherelationshipbetweentwocategoricalvariables,wecanuse
followingmethods:
Twoway table: We can start analyzing the relationship by creating a twoway table of count and
count%.Therowsrepresentsthecategoryofonevariableandthecolumnsrepresentthecategoriesof
theothervariable.Weshowcountorcount%ofobservationsavailableineachcombinationofrowand
columncategories.
StackedColumnChart:ThismethodismoreofavisualformofTwowaytable.

ChiSquare Test: This test is used to derive the statistical significance of relationship between the
variables. Also, it tests whether the evidence in the sample is strong enough to generalize that the
relationshipforalargerpopulationaswell.Chisquareisbasedonthedifferencebetweentheexpected
andobservedfrequenciesinoneormorecategoriesinthetwowaytable.Itreturnsprobabilityforthe
computedchisquaredistributionwiththedegreeoffreedom.

Probabilityof0:Itindicatesthatbothcategoricalvariablearedependent
Probabilityof1:Itshowsthatbothvariablesareindependent.
Probabilitylessthan0.05:Itindicatesthattherelationshipbetweenthevariablesissignificantat
95% confidence. The chisquare test statistic for a test of independence of two categorical
variablesisfoundby:

where O represents the observed frequency. E is the expected frequency under the null
hypothesisandcomputedby:

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

7/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

From previous twoway table, the expected count for product category 1 to be of small size
is 0.22. It is derived by taking the row total for Size (9) times the column total for Product
category(2)thendividingbythesamplesize(81).Thisisprocedureisconductedforeachcell.
StatisticalMeasuresusedtoanalyzethepowerofrelationshipare:
CramersVforNominalCategoricalVariable
MantelHaenszedChiSquareforordinalcategoricalvariable.

Differentdatasciencelanguageandtoolshavespecificmethodstoperformchisquaretest.In
SAS,wecanuseChisqasanoptionwithProcfreqtoperformthistest.

Categorical&Continuous:Whileexploringrelationbetweencategoricalandcontinuousvariables,
wecandrawboxplotsforeachlevelofcategoricalvariables.Iflevelsaresmallinnumber,itwillnot
showthestatisticalsignificance.TolookatthestatisticalsignificancewecanperformZtest,Ttestor
ANOVA.
ZTest/TTest:Eithertestassesswhethermeanoftwogroupsarestatisticallydifferentfromeach
otherornot.

IftheprobabilityofZissmallthenthedifferenceoftwoaveragesismoresignificant.TheTtestisvery
similartoZtestbutitisusedwhennumberofobservationforbothcategoriesislessthan30.

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

8/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

ANOVA:Itassesseswhethertheaverageofmorethantwogroupsisstatisticallydifferent.

Example:Suppose,wewanttotesttheeffectoffivedifferentexercises.Forthis,werecruit20men
andassignonetypeofexerciseto4men(5groups).Theirweightsarerecordedafterafew
weeks.Weneedtofindoutwhethertheeffectoftheseexercisesonthemissignificantlydifferentor
not.Thiscanbedonebycomparingtheweightsofthe5groupsof4meneach.
Tillhere,wehaveunderstoodthefirstthreestagesofDataExploration,VariableIdentification,Uni
VariateandBiVariateanalysis.Wealsolookedatvariousstatisticalandvisualmethodstoidentify
therelationshipbetweenvariables.
Now,wewilllookatthemethodsofMissingvaluesTreatment.Moreimportantly,wewillalsolookat
whymissingvaluesoccurinourdataandwhytreatingthemisnecessary.

2.MissingValueTreatment

Whymissingvaluestreatmentisrequired?
Missingdatainthetrainingdatasetcanreducethepower/fitofamodelorcanleadtoabiased
modelbecausewehavenotanalysedthebehaviorandrelationshipwithothervariablescorrectly.It
canleadtowrongpredictionorclassification.

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

9/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Notice the missing values in the image shown above: In the left scenario, we have not treated
missing values.The inference from this data set is that the chances of playing cricket by males is
higher than females. On the other hand, if you look at the second table, which shows data after
treatment of missing values (based on gender), we can see that females have higher chances of
playingcricketcomparedtomales.

Whymydatahasmissingvalues?
We looked at the importance of treatment of missing values in a dataset. Now, lets identify the
reasonsforoccurrenceofthesemissingvalues.Theymayoccurattwostages:
1.Data Extraction: It is possible that there are problems with extraction process. In such cases, we
shoulddoublecheckforcorrectdatawithdataguardians.Somehashingprocedurescanalsobeused
tomakesuredataextractioniscorrect.Errorsatdataextractionstagearetypicallyeasytofindandcan
becorrectedeasilyaswell.
2.Datacollection:Theseerrorsoccurattimeofdatacollectionandarehardertocorrect.Theycanbe
categorizedinfourtypes:
Missingcompletelyatrandom:Thisisacasewhentheprobabilityofmissingvariableissame
for all observations. For example: respondents of data collection process decide that they will
declare their earning after tossing a fair coin. If an head occurs, respondent declares his / her
earnings&viceversa.Hereeachobservationhasequalchanceofmissingvalue.
Missingatrandom:Thisisacasewhenvariableismissingatrandomandmissingratiovaries
fordifferentvalues/levelofotherinputvariables.Forexample:Wearecollectingdataforageand
femalehashighermissingvaluecomparetomale.

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

10/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Missingthatdependsonunobservedpredictors:Thisisacasewhenthemissingvaluesare
notrandomandarerelatedtotheunobservedinputvariable.Forexample:Inamedicalstudy,ifa
particular diagnostic causes discomfort, then there is higher chance of drop out from the study.
Thismissingvalueisnotatrandomunlesswehaveincludeddiscomfortasaninputvariablefor
allpatients.
Missing that depends on the missing value itself: This is a case when the probability of
missingvalueisdirectlycorrelatedwithmissingvalueitself.Forexample:Peoplewithhigheror
lowerincomearelikelytoprovidenonresponsetotheirearning.

Whicharethemethodstotreatmissingvalues?
1.Deletion:Itisoftwotypes:ListWiseDeletionandPairWiseDeletion.
Inlistwisedeletion,wedeleteobservationswhereanyofthevariableismissing.Simplicityisone
of the major advantage of this method, but this method reduces the power of model because it
reducesthesamplesize.
In pair wise deletion, we perform analysis with all cases in which the variables of interest are
present.Advantageofthismethodis,itkeepsasmanycasesavailableforanalysis.Oneofthe
disadvantageofthismethod,itusesdifferentsamplesizefordifferentvariables.

DeletionmethodsareusedwhenthenatureofmissingdataisMissingcompletelyatrandom
elsenonrandommissingvaluescanbiasthemodeloutput.
2.Mean/ Mode/ Median Imputation: Imputation is a method to fill in the missing values with estimated
ones.Theobjectiveistoemployknownrelationshipsthatcanbeidentifiedinthevalidvaluesofthedata
set to assist in estimating the missing values. Mean / Mode / Median imputation is one of the most
frequentlyusedmethods.Itconsistsofreplacingthemissingdataforagivenattributebythemeanor
median(quantitativeattribute)ormode(qualitativeattribute)ofallknownvaluesofthatvariable.Itcanbe

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

11/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

oftwotypes:
Generalized Imputation: In this case, we calculate the mean or median for all non missing
values of that variable then replace missing value with mean or median. Like in above table,
variableManpowerismissingsowetakeaverageofallnonmissingvaluesofManpower
(28.33)andthenreplacemissingvaluewithit.
Similar case Imputation: In this case, we calculate average for gender Male (29.75) and
Female(25)individuallyofnonmissingvaluesthenreplacethemissingvaluebasedongender.
ForMale,wewillreplacemissingvaluesofmanpowerwith29.75andforFemalewith25.
3.PredictionModel:Predictionmodelisoneofthesophisticatedmethodforhandlingmissingdata.Here,
wecreateapredictivemodeltoestimatevaluesthatwillsubstitutethemissingdata.Inthiscase,we
divideourdatasetintotwosets:Onesetwithnomissingvaluesforthevariableandanotheronewith
missingvalues.Firstdatasetbecometrainingdatasetofthemodelwhileseconddatasetwithmissing
valuesistestdatasetandvariablewithmissingvaluesistreatedastargetvariable.Next,wecreatea
modeltopredicttargetvariablebasedonotherattributesofthetrainingdatasetandpopulatemissing
values of test data set.We can use regression, ANOVA, Logistic regression and various modeling
techniquetoperformthis.Thereare2drawbacksforthisapproach:
1.Themodelestimatedvaluesareusuallymorewellbehavedthanthetruevalues
2.If there are no relationships with attributes in the data set and the attribute with missing values,
thenthemodelwillnotbepreciseforestimatingmissingvalues.
4.KNNImputation:Inthismethodofimputation,themissingvaluesofanattributeareimputedusingthe
givennumberofattributesthataremostsimilartotheattributewhosevaluesaremissing.Thesimilarity
oftwoattributesisdeterminedusingadistancefunction.Itisalsoknowntohavecertainadvantage&
disadvantages.
Advantages:
knearestneighbourcanpredictbothqualitative&quantitativeattributes
Creationofpredictivemodelforeachattributewithmissingdataisnotrequired
Attributeswithmultiplemissingvaluescanbeeasilytreated
Correlationstructureofthedataistakenintoconsideration
Disadvantage:
KNNalgorithmisverytimeconsuminginanalyzinglargedatabase.Itsearchesthroughall
thedatasetlookingforthemostsimilarinstances.
Choice of kvalue is very critical. Higher value of k would include attributes which are
significantly different from what we need whereas lower value of k implies missing out of
significantattributes.

Afterdealingwithmissingvalues,thenexttaskistodealwithoutliers.Often,wetendtoneglect
outlierswhilebuildingmodels.Thisisadiscouragingpractice.Outlierstendtomakeyourdata

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

12/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

skewedandreducesaccuracy.Letslearnmoreaboutoutliertreatment.

3.TechniquesofOutlierDetectionandTreatment

WhatisanOutlier?
Outlierisacommonlyusedterminologybyanalystsanddatascientistsasitneedscloseattention
elseitcanresultinwildlywrongestimations.Simplyspeaking,Outlierisanobservationthatappears
farawayanddivergesfromanoverallpatterninasample.
Lets take an example, we do customer profiling and find out that the average annual income of
customersis$0.8million.But,therearetwocustomershavingannualincomeof$4and$4.2million.
These two customers annual income is much higher than rest of the population. These two
observationswillbeseenasOutliers.

WhatarethetypesofOutliers?
Outliercanbeoftwotypes:UnivariateandMultivariate.Above,wehavediscussedtheexampleof
univariateoutlier.Theseoutlierscanbefoundwhenwelookatdistributionofasinglevariable.Multi

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

13/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

variate outliers are outliers in an ndimensional space. In order to find them, you have to look at
distributionsinmultidimensions.
Letusunderstandthiswithanexample.Letussayweareunderstandingtherelationshipbetween
height and weight. Below, we have univariate and bivariate distribution for Height, Weight. Take a
lookattheboxplot.Wedonothaveanyoutlier(aboveandbelow1.5*IQR,mostcommonmethod).
Now look at the scatter plot. Here, we have two values below and one above the average in a
specificsegmentofweightandheight.

WhatcausesOutliers?
Wheneverwecomeacrossoutliers,theidealwaytotacklethemistofindoutthereasonofhaving
theseoutliers.Themethodtodealwiththemwouldthendependonthereasonoftheiroccurrence.
Causesofoutlierscanbeclassifiedintwobroadcategories:
1.Artificial(Error)/Nonnatural
2.Natural.

Letsunderstandvarioustypesofoutliersinmoredetail:
DataEntryErrors:Humanerrorssuchaserrorscausedduringdatacollection,recording,orentrycan
cause outliers in data. For example:Annual income of a customer is $100,000.Accidentally, the data
entryoperatorputsanadditionalzerointhefigure.Nowtheincomebecomes$1,000,000whichis10
timeshigher.Evidently,thiswillbetheoutliervaluewhencomparedwithrestofthepopulation.
MeasurementError:Itisthemostcommonsourceofoutliers.Thisiscausedwhenthemeasurement
instrument used turns out to be faulty. For example: There are 10 weighing machines. 9 of them are
correct,1isfaulty.Weightmeasuredbypeopleonthefaultymachinewillbehigher/lowerthantherest
ofpeopleinthegroup.Theweightsmeasuredonfaultymachinecanleadtooutliers.

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

14/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

ExperimentalError:Anothercauseofoutliersisexperimentalerror.Forexample:Ina100msprintof7
runners,onerunnermissedoutonconcentratingontheGocallwhichcausedhimtostartlate.Hence,
thiscausedtherunnersruntimetobemorethanotherrunners.Histotalruntimecanbeanoutlier.
IntentionalOutlier:Thisiscommonlyfoundinselfreportedmeasuresthatinvolvessensitivedata.For
example:Teenswouldtypicallyunderreporttheamountofalcoholthattheyconsume.Onlyafractionof
themwouldreportactualvalue.Hereactualvaluesmightlooklikeoutliersbecauserestoftheteensare
underreportingtheconsumption.
DataProcessingError:Wheneverweperformdatamining,weextractdatafrommultiplesources.Itis
possiblethatsomemanipulationorextractionerrorsmayleadtooutliersinthedataset.
Samplingerror:Forinstance,wehavetomeasuretheheightofathletes.Bymistake,weincludeafew
basketballplayersinthesample.Thisinclusionislikelytocauseoutliersinthedataset.
NaturalOutlier:Whenanoutlierisnotartificial(duetoerror),itisanaturaloutlier.Forinstance:Inmy
lastassignmentwithoneoftherenownedinsurancecompany,Inoticedthattheperformanceoftop50
financial advisors was far higher than rest of the population. Surprisingly, it was not due to any error.
Hence, whenever we perform any data mining activity with advisors, we used to treat this segment
separately.

WhatistheimpactofOutliersonadataset?
Outliers can drastically change the results of the data analysis and statistical modeling. There are
numerousunfavourableimpactsofoutliersinthedataset:
Itincreasestheerrorvarianceandreducesthepowerofstatisticaltests
Iftheoutliersarenonrandomlydistributed,theycandecreasenormality
Theycanbiasorinfluenceestimatesthatmaybeofsubstantiveinterest
They can also impact the basic assumption of Regression, ANOVA and other statistical model
assumptions.

Tounderstandtheimpactdeeply,letstakeanexampletocheckwhathappenstoadatasetwithand
withoutoutliersinthedataset.
Example:

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

15/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Asyoucansee,datasetwithoutliershassignificantlydifferentmeanandstandarddeviation.Inthe
firstscenario,wewillsaythataverageis5.45.Butwiththeoutlier,averagesoarsto30.Thiswould
changetheestimatecompletely.

HowtodetectOutliers?
Most commonly used method to detect outliers is visualization. We use various visualization
methods,likeBoxplot,Histogram,ScatterPlot(above,wehaveusedboxplotandscatterplotfor
visualization).Someanalystsalsovariousthumbrulestodetectoutliers.Someofthemare:
Anyvalue,whichisbeyondtherangeof1.5xIQRto1.5xIQR
Usecappingmethods.Anyvaluewhichoutofrangeof5thand95thpercentilecanbeconsideredas
outlier
Datapoints,threeormorestandarddeviationawayfrommeanareconsideredoutlier
Outlierdetectionismerelyaspecialcaseoftheexaminationofdataforinfluentialdatapointsanditalso
dependsonthebusinessunderstanding
Bivariateandmultivariateoutliersaretypicallymeasuredusingeitheranindexofinfluenceorleverage,
ordistance.PopularindicessuchasMahalanobisdistanceandCooksDarefrequentlyusedtodetect
outliers.
InSAS,wecanusePROCUnivariate,PROCSGPLOT.Toidentifyoutliersandinfluentialobservation,
wealsolookatstatisticalmeasurelikeSTUDENT,COOKD,RSTUDENTandothers.

HowtoremoveOutliers?
Most of the ways to deal with outliers are similar to the methods of missing values like deleting

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

16/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

observations,transformingthem,binningthem,treatthemasaseparategroup,imputingvaluesand
otherstatisticalmethods.Here,wewilldiscussthecommontechniquesusedtodealwithoutliers:
Deletingobservations:Wedeleteoutliervaluesifitisduetodataentryerror,dataprocessingerror
oroutlierobservationsareverysmallinnumbers.Wecanalsousetrimmingatbothendstoremove
outliers.
Transformingandbinningvalues:Transformingvariablescanalsoeliminateoutliers.Naturallog
of a value reduces the variation caused by extreme values. Binning is also a form of variable
transformation. DecisionTree algorithm allows to deal with outliers well due to binning of variable.
Wecanalsousetheprocessofassigningweightstodifferentobservations.

Imputing: Like imputation of missing values, we can also impute outliers. We can use mean,
median,modeimputationmethods.Beforeimputingvalues,weshouldanalyseifitisnaturaloutlier
or artificial. If it is artificial, we can go with imputing values. We can also use statistical model to
predictvaluesofoutlierobservationandafterthatwecanimputeitwithpredictedvalues.
Treatseparately:Iftherearesignificantnumberofoutliers,weshouldtreatthemseparatelyinthe
statistical model. One of the approach is to treat both groups as two different groups and build
individualmodelforbothgroupsandthencombinetheoutput.

Tillhere,wehavelearntaboutstepsofdataexploration,missingvaluetreatmentandtechniquesof
outlierdetectionandtreatment.These3stageswillmakeyourrawdatabetterintermsofinformation
availabilityandaccuracy.Letsnowproceedtothefinalstageofdataexploration.ItisFeature

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

17/18

29/5/2016

ACompleteTutorialwhichteachesDataExplorationindetail

Engineering.

4.TheArtofFeatureEngineering

WhatisFeatureEngineering?
Featureengineeringisthescience(andart)ofextractingmoreinformationfromexistingdata.You
are not adding any new data here, but you are actually making the data you already have more
useful.
Forexample,letssayyouaretryingtopredictfootfallinashoppingmallbasedondates.Ifyoutry
andusethedatesdirectly,youmaynotbeabletoextractmeaningfulinsightsfromthedata.Thisis
becausethefootfallislessaffectedbythedayofthemonththanitisbythedayoftheweek.Now
this information about day of week is implicit in your data. You need to bring it out to make your
modelbetter.
Thisexercisingofbringingoutinformationfromdatainknownasfeatureengineering.

WhatistheprocessofFeatureEngineering?
You perform feature engineering once you have completed the first 5 steps in data exploration
Variable Identification, Univariate, Bivariate Analysis , Missing Values Imputation and Outliers
Treatment.Featureengineeringitselfcanbedividedin2steps:
Variabletransformation.
Variable/Featurecreation.

These two techniques are vital in data exploration and have a remarkable impact on the power of
prediction.Letsunderstandeachofthisstepinmoredetails.

http://www.analyticsvidhya.com/blog/2016/01/guidedataexploration/#three

18/18

Você também pode gostar