Você está na página 1de 28

Final Project MIS 6324 Data Mining Techniques

PROJET REPORT
MIS 6324! "usiness Intelligence So#t$are %
Techniques
To&ic! Data Mining #ro' "ir( Stri)es Data
Presented by!* +rou& 2
Adt Sau|a
Dvya Vanachara
Ishan Dndorkar
Rohan Pat
Vaay Rava
Under the guidance of Prof. Kelly Slaughter
Group 2 Page 1
Final Project MIS 6324 Data Mining Techniques
ontents
1. Introducton.......................................................................................................... 1
2. Understandng the Dataset.................................................................................. 2
3. Data Ceanng....................................................................................................... 5
4. Apror Agorthm - Extractng Interestng Assocaton Rues from Dataset...........6
5. Logstc Regresson............................................................................................. 10
6. Casscaton: Decson Tree............................................................................... 15
7. Partton Custerng............................................................................................. 18
8. References......................................................................................................... 20
Group 2 Page 2
Final Project MIS 6324 Data Mining Techniques
E,ecuti-e Su''ar.
/0 Intro(uction
Our am s to mprove the safety of arpane ghts by examnng arcraft
safety n the context of wdfe strkes, accompshed by.|as we are a
busness schoo, start wth the vaue to be devered, not the process to be
conducted|mpement data mnng technques on a arge dataset (around 20
attrbutes and 20000 nstances) and ook out for nterestng patterns whch
we can extract to ade n decson makng.
For our studes, wWe have decded to carry carred out the operatons on a
dataset ""ir( Stri)e0,ls,". It purports to represent a the reported Brd
Strkes (dene at rst use ) aganst arpanes reported from 2000-2011 n the
Unted States of Amerca. Ths dataset s pubc and sted on Tabeau
Software Communty, whch was extracted by Federa Avaton
Admnstraton(FAA) (nk). As stated by the FAA, puts t "The FAA Wildlife
Strike Database contains records of reported wildlife strikes since 1990.
Strike reporting is volntar!. Therefore" this database onl! represents the
infor#ation we have received fro# airlines" airports" pilots" and other
sorces."
By denton, a Brd Strke s a coson between an arcraft and arborne
anmas (generay brds but aso ncudes.? Bats?). Our nterest n ths
partcuar database was gnted by the fact that Brd Strke s a ma|or cause of
concern for arne ndustres and Ar Tramc Contros around the word. Some
ma|or casuates caused by Brd Strkes are as beow(source Wkpeda):
The Federa Avaton Admnstraton (FAA) estmates the probem costs
US avaton 400 mon doars annuay and has resuted n over 200
wordwde deaths snce 1988
NASA astronaut Theodore Freeman was ked when a goose shattered
the Pexgas cockpt canopy of hs Northrop T-38 Taon, resutng n
shards beng ngested by the engnes, eadng to a fata crash
In 1988 Ethopan Arnes Fght 604 sucked pgeons nto both engnes
durng take-oh and then crashed, kng 35 passengers.
On September 22, 1995, a U.S. Ar Force Boeng E-3 Sentry AWACS
arcraft (CasgnYuka 27, sera number 77-0354), crashed shorty after
takeoh from Emendorf AFB. The arcraft ost power n both port sde
engnes after these engnes ngested severa Canada Geesedurng
takeoh. It crashed about two mes (3 km) from the runway, kng a 24
crew members on board
+rou& 2 Page /
Final Project MIS 6324 Data Mining Techniques
20 1n(erstan(ing the Dataset
Before carryng out any knd of data mnng actvtes, t s very mportant to
know the data and understand exacty what t purports to represent. We
know that each nstance of our dataset represents a reported Brd Strke and
a the detas reated to t. In the beow tabe, we have expaned what each
attrbute n the data represents about the Brd Strke nstance:
|I woud suggest ncudng ths type of tabe n an appendx and ncudng
here a more easy, ess technca presentaton, eavng out the Type and
usng natura descrptons such as arcraft type rather than Arcraft_Type|
"ir(Stri)es
2ttri3utes E,&lanation T.&e
Arcraft_Type What knd of arcraft was nvoved n the
Brd Strke
Nomna
Arport_Name At whch arport was the strke detected Nomna
Attude_bn At what attude was the strke done
(<>1000 ft)
Bnoma
Arcraft_Mode What was the mode of the Arcraft struck Nomna
Wdfe_Number_struck A number of wdfe struck n the nstance Range
Ehect_Impact_to_ght What was the mpact of Strke on the
ght, f any
Nomna
Record_ID A unque record ID for each ncdent Numerca
Ehect_Indcated_Damag
e
What was the damage caused to the
Arcraft
Bnoma
Arcraft_Number_of_eng
nes.
Number of engnes n the Arcraft struck Numerca
Arcraft_Arne.Operator Arne operator of the arcraft struck Nomna
Orgn_State US State n whch strke occurred on the
arcraft
Nomna
When_Phase_of_ght Durng whch phase of ght dd strke
occur
Nomna
Condtons_Precptaton Precptaton condton durng the strke Nomna
Remans_of_wdfe_co
ected
Was the wdfe reman coected or not? Bnoma
Wdfe_Sze The sze of the wdfe struck (Sma,
Medum, Large)
Range
Condtons_Sky Condton of the sky durng strke Nomna
Wdfe_Speces Whch speces of the wdfe was struck? Nomna
Pot_warned Was the pot warned of the possbe
strke?
Bnoma
+rou& 2 Page 2
Final Project MIS 6324 Data Mining Techniques
Feet_above_ground At what feet above the ground was the
arcraft
Numerca
Speed (n Knots) Speed of arcraft durng the strke Numerca
+rou& 2 Page 3
Final Project MIS 6324 Data Mining Techniques
By graphng the attrbute vaues of the database, we can answer some basc
questons from the dataset:
a4 Frequenc. o# $il(li#e s&ecies $here (a'age $as cause(5'ost
co''on4
+i-en .our (e6nition o# 3ir( stri)es as air3ourne7 the
inclusion o# (eer see's at o((s
34 Frequenc. o# stri)es #or each state
+rou& 2 Page 4
Final Project MIS 6324 Data Mining Techniques
Frequency per arpane? Per year Tota count?
c4 Plotting o# S&ecies % States #or each stri)e
The graphs gves us the beow foowng resuts about the dataset:
1) The speces whch have the hghest frequency to strke the arpanes s
ana(a +oose (after Unknown category)
2) Flori(a has the hghest frequency of brd strkes, foowed by olora(o
and Te,as
3) Even though Canadan Goose has the hghest frequency for Brd Strkes
(dstrbuted across a the states), when we pot State Vs Speces graph we
see that Tur)e. 8ulture strkes are hghy promnent n Flori(a0 9ice
6n(ing
+rou& 2 Page :
Final Project MIS 6324 Data Mining Techniques
30 Data leaning
The next mportant step after knowng the data and before carryng out data
mnng actvtes, s to "cean" the data. As the name suggests, ceanng the
data s gettng rd of non-requred or naccurate nstances. |Why not cean
the data before vsuazaton - coud your Forda ndng be a resut of an
outer?| It heps mprovng mproves the data quaty and reducng reduces
negatve mpacts of errors. In our database, wWe w be ceanng the dataset
by:
a4 Re'o-ing instances ha-ing 91;;< "lan) -alues
By manua scannng, we can detect that number of nstances have Bank
vaues. We w address ths ssue by rst convertng the bank vaues by "NA"
and then removng a the nstances whch have one or more "NA" vaues n
them.
Note: We will not be replacing the $%&& vales with #ean vales becase
the attribtes having $%&& vales are all $o#inal.'s there an! pattern to the
nll vales (b! row or col#n)* +an !o ,stif! re#oving these vales apart
fro# the fact that their incldion #akes anal!-ing the data #ore di.clt)
R co''an(s! 5&er the assign'ent (escri&tion7 $ante( to see the
technical co''an(s in its o$n section4
="ir(s>Stri)es?*rea(0cs-5@"ir(s0cs-@4 #read the CSV
dataset
="ir(>Stri)esA"ir(>Stri)es BB @@C ?* 92 #convert bank vaues
nto NA
="ir(>Stri)esDOrigin>StateA"ir(>Stri)esDOrigin>State BB @9<2@C ?*
92
="ir(>Stri)es?* na0o'it5"ir(>Stri)es4 #deete NA
vaues
After removng the NULL vaues, we can see that theour dataset st has
"cean" 19,375 out of 90,000+ vaues.
Re'o-ing Outliers7 i# require(
By pottng the box-pot of a attrbutes, we can observe that t does not
contan any outers whch have to be deeted. There are some extreme
+rou& 2 Page 6
Final Project MIS 6324 Data Mining Techniques
vaues present but they are mportant for consderaton. (shoud ncude
the box pot here or n the appendx)
+rou& 2 Page E
Final Project MIS 6324 Data Mining Techniques
40 2&riori 2lgorith' * E,tracting Interesting 2ssociation
Rules #ro' Dataset
Usng the Apror agorthm, we w be ndng some nterestng assocaton
rues whch w hep us n detectng causaton factors aganst the attrbute
"Ehect_Indcated_Damage = Caused Damage". These rues w showcase the
reasons behnd damage done to arcraft n case of Brd Strke.
R co''an(s! 5again7 a&&en(i, is 3etter here #or R co''an(s7 this
reall. 3rea)s the Fo$ o# the rea(er7 $hat $oul( 3e hel&#ul is to
(iscuss ho$ .ou selecte( #requenc. an( con6(ence threshol(s4
="ir(>Stri)esDFeet>a3o-e>groun(?* gsu3
5@7@7@@7"ir(>Stri)esDFeet>a3o-e>groun(4
#removes "," from attrbute Feet_above_ground
="ir(>Stri)esDRecor(>ID?* as0#actor5"ir(>Stri)esDRecor(>ID4
="ir(>Stri)esDS&ee(?* as0#actor5"ir(>Stri)esDS&ee(4
="ir(>stri)es>trans?* as5"ir(>Stri)es7 @transactions@4
="ir(t.&eRules?* a&riori5"ir(>T.&e>trans7
&ara'eterBlist5su&&ortB0G/7 con6(enceB0644
parameter speccaton:
condence mnvasmaxarem avaorgnaSupport support mnenmaxen
target ext
0.6 0.1 1 none FALSE TRUE 0.01 1 10 rues FALSE
agorthmc contro:
ter tree heap memopt oad sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
the text beow s unnecessary even n the appendx
apror - nd assocaton rues wth the apror agorthm
verson 4.21 (2004.05.09) (c) 1996-2004 Chrstan Borget
set tem appearances ...|0 tem(s)| done |0.00s|.
set transactons ...|43255 tem(s), 20375 transacton(s)| done |0.08s|.
sortng and recodng tems ... |192 tem(s)| done |0.01s|.
+rou& 2 Page H
Final Project MIS 6324 Data Mining Techniques
creatng transacton tree ... done |0.01s|.
checkng subsets of sze 1 2 3 4 5 6 7 8 9 10 done |9.01s|.
wrtng ... |2605356 rue(s)| done |0.95s|.
creatng S4 ob|ect ... done |3.72s|
="ir(Rules>cause(?* su3set5"ir(t.&eRules7 su3set B rhsIinI
@EJect>In(icate(>Da'ageBause( (a'age@ % li#t = /0G44

= ins&ect5sort5"ir(Rules>Da'age7 3. B @con6(ence@4A/!/GC4
hs rhs support
condence ft
1
{Arcraft_Type=Arpane,

Ehect_Impact_to_ght=Precautonary
Landng,
Wdfe_Sze=Large} =>
{Ehect_Indcated_Damage=Caused damage} 0.01099387 0.8265683
7.218743
2
{Arcraft_Type=Arpane,

Ehect_Impact_to_ght=Precautonary
Landng,

Condtons_Precptaton=None,

Wdfe_Sze=Large} =>
{Ehect_Indcated_Damage=Caused damage} 0.01035583 0.8210117
7.170216
3 {Ehect_Impact_to_ght=Precautonary
Landng,
Wdfe_Sze=Large} =>
{Ehect_Indcated_Damage=Caused damage} 0.01207362 0.8200000
7.161380
4 {Ehect_Impact_to_ght=Precautonary
Landng,

+rou& 2 Page K
Final Project MIS 6324 Data Mining Techniques
Condtons_Precptaton=None,

Wdfe_Sze=Large} =>
{Ehect_Indcated_Damage=Caused damage} 0.01138650 0.8140351
7.109286
5
{Arcraft_Arne.Operator=BUSINESS,


Remans_of_wdfe_coected=FALSE,


Wdfe_Sze=Large,

Pot_warned=N} => {Ehect_Indcated_Damage=Caused
damage} 0.01006135 0.7620818 6.655558
6
{Arcraft_Type=Arpane,


Arcraft_Arne.Operator=BUSINESS,


Condtons_Precptaton=None,


Wdfe_Sze=Large,

Pot_warned=N} => {Ehect_Indcated_Damage=Caused
damage} 0.01060123 0.7578947 6.618991
7
{Arcraft_Type=Arpane,


Arcraft_Arne.Operator=BUSINESS,


Wdfe_Sze=Large,

Pot_warned=N} => {Ehect_Indcated_Damage=Caused
+rou& 2 Page /G
Final Project MIS 6324 Data Mining Techniques
damage} 0.01168098 0.7555556 6.598562
8
{Arcraft_Arne.Operator=BUSINESS,


Condtons_Precptaton=None,


Wdfe_Sze=Large,

Pot_warned=N} => {Ehect_Indcated_Damage=Caused
damage} 0.01128834 0.7516340 6.564313
9
{Arcraft_Arne.Operator=BUSINESS,


Wdfe_Sze=Large,

Pot_warned=N} => {Ehect_Indcated_Damage=Caused
damage} 0.01241718 0.7507418 6.556522
10
{Arcraft_Type=Arpane, Arcraft_Number_of_engnes.=1, Wdfe_Sze=Large}
=> {Ehect_Indcated_Damage=Caused damage} 0.01011043 0.7304965
6.379711
From some of the nterestng rues above, we can concude that maxmum
tmes damage s caused to the arcraft durng Brd Strkes when:
a4 Lil(li#e>SiMe B ;arge
34 Pilot>$arne( B 9
c4 EJect>I'&act>to>Fight B Precautionar. ;an(ing
These can be usefu resuts for arnes operators and ATC to take precautonary
measures, for exampe We can see that damages are caused most when Pots
are not warned about the possbty of Brd Strke by ATC. These are the
ndngs that shoud not be bured n wth the code but ncuded n ts own
secton and hghghted
+rou& 2 Page //
Final Project MIS 6324 Data Mining Techniques
To see whch speces cause the damage most number of tmes, we w run
another sma Apror agorthm on Wdfe_speces and
Ehect_Indcated_Damage:
="ir(>T.&e?*
(ata0#ra'e5"ir(>Stri)esDEJect>In(icate(>Da'age7"ir(>Stri)esDLil(l
i#e>S&ecies4
="ir(>T.&e>trans?* as5"ir(>T.&e7@transactions@4
="ir(t.&eRules?* a&riori5"ir(>T.&e>trans7
&ara'eterBlist5su&&ortB0GG:7 con6(enceB0444
parameter speccaton:
condencemnvasmaxaremavaorgnaSupport support mnenmaxen target
ext
0.4 0.1 1 none FALSE TRUE 0.005 1 10 rues FALSE
agorthmc contro:
ter tree heap memopt oad sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
apror - nd assocaton rues wth the apror agorthm
verson 4.21 (2004.05.09) (c) 1996-2004 Chrstan Borget
set tem appearances ...|0 tem(s)| done |0.00s|.
set transactons ...|312 tem(s), 19705 transacton(s)| done |0.00s|.
sortng and recodng tems ... |19 tem(s)| done |0.00s|.
creatng transacton tree ... done |0.00s|.
checkng subsets of sze 1 2 done |0.00s|.
wrtng ... |18 rue(s)| done |0.00s|.
creatng S4 ob|ect ... done |0.00s|.
="ir(Rules>cause(?* su3set5"ir(t.&eRules7 su3set B rhsIinI
@"ir(>Stri)es0EJect>In(icate(>Da'ageBause( (a'age@ % li#t =
/0G4
=ins&ect5"ir(Rules>cause(4
hsrhs support condence ft
1 {Brd_Strkes.Wdfe_Speces=Turkey vuture} =>
{Brd_Strkes.Ehect_Indcated_Damage=Caused damage} 0.005531591
0.6158192 5.466089
2 {Brd_Strkes.Wdfe_Speces=Canada goose} =>
{Brd_Strkes.Ehect_Indcated_Damage=Caused damage} 0.009489977
0.6032258 5.354308
The above resuts show that damage to the arcraft s caused most of the tmes
durng a Brd Strke when Wdfe_Speces = Tur)e. -ulture OR ana(a
+rou& 2 Page /2
Final Project MIS 6324 Data Mining Techniques
goose. The ATC can run speca operatons to reocate these speces from
surroundng areas of the arport.
+rou& 2 Page /3
Final Project MIS 6324 Data Mining Techniques
:0 ;ogistic Regression
For predctng the resut on whether any Brd Strke w cause damage or
not, we w be usng Logstc Regresson on the dataset. We generay use
Logstc Regresson n cases where our predctve varabe s bnoma, ke
True or Fase, Yes or No etc.
In our case, the dependent varabe w be "Ehect_Indcated_Damage" and
the ndependent varabes w be the rest of the attrbutes.
R o''an(s
= "ir(>DataSet ?* rea(0cs-5N"ir(0cs-N4
= "ir(>DataSetD2ircra#t>2irline>O&erator ?*
as0nu'eric5"ir(>DataSetD2ircra#t>2irline>O&erator4
= "ir(>DataSetD2ircra#t>T.&e ?*
as0nu'eric5"ir(>DataSetD2ircra#t>T.&e4
= "ir(>DataSetD2ir&ort>9a'e ?*
as0nu'eric5"ir(>DataSetD2ir&ort>9a'e4
= "ir(>DataSetD2ltitu(e>3in ?*
as0nu'eric5"ir(>DataSetD2ltitu(e>3in4
= "ir(>DataSetD2ircra#t>Mo(el ?*
as0nu'eric5"ir(>DataSetD2ircra#t>Mo(el4
= "ir(>DataSetDLil(li#e>9u'3er>struc) ?*
as0nu'eric5"ir(>DataSetDLil(li#e>9u'3er>struc)4
= "ir(>DataSetDEJect>I'&act>to>Fight ?*
as0nu'eric5"ir(>DataSetDEJect>I'&act>to>Fight4
= "ir(>DataSetDEJect>In(icate(>Da'age ?*
as0nu'eric5"ir(>DataSetDEJect>In(icate(>Da'age4
= "ir(>DataSetD2ircra#t>9u'3er>o#>engines0 ?*
as0nu'eric5"ir(>DataSetD2ircra#t>9u'3er>o#>engines04
= "ir(>DataSetDOrigin>State ?*
as0nu'eric5"ir(>DataSetDOrigin>State4
+rou& 2 Page /4
Final Project MIS 6324 Data Mining Techniques
= "ir(>DataSetDLhen>Phase>o#>Fight ?*
as0nu'eric5"ir(>DataSetDLhen>Phase>o#>Fight4
= "ir(>DataSetDon(itions>Preci&itation ?*
as0nu'eric5"ir(>DataSetDon(itions>Preci&it
ation4
= "ir(>DataSetDLil(li#e>SiMe ?*
as0nu'eric5"ir(>DataSetDLil(li#e>SiMe4
= "ir(>DataSetDon(itions ?* as0nu'eric5"ir(>DataSetDon(itions4
= "ir(>DataSetDLil(li#e>S&ecies ?*
as0nu'eric5"ir(>DataSetDLil(li#e>S&ecies4
= "ir(>DataSetDPilot>$arne( ?*
as0nu'eric5"ir(>DataSetDPilot>$arne(4
= "ir(>DataSetDFeet>a3o-e>groun( ?*
as0nu'eric5"ir(>DataSetDFeet>a3o-e>groun(4
= "ir(>DataSetDS&ee( ?* as0nu'eric5"ir(>DataSetDS&ee(4

= "ir(>DataSetDEJect>In(icate(>Da'age ?*
su35@/@7@G@7"ir(>DataSetDEJect>In(icate(>Da'age7 ignore0case B
F2;SE7 &erl B F2;SE7 6,e( B F2;SE7 use".tes B F2;SE4
= "ir(>DataSetDEJect>In(icate(>Da'age ?*
su35@2@7@/@7"ir(>DataSetDEJect>In(icate(>Da'age7 ignore0case B
F2;SE7 &erl B F2;SE7 6,e( B F2;SE7 use".tes B F2;SE4
= "ir(>DataSetDEJect>In(icate(>Da'age ?*
as0nu'eric5"ir(>DataSetDEJect>In(icate(>Da'age4
= "ir(>;RMo(el ?* gl'5EJect>In(icate(>Da'ageO0*
P7#a'il.B3ino'ial7(ataB"ir(>DataSet4
= su''ar.5"ir(>;RMo(el4
Ca:
gm(formua = Ehect_Indcated_Damage - . - X, famy = bnoma,
data = Brd_DataSet)

Devance Resduas:
+rou& 2 Page /:
Final Project MIS 6324 Data Mining Techniques
Mn 1O Medan 3O Max
-3.1351 0.1939 0.2671 0.4115 2.4117

Coemcents:
Estmate Std. Error z vaue Pr(>|z|)
(Intercept) -7.881e-01 4.259e-01 -1.851 0.064239 .
Arcraft_Type -2.570e-01 1.844e-01 -1.394 0.163374
Arport_Name -4.395e-04 9.675e-05 -4.542 5.56e-06 ***
Attude_bn -4.713e-01 6.981e-02 -6.751 1.47e-11 ***
Arcraft_Mode -1.109e-03 3.459e-04 -3.205 0.001351 **
Wdfe_Number_struck -3.706e-01 3.346e-02 -11.074 < 2e-16 ***
Ehect_Impact_to_ght -5.529e-01 3.780e-02 -14.624 < 2e-16 ***
Record_ID 4.598e-06 8.014e-07 5.738 9.60e-09 ***
Arcraft_Number_of_engnes. 5.304e-01 7.266e-02 7.300 2.87e-13 ***
Arcraft_Arne_Operator 2.724e-03 3.567e-04 7.637 2.23e-14 ***
Orgn_State -1.204e-03 1.556e-03 -0.774 0.439075
When_Phase_of_ght -5.628e-02 1.413e-02 -3.983 6.82e-05 ***
Condtons_Precptaton 1.163e-02 3.862e-02 0.301 0.763383
Remans_of_wdfe_coectedTRUE -6.962e-01 7.182e-02 -9.693 < 2e-16
***
Wdfe_Sze 1.526e+00 3.517e-02 43.399 < 2e-16 ***
Condtons 2.305e-02 2.987e-02 0.772 0.440326
Wdfe_Speces 1.343e-03 3.718e-04 3.612 0.000304 ***
Pot_warned 3.006e-01 5.831e-02 5.154 2.54e-07 ***
Feet_above_ground -5.510e-02 1.324e-02 -4.162 3.16e-05 ***
Speed -1.020e-02 5.948e-02 -0.171 0.863840
---
Sgnf. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dsperson parameter for bnoma famy taken to be 1)

Nu devance: 13874 on 19704 degrees of freedom
Resdua devance: 10176 on 19685 degrees of freedom
2I! /G2/6

Number of Fsher Scorng teratons: 6
Post Deletion o# Insigni6cant 8aria3les

Now after removng foowng nsgncant varabes the seecton of mode
varabes s more compcated then smpy removng the nsgncant
varabes, but beyond the scope of the cass so you are ne dong ths

1. Arport_Type
2. Orgn_State
+rou& 2 Page /6
Final Project MIS 6324 Data Mining Techniques
3. Condtons_Precptaton
4. Condtons
5. Speed

= "ir(>;RMo(el ?* gl'5EJect>In(icate(>Da'ageO0*Origin>State*
on(itions>Preci&itation*2ircra#t>T.&e*on(itions*S&ee(*
P7#a'il.B3ino'ial7(ataB"ir(>DataSet4
= su''ar.5"ir(>;RMo(el4

Ca:
gm(formua = Ehect_Indcated_Damage - . - Orgn_State -
Condtons_Precptaton -
Arcraft_Type - Condtons - Speed - X, famy = bnoma,
data = Brd_DataSet)

Devance Resduas:
Mn 1O Medan 3O Max
-3.1541 0.1944 0.2679 0.4118 2.4130

Coemcents:
Estmate Std. Error z vaue Pr(>|z|)
(Intercept) -9.615e-01 3.334e-01 -2.884 0.003925 **
Arport_Name -4.530e-04 9.574e-05 -4.731 2.23e-06 ***
Attude_bn -4.801e-01 6.112e-02 -7.856 3.97e-15 ***
Arcraft_Mode -1.112e-03 3.427e-04 -3.243 0.001182 **
Wdfe_Number_struck -3.687e-01 3.340e-02 -11.038 < 2e-16 ***
Ehect_Impact_to_ght -5.585e-01 3.750e-02 -14.892 < 2e-16 ***
Record_ID 4.496e-06 7.979e-07 5.635 1.75e-08 ***
Arcraft_Number_of_engnes. 5.434e-01 6.637e-02 8.187 2.68e-16 ***
Arcraft_Arne_Operator 2.729e-03 3.559e-04 7.669 1.73e-14 ***
When_Phase_of_ght -5.756e-02 1.378e-02 -4.176 2.97e-05 ***
Remans_of_wdfe_coectedTRUE -6.899e-01 7.169e-02 -9.624 < 2e-16
***
Wdfe_Sze 1.523e+00 3.507e-02 43.413 < 2e-16 ***
Wdfe_Speces 1.329e-03 3.706e-04 3.586 0.000336 ***
Pot_warned 3.101e-01 5.792e-02 5.354 8.60e-08 ***
Feet_above_ground -5.758e-02 1.307e-02 -4.405 1.06e-05 ***
---
Sgnf. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

(Dsperson parameter for bnoma famy taken to be 1)

Nu devance: 13874 on 19704 degrees of freedom
+rou& 2 Page /E
Final Project MIS 6324 Data Mining Techniques
Resdua devance: 10179 on 19690 degrees of freedom
2I! /G2GK

Number of Fsher Scorng teratons: 6
We can observe that n the rst mode, AIC vaue was 10216. In the second
mode, we removed the nsgncant varabes and the new AIC vaue came
out to be 10209, whch shows t s a better mode. It s probaby more
accurate to say that the modes are smar wth respect to AIC (a 7 pt
change from a 10k scae s not much to cam; however, gven the smarty,
the smper mode (fewer ndependent varabes) s the better mode Hence,
our na equaton w form out to be:
EJect>In(icate(>Da'age = -9.615e-01-4.530e-04(Arport_Name) -4.801e-
01(Attude_bn) -1.112e-03(Arcraft_Mode) -3.687e-01(Wdfe_Struck)
-5.585e-01(Ehect_Impact_to_ght) + 4.496e-06(Record_ID)+ 5.434e-
01(Arcraft_Number_of_engnes.) + 2.729e-03 (Arcraft_Arne_Operator)
-5.756e-02 (When_Phase_of_ght) -6.899e-01
(Remans_of_wdfe_coected) + 1.523e+00 (Wdfe_Sze) + 1.329e-03
(Wdfe_Speces) + 3.101e-01 (Pot_warned) -5.758e-
02(Feet_above_ground)
In the paper I woud descrbe the na mode and ncude the fu
mode as above n the appendx. The queston remans - what
does ths te me about avodng brd strkes?
+rou& 2 Page /H
Final Project MIS 6324 Data Mining Techniques
+rou& 2 Page /K
Final Project MIS 6324 Data Mining Techniques
60 lassi6cation! Decision Tree
Through ths technque, we tred predctng vaues ("Cause Damage", "No
Damage") of dependent varabecaaa abe - /0ect1'ndicated1Da#age. We
dvded, our dataset nto two parts - tranng data & test data. Wth the hep
of tranng data, we created a predctor mode by executng the tree functon.
Then, we tred predctng vaues of Ehect_Indcated_Damagen the test
dataset by appyng predctor mode usng the command predct (good - I dd
not cover ths n cass but t s usefu). Ths exercse heped us n cassfyng
casuaty cases on the bass of occurrence of damage and gaugng emcency
of predctor mode f t s apped on unknown datasetthe addtona brd
strke data.
R co''an(s!
=install0&ac)ages5@tree@4
=li3rar.5tree4
="ir(>Training>DataSetD2ircra#t>T.&e?*
as0nu'eric5"ir(>Training>DataSetD2ircra#t>T.&e4
="ir(>Training>DataSetD2ltitu(e>3in?*
as0nu'eric5"ir(>Training>DataSetD2ltitu(e>3in4
="ir(>Training>DataSetD2ir&ort>9a'e ?*
as0nu'eric5"ir(>Training>DataSetD2ir&ort>9a'e4
="ir(>Training>DataSetD2ircra#t>Mo(el?*
as0nu'eric5"ir(>Training>DataSetD2ircra#t>Mo(el4
="ir(>Training>DataSetDLil(li#e>9u'3er>struc)?*
as0nu'eric5"ir(>Training>DataSetDLil(li#e>9u'3er>struc)4
="ir(>Training>DataSetDEJect>I'&act>to>Fight?*
as0nu'eric5"ir(>Training>DataSetDEJect>I'&act>to>Fight4
="ir(>Training>DataSetDEJect>In(icate(>Da'age?*
as0nu'eric5"ir(>Training>DataSetDEJect>In(icate(>Da'age4
="ir(>Training>DataSetD2ircra#t>9u'3er>o#>engines0 ?*
as0nu'eric5"ir(>Training>DataSetD2ircra#t>9u'3er>o#>engines04
="ir(>Training>DataSetDOrigin>State?*
as0nu'eric5"ir(>Training>DataSetDOrigin>State4
="ir(>Training>DataSetDLhen>Phase>o#>Fight?*
as0nu'eric5"ir(>Training>DataSetDLhen>Phase>o#>Fight4
="ir(>Training>DataSetDon(itions>Preci&itation?*
as0nu'eric5"ir(>Training>DataSetDon(itions>Preci&itation4
="ir(>Training>DataSetDLil(li#e>SiMe?*
as0nu'eric5"ir(>Training>DataSetDLil(li#e>SiMe4
="ir(>Training>DataSetDon(itions?*
as0nu'eric5"ir(>Training>DataSetDon(itions4
="ir(>Training>DataSetDLil(li#e>S&ecies?*
as0nu'eric5"ir(>Training>DataSetDLil(li#e>S&ecies4
+rou& 2 Page 2G
Final Project MIS 6324 Data Mining Techniques
="ir(>Training>DataSetDPilot>$arne(?*
as0nu'eric5"ir(>Training>DataSetDPilot>$arne(4
="ir(>Training>DataSetDFeet>a3o-e>groun(?*
as0nu'eric5"ir(>Training>DataSetDFeet>a3o-e>groun(4
="ir(>Training>DataSetDS&ee(?*
as0nu'eric5"ir(>Training>DataSetDS&ee(4 Just a note that .ou
con-erte( the (ata to nu'eric is suQcient
="ir(>DTMo(el?* tree5EJect>In(icate(>Da'ageO0*
2ircra#t>2irline>O&erator*Recor(>ID*Origin>State*
on(itions>Preci&itation*2ircra#t>T.&e*P0/*on(itions*S&ee(*
P7"ir(>Training>DataSet4
=su''ar.5"ir(>DTMo(el4
Varabes actuay used n tree constructon:
|1| "Wdfe_Sze" "Ehect_Impact_to_ght" "Wdfe_Speces"
Number of termna nodes: 6
Resdua mean devance: 0.08454 = 844.2 / 9985
Dstrbuton of resduas:
Mn. 1st Ou. Medan Mean 3rd Ou. Max.
-0.96440 0.03564 0.03564 0.00000 0.14980 0.80840
After creatng modes wth the tranng data, now we appy the mode to test
data n order to determne the emcency of the mode,

="ir(>Test>DataSet?* rea(0cs-5N"ir(>Test>Data0cs-N4
="ir(>Test>DataSetD2ircra#t>T.&e?*
as0nu'eric5"ir(>Test>DataSetD2ircra#t>T.&e4
="ir(>Test>DataSetD2ltitu(e>3in?*
as0nu'eric5"ir(>Test>DataSetD2ltitu(e>3in4
="ir(>Test>DataSetD2ir&ort>9a'e ?*
as0nu'eric5"ir(>Test>DataSetD2ir&ort>9a'e4
+rou& 2 Page 2/
Final Project MIS 6324 Data Mining Techniques
="ir(>Test>DataSetD2ircra#t>Mo(el?*
as0nu'eric5"ir(>Test>DataSetD2ircra#t>Mo(el4
="ir(>Test>DataSetDLil(li#e>9u'3er>struc)?*
as0nu'eric5"ir(>Test>DataSetDLil(li#e>9u'3er>struc)4
="ir(>Test>DataSetDEJect>I'&act>to>Fight?*
as0nu'eric5"ir(>Test>DataSetDEJect>I'&act>to>Fight4
="ir(>Test>DataSetD2ircra#t>9u'3er>o#>engines0 ?*
as0nu'eric5"ir(>Test>DataSetD2ircra#t>9u'3er>o#>engines04
="ir(>Test>DataSetDOrigin>State?*
as0nu'eric5"ir(>Test>DataSetDOrigin>State4
="ir(>Test>DataSetDLhen>Phase>o#>Fight?*
as0nu'eric5"ir(>Test>DataSetDLhen>Phase>o#>Fight4
="ir(>Test>DataSetDon(itions>Preci&itation?*
as0nu'eric5"ir(>Test>DataSetDon(itions>Preci&itation4
="ir(>Test>DataSetDLil(li#e>SiMe?*
as0nu'eric5"ir(>Test>DataSetDLil(li#e>SiMe4
="ir(>Test>DataSetDon(itions?*
as0nu'eric5"ir(>Test>DataSetDon(itions4
="ir(>Test>DataSetDLil(li#e>S&ecies?*
as0nu'eric5"ir(>Test>DataSetDLil(li#e>S&ecies4
="ir(>Test>DataSetDPilot>$arne(?*
as0nu'eric5"ir(>Test>DataSetDPilot>$arne(4
="ir(>Test>DataSetDFeet>a3o-e>groun(?*
as0nu'eric5"ir(>Test>DataSetDFeet>a3o-e>groun(4
="ir(>Test>DataSetDS&ee(?* as0nu'eric5"ir(>Test>DataSetDS&ee(4

="ir(>Pre(Mo(el?* &re(ict5"ir(>DTMo(el7"ir(>Test>DataSet4
=su''ar.5"ir(>Pre(Mo(el4
Mn. 1st Ou. Medan Mean 3rd Ou. Max.
1.192 1.850 1.964 1.895 1.964 1.964
="ir(>Pre(Mo(el?* cut5"ir(>Pre(Mo(el73rBc5/7/0HK:7247
la3elsBc5@ause( Da'age@7@9o Da'age@44

To cacuate emcency, we assgned resuts of Brd_PredMode to a new data
frame, for nstance, Damage,
= Da'age ?* "ir(>Pre(Mo(el
= Da'age ?* as0nu'eric5Da'age4 Shoul(nRt this co''an( result in
92sS
Aso we assgned vaues of coumn 'Ehect_Indcated_Damage' nto another
data frame, Resuts
= Results ?* "ir(>Test>DataSetDEJect>In(icate(>Da'age

+rou& 2 Page 22
Final Project MIS 6324 Data Mining Techniques
Now n order to compute emcency of predctor mode, the foowng
command s executed,
='ean5Result BB Da'age4 IsnRt -ector na'e( Results7 not Result7
&er a3o-eS 2n( $hat are .ou co'&aring7 92s in Results to
EJect>In(icate(>Da'ageS
|1| 0.7425304

Expanaton for convertng coumns nto numerc types
On executng tree command, wthout convertng coumns to numerc type,
we got foowng error,

>Brd_DTMode_Tree<- tree(Ehect_Indcated_Damage-.-
Arcraft_Arne_Operator-Record_ID-Orgn_State-Condtons_Precptaton-
Arcraft_Type-X.1-Condtons-Speed-X,Brd_Tranng_DataSet)
Error n tree(Ehect_Indcated_Damage - . - Arcraft_Arne_Operator- :
factor predctors must have at most 32 eves
Therefore, we converted a the coumns havng nomna data nto numerc
before executng tree command or predct command.
An emcency of E4I shows that the predctve mode s generated sutabe
for predctng vaues of dependent varabe Ehect_Indcated_Damage n case
of the unknown data set.
E0 Partition lustering
By the method of partton custerng, we w be groupng a the nstances of
Brd Strkes nto dherent custers. A the nstances n one custer w be
smar to each other by some pre-dened parameters and dherent from
nstances n other custers. Lke to see more n advance as to why these
technques may create nsght as opposed to |ust usng them as a resut of
our cass coverage
We w be usng K-means agorthm on our dataset and create 5 custers
from the compete dataset. We w requre numerc attrbutes to compare
the dstance between two nstances, whch w hep us n custerng them
under one ground. Hence we w be usng the attrbutes "Feet above ground"
and "Wdfe Speces". We w be carryng out the custerng ONLY on
nstances where there was damage reported due to a brd strke.
= "ir(>Stri)e>ause(DFeet>a3o-e>groun(?*
gsu35@7@7@@7"ir(>Stri)e>ause(DFeet>a3o-e>groun(4
+rou& 2 Page 23
Final Project MIS 6324 Data Mining Techniques
= "ir(>Stri)e>ause(DFeet>a3o-e>groun(?*
as0nu'eric5"ir(>Stri)e>ause(DFeet>a3o-e>groun(4
= "ir(>Stri)e>ause(DLil(li#e>S&ecies?*
as0nu'eric5"ir(>Stri)e>ause(DLil(li#e>S&ecies4
= "ir(>cluster ?*
)'eans5"ir(>Stri)e>ause(A7c5@Lil(li#e>S&ecies@7@Feet>a3o-e>groun
(@4C7centersB47nstartB/G4
= "ir(>cluster
K-means custerng wth 4 custers of szes
505, 182, 1462, 71
Custer means:
Wdfe_Speces Feet_above_ground
1 100.13069 2128.2376
2
106.60440 5151.0989
3
94.89877 248.6628
4
107.92958 10664.7887
Wthn custer sum of squares by custer:
|1| 203377571 229103570 163966336 369537922
(between_SS / tota_SS = 91.8 %)
Avaabe components:
|1| "custer" "centers" "totss" "wthnss" "tot.wthnss"
"betweenss"
|7| "sze" "ter" "faut"
&lot5"ir(>Stri)e>ause(DLil(li#e>S&ecies7"ir(>Stri)e>ause(DFeet>a
3o-e>groun(7colB"ir(>clusterDcluster7&chB37,li'Bc5G7/HG47.li'Bc5G
7/:GGG4
= &oints5"ir(>clusterDcenters7colB/!37&chBE7l$(B34
+rou& 2 Page 24
Final Project MIS 6324 Data Mining Techniques
From the pottng graph, we can ceary see that a the nstances have been
dvded nto dstnct 5 custers (who dd you nd ve custers when yo set the
kmeans parameter to 4?), each nstance n one custer smar to other nstances
n the same custer and dherent from nstances n other custer on the bass of
"Feet above ground" and "Wdfe Speces".
Ths nformaton w be usefu whe mpementng measures to curb damage to
arpanes where t w be convenent for the decson maker to understand the
smarty between dherent custers and any acton on one custer w have the
amost same resuts on other nstances n the same custer. Ths s where you
need to dentfy actons and make recommendatons; to ths pont, you |ust have
some numbers
2 6nal section here $ith su''ar. o# reco''en(e(
actions $oul( 3e hel&#ul
H0 Re#erences
https://www.tabeausoftware.com/pubc/communty/sampe-data-sets#arpane
+rou& 2 Page 2:
Final Project MIS 6324 Data Mining Techniques
http://www.statmethods.net/advstats/custer.htm
http://users.|yu./-samayr/pdf/ntrotocusterng_report.pdf
http://una.cas.usf.edu/-mbrannc/es/regresson/Logstc.htm
http://en.wkpeda.org/wk/Brd_strke
Nce kck oh pcture; carefu not to skp through sdes; exceent set up of probem;
ook at the audence more than the sdes and professor; use rea word names for
varabes on sde 6, not e.g., Arport_Name; ess on technca appraoch and more on
what vaue was reveaed through apror; why ncude rue ndngs f you ony spend
5 seconds on t?; do not tak among each other whe another s presentng; do not
use terms IV and DV but speak n terms of context; for output, gve me the
summary - whch few seem mportant from a statstca AND ehect sze that seem
reevant for context; too many sdes for a 15 mnute presentaton; as audence, not
sure how to nterporet Wdfe_Speces < 266.5;
Organization Mannerisms Rapport Content Organization Audience Style Grammar
80.00% 80.00% 85.00% 90.00% 90.00% 90.00% 90.00% 85.00% 88%
+rou& 2 Page 26

Você também pode gostar