Você está na página 1de 4

Intelligent Knowledge iscovery D

Jn arali P
DepartmentCybernetics rtificial of A a Intelligence Technical Universityf oKoice Letn 9, 0Koice 0420 Slovak epublic R paralic@tuke.sk

Eva ndrssyov A
DepartmentCybernetics rtificial of A a Intelligence Technical Universityf oKoice Letn 9, 0Koice 0420 Slovak epublic R andrassy@tuke.sk

Abstract The ain othis aperdescriptionbasic m role f p is of ideas ehind b international Copernicusesearch r proj ect titled GOAL - eographic G Information On-Line Analysis GIS ( Data Warehouse Integration), mainly focusing on its knowledge discovery part. Regarding the knowledge extracting hase, DD p a (Knowledge iscoveryDat K D in abases) package should developed be within project. h this T e basic ideahereistoprovideasacorea umberofdiffe n rent algorithmsandtheircombinations.Algorithmsable to discover association rules, ecision and d trees clu sters are planedo integrated. plano rovidehis a t be We tp t p ckagea as standlonepplicationwell. a a as

I. INTRODUCTION Theintegration combination GIS(Geographic and of Information System)datainto with and OLAP systems poses a number yet otatisfyinglyolved of n s s probl emsn i terms of gettingthe data intothe OLAP system, representinghe ataor t d f analysisndxtracting a e k nowledge while onsideringecurityestrictions. c s r Current ap proaches do addresshese not t special roblemsesulting p r fro m the targetedpplicationrenaGISnd LAPystems. a a of a O s The othe OAL goal f G projectso i t develop ener a g ic frameworkbothrecognizedby theresearchcommunity andpplicableneal applications, a i r world whicho s lveshe t general issuesGIS DWH of and interoperability, inc luding DWH feeding,knowledge extraction, nterpretation, i and securityoncepts. feasibilitytfhisramewor c The o f k ill w be tested 2 different world pplications on very real a from the GIS domain environmental ensor ata nd using s d a cult ural data, allowing worldvaluationtherame real e a of f work. Regardingtheknowledgeextractingphase,a DD K (Knowledge Discovery Databases) ackage in p should b e developed within project. he this T basic here idea is to provide as a number different lgorithms core a of a and their combinations. Algorithms able to discover associationules, r decisionrees clustersre t and a planedo t bentegrated. planprovidehis ackage s i We to t p aa stand alonepplicationwell. a as II. PROCESS KDD When studying literature topic data with of mining we haveencounteredwithtermssuchlike: datamining , knowledgediscoveryindatabases or abbreviation KDD.In various sources those terms explained are on differentway.Inouropinion,themostsophisticat ed definition one is according to [4] Fayyad al.), here ( et w authorshavedeterminedthatknowledgediscoveryin databasessnteractive iterative ii and process ith w several steps. Itmeansthatatany stagethe should user have possibilitytomakechanges(forinstancetochoose different orechnique) nd task t a repeat followi the ngteps s 307

to achieve better results. ata D mining a of ispart this process. In ostsources, term m of the Data ining (DM) ften M is o usedtonamethefieldofknowledgediscovery.This confusing oterms DD DMdueo use f K and is t histori cal reasonsndueoact the ostthe ork a d tf that m of w i focused s onefinement applicabilityxperimentsML r and e of a nd I A algorithmsor data iningtep. f the m s Pre-processing isften o included his apartmininglgorithm it stepa of n s a . WithintheKDDprocessfollowingstepscanbe recognised (according to [1]). twoteps the DD First s of K process reelatedohe ar tt goal identification namely task , discovery and data discovery Theollowingtep . f s includes all data re-processing Letcall p . us it dataleaning Core c . oftheKDD processisthe datamining phase,which includes modeldevelopment and dataanalysis. Finally suitable output generation is ecessaryorhe ser.the n f t u In following KDD process will e steps b described mo in re details. Within the task discovery one asotatehe roblem h ts t p or goal, which often seems to be clear. Further investigation isrecommended as be such to acquaint ed with customer's rganisation spending ti o after some me at the lacendoift p a t s throughheaw (tonde t r data u rstand its form,content,organisationalroleand sourcesofd ata). Then real of discoveryanfound. the goalthe c be Data Discovery is complementary the ofask to step t discovery. thetepdata iscovery, has In s of d one t o ecide d whether uality datasatisfactory the q of is for goa (what l dataoesdoes cover). d or not Data leaning is ften ecessaryhought happen C o n t imay that omething s removed cleaning be by can indicator of somenteresting i domain phenomenon (outlier key or data point?). Analyst's ackground nowledgecrucial b k is in ata d cleaningprovidedbycomparisonsofmultiplesource s. Other ayso data eforeoaded datab w it clean b l into ase y b editing procedures. Recently, dataor areoming the f KDD c from warehouses containatalreadylea data that d a c ned. Model Development isnmportant oKDD ai phase f that must recede ctual nalysis fhe Interacti p a a ot data. on with the leads data analysts formation hypothesis to of (it s i often based experienceand on background knowledge) . Sub-processesmodel of development are: data segmentation (unsupervised learning techniques, examplelustering); for c model election s (choosing bestype model the t of after exploringeveral s different types); parameter selectionparameterschosen odel). ( of m Data Analysis in generals n i a ambition understand to why certain groups entities behaving the of are on way theyo,searchor orulessucheh d it is f lawsr of b aviour. At first hould analysed parts here a s be those w such g roups

arelready a identified. Sub-processes atanaly in a d sis are: model pecification s someformalism used is to denotepecific odel, s m model fitting - when necessary the specific parameters determined, are evaluation model - is evaluatedgainst data, a the model efinement - odel s r m i refined iterations in accordinghevaluationesults. tt e o r Model evelopment nd analysis re d a data a complementa ry so leads scillationetween it often to o b those twoteps. s Output eneration - utput an in G o c be variousorms. f The simplest orm a f isreport ith w analysis results The . other, complicatedorms, graphsinom more f are ors cases e itdesirableo btainction escriptions hich is t a d w might be taken directlys utputs. therehould a ao Or s be mon itor s a the utput, hich o w should trigger n a alarm action or under some certain condition. Output requirements might determine odesigned DD taskf K application. III. TASKS KDD First tep the s in KDD processsask i t discovery. h T ere are ossible ovariousasks. thisectio p a f lot t In s n of some the important iscovery are most d tasks listed b and riefly described. More ( particular description be can foun d in [5].) Discovery SQO of rules. Semantic uery ptimisation Q O rules erform yntactical p a s transformationthe of i ncoming querytoproducemoreefficientquerybyaddingor

Discoveryassociation of rules. Anssociationules a r ia relationshiptheorm of f X =>where Y Y X are and setsf o itemsconjunctsattribute ( of values) X and Y = . Rules aressignedsupport confidenceactor. a by and f Dependencemodelling. Discoveryofdependencies amongttributesnormif-thenules"if a i f of r as (A ntecedent isrue)-then t (Consequentsrue)". ntecedents it A i usually conjunction of ttribute alues nd a v a consequents ia single value. main ifference etween ependence odel The d b d m ling and database dependencies that ules depende is r for nce modellingnot tbexact. do have o Deviation detection. Thisask t focuses n o discovery f o significant deviations etweenhectual b t a contents of data a subsetanditsexpectedcontents.Ingeneralwecan distinguish typesdeviations: two of temporal - significant changes along time dimension; group unexpected ifferences etweenwoubsets d b t s of data. Significance deviation subjective of is measure str ongly dependentuser. on Clustering.Itisaclassificationscheme,wherethe classes re a unknown. uples ith T w similar ttribute a values are lusteredntoheame lass. problemt c i t s c The in hisask t iso etermine owmeasurehe ualitythe td h to t q of p roduced clusters. fter lusteringdoneis ossible A c is itp t o a apply classificationsummarisationlgorithmthei or a to " nvented" classes. Causationmodelling. Discoveryofrelationshipof

prediction classification regression

association rules

clustering dependency modeling causation modeling

database dependencies SQO rules

deviationetection d summarisation

highlighting
Fig. nowledgeiscovery partitioning 1 K d tasks

description

removingonjuncts. c Characteristicor rules f SQO i that the query processing (derived access time from method a nd indexingchemedatabase anagement s of m system)t is aken intoccountcostattribute. a as of Discovery database of dependencies. In case this the term referso t relationships mong a attributes fe or lations. Databasedependenciesareusedinthedesignand maintenanceDBMS. of 308

cause effect mong and a attributes. ules simila R are dependencemodelling,butcausalrulesindicatetha antecedent auseshe c t consequent nd relations a this not tanother due o observedariable. v Classification. whereachupleelongs Task e t b tclass, o a whichone pre-defined oclasses. cla is of set f The tupleisindicatedbythevalueofuserdefinedcla attribute.Classificationalgorithmaimstofindso

r to t hip is

ss a of ss me

relationship between predicting attributes each and class (eachalueclass v of attribute). Regression. similarolassification, predicted Task tc the value is rather continuous. Traditionalmethods are statistical(suchaslinearregression)howeverthe reis numbersymbolic ethods here odifiedlassific of m w m c ation methods involved instance are (for decision wi tree th a linear as node). modela leaf Summarisation.Itis akind summary, escribing of d some ropertieshared most theuples elong p s by of t b ingo t theamelass. s c Discoveredummariesan express s c b eds a characteristicule hich benterpreted "i r w may i as: (tuple f belongsohelassndicatednntecedent) tt c i ia then (theuple t hasll a properties entionedconsequent)". m in Such rules a i not discriminatinghelasses classifica tt c o unlike tionules. r IV. PACKAGE KDD As of resultanalysisasednheactsegardin a b ot f r g DD K process nd DD describedn reviouswoec a K tasks ip t s tions we ecidedo roposeheollowingtructureth d tp t f s of e DD K package haveeenevelopingsee ig. we b d ( F 2). KDDpackagewillhave amodularstructure,where common ofhe parts t system be by of can used each t he specialized modules DM covering or ore one m possibl e KDD Theommonarts just tasks. c p are two. 1) access . t smodule accessing DB Ii a for database sourceswhichan e BF SQL ( c bD file, databasepos or sibly data arehouse). w 2)Visualcomponentfordatamanipulation. This moduleenablesausertovisualize,browse,modify,

transform dataromatabasedata arehou etc. f d a (a w se). All possible perationsn atare efined ylug-ins o od a d bp as ell. w Thismakesitpossibletoaddanewtransformation (sampling, checking, operation n ata nyi etc.) od a t me ery v easily. For ach DD a e K task different mininglgorithm data a as well s of a type output eneration suitable. her g is T efore eachnew dataminingalgorithm aswellas output generationmodule canbeimplementedseparatelyand added nto ur DD o o K packagenorma i f of plug-in. It does not necessarily eanhat data ininglgorith m t each m a m must haveownutput its o generation odule. m Usuallyadd KDD processingunctional to a new task f ity into KDD the package ill theollowing: w mean f To implement (or just re-use an existing implementation f) mininglgorithmform oa data a in of plug-in a Ifnecessary,toimplementnewtransformationor other pre-processingunctions ormplug-ins f in of f If ecessary, o n t implementnew a output eneration g module ormalug-in. if of n p Based somevery experiments real ata on first with d fromtheGOALprojectpilotapplicationsweplanto implement at least the following four KDD tasks functionality. he DD areeferred resp T K tasks r with ecto t our partitioningepicted ig. d iF 1. n A. Prediction classification Herewearegoingtoimplementtwodifferentdata

Association rules

if ... ... then

DB

DB

access

Visual component for data manipulation

Classification

.
C1: ... C2: ...

Clustering

Others
. . .

Data preprocessing modules

Data ining m modules

Output generation modules

Fig. Proposedtructurethe DD 2 s of K package.

309

miningalgorithmswithtwopossibleoutputgenerati forms. first nes N2 The o iC system, Clark Ni by and (see [3]),isasymbolicdataminingtooldesignedto efficientlyinducesimpleandcomprehensiblerules domains where noisy may present. he data be T input CN2onsists suallyaile c u of describing attr f the theirtypesandafilecontainingtheexamples.The attributes be two can of types: iscrete finite d (a values) orderedintegersfloats). and ( or CN2 outputs ordered unordered of ecisio an or list d lists,rulestheorm <complex> PRE or of f 'IF THEN <class>', where complex>a < is conjunctattribut of Thesedecisionlistsareprobabilisticrules,i.e. condition oversxamplesa c e osingle lass, po f c but few examples classeswell. other of as The ther ne roduces ell o o p w known decision I trees. C4.5nrderbableoandle ithumerical i o toe th w n a (whichpredecessor cannot). its ID3 B. Highlighting/Prediction associationules r

on blett in of ibutesnd a set f o n DICT eests. t the ssibly a is t ttributes

Herewearegoingtoimplementamoregeneral approach which in usable other inds f is fact in k o KDD taskswell. describedManilla as It was by in [6]. A fairly class data large of mining tasks be can de scribed the as search interesting frequentl for and y occurringatternsrom data. is, areiv p f the Thatwe g en class a P of atterns sentenceshat escribe p or t d properties o fhe t data, nd can a we specify whether a pattern p P occurs frequentlynoughndotherwisenteresting. e a is i Tha is, t the genericata ining itoind set d m taskf the s PI(d, P)={p P| poccuressufficientlyoftenin database a iinteresting}. d p s For ssociation a rules,he t pattern ishe classt set of ll a rules fhe ot form X => B, and isnterestingits a i rule if confidence sufficiently For inding is high. f episod es,he t patternsaretheepisodesandthereneednotbeany interestingness criterion. C. Description/Prediction clustering For lustering e ouldikeo ffer utoClassys c ww l t A s tem. May some interesting be other approachese.g. ase ( b d on neural networks) badded heuture. cane in f t AutoClassisanautomaticclassificationprogramto extractusefulinformationfromdatabases [2].Itisan approachtounsupervisedclassificationbasedupon the classicalmixturemodel,supplementedbyaBayesian methodor f determininghe ptimal t o classes. emph We asize thatnocurrentunsupervisedclassificationsystem can produce optimal n own.ishe o its Itt interaction b etween domain xperts ndhe achineearching verhe e a t m s o t m odel space,hat enerates knowledge. oth uni t g new B bring que informationnd bilitiesohe atabase nalysis a a tt d a task, and eachnhances others' e the effectiveness. V. SUMMARY This aper resents p p a strategy implementation for o KDD package hich ill aa w w servespecial s module i w the OAL G project can e sed an pentand but b u as o s a applicationor f knowledgeiscovery. d Description the DD of K process nd particular a its t f a thin lone asks 310

hashown data ininga s that m is versatile ndhoug a t h very importantisust ne ofhe it j o part t whole KDD proc ess. Thereforenrder existingata ininglgor i o adopt d m a ithmso t be in used connection real ata ourcesand with d s ( i does t not atterthey ren atabase a ware m if a i a d or data house), threebjectives crucial met. o are toe b 1. Fast onnectionexisting ataourcesin c to d s ( cas ef o the OALroject G p preferablyata arehouses). d w 2. Flexible nd data election transformat a rich s and ion methods ust providednorm m be i f whicheasyo is t usendasy nderstandtheser. a e tu o by u 3. The system be for asy must open e integration of new mining data algorithms ndnecessary a if even new output generationorms. f BasedontheseobjectivesKDDpackagedesignhas beenketched. modulartructure nablesoch s Its s e ta ieve ll a three given objectives. Based on some very first experiments real ata the with d from GOAL project i p lot applications e to w plan implement leasthe at t foll owing three KDD tasks functionality 1) prediction classification, 2) highlighting/prediction association rules 3) and description/prediction clustering. VI. ACKNOWLEDGMENT This ork as eenupported w h b s by uropean ommission E C within INCO the Copernicus Programme under ontract c No. 977091andbyMinistryofEducation,Slovak Republic,VEGAgrantNo.1/5032/98 -Integrationof Tools Intelligent for Technologies. V. REFERENCES [1] R.J.Brachman,andT.Anand,"TheProcessof KnowledgeDiscoveryinDatabases," Advancesin KnowledgeDiscovery Data & Mining , AAI/MIT A Press, Cambridge, Massachusetts, pp. 1996, 37-57. [2] P. Cheeseman J. andStutz, Bayesian " Classifica tion (AutoClass):Theory Results,"in and Advancesin Knowledge Discovery Data and Mining Usama , M. Fayyad, Gregory iatetsky-Shapiro, P Padhraic myth, S & Ramasamy thurusamy, AAAI 1996. U Eds., Press, [3] P. Clark and T. Niblett. The CN2 Induction Algorithm.In MachineLearningJournal, no. 261-283, 3, pp. Netherlands, Kluwer, 1989. vol. 4,

[4] U.M. ayyad, . iatetsky-Shapiro, nd Smyth F GP a P. , "The DD K Processor xtracting seful fE U Knowledge from Volumes Data", of COMMUNICATIONS OF THE ACM vol.39, Nov. pp. , no.11, 1996, 27-34. [5] A.A.Freitas, Generic,Set-OrientedPrimitivesto Support Data-Parallel Knowledge Discovery in Relational Database Systems . Ph.D. Thesis, Universityf oEssex, July997. UK, 1 [6] H. Mannila, "Methods and Problems in Data Mining," in the Proceedings of International Conference Database heory on T Jan. 997, , 1 Delphi, Springer-Verlag.

Você também pode gostar