Você está na página 1de 23

SearchEngines

InformationRetrievalinPractice

AllslidesAddisonWesley,2008

SearchEngineArchitecture
Asoftwarearchitectureconsistsofsoftware components,theinterfacesprovidedbythose components,andtherelationshipsbetween them
describesasystemataparticularlevelofabstraction

Architectureofasearchenginedeterminedby2 requirements
effectiveness(qualityofresults)andefficiency (responsetimeandthroughput)

IndexingProcess

IndexingProcess
Textacquisition
identifiesandstoresdocumentsforindexing

Texttransformation
transformsdocumentsintoindextermsor features

Indexcreation
takesindextermsandcreatesdatastructures (indexes)tosupportfastsearching

QueryProcess

QueryProcess
Userinteraction
supportscreationandrefinementofquery,display ofresults

Ranking
usesqueryandindexestogeneraterankedlistof documents

Evaluation
monitorsandmeasureseffectivenessand efficiency(primarilyoffline)

Details:TextAcquisition
Crawler
Identifiesandacquiresdocumentsforsearch engine Manytypes web,enterprise,desktop Webcrawlersfollowlinks tofinddocuments
Mustefficientlyfindhugenumbersofwebpages (coverage)andkeepthemuptodate(freshness) Singlesitecrawlersforsitesearch Topicalor focusedcrawlersforvertical search

Document crawlersforenterpriseanddesktop search


Followlinksandscandirectories

TextAcquisition
Feeds
Realtimestreamsofdocuments
e.g.,webfeedsfornews,blogs,video,radio,tv

RSSiscommonstandard
RSSreadercanprovidenewXMLdocumentstosearch engine

Conversion
Convertvarietyofdocumentsintoaconsistenttext plusmetadataformat
e.g.HTML,XML,Word,PDF,etc.XML

Converttextencodingfordifferentlanguages
UsingaUnicodestandardlikeUTF8

TextAcquisition
Documentdatastore
Storestext,metadata,andotherrelatedcontent fordocuments
Metadataisinformationaboutdocumentsuchastype andcreationdate Othercontentincludeslinks,anchortext

Providesfastaccesstodocumentcontentsfor searchenginecomponents
e.g.resultlistgeneration

Coulduserelationaldatabasesystem
Moretypically,asimpler,moreefficientstoragesystem isusedduetohugenumbersofdocuments

TextTransformation
Parser
Processingthesequenceoftexttokensinthe documenttorecognizestructuralelements
e.g.,titles,links,headings,etc.

Tokenizer recognizeswordsinthetext
mustconsiderissueslikecapitalization,hyphens, apostrophes,nonalphacharacters,separators

MarkuplanguagessuchasHTML,XMLoftenusedto specifystructure
Tags usedtospecifydocumentelements
E.g.,<h2>Overview</h2>

Documentparserusessyntax ofmarkuplanguage(orother formatting)toidentifystructure

TextTransformation
Stopping
Removecommonwords
e.g.,and,or,the,in

Someimpactonefficiencyandeffectiveness Canbeaproblemforsomequeries

Stemming
Groupwordsderivedfromacommonstem
e.g.,computer,computers,computing,compute

Usuallyeffective,butnotforallqueries Benefitsvaryfordifferentlanguages

TextTransformation
LinkAnalysis
Makesuseoflinks andanchortextinwebpages Linkanalysisidentifiespopularity andcommunity information
e.g.,PageRank

Anchortextcansignificantlyenhancethe representationofpagespointedtobylinks Significantimpactonwebsearch


Lessimportanceinotherapplications

TextTransformation
InformationExtraction
Identifyclassesofindextermsthatareimportant forsomeapplications e.g.,namedentityrecognizersidentifyclasses suchaspeople, locations, companies, dates, etc.

Classifier
Identifiesclassrelatedmetadatafordocuments
i.e.,assignslabelstodocuments e.g.,topics,readinglevels,sentiment,genre

Usedependsonapplication

IndexCreation
DocumentStatistics
Gatherscountsandpositionsofwordsandother features Usedinrankingalgorithm

Weighting
Computesweightsforindexterms Usedinrankingalgorithm e.g.,tf.idf weight
Combinationoftermfrequencyindocumentand inversedocumentfrequencyinthecollection

IndexCreation
Inversion
Coreofindexingprocess Convertsdocumentterminformationtoterm documentforindexing
Difficultforverylargenumbersofdocuments

Formatofinvertedfileisdesignedforfastquery processing
Mustalsohandleupdates Compressionusedforefficiency

IndexCreation
IndexDistribution
Distributesindexesacrossmultiplecomputers and/ormultiplesites Essentialforfastqueryprocessingwithlarge numbersofdocuments Manyvariations
Documentdistribution,termdistribution,replication

P2P anddistributedIR involvesearchacross multiplesites

UserInteraction
Queryinput
Providesinterfaceandparserforquerylanguage Mostwebqueriesareverysimple,other applicationsmayuseforms Querylanguageusedtodescribemorecomplex queriesandresultsofquerytransformation
e.g.,Booleanqueries,IndriandGalago querylanguages similartoSQLlanguageusedindatabaseapplications IRquerylanguagesalsoallowcontentandstructure specifications,butfocusoncontent

UserInteraction
Querytransformation
Improvesinitialquery,bothbeforeandafterinitial search Includestexttransformationtechniquesusedfor documents Spellcheckingandquerysuggestion provide alternativestooriginalquery Queryexpansionandrelevancefeedback modify theoriginalquerywithadditionalterms

UserInteraction
Resultsoutput
Constructsthedisplayofrankeddocumentsfora query Generatessnippets toshowhowqueriesmatch documents Highlights importantwordsandpassages Retrievesappropriateadvertising inmany applications Mayprovideclustering andothervisualization tools

Ranking
Scoring
Calculatesscoresfordocumentsusingaranking algorithm Corecomponentofsearchengine Basicformofscoreis qi di
qi anddi arequeryanddocumenttermweightsfor termi

Manyvariationsofrankingalgorithmsand retrievalmodels

Ranking
Performanceoptimization
Designingrankingalgorithmsforefficient processing
Termatatimevs.documentatatime processing Safe vs.unsafe optimizations

Distribution
Processingqueriesinadistributedenvironment Querybrokerdistributesqueriesandassembles results Caching isaformofdistributedsearching

Evaluation
Logging
Logginguserqueriesandinteractioniscrucialfor improvingsearcheffectivenessandefficiency Querylogsandclickthrough datausedforquery suggestion,spellchecking,querycaching,ranking, advertisingsearch,andothercomponents

Rankinganalysis
Measuringandtuningrankingeffectiveness

Performanceanalysis
Measuringandtuningsystemefficiency

HowDoesItReally Work?
Thiscourseexplainsthesecomponentsofa searchengineinmoredetail Oftenmanypossibleapproachesandtechniques foragivencomponent
Focusisonthemostimportantalternatives i.e.,explainasmallnumberofapproachesindetail ratherthanmanyapproaches Importancebasedonresearchresultsandusein actualsearchengines Alternativesdescribedinreferences

Você também pode gostar