Você está na página 1de 16

OCRopusAddons

InternshipReport
Submittedto:

ImageUnderstandingandPatternRecognitionLab GermanResearchCenterforArtificialIntelligence Kaiserslautern,Germany

Submittedby:

AmbrishDantrey,B.Tech.IIIyear,E&CE IndianInstituteofTechnology,Roorkee Roorkee,India


Supervisors:FaisalShafait,IllyaMezhirov

Reviewer:prof.Dr.ThomasBreuel StartDateforInternship:15thMay,2007 EndDateforInternship:27thJuly,2007

ReportDate:27thJuly,2007 Preface
This report documents the work done during the summer internship at Image UnderstandingandPatternRecognition(IUPR)Lab,DeutscheForschungszentrum fr Knstliche Intelligenz(DFKI), Germany under the supervision of Prof. Dr. ThomasBreuel. The report first shall giveanoverviewofthetaskscompleted duringtheperiodofinternshipwithtechnicaldetails.Thentheresultsobtained shallbediscussedandanalyzed. Reportshallalsoelaborateonthethefutureworkswhichcanbepersuadedasan advancementofthecurrentwork. Ihavetriedmybesttokeepreportsimpleyettechnicallycorrect.IhopeIsucceed inmyattempt. AmbrishDantrey

Acknowledgments
Simplyput,IcouldnothavedonethisworkwithoutthelotsofhelpIreceived cheerfully from whole IUPR. The work culture in IUPR really motivates. Everybodyissuchafriendlyandcheerfulcompanionherethatworkstressisnever comesinway. I would specially like to thank Dr. Thomas Breuel and Dr. Daniel keysers for provingtheniceideastoworkupon.Notonlydidtheyadvisedaboutmyproject butlisteningtotheirdiscussionsinIPeTmeetinghaveevokedagoodinterestin Imageanalysis.IamalsohighlyindebtedtomysupervisorsFaisalShafaitandIlya Mezhirov,whoseemedtohavesolutionstoallmyproblems. Author

ThereportpresentsthethreetaskscompletedduringsummerinternshipatIUPR whicharelistedbelow: 1. Detection of headlines in document images with black runlengths and OCRopusperformanceevaluationindetectingheadlines 2. Reengineeringthezoneclassificationmodule 3. Evaluationofdifferentsegmentationalgorithmsperformance Allthesetaskshavebeencompletedsuccessfullyandresultswereaccordingto expectations.Thedetection of headlinesachievedalowerrorrateof2.85%as against 6.52 of previously used methods. During evaluation of segmentation algorithmsXYcutwasfoundtogainalotbynoisecleanup,whichisaninteresting resultasitstrengthentheclaimofXYcutsegmentationalgorithmasasuitable method for OCRopus. The reengineering and porting of zoneclassification module to OCRopus makes it possible for OCRopus to have a text/image segmentationifitisrequiredinfuture. Author

Abstract

OCRopus:Introduction

Thoughthefieldofopticalcharacterrecognition(OCR)isconsideredtobewidely explored,thedevelopmentofanefficientsystemforuseinrealworldsituations stillremainsachallengefordevelopers. OCRopusisastateoftheartdocument analysisandOCRsystem,featuringpluggablelayoutanalysis,pluggablecharacter recognition,statisticalnaturallanguagemodeling,multilingualcapabilitiesandis beingdevelopedatIUPR.Thisbeingaverybigproject,Iwasassignedthetasksof developingtoolsforlayoutanalysisandevaluation.

TheGoals:

FollowinggoalsweresetasIproceededinmywork: 1. ConversionofgroundtruthdatainMARGdatabasefromXMLformat tohOCRmicroformat[1]. 2. Developmentofarulebasedheadlinedetectionmethodusingthemedian blackrunlengthofthelines.

3. Development of segmentationclassification module and evaluation of performanceofdifferentsegmentationalgorithmsasagainstnoise.

1.XMLtohOCR:

hOCR is a format for representing OCR output, including layout information, character confidences, bounding boxes, and style information. It embeds this information invisibly in standard HTML. By building on standard HTML, it automatically inherits welldefined support for most scripts, languages, and common layout options. Furthermore, unlike previous OCR formats, the recognizedtextandOCRrelatedinformationcoexistinthesamefileandsurvives editingandmanipulation.hOCRmarkupisindependentofthepresentation. DuetoallabovequalitiesofhOCRformat,itishighlydesirabletohaveground truthinthisformat.IwasassignedthetaskofconvertingtheMARGdatabase groundtruthintohOCRformat.ForthispurposeIhavewrittenfollowingscript. ScriptName:xmltohocr LanguageUsed:Python Commandlineargumentform:xmltohocrFILE.XML FILE.XML:ThefileinXMLformattobeconvertedintohOCRmicroformat. Note: The script does not take care of latex characters yet. It would be an improvementtoincorporatethisfeature.

2.HeadlinedetectionBasedonblackrunlengthanditsintegration intoOCRopus:
Detectionofheadlinesindocumentimagesisoneissuethatismostlyoverlooked butyetishighlydesirabletoproperlyformattheoutputofOCR.OCRopushadtill nowusedarulebasedmethodwhichusedspacebetweenlinesasthecriteriafor detectionofheadlines.Thoughthismethodworkedformanyimages,italsofailed manytimes.Itwasanobviousobservationthatblackrunlengthsofheadlinesare morethantheblackrunlengthofthenormalline,andwetriedtobuilduponthis

concept.Weusedmedianblackrunlengthofalineasthedecidingcriteria.The medianwasusedinsteadofmeanbecausemeanrunlengthcouldhaveeasilybeen affectedbythenoisemergingwithtextandwouldhaveproduceerrors. Thewholeapproachissimpleasdiscussedbelow: 1. Calculatethemedianblackrunlengthfortheeachlineonpage. 2. Comparethisrunlengthforeachlinewiththelinesbelowandaboveit. 3. If black runlength for a line has been found K1(a parameter) times the median runlength oflinebelowit,andK2(anotherparameter)timesthe medianrunlengthofthelineaboveit,setitasaheadline. ThevalueofparametersK1andK2wastobefoundexperimentally.Aftermany timesevaluatingtheperformanceoftheprogram,thevalueofK1andK2hasbeen setto1.5and1.1respectively. Weusedhistogrambasedmethodtofindthemedianrunlength.Ahistogramof thenumberofoccurrencesversusrunlengthwascalculated,oncewehavesucha histogramwenormalizeitwiththelargestvalueofoccurrence.Thenwecalculated thecumulativedistributionfunctionforthisnormalizedhistogram.Thepointwhen cumulativedistributionfunctionrechesavalueof0.5,correspondstothemedian runlength. The program for detection of headlines was written in C++ and used standard OCRopusclasses.TheprogramhasbeensuccessfullyintegratedintoOCRopusand

Evaluation:

We also designed a tool which evaluates the performance of the OCRopus in detecting headlines. As according to OCRopus standards, this tool has been developedtoworkwithfilesinhOCRmicroformat.Thistoolcomprisesoftwo programs: 1. ThefirstprogramtakestheOCRopusoutputandthecorrespondingground truthfileinhOCRformatand outputsthetotalnooffalsepositivesand

falsenegativeswhichoccurredindetection.Italsooutputsthetotalnoof true headlines which are present in the groundtruth. The command line formofthisprogramsis: headlineevalhOCRtruehOCRactual 2. The second program is for parsing the file produced by running above programonalargenooffiles(oronadatabase)andcountsthetotalnoof falsepositivesandfalsenegativesoccurredinwholedatabaseandtellsthe errorrateofOCRopusonwholedatabase. Thecommandlineformofthis programsis count_errorsFILE.TXT BothoftheaboveprogramswerewritteninPYTHON. Criteriaforevaluation:ForevaluatingtheperformanceofOCRopusindetection ofheadlineswedefinethetheerrorrateas: e=(fp+fn)/T e=percentageerror fp=totalnooffalsepositives fn=totalnooffalsenegatives WeevaluatedtheperformanceonstandardUniversityofWashingtonIII(UWIII) database[2].Theresultsforheadlinedetectionprogramshowedclearlythat medianblackrunlengthcriteriaisbetterthanthespacebetweenlinescriteria,yet errorswerestillpresent.Whilevisuallyanalyzingtheoutput,anobservationwas madethatrunlengthbasedcriteriaandspacebasedcriteriabothproduced differentfalsenegativesandpositives.Henceitwasclearthatoneofthemethod canbeusedtoremovetheerrorsproducedbyother.Sowetriedtocombinethe bothapproachesinsuchawaythatspacebasedcriteriaisusedasafiltertodetect falsepositivesproducedbytherunlengthbasedcriteria.Therulewhichwasused tocombinethemwasasfollows: 1. Userunlengthbasedcriteriatofindtheheadlines. 2. Calculatethemedianblackrunlengthforwholepage

3. Comparethemedianblackrunlengthofalllinesfoundtobeheadlinein step1withthemedianblackrunlengthofthepage.Sincemedianblackrun lengthofthepagerepresentsjustthesimplelinenotaheadline,ifany headlinefoundinstep1hasarunlengthlessthanorequaltotherunlength forwholepage,itisasuspiciouscase.Recheckforthislinewithspacebased criteria.

Results:

Theresultswereasexpected.Onlyrunlengthbasedcriteriaperformedbetterthan onlyspacebasedcriteriaandacombinationofboththecriteriaasdescribedabove outperformedtheboth.TheerrorratesonstandardUW3databasefordifferent approachesareasfollows: Spacebasedheadlinedetection: totalnooftextlines:138018 totalnooffalsepositives:7356.0 totalnooffalsenegatives:1713.0 %error=6.52%

BlackRunlengthbasedheadlinedetection: totalnooftextlines:138018 totalnooffalsepositives:4341.0 totalnooffalsenegatives:1386.0 %error=4.14% Bothapproachescombined(usingspacebasedapproachasafiltertoremove falsepositives) totalnooftextlines:138018 totalnooffalsepositives:2452.0

totalnooffalsenegatives:1476.0 %error=2.85% Nextweshowsomeoftheexamples:

3. Text/ImageSegmentationandClassification Documentimagelayoutanalysisisacrucialstepinmanyapplicationsrelatedto documentimages,liketextextractionusingopticalcharacterrecognition(OCR), reflowingdocuments,andlayoutbaseddocumentretrieval.Layoutanalysisisthe process of identifying layout structures by analyzing page images. Layout structures can be physical (text, graphics, pictures, . . . ) or logical (titles, paragraphs, captions, headings, . . . ). The identification of physical layout structuresiscalledphysicalorgeometriclayoutanalysis,whileassigningdifferent logicalrolestothedetectedregionsistermedaslogicallayoutanalysis[3].The taskofageometriclayoutanalysissystemistosegmentthedocumentimageinto homogeneouszones,eachconsistingofonlyonephysicallayoutstructure,andto identifytheirspatialrelationship(e.g.readingorder).Therefore,theperformance oflayoutanalysismethodsdependsheavilyonthepagesegmentationalgorithm used. A detailed explanation of defferent segmentation algorithms and their performancecomparisoncanbefoundin[4,5]. Also,anotherimportantsubtaskofdocumentimageanalysisintheclassificationof physicallysegmentedblocksintooneofthepredefinedclasses.Inmostofthe casestheclassificationstepsfollowsthesegmentationanditishighlydesirableto evaluatethesystemperformanceonwholesegmentation/classificationtask.With thehelpofsuchanevaluation,itiseasytodecideiftheincorporationofthesestep inOCRopuswouldresultinimprovedperformance.alsoitwouldbeeasytodecide whichsegmentationalgorithmtouse. Forclassificationstepweusedmethodasdescribedin[6]thisbeingthebest classificationmethod.Weusedonlytwoclassestextandnontextwhichwere releventtoOCRopus,insteadofeightclassesasdescribedinthispaper. Wealreadyhadanimplementationofvarioussegmentationalgorithmsand classificationstep.Thetaskincludedreengineeringtheclassificationstep'scode andportingthewholesegmentationclassificationmoduleintoOCRopus,making itusestandardOCRopusclassesandfunctions.Thetaskhasbeencompleted

successfullyandnowwehaveaversionofwholesegmentationclassification moduleinOCRrepositoryanditcanbeintegratedwithOCRopusiftheresultsand experimentscomespositive.Thecommandlineformoftheprogramis: ocrclassifyanddisplayiIMAGEbBOUNDINGBOXFILEoOUTPUT IMAGE IMAGE:Theimagetobeclassified BOUNDINGBOXFILE:Theboundingboxfileproducedbysegmentation algorithms OUTPUTIMAGE:Thenameofoutputimagetobewritten

Evaluation

Asdiscussedearliertheevaluationofbothsegmentationandclassificationsteps combinedtogetherishighlydesirable.Thepurposeofdevelopingaevaluation modulewastodecidewhichsegmentationalgorithmwouldbestsuitetheneedof OCRopus.Wedevelopedaevaluationprogramwhichevaluatestheperformanceof twostepsasagainstthegroundtruth.Ourcriteriafortheevaluationisthe hammingdistancebetweenthetext/nontextzoneimageproducedfromground truthandthatfromtheZoneclassificationmodule.Theerrorrateisdefinedas follows: e=HD*100/T e=errorrate HD=Hammingdistancebetweengroundtruthtextnontextimageand actualtextnontextimage T=Totalnoofpixelspresentinimage %efficiency=100e ThisprogramwasdevelopedinC++.Thecommandlineargumentformofthe programis:

ocrevaluategtGROUNDTRUTHIMAGEaiACTUALIMAGE GROUNDTRUTHIMAGE:Text/nontextimageproducedfromgroundtruth ACTUALIMAGE:Text/nontextimageproducedfromactualprogram Issueofnoisecleanup:DocumentImageNoiseaffectstheperformanceof segmentationalgorithmsgreatly.Itwasourviewthattheperformanceofallthe algorithmsshouldimproveafternoisecleanup.Abetterexplanationcanbefound in[5].Weusednoisecleanupsystemasexplainedin[7]Alsoweexpected improvementinperformanceofsimplesegmentationalgorithmslikeXYcuttobe morethanthatofcomplexalgorithmslikevoronoi,reasonbeingXYcutgetsmore affectedbynoisethanvoronoidoesandasweevaluatedtheperformanceofthese algorithmswithandwithoutnoise,weprovedcorrect.

Results:

Threesegmentationalgorithms(Voronoi,DocstrumandXYcut)performancewas evaluatedbyourprogram.Theresultswereaswehadexpectedandhencewere quiteencouraging.Belowaretheerrorratesforallthesealgorithmswithand withoutnoisecleanup.


Algorithm Voronoi Docstrum XYcut Percentageefficiencywithoutnoise Percentageefficiencywithnoise cleanup cleanup 87.03 86.88 80.16 87.69 86.92 85.70

Asevidenttheperformanceofallthealgorithmsincreasewithnoisecleanup,but theimprovementwasmuchmoreforXYcutcomparedtootheralgorithms.After noisecleanupXYcuthasanefficiencymuchclosetothatofVoronoiandbeinga simplealgorithmsXYcutcanbeanoptimumchoicefortheOCRopus.

Conclusion:

ThewholeexperienceofworkingatIUPRwasgreat.Thisorganizationhasa superbworkculture,greatmindsandveryhighqualityofwork.Ilearnedalotof aboutimageprocessingandanalysis.TheworkIcouldcompleteherewasvery satisfactory.IhavetriedtodevelopasmanyaddonsaspossibleforOCRopusand evengotveryencouragingresultswithsomeofthem.IhopemyworkonOCRopus helpsitmeetitsgoals.

1. T.M.Breuel:ThehOCRMicroformatforOCRWorkflowandResults: ICDAR,2007,acceptedforpublication 2. I.Guyon,R.M.Haralick,J.J.HullandI.T.Phillips:DatasetsforOCRand documentimageunderstandingresearch.In:Handbookofcharacter recognitionanddocumentimageanalysis,WorldScientific,(1997)779799 3. R.Cattoni,T.Coianiz,,Messelodi,S.Modena,C.M.:Geometriclayout analysistechniquesfordocumentimageunderstanding:areview.Technical report,IRST,Trento,Italy(1998)* 4. F.Shafait,D.Keysers,andT.M.Breuel:PerformanceComparisonofSix AlgorithmsforPageSegmentation:7thIAPRWorkshoponDocument AnalssisSystems(DAS),pages368379 5. F.Shafait,D.Keysers,T.M.Breuel:PixelAccurateRepresentationand EvaluationofPageSegmentationinDocumentImages:ICPR2006, InternationalConferenceonPatternRecognition,pages872875* 6. T.M.Breuel,D.Keysers,F.Shafait:DocumentImageZoneClassification ASimpleHighPerfomanceApproach:VISAPP2007,pages4451 7. T.Gupta:OCRopusaddons:techreports,IUPR,2007

References