Escolar Documentos
Profissional Documentos
Cultura Documentos
www.devbg.org
Hot News!
Microsoft Corporation just announced its strategic partnership with OpenFest
OpenFest is upgrading to Windows 7 and MS SQL Server 2008
= +
What is OCR?
Stands for Optical Character Recognition
Extracts the text from a given image
Tauscheks machine
Was a mechanical device
Uses templates, light and photodetector When a light was directed towards the templates no light reach the photodetector
Project Tesseract
History of Tesseract
Open source OCR engine Developed by HP between 1985 and 1995 Never used in an HP product Rated highly at The Fourth Annual Test of OCR Accuracy in 1995 In 2005 HP transferred Tesseract to the ISRI and released it as open source
ISRI == Information Science Research Institute
Tesseract Versions
Stable build version 2.04
Has some documentation Can be easily trained on a new language Has memory leaks
Demo
Training Tesseract
1. Prepare training images and .box files
Files: lang.tif and lang.box 2.04 supports only uncompressed TIFFs .box files contain characters with coordinates
Demo
http://code.google.com/p/ocropus/
Windows mostly
Costs $130-$500
Tesseract Future
Page layout analysis
More languages Improve accuracy Add a UI Support for connected scripts (like Arabian)
Links
For more information see:
http://code.google.com/p/tesseract-ocr/ http://en.wikipedia.org/wiki/Optical_characte r_recognition http://tesseract-ocr.repairfaq.org/ downloads/tesseract_overview.pdf
Speakers
http://nakov.com/blog
http://veskokolev.blogspot.com
Tesseract OCR
Questions?