Bem-vindo(a) ao Scribd!

OCR in Alfresco

Enviado por

0% acharam este documento útil (0 voto)

198 visualizações22 páginas

This document discusses integrating optical character recognition (OCR) into Alfresco. It describes building a simple OCR action using a REPO AMP that includes an action class and transformer. The action allows adding OCR to PDFs by dropping documents into a folder. Configuration is provided for Linux using pdfsandwich and Windows using the Windows Media API. Several open source OCR software options are reviewed for different operating systems. Real-world use cases demonstrating OCR with Alfresco on Ubuntu, Windows Server, and different OCR tools are also outlined.

Descrição original:

OCR in Alfresco

Direitos autorais

Formatos disponíveis

PDF, TXT ou leia online no Scribd

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Denunciar este documento

Direitos autorais:

Formatos disponíveis

Baixe no formato PDF, TXT ou leia online no Scribd

Sinalizar o conteúdo como inadequado

0% acharam este documento útil (0 voto)

198 visualizações22 páginas

OCR in Alfresco

Enviado por

DJ JAM

Direitos autorais:

Formatos disponíveis

Baixe no formato PDF, TXT ou leia online no Scribd

Sinalizar o conteúdo como inadequado

Pular para a página

Você está na página 1de 22

Pesquisar no documento

Integrating

a simple OCR
in Alfresco

Angel Borroy
developer @ keensoft
OCR for the Enterprise
• Minimum license starting in
100,000 documents/year
• Dedicated server required
• Hard learning curve
– Regular expressions
– Templates and workflows
– Proprietary integration
OCR for the Community
• Open Source
• No other server than Alfresco
• No learning curve, just drop off
your documents on a folder
and get Searchable PDFs
• Every hosting OS is supported
Building a simple OCR Action

1 REPO AMP
• Content model (simple)
• Action
• Transformer
OCR Action: Key classes
<bean id="ocr-extract"
class="es.keensoft.alfresco.ocr.OCRExtractAction"
parent="action-executer" init-method="init">
<property name="ocrTransformWorker" ref="transformer.worker.OCR" />
</bean>

<bean id="transformer.worker.OCR"
class="es.keensoft.alfresco.ocr.OCRTransformWorker">
<property name="serverOS" value="${ocr.server.os}" />
<property name="executerWindows">
<property name="executerLinux">
</bean>
OCR Action: Configuration Linux
alfresco-global.properties

# local ocr program

ocr.command=/usr/local/bin/pdfsandwich
ocr.output.verbose=true
ocr.output.file.prefix.command=-o
# rotating, cleaning, languages…
ocr.extra.commands=-lang spa
ocr.server.os=linux
OCR Action: Configuration Windows
alfresco-global.properties

# local ocr service

ocr.url=http://localhost:60064/api/OCR
ocr.output.verbose=true
# rotating, cleaning, languages…
ocr.extra.commands=Spanish
ocr.server.os=windows
OCR Action: Rule configuration

Only apply for foreground

OCR Action: Rule configuration
SYNCHRONOUS

ASYNCHRONOUS
OCR Action: Results
What else?
• Study different original documents
– Existing (incorrect) layer text
– Image resolution below 200 dpi
– Landscape / portrait orientation
– Paper size may change
• Plain OCR soft is not enough

* Image coming from http://www.tobias-elze.de/pdfsandwich/

OCR Software: Mac OS X
https://github.com/jbarlow83/OCRmyPDF
• Generates a searchable PDF/A file from a regular PDF
• Keeps the exact resolution of the original images
• Keeps file size about the same
• Deskews and/or cleans the image before performing
OCR
• Uses Tesseract OCR engine
• Open Source and developed with Python 3
OCR Software: Linux
http://www.tobias-elze.de/pdfsandwich/
• Generates "sandwich" OCR pdf files
• Recognizes page layout (even for
multicolumn)
• Uses unpaper, convert, gs and
tesseract
• Open Source and developed using OCAML
OCR Software: Windows
https://github.com/Xandroid4Net/CommandLin
eOcr (non final)
• Windows.Media.Ocr
– Microsoft API runnable in Windows 8 and
Windows 2012
– Native in Windows 10 and Windows 2016
OCR Software: Hosted services
https://ocr.space/OCRAPI
http://www.ocrwebservice.com/api/restguide
http://www.bitocr.com/documentation.html
…

https://cloud.google.com/vision/
Real world use case (1)
OS Ubuntu 14.04 LTS
Version Alfresco 5.0.d
OCR soft pdfsandwich
Languages eng+spa+cat+fra

OCR
Real world use case (2)
OS Ubuntu 15.10
Version Alfresco 5.0.d
OCR soft OCRmyPDF
Language eng

OCR
Real world use case (3)
OS Windows Server 2012 R2
Version Alfresco 5.1.e
OCR soft Windows.Media.Ocr
Language Spanish

OCR
Open Source OCR addon
https://github.com/keensoft/alfresco-simple-ocr
License LGPL v3.0
State Production
Languages (interface) English, Portuguese Brazilian,
German and Spanish
Languages (OCR) 39 / 25

“No original Alfresco resources have been overwritten”

https://github.com/OrderOfTheBee/addons/wiki/Inclusion-criteria-overview
OCR: Recap
• Generates automatically PDF searchable from
PDF Image
• Open Source addon for Alfresco available
• Minimal configuration required
• Different Open Source Linux programs
available
• Also Microsoft is providing the library
Windows.Media.Ocr
Resources
GitHub
http://github.com/keensoft/alfresco-simple-ocr
Twitter
@AngelBorroy
Blog
http://www.keensoft.es/en/category/blog-en/
http://angelborroy.wordpress.com
Integrating a simple OCR
in Alfresco

Angel Borroy
developer @ keensoft

Você também pode gostar

Eastern Europe Sourcebook
Documento110 páginas
Eastern Europe Sourcebook
Daniel Alan
93% (15)
Quick Installation FreeSwitch ASTPP
Documento6 páginas
Quick Installation FreeSwitch ASTPP
yimbi2001
Ainda não há avaliações
Install Ntopng Network Traffic Monitoring Tool On CentOS 7
Documento6 páginas
Install Ntopng Network Traffic Monitoring Tool On CentOS 7
anirudhas
100% (1)
Ecler Nuo4 Controller Service Manual PDF
Documento50 páginas
Ecler Nuo4 Controller Service Manual PDF
DJ JAM
0% (1)
Banana Stem Patty Pre Finale 1
Documento16 páginas
Banana Stem Patty Pre Finale 1
Armel Barayuga
86% (7)
Ecler Nuo3 SM
Documento32 páginas
Ecler Nuo3 SM
DJ JAM
Ainda não há avaliações
Opennebula Installation
Documento7 páginas
Opennebula Installation
Deepan Chakkaravarthy R
Ainda não há avaliações
The Definitive Guide On How To Build A High Status Social Circle
Documento46 páginas
The Definitive Guide On How To Build A High Status Social Circle
Cecilia Teresa Grayeb Sánchez
Ainda não há avaliações
Firewall and Proxy Servers
Documento26 páginas
Firewall and Proxy Servers
rajeevadeevi
Ainda não há avaliações
Docker
Documento72 páginas
Docker
Rahul Vashishtha
Ainda não há avaliações
Create Reports Using JasperSoft Studio - IDempiere en
Documento3 páginas
Create Reports Using JasperSoft Studio - IDempiere en
Wilson Gayo
0% (1)
Alfresco One 5.x Developer’s Guide - Second Edition
No Everand
Alfresco One 5.x Developer’s Guide - Second Edition
Benjamin Chevallereau
Nota: 3 de 5 estrelas
3/5 (1)
RHCSA
Documento3 páginas
RHCSA
BigaBabu
0% (1)
Critical Capabilities For High-Security Mobility Management - Gartner - 24aug2017 PDF
Documento59 páginas
Critical Capabilities For High-Security Mobility Management - Gartner - 24aug2017 PDF
DJ JAM
Ainda não há avaliações
OFBiz Framework Training
Documento2 páginas
OFBiz Framework Training
Thanakrit Wongyued
Ainda não há avaliações
Odoo 10 Install
Documento6 páginas
Odoo 10 Install
Manuel Vega
100% (1)
FortiSwitch and Security Fabric v2 - Public PDF
Documento37 páginas
FortiSwitch and Security Fabric v2 - Public PDF
DJ JAM
Ainda não há avaliações
PURL Questions and Answers
Documento3 páginas
PURL Questions and Answers
SHAHAN VS
100% (5)
Food Product Innovation PDF
Documento35 páginas
Food Product Innovation PDF
Didik Hariadi
Ainda não há avaliações
Docker For Local Web Development, Part 2 - Put Your Images On A Diet
Documento17 páginas
Docker For Local Web Development, Part 2 - Put Your Images On A Diet
Shirouit
Ainda não há avaliações
Haproxy Load Balancer
Documento60 páginas
Haproxy Load Balancer
Hady Elnabriss
100% (1)
Alfresco One Developer Guide
Documento159 páginas
Alfresco One Developer Guide
jorgetbastos
Ainda não há avaliações
APACHE Web Server and SSL Authentication
Documento8 páginas
APACHE Web Server and SSL Authentication
DJ JAM
Ainda não há avaliações
CapsMan Setup
Documento50 páginas
CapsMan Setup
DJ JAM
100% (1)
WildFly Cookbook
No Everand
WildFly Cookbook
Luigi Fugaro
Ainda não há avaliações
Odoo 11.0 Crud Development
Documento13 páginas
Odoo 11.0 Crud Development
globalknowledge
Ainda não há avaliações
Prelims Reviewer Biochem Lab
Documento4 páginas
Prelims Reviewer Biochem Lab
Riah Mae Merto
Ainda não há avaliações
Buildung OpenStack VDI
Documento17 páginas
Buildung OpenStack VDI
DJ JAM
Ainda não há avaliações
Graphite Grafana Quick Start v1.4
Documento25 páginas
Graphite Grafana Quick Start v1.4
tatreromanos
Ainda não há avaliações
Getting Started With Document Management
Documento41 páginas
Getting Started With Document Management
ntis.vn
100% (2)
Go Handbook
Documento44 páginas
Go Handbook
guillermo ocegueda
Ainda não há avaliações
Real Time Chat Using Ionic and Laravel at API
Documento8 páginas
Real Time Chat Using Ionic and Laravel at API
Jose Anguirai Moises Niquisse
Ainda não há avaliações
Air Defence Systems: Export Catalogue
Documento105 páginas
Air Defence Systems: Export Catalogue
serrorys
Ainda não há avaliações
Managing Alfresco Content From Within MS Office Community Edition 3 3
Documento27 páginas
Managing Alfresco Content From Within MS Office Community Edition 3 3
barbusba
Ainda não há avaliações
Foundation For Developers
Documento122 páginas
Foundation For Developers
bodielee79
Ainda não há avaliações
Enterprise VoIP Solutions With Alpine Linux - Slashroots 2011
Documento47 páginas
Enterprise VoIP Solutions With Alpine Linux - Slashroots 2011
SlashRoots
Ainda não há avaliações
Ephesoft Alfresco Based Enterprise Integrated Document Capture
Documento18 páginas
Ephesoft Alfresco Based Enterprise Integrated Document Capture
CIGNEX
Ainda não há avaliações
Odoo in High Availability and Ready For Mass Scalability With Zevenet - ZEVENET
Documento10 páginas
Odoo in High Availability and Ready For Mass Scalability With Zevenet - ZEVENET
Idris Idris
Ainda não há avaliações
Developer Guide: Alfresco Content Services 6.2
Documento69 páginas
Developer Guide: Alfresco Content Services 6.2
Habu Usman
Ainda não há avaliações
FastTrack To Red Hat Linux 7 System Administrator
Documento3 páginas
FastTrack To Red Hat Linux 7 System Administrator
Mangesh Abnave
Ainda não há avaliações
Quick Installation FreeSwitch ASTPP
Documento6 páginas
Quick Installation FreeSwitch ASTPP
Douglas Braga Gomes
Ainda não há avaliações
Freeswitch
Documento6 páginas
Freeswitch
Zaid Asad
Ainda não há avaliações
Getting Started With Alfresco Records Management For Community Edition 3 3
Documento35 páginas
Getting Started With Alfresco Records Management For Community Edition 3 3
jibe9191
Ainda não há avaliações
OpenNebula Workshop
Documento5 páginas
OpenNebula Workshop
Thangavel Murugan
Ainda não há avaliações
Alfresco Developer Series
Documento3 páginas
Alfresco Developer Series
Manoj Tewari
Ainda não há avaliações
Apache Karaf in Real Life - 1
Documento51 páginas
Apache Karaf in Real Life - 1
aftab_get2
Ainda não há avaliações
Elk
Documento5 páginas
Elk
rkarthik403
Ainda não há avaliações
DDD
Documento27 páginas
DDD
Ku Lee
Ainda não há avaliações
Zimbra Mail Server Maintenance
Documento6 páginas
Zimbra Mail Server Maintenance
Stefanus E Prasst
Ainda não há avaliações
Logstash V1.4.1getting Started
Documento9 páginas
Logstash V1.4.1getting Started
Anonymous taUuBik
Ainda não há avaliações
RHCE Fast Track Course
Documento3 páginas
RHCE Fast Track Course
Angel Fulton
Ainda não há avaliações
Setting Up Zimbra and Backup
Documento13 páginas
Setting Up Zimbra and Backup
Collins Emadau
Ainda não há avaliações
Oracle 11g DB Settings
Documento74 páginas
Oracle 11g DB Settings
prakash9565
Ainda não há avaliações
SCO Open Server 5.0.7 HandBook
Documento446 páginas
SCO Open Server 5.0.7 HandBook
escaramuju
Ainda não há avaliações
Alfresco One On Aws Reference Deployment 1.0
Documento33 páginas
Alfresco One On Aws Reference Deployment 1.0
Pavan Kalyan
100% (1)
Create Web Service in Java Using Apache Axis2 and Eclipse
Documento14 páginas
Create Web Service in Java Using Apache Axis2 and Eclipse
gdskumar
Ainda não há avaliações
GLPI
Documento6 páginas
GLPI
Shubhra Mohan Mukherjee
Ainda não há avaliações
The Top 10 Best Free Open Source PBX Software: Asterisk
Documento9 páginas
The Top 10 Best Free Open Source PBX Software: Asterisk
ravikkota
Ainda não há avaliações
Panasonic PBX SIP Trunking Setup Guide
Documento14 páginas
Panasonic PBX SIP Trunking Setup Guide
Kalpesh Jain
Ainda não há avaliações
Official Documentation (Ws Apache Org) - Axis2 - Part 2
Documento80 páginas
Official Documentation (Ws Apache Org) - Axis2 - Part 2
Fede
100% (9)
Documentum Audit Facility: Documentum Auditing Built-In Mechanism
Documento13 páginas
Documentum Audit Facility: Documentum Auditing Built-In Mechanism
mmarkovic
Ainda não há avaliações
FreeSWITCH and FusionPBX
Documento4 páginas
FreeSWITCH and FusionPBX
fone4vn
Ainda não há avaliações
Changing The Current Selinux Mode: (User@Host ) # Getenforce
Documento10 páginas
Changing The Current Selinux Mode: (User@Host ) # Getenforce
pmmanick
Ainda não há avaliações
How To Install OpenLDAP With MySQL As Backend Data On Debian 6 64-Bit - WingFOSS PDF
Documento9 páginas
How To Install OpenLDAP With MySQL As Backend Data On Debian 6 64-Bit - WingFOSS PDF
Ø Yasinø Rizqi
Ainda não há avaliações
Configuring A Solaris 11 DNS Client
Documento3 páginas
Configuring A Solaris 11 DNS Client
Евгений Паниот
Ainda não há avaliações
LXC Docker
Documento24 páginas
LXC Docker
nshivegowda
Ainda não há avaliações
Linux Foundation's LFCS and LFCE Certification Preparation Guide ($39)
Documento29 páginas
Linux Foundation's LFCS and LFCE Certification Preparation Guide ($39)
Alok Sharma
Ainda não há avaliações
Sso - Oam - Idm
Documento20 páginas
Sso - Oam - Idm
laiq
Ainda não há avaliações
Implementing IBM Storewise v7000 Unified
Documento372 páginas
Implementing IBM Storewise v7000 Unified
Annette Sahores
Ainda não há avaliações
0 2 Contiki Tutorial
Documento33 páginas
0 2 Contiki Tutorial
Moazh Tawab
Ainda não há avaliações
Sistema de Ficheros FSRM
Documento48 páginas
Sistema de Ficheros FSRM
ff
Ainda não há avaliações
Kubernetes
Documento27 páginas
Kubernetes
Jason Gomez
Ainda não há avaliações
Groovy in Action, Second Edition
Documento2 páginas
Groovy in Action, Second Edition
Dreamtech Press
Ainda não há avaliações
OpenShift Cookbook Sample Chapter
Documento43 páginas
OpenShift Cookbook Sample Chapter
Packt Publishing
Ainda não há avaliações
Technics SATX 30 Service Manual
Documento48 páginas
Technics SATX 30 Service Manual
DJ JAM
Ainda não há avaliações
Microsoft Windows Server 2012 RDS Vs Citrix XenApp-XenDesktop, A Comparison
Documento31 páginas
Microsoft Windows Server 2012 RDS Vs Citrix XenApp-XenDesktop, A Comparison
DJ JAM
Ainda não há avaliações
Samsung MagicRMS 2.0-0
Documento4 páginas
Samsung MagicRMS 2.0-0
DJ JAM
Ainda não há avaliações
Fusionpbx Docs
Documento141 páginas
Fusionpbx Docs
DJ JAM
Ainda não há avaliações
iDempiereDocBook PDF
Documento2.572 páginas
iDempiereDocBook PDF
DJ JAM
Ainda não há avaliações
Oracle St2540m2 Ovm WP v19
Documento39 páginas
Oracle St2540m2 Ovm WP v19
DJ JAM
Ainda não há avaliações
Acronis Backup 12 Vs Backupexec 15
Documento7 páginas
Acronis Backup 12 Vs Backupexec 15
DJ JAM
Ainda não há avaliações
Lecture 3: Bode Plots: Prof. Niknejad
Documento26 páginas
Lecture 3: Bode Plots: Prof. Niknejad
selaroth168
Ainda não há avaliações
4148-Article Text-14752-1-10-20211029
Documento7 páginas
4148-Article Text-14752-1-10-20211029
Daffa Azka
Ainda não há avaliações
Specification For Neoprene Coating On The Riser Casing
Documento17 páginas
Specification For Neoprene Coating On The Riser Casing
Lambang Asmara
Ainda não há avaliações
Internal Audit, Compliance& Ethics and Risk Management: Section 1) 1.1)
Documento6 páginas
Internal Audit, Compliance& Ethics and Risk Management: Section 1) 1.1)
Noora Al Shehhi
Ainda não há avaliações
Small Scale Industries
Documento6 páginas
Small Scale Industries
Mangesh Kadam
Ainda não há avaliações
Prepared by M Suresh Kumar, Chief Manager Faculty, SBILD HYDERABAD
Documento29 páginas
Prepared by M Suresh Kumar, Chief Manager Faculty, SBILD HYDERABAD
Bino Joseph
Ainda não há avaliações
Forms of Organizing Activity Games, Methodology of Conducting Activity Games in Physical Education Lessons
Documento4 páginas
Forms of Organizing Activity Games, Methodology of Conducting Activity Games in Physical Education Lessons
Academic Journal
Ainda não há avaliações
Concept Paper
Documento4 páginas
Concept Paper
janet a. silos
Ainda não há avaliações
TTC 1000
Documento2 páginas
TTC 1000
svismael
Ainda não há avaliações
Partnership For Sustainable Textiles - Factsheet
Documento2 páginas
Partnership For Sustainable Textiles - Factsheet
Masum Sharif
Ainda não há avaliações
AWP 4A Syllabus Fall 2021 (Misinformation)
Documento11 páginas
AWP 4A Syllabus Fall 2021 (Misinformation)
cam
Ainda não há avaliações
Likert Scale Video Presentation Rubrics
Documento1 página
Likert Scale Video Presentation Rubrics
ALDWIN B. BAYLON
Ainda não há avaliações
Cultures of The West A History, Volume 1 To 1750 3rd PDF
Documento720 páginas
Cultures of The West A History, Volume 1 To 1750 3rd PDF
tonny
Ainda não há avaliações
EEE Sofware Lab Experiment 1, PDF
Documento11 páginas
EEE Sofware Lab Experiment 1, PDF
240 Sadman Shafi
Ainda não há avaliações
Modal Verbs Ejercicio
Documento2 páginas
Modal Verbs Ejercicio
Angel sosa
Ainda não há avaliações
GFN Cired Paper
Documento8 páginas
GFN Cired Paper
Sukant Bhattacharya
Ainda não há avaliações
Iecex Bas 13.0069X
Documento4 páginas
Iecex Bas 13.0069X
Francesco_C
Ainda não há avaliações
Social and Professional Issues Pf2
Documento4 páginas
Social and Professional Issues Pf2
DominicOrtega
Ainda não há avaliações
EvoFox Katana-X Mechanical Gaming Keyboard With Outemu Blue Switches Vivid Rainbow Lighting With 13 Preset Effects Dedicated
Documento1 página
EvoFox Katana-X Mechanical Gaming Keyboard With Outemu Blue Switches Vivid Rainbow Lighting With 13 Preset Effects Dedicated
saqibdar7051236186
Ainda não há avaliações
Pte Lastest Questions
Documento202 páginas
Pte Lastest Questions
Ielts Guru Review
Ainda não há avaliações
R820T Datasheet-Non R-20111130 Unlocked
Documento26 páginas
R820T Datasheet-Non R-20111130 Unlocked
Konstantinos Goniadis
Ainda não há avaliações
EASA CS-22 Certification of Sailplanes
Documento120 páginas
EASA CS-22 Certification of Sailplanes
snorrig
Ainda não há avaliações
Jesoc5 1 PDF
Documento15 páginas
Jesoc5 1 PDF
faisal3096
Ainda não há avaliações