Escolar Documentos
Profissional Documentos
Cultura Documentos
Sadagopan Srinivasan, Li Zhao, Lin Sun, Zhen Fang, Peng Li, Tao Wang,
Ravishankar Iyer, Ramesh Illikkal, Dong Liu
Intel Corporation
Distance = [(x1-x2)^2
+(y1-y2)^2
Statistical
+(LJ1-LJ2)^2]
language
modeling
3
Compute whitespace() Identifies the columns and other
whitespaces in the text 2
1
Extract() Extracts text lines from columns
and assigns the reading order 0
13.50% 41.80% 60.60%
% of character to total pixel ratio
Figure 6 shows the impact of resolution on segmentation. The
execution time of segmentation increases with resolution for the Figure 7. Impact of character-total pixels ratio on
two functions: label_component and bounding_box. Both these Segmentation
functions operate on every individual pixel for their respective 2) Image smoothing
process. The other two functions remain almost constant with Image smoothing removes noise and smoothens the boundary
resolution. of the characters. Binary morphology [13][14] is widely used
in this process and, the main function used in this
2.5 Extract implementation is binary dilation. This function calculates the
Compute maximal value for each pixel for a given radius. As shown in
Execution time (s)
2 Bounding_boxes
Figure 8, each box is representative of a pixel. The central red
1.5 Label_components
box is the current pixel that is being smoothed. When the
1 radius is set as three, all the yellow pixels fall within a circle
with the red box as its center. The maximum of these 29 values
0.5 is calculated and used as the value for the new image.
0
1M 2M 3M 4M 5M 8M
Image resolution Source:
codesource.net
Figure 6. Impact of image resolution on Segmentation
To understand the scaling behavior of the remaining functions
in segmentation, we used a metric called character to total
pixels ratio. This represents the text content in an image. Input Output
Character pixels are the pixels that are classified to be part of a
character (data) during the segmentation phase and the rest are
classified as non-data pixels. These non-data pixels are
separate from the image area that is classified as column
boundaries. We used this metric as we observed that different
font sizes can lead to different amount of text content in an
image and, OCR processing was dependent on pixels rather
than actual content. Figure 8. Image smoothing process
We fixed the image resolution at 8 MP for this study and Figure 9 shows the impact of image resolution on image
varied the ratio of character to total pixels by increasing the smoothing process. We scaled down a 15MP image down to
data content in the image. The results are shown in Figure 7. 1MP for this study. We can observe that the execution time
As expected, the first two functions were not affected by the scales linearly with the resolution. Although we show the
image content as the resolution was fixed. As we increased the resolution from 1 to 15 mega-pixels for our scaling study, we
character-total pixel ratio, the execution time for compute() have observed that this trend is repeated at intermediate
function decreased while that of extract() increased. As resolutions as well as smoothing is applied for each pixel of
described in Table 2, compute() function examines non-data the segmented image. Therefore this step takes a significant
pixels alone and searches for the maximal whitespace amount of time in the total processing time.
rectangle to identify columns. Therefore the increased
whitespace leads to increased execution time. On the other
14 14
12
10
,
10
Execution Time (seconds)
8
6
6 4
2
4
0
2 62.13% 63.19% 76.22% 82.41%
% Character to total pixel ratio
0
1M 2M 3M 4M 5M 8M 15M Figure 12. Recognition time for different text content ratio at
Image Resolution
various resolutions
Figure 9. Execution Time Scaling of Image Smoothing To corroborate our earlier findings, we took the same input
3) Recognition image at different resolutions and measured the recognition
Recognition time in OCR depends on various parameters time. As shown in Figure 12, the recognition time greatly
such as text content reflected by the character-to-total pixel depends on character-total pixels ratio but does not increase
ratio parameter in our experiments, text lines in the input linearly. Besides the line count and text content ratio, it is
image and clarity of the image, etc. In our studies, we do not observed that recognition time can be affected by other factors
consider clarity as there is no good metric to define it. Figure such as clarity of image and sharpness. The investigation of
10 shows the execution time for sample images for various these factors will be part of our future work.
text lines, varied from 1 to 5. We observe that, with other 3.2.2 Architectural Characteristics of OCRopus
parameters such as image resolution being constant, the
recognition time is directly proportional to the number of lines Figure 13 shows the various architectural characteristics of
and scales linearly. OCR measured using Vtune[15] utility for different image
resolutions. Figure 13(a) shows that the Cycles per Instruction
500 (CPI) is high for all phases of OCR. This is due to the low
Recognition time
300 2.5
Recognition
200 Binary Dilation
2 Segmentation
100
0 1.5
1 2 3 4 5
CPI
1
Number of lines
0.5
Figure 10. Impact of line count on recognition
We also found that for a given resolution, the recognition time 0
depends primarily on the actual text content in an image. This is 1M 2M 3M 4M
a more generic case (superset) of the above example wherein Image Resolution
we varied the line count. Figure 11 shows the recognition time Figure 13 (a). CPI for OCR components
for various character pixels to total pixel ratio of a 5MP image. 0.0025 Recognition
Binary Dilation
14 Segmentation
0.002
12
Recognition time (Seconds)
10 0.0015
L2 MPI
8
0.001
6
4 0.0005
2
0
0 1M 2M 3M 4M
30.91% 58.97% 60.65% Image Resolution
% Character to total pixel ratio
Figure 13 (b). L2 MPI for OCR Components
Figure 11. Impact of text content on recognition
250 compared to 5MP, for the segmentation phase. This
Text Recognition
Image Smoothening
contributes to the low CPI in segmentation phase at 1MP as
Memory Bandwidth (MB/Sec)
200
Layout Analysis observed in 13(a). Segmentation phase has significant spatial
locality. Hence, at higher resolution, prefetchers’ effectiveness
150
becomes pronounced and the L1 Data cache MPI is reduced.
100
Figure 14(b) shows the L2 MPI of the workload using
CMPsim. We used a 5MP input image for this study and
50
varied the L2 cache size from 512KB to 2MB with 8-way
0
associativity. The results show that the working set size for
1M 2M 3M 4M 5MP image is around 1MB. L2 MPI varies from 0.0025 for
Image Resolution
128KB cache to around 0.001 for 1MB cache. This shows that
Figure 13 (c). DRAM Bandwidth this workload is not memory bound and can fit in a small
Figure 13. Architectural Characteristics of OCR Components cache (512KB) as found in Atom platform. These results along
CPI remains constant for image smoothing and recognition with Vtune results corroborate that the application is
phases for all image resolutions but increases significantly (by computationally bound.
almost 50%) at 1MP for segmentation phase. This is due to the
increased L1 data cache Misses per Instruction (MPI) for 1MP 4. Software optimizations
as shown in Figure 14(a). Figure 13(b) shows the L2 MPI and As shown in previous sections, the execution time for text
13(c) shows the memory bandwidth for various phases. These recognition is in the order of several seconds and is not
graphs highlight the low L2 misses and shows the relatively appealing for real-time interactive usage. Further, the
low memory bandwidth utilization (~200MB/Sec which is less execution time increases with the image resolution and, our
than 8% of the maximum throughput available in the observation based on empirical analysis shows that we need 5
platform). or 8 mega-pixel image resolution to achieve accurate
recognition in a reasonable time. Moreover these resolutions
0.03 are supported by the handheld devices as well [17][20].
0.025 In this section, we analyze various software optimizations for
the various OCR phases to speedup its execution on Atom
Segmentation MPI
0.02 processor.
0.015 4.1 Image smoothing optimizations
0.01 As shown in previous sections, image smoothing takes a
significant amount of time in OCRopus implementation
0.005 especially for higher image resolutions. Hence image
smoothing, which depends on the image resolution, is one of
0
1MP 5MP 8MP 15MP
the main hotspot of OCR
Image Resolution
Image smoothing is performed on each every pixel on the
Figure 14 (a). L1 Data cache MPI for segmentation phase binarized image. Each new pixel value is computed
independent of neighboring new pixel values. As there is no
0.003 data dependency in this process, we start with multithreading
0.0025
mechanism. Figure 16 shows the various software
optimization results at different resolutions for image
0.002
smoothing phase.
M PI
0.0015
1) Multi-threading (MT): We make use of multithreading
0.001
capability (2 hardware threads) available in Atom. We
0.0005 threaded this function using p-thread libraries in Linux. We
0
threaded the binary dilation across the various rows and
128K 256K 512K 1M 2M columns and found the results to be similar. As shown,
L2 cache size (bytes) multithreading improves the performance by about 24%.
Figure 14 (b). OCR L2 MPI for various cache sizes 2) Computation Optimization (CO): Figure 15 shows the
Figure 14 shows the various cache statistics obtained using pseudo code for the base implementation of this function along
CMPsim, a pin based cache simulator [16]. Figure 14(a) with comments. The pixel for which binary dilation is
highlights the L1 MPI for segmentation phase. L1 MPI computed is loaded twice for each comparison, as shown in
increases by almost 50% at 1MP, a significant degradation line 6 and 7, which leads to 29 additional loads. Furthermore,
boundary conditions were checked for each pixel before the simultaneously, hyper-threading increases instruction issue
actual computation as shown in line 7. Therefore, we re-wrote width and helps to remove some of the pipeline bubbles. Figure
the code to optimize the unnecessary computations. These 17 shows that multi-threading improve execution time by about
miscellaneous compute optimizations (CO) yield significant 27% for various image resolutions. Every function in
reduction in runtime. Our results shows that CO alone segmentation is benefited equally due to multi-threading as they
improves the performance by over 3X. all exhibit similar behavior.
1. For (i = -3; i <= 3; i++){ //radius along row 1.4 Base
2. for (j = -3; j <= 3; j++){ //radius along columns
Multi-threaded
3. if (i*i+j*j <= 9){
1.2
4. for(p=0; p<image.width;p++){ //stride through rows
5. for(t=0; t<image.lenght;t++) //stride through columns
6. new_image(p,t) = //load pixel and compare 1
10 SP 0
MT+CO 1M 2M 3M 4M 5M
8 MT+CO+SP Image resolution
8 stored in the next row register, and so on. After reading and
calculating seven rows of pixels, the seven row registers are
6
filled up with maximum values for each row. Then the control
unit selects one byte from each of the row registers and feed
4
them into the second comparator. This comparator calculates
the maximum value among the seven pixels, which is the final
2
value for the new pixel. The new value can then be stored back
0
to the SRAM.
1M 2M 3M 4M 5M 8M 15M Image smoothing process is performed on a binarized image,
Image Resolution where the pixel values are 0 or 1 (indicating a black or white
pixel). Hence the size of each pixel is a byte. Therefore, we
Figure 18(b). Multi-threaded Recognition for various resolutions replaced the comparators with OR logics to obtain the
4.4 Software optimizations summary maximum value. This reduced the execution time and die area.
Figure 19 summarizes all the software optimizations for various
Control Computing Unit
OCRopus phases. We reduced the overall execution time by at Unit
least 2X for a 5MP image and the image smoothing time (a RR0
M1 M5 M7 Comp2
hotspot) by almost 9X using various software and algorithm
RR1
optimization techniques. The performance improvements are M1 M5 M7
significant across all resolutions and increases with it. Comp1
RR2
M1 M5 M7
12 M1
Input RR3 Output
Recognition M1 M5 M7
7B M5 MAX
1B
Smoothening
10 M7 RR4
,
Segmentation M1 M5 M7
Binarizattion
Execution time (Seconds)
8 RR5
M1 M5 M7
6
RR6
M1 M5 M7
4
Figure 20. Overview of a Computing Unit
2
Note that the control unit controls which register row to fill for
the next iteration and assures that the appropriate byte from
0 each row is enabled. Essentially, each register row is read seven
times before it is overwritten. It provides one byte in each cycle
Optimized
Optimized
Optimized
Optimized
Optimized
Base
Base
Base
Base
Base
0
1 2 3 4 5
Image re solution (mega-pixe ls)
0.1
Energy-delay product
0.01
11. T. M. Breuel, “High performance document layout analysis,” in
Symposium on Document Image Understanding Technology, Greenbelt,
0.001 MD, 2003.
0.0001 12. “Openfst library.” http://www.openfst.org/, 2007.
0.00001 13. Ergina Kavallieratou, “A Binarization Algorithm specialized on Document
Images and Photos”, Proceedings of the Eighth International Conference
0.000001
on Document Analysis and Recognition, 2005
0.0000001
14. You Yang, “OCR Oriented Binarization Method of Document Image”,
0.00000001 Congress on Image and Signal Processing, 2008.
1M 2M 3M 4M 5M
Image Resolution 15. “Vtune Performance Analyzer”, Intel Corporation,
http://www.intel.com//software/products/vtune
Figure 24. Energy delay for Image smoothing phase
16. Aamer Jaleel, Robert S. Cohn, Chi-Keung Luk, and Bruce Jacob,
6. Conclusions “CMP$im: A Pin-Based On-The-Fly Multi-Core Cache Simulator", Fourth
Annual Workshop on Modeling, Benchmarking and Simulation (MoBS),
In this paper, we analyzed the execution time of OCRopus 2008.
processing on the Intel Atom CPU for handheld devices. We
17. Intel Reader, “http://www.intel.com/healthcare/reader/index.htm”
showed that the base software implementation requires more
than 10 seconds for OCR processing even on a 1.6GHz core. 18. iPhone, http://www.apple.com/iphone/
We presented several software optimizations (multithreading, 19. iPad, http://www.apple.com/ipad/
image sampling) as well as hardware acceleration to the
20. Nokia N95, “http://www.nokiausa.com/find-products/phones/nokia-n95”
OCRopus hotspots, implemented them and showed that these
can improve the overall processing time by as much as 2X for 21. D.G. Lowe, Distinctive image features from scale-invariant keypoints,
International Journal of Computer Vision, 60(2):91-110, 2004.
a 5MP image and, almost an order of magnitude for a hotspot.
We also described our hardware accelerator for image 22. H. Bay, T. Tuytelaars, and L. Van Gool, SURF: Speeded up robust features,
smoothing, a hotspot in OCR, which reduces the power in ECCV, 2006.
consumption significantly. 23. Haidar Almohri, John S. Gray, Hisham Alnajjar, Real-time DSP-Based
Optical Character Recognition System for Isolated Arabic characters using
As part of future work, we are looking into accelerator designs the TI TMS320C6416T, IAJC-IJME, 2008.
for other phases, segmentation and recognition, to further