Escolar Documentos
Profissional Documentos
Cultura Documentos
3 Text Extraction
Conceptually, our text extraction algorithm follows
a paradigm similar to Zhong et al.'s first method. Our
approach assumes that the foreground color is roughly
Figure 1: A WWW image containing text'. uniform for a given character. In other words, the
color of a single character varies slowly, and its edges
In this paper, we present an algorithm for extract- are distinct. Like Zhong et al.'s method, we first
ing text from Web images. The algorithm first quan- quantize the color space of the input image into color
tizes the color space of the input image into a num- classes by a clustering procedure. We adopt a clus-
ber of color classes using a parameter-free clustering tering method which is intuitive and parameter-free.
Pixels are then assigned to color classes closest to their
'The images in the figures throughout this paper arc origi- original colors. For each color class, we analyze the
nally in color. shape of its connected components and classify them
Authorized licensed use limited to: Madhira Institute of Technology and Sciences. Downloaded on January 8, 2010 at 02:28 from IEEE Xplore. Restrictions apply.
as character-like or non-character-like. Here we use a number of holes. A connected component from a
set of features which are different from Zhong et al.’s non-text region, on the other hand, is irregular, with
for classification. Finally, a post-processing procedure varying-size segments and many holes (see Figure 2).
performs a “clean-up” operation by analyzing the lay-
~~
249
Authorized licensed use limited to: Madhira Institute of Technology and Sciences. Downloaded on January 8, 2010 at 02:28 from IEEE Xplore. Restrictions apply.
~
0% - 10% 58
10% - 20% 18
20% - 30% 11
30% - 40% 13
40% - 50% 19
50% - 60% 17
60% - 70% 22
70% - 80% 25
80% - 90% 20
90% - 100% 59
I Ave. detect rate = 47%
Figure 5: Test example 1.
250
Authorized licensed use limited to: Madhira Institute of Technology and Sciences. Downloaded on January 8, 2010 at 02:28 from IEEE Xplore. Restrictions apply.
Figure 10: Test example 6.
I
25 1
Authorized licensed use limited to: Madhira Institute of Technology and Sciences. Downloaded on January 8, 2010 at 02:28 from IEEE Xplore. Restrictions apply.
Figure 13: Difficult example 3 - outline font. Figure 16: Angled and touching text.
252
Authorized licensed use limited to: Madhira Institute of Technology and Sciences. Downloaded on January 8, 2010 at 02:28 from IEEE Xplore. Restrictions apply.