Você está na página 1de 14

Neurocomputing 122 (2013) 310 323

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/neucom

Text extraction from natural scene image: A survey

Honggang Zhang a,n, Kaili Zhao a, Yi-Zhe Song b, Jun Guo a
a b

School of Communication and Information Engineering, Beijing University of Posts and Telecommunications, Beijing, China School of Electronic Engineering and Computer Science, Queen Mary, University of London, London E1 4NS, UK

art ic l e i nf o
Article history: Received 10 January 2013 Received in revised form 29 March 2013 Accepted 30 May 2013 Communicated by Liang Wang Available online 27 July 2013 Keywords: Text extraction Text detection and localization Text enhancement and segmentation Scene understanding

a b s t r a c t
With the increasing popularity of portable camera devices and embedded visual processing, text extraction from natural scene images has become a key problem that is deemed to change our everyday lives via novel applications such as augmented reality. Text extraction from natural scene images algorithms is generally composed of the following three stages: (i) detection and localization, (ii) text enhancement and segmentation and (iii) optical character recognition (OCR). The problem is challenging in nature due to variations in the font size and color, text alignment, illumination change and reections. This paper aims to classify and assess the latest algorithms. More specically, we draw attention to studies on the rst two steps in the extraction process, since OCR is a well-studied area where powerful algorithms already exist. This paper offers to the researchers a link to public image database for the algorithm assessment of text extraction from natural scene images. & 2013 Elsevier B.V. All rights reserved.

1. Introduction Cameras mounted on various hand-held devices have become very popular. And natural scene images usually taken by digital cameras are focused on more and more in computer vision. These photos taken by people are various and used in different situations. For example, scenery photos and portrait photos are commonly seen when we are in journey and daily life. However, to many people, text is the most intuitive way to get the around information. In consequence, the extraction of text in natural scenes is one of these important technologies in computer vision. Text information buried in digital images is considered to be an important aspect of overall image understanding. Nonetheless, extracting text information from natural scene images has many challenging issues. The challenges lie within various factors, such as the variation of the light intensity, alignment of text, color, font size, and camera angles. Examples of natural scene images with textual information can be found in Fig. 1(a)(d). Before we precede any further, it is important to dene commonly used terms and identify common text characteristics. Text in natural scene images can exhibit many variations with respect to the following properties: 1. Size: the range of font size variation could be diverse [6]. 2. Alignment: scene texts are often aligned in many directions and have geometric distortions [7].
Corresponding author. Tel.: +86 1062283059. E-mail addresses: zhhg@bupt.edu.cn (H. Zhang), zhaokailibupt@gmail.com (K. Zhao), yizhe.song@eecs.qmul.ac.uk (Y.-Z. Song), guojun@bupt.edu.cn (J. Guo). 0925-2312/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.neucom.2013.05.037

3. Color: characters tend to have the same or similar colors. This property makes it possible to use a connected componentbased approach for text detection [8,9]. 4. Edge: most scene texts are designed to be easily read, hence resulting in strong edges at text boundaries and background [10,11]. 5. Distortion: due to changes in camera angles, some text may carry perspective distortions that dramatically affect extraction performance [12,13]. With the aforementioned problems in mind, text extraction paradigms are often divided into the following sub-problems: (1) detection and localization, (2) text extraction and enhancement and (3) text recognition (OCR). A owchart can be found in Fig. 2. Text detection and localization is the process of determining text locations in the image and generating bounding boxes around them. Although locations can be indicated by their bounding boxes, a background removal procedure is often performed to facilitate recognition. This means extracted text has to be converted to a binary image and enhanced before it is fed into an OCR engine. Text extraction is the stage where text components are segmented from the background. Enhancement of extracted text components is required because text regions are usually of low-resolution and prone to noise. In this paper, we examine the rst two stages (text localization and extraction) in particular, which extract the texts before they are fed into an OCR engine. Papers that follow this two-stage framework are surveyed and categorized based on their core methodology. Where available, issues such as performance, execution time, and platform used for each method are reported. To the best of our knowledge, no comprehensive surveys have yet to address text extraction from natural scene images specially. Many surveys place emphasis on specic application such as licence plate recognition [14]

H. Zhang et al. / Neurocomputing 122 (2013) 310 323


Fig. 1. Natural scene text images: images with variations in size, alignment, color, blur, illumination and distortion [14]. (For interpretation of the references to color in this gure caption, the reader is referred to the web version of this article.)

and text extraction in images and videos [15]. The main contributions of this paper are summarized as follows:

 we perform a thorough categorization of the text extraction  we summarize state-of-art results carried out on common  we collect commonly used public data sets in the eld, and offer in Combined with the hot topics recently in text extraction,
several typical applications are surveyed generally. This paper is organized as follows: In Section 2.1, we provide a detailed review of methods to detect and localize natural scene depth analysis on the benets and shortcomings of these databases. public data sets with common performance measurements. from natural scene images literature from 2000 onwards.

text. Text enhancement and segmentation methods are discussed in Section 2.1.1. The important issue of performance evaluation is discussed in Section 3, along with sample public test data sets and a review of evaluation methods. Section 4 introduces main applications of scene text extraction. Finally, this paper concludes with a discussion of current state-of-art and anticipates further research in natural scene text extraction. 2. Text extraction technologies 2.1. Text detection and localization The existing methods of text detection and localization can be roughly categorized into four groups: edge based, texture based,


H. Zhang et al. / Neurocomputing 122 (2013) 310 323

Fig. 2. An overview of word detection and recognition pipeline. The images are referred to paper [5].

Table 1 A brief experiment carried on ICDAR 2003/2005 datasets of text extraction from natural images. Author Ou et al. [16] Kim et al. [17] Kim et al. [18] Hu and He [19] Bui et al. [20] Pan et al. [21] Lee et al. [22] Minetto et al. [23] Neumann and Matas [24] Pan et al. [25] Chen et al. [26] Le et al. [27] Pan et al. [28] Epshtein et al. [29] Zhou et al. [30] Neumann and Matas [24] Year 2004 2004 2005 2008 2009 2010 2010 2010 2010 2011 2011 2011 2011 2011 2011 2012 Precision 0.53 0.563 0.558 0.58 0.787 0.66 0.69 0.63 0.59 0.674 0.73 0.30 0.68 0.73 0.37 0.647 Recall 0.53 0.643 0.689 0.74 0.734 0.70 0.60 0.61 0.55 0.697 0.60 0.43 0.67 0.60 0.88 0.731 F-measure 0.53 Feature Edge intensity, text localization Text extraction, hierarchical features Image intensities, text localization Edge detection, text extraction Topographic map, convolutional, neural network, text detection Text detection, feature extraction, coarse-to-ne, classication Edge constraint, text detection Text detection, multi-resolution, image segmentation Text localization, maximally stable, extremal regions Conditional random eld, connected component analysis Text detection, maximally stable, extremal regions Text localization, parallel edge feature, mean shift clustering Text detection, stroke segmentation, conditional random eld Text detection, stroke width transform Multilingual, scene text detection, HOG, MG, LBP Extremal regions, text localization

0.65 0.68 0.64 0.61 0.57 0.685 0.66 0.35 0.67 0.66 0.53 0.687

The higher the better. The best result of each evaluation criterion is shown in boldface.

connected component (CC)-based and the others. For reference, a chronological listing of some of the published work carried on ICDAR 2003/2005 data sets can be found in Table 1.

2.1.1. Edge based methods Edges are reliable features for text detection. Usually, an edge detector (e.g., Canny) is used rst followed by morphological operations to extract text from background and to eliminate non-text regions. Edge-based methods are usually more efcient and simple in nature scene text extraction. Good performance is often found on scene images exhibiting strong edges. For the same reason, a major problem of edge-based methods lies with the fact

that good edge proles are hard to obtain under the inuence of shadow or highlight. Sun et al. [9] proposed a method to extract board text under natural scene. The approach is built up on color image ltering techniques, where rims are rst obtained, followed by an analysis on inherent features and relationships among characters. The method was shown to work efciently on board text under natural scenes. Liu and Samarabandu [7] and Ou et al. [16] proposed a multiscale edge-based text extraction algorithm, which can automatically detect and extract text in complex images. Edge strength, density and the orientation variance were used as the three distinguishing characteristics of text embedded in images, which

H. Zhang et al. / Neurocomputing 122 (2013) 310 323


can be used as main features for detection. It is robust with respect to the font size, style, color, orientation, and alignment of text and can be used in a large variety of application elds, such as mobile robot navigation, vehicle license detection and recognition, object identication, document retrieving, and page segmentation. Ren et al. [31] demonstrated a framework that rst uses the Roberts operator to compute edges, which processes binary image based on a self-adapting threshold, utilizes erosion operator in mathematical morphological to eliminate nonlinear inuence and outstand linear feature. Then, the method puts forward focusing function based projection, nding text region required and completing text extraction. Bai et al. [32,33] investigated text location in complex background images. The authors proposed an effective edge-based algorithm for text extraction in natural scenes. Pyramid decomposition is rst performed, followed by color-based edge detection and binarization. Through the mathematical morphology method, the text of the color image is extracted by the restriction of text regions at last. Carried experiment on a large number of images selected from the ICDAR 2003 database [1], this algorithm shows its robustness and accuracy against variations in text color and font size.

2.1.2. Texture based methods Texture-based methods are utilized on the observation that texts in images have distinct textural properties that distinguish them from the background. These mostly use texture analysis approaches such as Gaussian ltering, Wavelet decomposition, Fourier transform, Discrete Cosine Transform (DCT), and Local Binary Pattern (LBP). Typically, features are extracted over a certain region and a classier (trained using machine learning techniques or by heuristics) is employed to identify the existence of text. Because text regions have distinct textural properties from non-text ones, these methods can detect and localize texts accurately even when images are noisy. However, speed is relatively slow and the performance is sensitive to text alignment orientation. Zhou et al. [30] proposed a multilingual text detection method, which focuses on nding all of the text regions in natural scene regardless of their language type. According to rules of writing system, three different texture features are selected to describe the multilingual text: histogram of oriented gradient (HOG), mean of gradients (MG) and local binary patterns (LBP). Finally, cascade AdaBoost classier is adopted to combine the inuence of different features to decide the text regions. This paper is similar to the methods illustrated in paper [34,35]. Pan et al. [21] proposed a new method for fast text localization in natural scene images by combining learning-based region ltering and verication using a coarse-to-ne strategy. Unlike methods that use learning-based classication for only ltering or verication, a boosted classier and a polynomial classier are used for coarse region ltering and ne verication respectively with selecting discriminative features. For the verication stage, the authors evaluate ve widely used features: HOG (histogram of oriented gradients), LBP (local binary pattern), DCT (discrete cosine transform), Gabor, and wavelets. A boosting framework integrating feature and weak classier selection based on computational complexity is proposed to construct efcient text detectors in the paper proposed by Shehzad Muhammad et al. [35]. The proposed scheme used a small set of heterogeneous features which are spatially combined to build a larger set. A neural network based localizer learns necessary rules for localization. Three different types of features (Mean Difference Feature (MDF), Standard Deviation (SD) and Histogram of oriented Gradient (HoG) are extracted from a text segment on block level.

Ji et al. [36] presented a robust text characterization approach based on local Haar Binary Pattern (LHBP). The authors specially addressed the issues of variant illumination and text-background contrasts. More specially, threshold-restricted local binary pattern is extracted from high-frequency coefcients of pyramid Haar wavelet, calculated at different resolution to represent multiscale texture information. LHBP can preserve and uniform inconsistent text-background contrasts while ltering gradual illumination variations. Assumed that occurrence between certain directions were notable, directional correlation analysis (DCA) was used to locate candidate text regions. Saoi et al. [37] developed a new unsupervised clustering technique for the classication of multi-channel wavelet features to deal with color images, as algorithm in [38] lacked the ability of discriminating color differences. The main contributions in the paper consist of the following stages: decomposing color image into R, G, B channel images and making 2D Wavelet Transform of each decomposed image, then using the unsupervised pixel block classication with the k-means algorithm in combined feature vector space and integrating results of three channels by logical OR. Angadi and Kodabagi [14] proposed a new texture based text detection method. It used high pass ltering in the DCT domain to suppress most of the background. Then, the feature vectors based on homogeneity and contrast are computed to identify text regions. Although the algorithm is robust and achieves a detection rate of 0.966 on a variety of 100 low resolution natural scene images, this paper mainly focused on the localization of rough estimate text blocks. Gllavata et al. [38] proposed a very accurate method based on unsupervised classication of high frequency wavelet coefcients for text detection in video frames. The authors used a slide window to move over the transformed wavelet images and characterized the areas with the distribution of high-frequency wavelet coefcients. Then classied the predened regions into three parts with k means algorithm text, simple and complex backgrounds.

2.1.3. Connected component (CC)-based methods Connected component-based methods use a bottom-up approach by grouping small components into successively larger components until all regions are identied in the image [3941]. A geometrical analysis is often needed in later stages to identify text components and group them to localize text regions. CC-based methods directly segment candidate text components by edge detection or color clustering. The non-text components are then pruned with heuristic rules or classiers. Since the number of segmented candidate components is relatively small, CC-based methods have lower computation cost and the located text components can be directly used for recognition. However CC-based methods cannot segment text components accurately without prior knowledge of text position and scale. Moreover, designing fast and reliable connected component analyzers is difcult since there are many non-text components which are easily confused with texts when analyzed individually. In the paper proposed by Zhang et al. [42], conditional random eld (CRF) has been used to give connected components text or non-text labels. This paper is derived from the algorithm proposed in [25], which also used a CRF model. And as the case are that text-like background regions are recognized as text characters with a low condence. The authors in this paper proposed a twostep iterative CRF algorithm with a Belief Propagation inference stage and an OCR ltering stage. Two kinds of neighborhood relationship graph are used in the respective iterations for extracting multiple text lines. The rst CRF iteration aims at nding certain text CCs, especially in multiple text lines, and sending uncertain CCs to the second iteration. The second iteration gives


H. Zhang et al. / Neurocomputing 122 (2013) 310 323

second chance for the uncertain CCs and lter false alarm CCs with the help of OCR. The proposed method aims at extracting text lines, instead of separated words as the ground truth of ICDAR2005 competition, which contributes to a reduction of precision and recall rate. The accuracy improved apparently without considering the factors, which is shown in Table 1. Similar to the two methods above, Wang et al. [43] proposed a coarse to ne method based on CC-method to locate characters in scene images. The authors also separate color images into homogeneous color layers. Then the author analyzed each connected component in color layers using block adjacency graph (BAG) similarly. For the coarse detection of characters, an aligning-andanalysis scheme is proposed to locate all the potential characters in all color layers. Wang and Kangas [44] proposed a basic localization method based on connected component analysis. Connected components are extracted from each decomposed layer (hue space, weak color space and gray-scale space). Alignment analysis helps to check the block candidates. Character segmentation using PAS (Priority adaptive segmentation) algorithm extracts characters in the nal composed image. This algorithm is not so robust to variant illumination, multi-scale size, low contrast and over lighting. Wang and Kangas described a CC-based approach to automatic text location and segmentation in natural scene images in paper [45]. A multi-group decomposition scheme is used to deal with the complexity of color background. Connected component extraction is implemented using block adjacency graph (BAG) algorithm after noise ltering and run length smearing (US) operation. Some heuristic features and priority adaptive segmentation (PAS) of characters are proposed in block candidate verication and grayscale-based recognition.

natural scene images. Text characters and strings are constructed by stroke components as basic units. Gabor lters are used to describe and analyze the stroke components in text characters or strings. Then a K-means algorithm is applied to cluster the descriptive Gabor lters. The experimental results demonstrated that the algorithm performed well on backgrounds and variant text patterns, and outperforms algorithms before the year 2010 for text extraction from natural scene images. Srivastav and Kumar [47] proposed a method to detect text in scene images using Stroke Width and nearest-neighbor constraints derived from empirical knowledge. Saurav Kumar et al. [16] implemented the stroke width transform text detection algorithm in paper [29] on Nokia N900. The most important improvement to the algorithm has been improved robustness to stray strokes. Sezer Karaoglu et al. [48] proposed a merging step to resolve both issues of reducing the background and noise and disconnection of many letters. The merging step is dened as follows: BI x; y DGx; yLBx; y, where (x,y) refers to the spatial coordinates of the current pixel, BI to the merging output, DG to the difference of gamma corrections output, and LB to the local binarization output. Once the output of BI is obtained, the authors referred to connected components as text candidates. Then the author proposed to use a Random forest classier to classify these candidates as text and non-text regions.

2.1.4. Stroke based methods As a basic element of text strings, strokes provide robust features for text detection in natural scene images. Text can be modeled as a combination of stroke components with a variety of orientations, and features of text can be extracted from combinations and distributions of the stroke components. One feature that separates text from other elements of a scene is its nearly constant stroke feature like stroke width. This can be utilized to recover regions that are likely to contain text. For stroke-based methods, text stroke candidates are extracted by segmentation, veried by feature extraction and classication, and grouped together by clustering. These methods are easy to implement on specic applications because of their intuition and simplicity. However, complex backgrounds often make text strokes hard to segment and verify. Epshtein et al. [29] presented a novel image operator that seeks to nd the value of stroke width for watch image pixel combined with geometric reasoning, which was called the Stroke Width Transform (SWT). With the absence of scanning window over a multi-scale pyramid, this paper merged pixels of similar stroke width into connected components which allows to detect letters across a wide range of scales in the same image. Another notable difference of this paper is not using any language-specic ltering mechanisms, which resulted in a truly multilingual text detection algorithm, compared to the available tests before the year 2010. Pan et al. [28] presented a hybrid method for detecting and localizing texts in natural scene images by a scale-adaptive segmentation algorithm designed for stroke candidates extraction and a CRF model with pair-wise weight by local line tting designed for stroke verication. This algorithm achieved competitive results in the ICDAR 2005 competition. Pan et al. [46] proposed a novel algorithm, based on stroke components and descriptive Gabor lters, to detect text regions in

2.1.5. The other approaches Due to the large number of possible variations in text, singletrack approaches previously listed often fail under certain conditions. To deal with such variations, some researchers have developed new approaches [20,22,23,46,49] that utilize a combination of the above. Not like most existing algorithms above which have focused on detecting horizontal texts, Yao and Bai [50] constructed an effective and practical detection system for texts of arbitrary orientations in natural scenes (the samples of arbitrary orientations of texts in natural images are shown in Fig. 3). Conventional features (such as Stroke Width Transform (SWT) [29]) that are primarily designed for horizontal texts, could lead to signicant false positive as most non-horizontal texts are not detected. Yao and Bai adopted two sets of rotation invariant features based on SWT and a two-level classication scheme that can discriminate texts from non-texts. Hence, the system is able to effectively detect texts of arbitrary orientations but produce fewer false positives. The authors implemented text detection methods on two datasets, ICDAR [1,51] and a dataset [46] with texts of directions and achieved good results. Text detection process is depicted in Fig. 4. Neumann and Matas [24] exploited Maximally Stable Extremal Regions (MSERs) which provide robustness to geometric and illumination conditions similar to [26]. The performances of this method are evaluated on two standard data sets. On the Char74k data set, a recognition rate of 72% is achieved. On the ICDAR 2003 text localization data set , the precision is 0.59 and f-number is 0.57. While the result on robust reading data sets, the precision is 0.42. Chen et al. [26] combined the complimentary properties of Canny edges and Maximally Stable Extremal Regions (MSER). The authors remove the MSER pixels outside the boundary formed by the Canny edges to allow for detecting small letters in images of limited resolution. Then SWT and geometric image of these MSER regions are generated to obtain more reliable results. Finally, letters are clustered into lines and additional checks are performed to eliminated false positive. These detected texts are binarized letter patches, which can be directly used for text recognition system.

H. Zhang et al. / Neurocomputing 122 (2013) 310 323


Fig. 3. The Examples of detection results. (a) Detected texts in various orientations carried on the proposed data set, yellow rectangles: true positives. Pink rectangles: false positives. (b) Detected texts in various languages images collected from the internet [50]. (For interpretation of the references to color in this gure caption, the reader is referred to the web version of this article.)

Fig. 4. Text detection process [50]: (a) original; (b) edge detection; (c) SWT; (d) association; (e) component ltering; (f) component verication; (g) aggregation; and (h) chain verication.

Neumann [52] differed from the MSER-based methods [26] and [24] in that it tested all Extremal Regions (ERs) (not only the subset of MSERs) while reducing the memory footprint and maintaining the same computational complexity and real-time performance. The selection of suitable ERs is carried out in realtime by a sequential classier on the basis of novel features specic for character detection. The method was trained using the ERs manually extracted from the ICDAR 2003 training dataset. Then the method was evaluated with ICDAR 2011 dataset and Street View Text-dataset (SVT) [4] (see Fig. 5). Region-based methods and Connected Components (CC)-based methods are complementary referred in Pan et al. [25]. First, region detector is designed to estimate the text regions in each layer of the image pyramid and scale adaptive binarization is used to generate text components. Then, at component analysis stage, a CRF model is used to lter out the non-text components. At last, grouped texts with minimum spanning tree cluster. Le et al. [27] proposed a new text location method using the parallel edge feature of text stroke, based on the observation that text-stroke consists of two edges in parallel. First, meanshift clustering is employed to group similar pixels into clusters. Then, parallel edges are detected to verify whether the connected

components are text strokes. The contribution of this paper is the presentation of a new feature of parallel edges along the stroke, providing structural information for the text localization. Ma et al. [54] presented a robust method for text detection in color scene image. The algorithm is based on edge detection and connected-component. First, multi-scale edge detection is achieved by canny operator and an adaptive thresholding binary method. Second, the ltered edges are classied by the classier trained by SVM combining HOG, LBP and several statistical features, including mean, standard deviation, energy, entropy, inertia, local homogeneity and correlation. Third, k-means clustering algorithm and the binary gradient image are used to lter the candidate regions and re-detect the regions around the candidate text candidates. Finally, the texts are relocated accurately by projection analysis. The experimental result on ICDAR 2003 text location test database shows the effectiveness of the algorithm. Kumar et al. [55] proposed a layer based morphological operation for detecting text from natural scene images and get a relative high performance rate. Color reduction method was applied on RGB color image to reduce the total number of colors in each RGB component. Then, color reduced image was converted into gray scale image and it was divided into three layers of connected


H. Zhang et al. / Neurocomputing 122 (2013) 310 323

Fig. 5. Experimental results [52]. Incorrectly recognized letters marked red: (a) text localization and recognition examples on the ICDAR 2011 dataset [53] and (b) text localization and recognition examples from the Street View Text Detection dataset [5]. (For interpretation of the references to color in this gure caption, the reader is referred to the web version of this article.)

Fig. 6. Few experimental samples from ICDAR 2003 data using method [55].

components. Each layer was processed individually based on geometrical feature to nd the text region and combined together. Finally, edge projection prole was used to verify the text region. Fig. 6 shows some sample results. However, [55] is limited to some standard font sizes, while images with a very large font size or small font size of less than 10 pixels are not taken into consideration. Toan Dinh Nguyen et al. [56] proposed a very new method. 2-D tensor voting is used to identify the text regions from background objects or non-text regions. The text line information is extracted by tensor voting, which is useful to reduce the false positive rate in region-based text localization methods. Navid Mavaddat et al. [57] explored features for text detection within images of scenes containing other background objects using a Support Vector Machine (SVM) algorithm. In the approach, the Haar-like features are designed and utilized on banks of

bandpass lters and phase congruency edge maps. The paper also evaluates the contributions of the features to text detection by the SVM coefcient, which lead to the time-efcient detection by using an optimal subset of features. Kim et al. [58] proposed a text extraction algorithm by utilizing the focus information on scene images. First step, pixel sampling and a mean-shift algorithm are used to choose the text color candidates within the focused section. Second, all pixels in the image are compared to the target seed color in HCL (hue, chroma, and luminance) distance measure. And then the adaptive binarization method classies them into the two regions to form connected components. The text region is expanded iteratively by searching neighboring regions with the updated text color. By indicating the location of the target text with the focus interface, the method achieved high precision rate in the test images. The authors conrmed the feasibility of this method for hand-held camera applications.

H. Zhang et al. / Neurocomputing 122 (2013) 310 323


Minori Maruyama et al. [59] presented a method to detect characters on signboards in natural scene images. In the method, the authors used Harr wavelet (full and sparse), HOG, and moment statistics (skewness and kurtosis), separately, the method trained SVMs with RBF kernel. Finally, ensemble of stump classiers. Ye et al. [9] proposed a method for text detection in natural scene images by feature combination under a coarse-to-ne framework similar with the algorithm proved in [21]. First, color feature is used to segment images into color-uniform regions by a clustering algorithm. Then edge features are extracted to construct a weak classier to classify the regions into candidates or background. After a layout analysis procedure which is based on the orientations of Chinese, Japanese and Korean characters, candidate regions are connected into text lines. The author used GLVQ (generalized learning vector quantization) to group pixels of similar color into clusters. And the part of feature extraction is using the method of wavelet transform. In the future, text detection on other languages (English, Arabic, etc.) should be investigated. The experiments on the data set collected by authors show a higher recognition rate of coarse and ne detection and a lower false alarm rate, which are 93.9% and 4.2% respectively. The image sizes in the dataset range from 640 480 to 1024 768 pixels. The test set consists of a variety of situations such as text in different font-sizes, colors, light texts on dark background. Gao and Yang [60] proposed an adaptive algorithm for text detection from a natural scene by utilizing a hierarchical algorithm structure with different emphasis at each layer. First, a multi-resolution edge detector is utilized to obtain initial candidates of text regions. Then adaptive searching based on layout syntax and Gaussian mixtures based color modeling detects the text cues to discriminate text/ non-text regions. After the layout analysis, the detected text regions are fed into the EBMT [61] software for translation. The rate of detected text regions without missing characters is 93.3%. 2.2. Text enhancement and segmentation Camera-captured images can suffer from low resolution, blur, and perspective distortion, as well as complex layout and interaction of the content and background [6265]. OCR systems have been available for a number of years and the current commercial systems can produce an extremely high recognition rate for machine-printed documents on a simple background. However, it is not easy to use commercial OCR software for recognizing text extracted from natural scene images. We now address the issue of text extraction and enhancement for generating the input to an OCR algorithm. Although most texts with simple background and high contrast can be correctly localized and extracted, poor quality text can be difcult to extract. So we focus on how to enhance and segment the texts or documents with complex and simple background in the following part. In the paper [66] proposed by Zhu et al., a natural scene character recognition method using convolutional neural network (CNN) and bimodal image enhancement is proposed. The character recognition

rate on ICDAR'03 [1] is 86.96%. In the future, the author will seek a proper way to enhance images and depress their noises at the same time. Kita et al. [67] illustrated a new extraction method for binalizing multicolored characters subject to heavy degradation. K-means clustering to constitute pixels of a given image in the HSI (hue, saturation, and intensity) color space is applied, and the binarized images are generated by every dichotomization of a total of K cluster or sub-images, then calculated the average aspect ratio of a character to estimate the number of characters, in order to calculate the degree of character-likeness with the mesh features fed into SVM. Experiments using the ICDAR 2003 robust word recognition dataset showed that this method [67] achieved a correct binarization rate of 80.4%. Zhou et al. [68] proposed an improved adaptive document image binarization method. First, the authors use a low-pass Wiener lter based on local statistics to denoise the given image and a rst rough estimation of foreground regions is applied. Then, compute the valuation of background by a neighboring pixel interpolation. At last, obtain nal thresholding by combing the calculated background surface with the preprocessed image. The method has good robustness for uneven illumination; using the algorithm extract foreground regions, it will get fewer lost strokes and can be effective to retain the edge information. The experiment carried on the distortions printed document image is shown in Fig. 7. Anand Mishra et al. [69] detected potential locations of characters in a word image. The author proposed a sliding window based approach to achieve this. Then an energy minimization framework is used to formulate the problem of nding the most likely word from the set of characters produced by the detection framework. The author made experiments on cropped word recognition and improved the accuracy by over 15% and 10% respectively on Street View Text-WORD [4] and ICDAR data sets. Liu and Wang [70] proposed a new approach for image binarization based on adaptive threshold. The image binarization is treated as an optimization problem where the best threshold could be found out by partitioning binarization of Ostu threshold, meanwhile the algorithm of eliminating boundary effects is proposed to perfect the purpose of binarization optimization. The new approach is compared with a classic binarization method. The experimental result shows that the new method is better in keeping the original edge feature and available to more applications, especially in obtaining better effects of binarization optimization on images with rich edge information. Jiang et al. [71] described an enhancement approach for degraded-document binarization. The authors use the dilation and erosion in gray-scale image processing. Then the authors combined the binarization technique proposed by [72] and the small neighbourhood in the rectangle area thresholding. The binarization result is shown in Fig. 8. Le and Lee [73] presented two methods on the correlation of the distorted text, one by using mapping function and another by

Fig. 7. Distortions printed document image binarization [68].


H. Zhang et al. / Neurocomputing 122 (2013) 310 323

Fig. 8. The binarization result [71].

Fig. 9. Experimental result [74]: (a) badly illuminated document images and (b) binarization.

the biquadratic transformation function. First, boundary lines of the label are detected by Hough transform and Bezier curve approximation. Experimental results show that the proposed methods correctly restore the original rectangle shape of the label area. Lu et al. [40] proposed a text enhancement method which is capable of extracting camera text lying over a planar or smoothly curved surface in perspective views. In this paper, a few perspective invariants including character ascender and descender, centroid intersection numbers, and water reservior are rst detected. In the paper [18] proposed by Kim et al., there is a well-known algorithm for image segmentation which is a Split and Merge approach. It is then possible to regroup probable text blocks, then is followed by the removal of non text block. It is very useful to robust extraction of text from camera images. Valizadeh et al. [74] proposed a new novel hybrid algorithm for binarization of badly illuminated document images. To enhance the badly illumination document image, a new transformation function is designed to transform the gray level of each pixel to the new domain. The proposed binarization algorithm was tested on a set of degraded document images in Fig. 9. Huang et al. [75] illustrated an extraction method for scanned document based on HMM. The paper [48] proposes a novel approach for foreground/background detection and skew estimation using morphological edge analysis that shows immediate improvement in OCR accuracy. In the paper [76], a binarization algorithm based on the difference of gamma correction and morphological reconstruction is realized to extract the connected components of an image. The paper [77] proposed by Huading et al. illustrates a new binarization algorithm based on maximum

gradient of histogram. The paper [78] proposes a new binarization method which obtains a threshold value based on the fractal dimension by evaluating both region's density and stability to threshold value.

3. Performance evaluations 3.1. Metrics The performance measure used for natural scene text extraction is usually the detection rate, which is dened as the ratio between the number of detected text and all of the texts contained. Compared to the other Computer Vision algorithms, the extraction of texts from natural scene images is carried on a public data set: ICDAR dataset 2003 and 2005 [1,51], which is the most widely tested among most of the specialists. The below formulas, precision, recall and F-measure, are the uniform performance evaluation of algorithms of scene texts localization. As ICDAR 2003 robust reading competitions [79] show, in general, precision and recall are used to measure a retrieval system. Precision, p is dened as the number of correct estimates divided by the total number of estimates, p C =E. System that over-estimates the number of text rectangles will get a low precision score. Recall, r is dened as the number of correct estimates divided by the total number of targets, r C =T . System that under-estimates the number of text rectangles will have a low recall score, where E represents number of detected text, T represents a ground-truth set of targets, and C represents the number of estimates which are correct.

H. Zhang et al. / Neurocomputing 122 (2013) 310 323


Tomaz [80] proposed a very new metric extended from the metric of retrieving text in document. It denes a new overall metrics as f 2 precision recall=precision recall. However, the metric above is unrealistic to exactly evaluate the performance of scene text extraction system. Hence, the ICDAR 2003 robust reading competition [79] denes the match M between two rectangles as the area of intersection divided by the area of the minimum bounding box containing both rectangles. Then the new denitions of precision and recall are shown as

follows: p r M E M T 1 2

Then adopt the standard f measure to combine the precision and recall gures into a single measure of quality. The relative weights of these are controlled by . The overall performance is f-measure: f 1 1 p r 3

Table 2 Public data set. Data set ICDAR'03 robust reading and text location ICDAR'03 robust word recognition ICDAR'03 optical character recognition (OCR) ICDAR'05 NEOCR Natural Environment OCR dataset KAIST Scene Text Database The Chars74K dataset SVT The Street View datasets MSRA Text Detection 500 Database SVHN The Street View House Numbers Dataset Scale Website Attribute

All algorithms in Table 1 used these evaluation metrics. The following part explains the usage of the above metrics in different researches in detail. 3.2. Public databases We summarize natural scene text images sources that can be downloaded from the Internet. Readers can nd a detailed information and test data on websites shown in Table 2. Fig. 10 shows a collection of images gathered from several research institutes. 3.3. Review of public databases and metrics The method in [20] detects texts using a method called sparsity testing, which is a different way of making use of the overcomplete and sparse structure in the data. Compared with the method proposed in [17,18], the algorithm proposed in this paper achieves a slightly lower recall rate, but it improves signicantly

120.2 MB 39.8 MB 49.2 MB 13.8 MB 1.3 GB 347.41 MB 423.6 MB 118 MB 96 MB 4.1 GB

[1] Ditto Ditto [51] [81] [3] [82] [4] [2] [83]

Scene images Ordinary images Ordinary images Character images Scene images Scene images Scene images Scene images Scene images Scene images

Fig. 10. Sample images in public datasets: (a) ICDAR [1,51]; (b) KAIST [3]; (c) SVHN [83]; (d) SVT [4]; (e) MSRA [2]; and (f) NEOCR [81].


H. Zhang et al. / Neurocomputing 122 (2013) 310 323

the precision rate of text detection. The benets of this method includes improved text detection accuracy, better capability of handling texts with different sizes and more robust to varying imaging conditions. Experimental results carried out on ICDAR 2003 Text Location Con-test trial test database has a precision rate of 0.787 and a recall rate of 0.734. Pan et al. [20] used topographic maps and sparse representations to detect text from natural scene images. The proposed method has been trained on a database that the authors created, which contains 350 scene images captured with digital cameras. The camera was set to the default automatic setting for focus, ashlight and contrast gain control. For testing, the authors chose the 2003 ICDAR [1] Text Localization Contest trail test database. Included in this database are 251 images and the ground truth of the word bounding boxes of all the target texts in these images. The results of the proposed method on the testing database is shown in Table 1. Although widely used in the text detection algorithms, the ICDAR data set has two major drawbacks. First, most of the texts in ICDAR data set are horizontal. Second, all the texts in this data set are in English. These two shortcomings are also pointed in [46]. In contrast, the KAIST scene text data set comprises 3000 images captured in different environments, including outdoors and indoors scenes under different lighting conditions(clear day, strong articial lights, etc). Images were captured either by the use of high-resolution digital camera or low-resolution mobile phone camera. All images have been resized to 640 480. The KAIST scene text database is categorized according to the language of the scene text captured: Korean, English, and Mixed. Lee et al. [22] presented a novel approach for combining features and relationships within the Conditional Random Field (CRF) framework (Full-CRF). The authors evaluated the proposed approach on 540 various images from the KAIST scene text database and the ICDAR 2003 Robust Reading competition database. These images consist of normal environment and special case images affected by strong illumination and complex backgrounds. The text regions in these images are manually cropped around their bounding box. The evaluation was based on pixel-wise precision, recall and f-measure. Let T be the set of foreground pixels in the ground truth image and P be the set of foreground pixels in the predicted image. Precision p is p jP T j=jP j, recall r is r jP T j=jT j and f-measure is f measure 2 p r =p r . These measures are estimated for each test image, and the averages of them represent the performance of the method. The precision is 0.841 and the recall is 0.883. In paper [50] proposed by Yao and Bai , a new multilingual image dataset is named as MSRA Text Detection 500 Database [2], with horizontal as well as skewed and slant texts. Similar to the evaluation method for the PASCAL object detection task [84], the true or false positives based on the overlap ratio between the estimated minimum area rectangles and the ground truth rectangles. The denitions of precision and recall are precision jTP j=jEj, recall jTP j=jT j where TP is the set of true positive detections while E and T are the sets of estimated rectangles and ground truth rectangles. The F-measure is dened as f 2 precision recall=precision recall [80]. The Street View Text (SVT) dataset [4] was harvested from Google Street View. Image text in this dataset exhibits high variability and often has low resolution. The NEOCR dataset [81] contains 659 real world images with 5238 annotated bounding boxes (textelds). The dataset covers a broad range of characteristics which distinguish real world images from scanned documents. The images in SVHN dataset [83] are of small cropped digits whose backgrounds are not so complex as previous datasets.

4. Application The applications based on scene text extraction are still so popular in recent years. The aroused general interests are briey listed below:
n Wearable applications: Google Goggles [85] is an image recognition application created by Google Inc. It allows us to translate the world into text information. The vOICe [86] for android mobile phone allows the blind users to use their autofocus camera phone in combination with a phone screen reader as a portable reader to access print. The Sypole device [87] mainly aims at building a portable, light, convenient and easyto-use tool in order to read textual information for visually impaired people. n Text extraction in www images: Bo Luo [88] rst use a textbased image meta-search engine to retrieve images from the Web based on the text information on the image host pages to provide an initial image set. n Signboard/plate recognition: Automatic recognition of signboards and license plate is essential for automated driving and our daily life. Arth and Limberger [89] detected the licence plate based on the Adaboost approach presented by Viola and Jones, and the detected license plates are segmented into individual characters by using a region-based approach. Zin and co-workers [90] detected signboards using the main peak nding algorithm based on HSV color space to remove regions of occlusion and reection. n Online electric goods search. Like some shopping applications (e.g. Google Shopper) on android phone, more applications allow customers to take the goods' name rst, then serve their related information as the feedback. n Other applications: Extract texts regions from video and output characters for video search [91].

As android phone is so popular at home and abroad and the expansion in internet, we believe that more intelligent applications in processing texts from natural scene images around us will emerge beyond the imagination.

5. Discussion We have made a comprehensive survey on text extraction from natural scene images. There are a lot of algorithms to localize the text region and methods to enhance and segment the text regions listed. However, there is no single method to satisfy all cases as the scene texts vary a lot in fonts, sizes, illuminations, blur, distortion, etc. According to the text information, we classify the localization and detection methods into ve types: edge-based, texture-based, connected component-based, stroke-based and the other methods. Through our studies, Edge-based methods are viewed as low efcient methods as these methods usually are based on some morphological algorithms or heuristical ways to extract the text boundaries. Consequently, Edge-based method is sensitive to the complex background, and is usually regarded the auxiliary method as the other algorithms. Texture-based methods and CC-based methods are more efcient methods to localize the text regions as they focus on the local and detail pixels. However, CC-based methods mostly depend on the accuracy of classication between text and non-text. So this makes the confusion of text and non-text happen easily. On the other hand, since the number of segmented candidate components is relatively small, CC-based methods have lower computation cost and the located text components can be directly used for

H. Zhang et al. / Neurocomputing 122 (2013) 310 323


recognition. In contrast, texture-based methods classify the texts from non-texts using the texture features fed into classication machines. As a result, the precision of localization based on the textures has a relationship with the selection of features of images which can differentiate the texts and non-texts in large. However, the speed is relatively slow and the performance is sensitive to the text alignment orientation. It seems that the paper [50] solves the problem of sensitivity to text alignment orientation by adding the Stroke Width Transform (SWT) and some smart features, which are robust to variations in texts. Combined with low level classication schemes, the algorithm can detect the texts in arbitrary orientations in complex natural scene images. We can see that maybe the single method cannot satisfy some cases. However, we can solve it with the combination of these methods or by using some other maturer localization methods. For example, SWT (stroke-based methods) is usually used combined with other methods, as it is robust to variations in texts and usually costs a lot when used alone. Camera-captured images suffer from low resolution, blur, and perspective distortion, as well as complex layout and interaction of the content and background. So if we cannot localize the texts accurately, the enhancement and segmentation methods can be used to remedy it. The enhancement and segmentation methods are usually the conventional and mature algorithms, respectively. Although, a lot of algorithms referred above have already proposed the superior methods to solve the problem of detection and localization texts, and the problem of segmentation and enhancement, there are no complete and efcient systems for users. As people's urgent need for the real applications and the difculties in automatic text extraction from natural scene images, it is deserve to pay attention to withdraw this topic.

Acknowledgments This work was partially supported by National Natural Science Foundation of China under Grant nos. 61273217, 61175011 and 61171193, the 111 project under Grant no. B08004, and the Fundamental Research Funds for the Central Universities.

[1] Icdar2003, http://algoval.essex.ac.uk/icdar/Datasets.html, 2003. [2] Msra text detection 500 database (msra-td500), http://www.iapr-tc11.org/ mediawiki/index.php/MSRA_Text_Detection_500_Database_(MSRA-TD500), 2012. [3] Kaist scene text database, http://www.iapr-tc11.org/mediawiki/index.php/ KAIST_Scene_Text_Database, 2011. [4] The street view text dataset, http://vision.ucsd.edu/  kai/svt/, 2011. [5] K. Wang, B. Babenko, S. Belongie, End-to-end scene text recognition, in: 2011 IEEE International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 14571464. [6] F. Einsele-Aasami, Recognition of Ultra Low Resolution, Anti-aliased Text with Small Font Sizes, Ph.D. Thesis, University of Fribourg, Switzerland, 2008. [7] X. Liu, J. Samarabandu, Multiscale edge-based text extraction from complex images, in: 2006 IEEE International Conference on Multimedia and Expo, IEEE, 2006, pp. 17211724. [8] P. Shivakumara, T.Q. Phan, C.L. Tan, A gradient difference based technique for video text detection, in: 2009 10th International Conference on Document Analysis and Recognition, ICDAR'09, IEEE, 2009, pp. 156160. [9] Q. Ye, J. Jiao, J. Huang, H. Yu, Text detection and restoration in natural scene images, Journal of Visual Communication and Image Representation 18 (6) (2007) 504513. [10] P. Shivakumara, W. Huang, C.L. Tan, An efcient edge based technique for text detection in video frames, in: 2008 Eighth IAPR International Workshop on Document Analysis Systems, DAS'08, IEEE, 2008, pp. 307314. [11] X. Liu, J. Samarabandu, An edge-based text region extraction algorithm for indoor mobile robot navigation, in: 2005 IEEE International Conference on Mechatronics and Automation, vol. 2, IEEE, 2005, pp. 701706. [12] S. Lu, C. Tan, The restoration of camera documents through image segmentation, in: Document Analysis Systems VII, 2006, pp. 484495. [13] J. Liang, D. Doermann, H. Li, Camera-based analysis of text and documents: a survey, International Journal on Document Analysis and Recognition 7 (2) (2005) 84104. [14] S.A. Angadi, M.M. Kodabagi, A texture based methodology for text region extraction from low resolution natural scene images, in: Advance Computing Conference, 2010, pp. 121128. [15] Anil K. Jainc, Keechul Junga, Kwang In Kimb, Text information extraction in images and video: a survey, Pattern Recognition 37 (5) (2004) 977997. [16] W. Ou, J. Zhu, C. Liu, Text location in natural scene, Journal of Chinese Information Processing (2004). [17] K.C. Kim, H.R. Byun, Y.J. Song, Y.W. Choi, S.Y. Chi, K.K. Kim, Y.K. Chung, Scene text extraction in natural scene images using hierarchical feature combining and verication, in: 2004 Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, IEEE, vol. 2, 2004, pp. 679682. [18] J.S. Kim, S.C. Park, S.H. Kim, Text locating from natural scene images using image intensities, in: 2005 Proceedings of the Eighth International Conference on Document Analysis and Recognition, IEEE, 2005, pp. 655659. [19] D.T. Hu, X.H. He, Effective Algorithm of Text Extraction in Natural Scenes, Ph.D. Thesis, 2008. [20] T.D. Bui, W. Pan, C.Y. Suen, Text detection from natural scene images using topographic maps and sparse representations, in: 2009 IEEE International Conference on Image Processing, 2009. [21] Y.F. Pan, C.L. Liu, X. Hou, Fast scene text localization by learning-based ltering and verication, in: 2010 17th IEEE International Conference on Image Processing (ICIP), IEEE, 2010, pp. 22692272. [22] S.H. Lee, M.S. Cho, K. Jung, J.H. Kim, Scene text extraction with edge constraint and text collinearity, in: 2010 Proceedings of 20th ICPR, pp. 39833986. [23] R. Minetto, N. Thome, M. Cord, J. Fabrizio, B. Marcotegui, in: Proceedings of 2010 IEEE 17th International Conference on Image Processing Snoopertext: A Multiresolution System for Text Detection in Complex Detection in Complex Visual Scenes, vol. 1, 2010, pp. 38613864. [24] L. Neumann, J. Matas, A method for text localization and recognition in realworld images, in: Computer VisionACCV, 2010, 2011, pp. 770783. [25] Y. Pan, X. Hou, C. Liu, A hybrid approach to detect and localize texts in natural scene images, IEEE Transactions on Image Processing 20 (2011) 800813. [26] Huizhong Chen, Sam S. Tsai, Georg Schroth, David M. Chen, Radek Grzeszczuk, Bernd Girod, Robust text detection in natural images with edge-enhanced maximally stable extremal regions (2011) 26092612. [27] H.P. Le, N.D. Toan, S.C. Park, G.S. Lee, Text localization in natural scene images by mean-shift clustering and parallel edge feature, in: Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication, ACM, 2011, p. 116. [28] Y.F. Pan, Y. Zhu, J. Sun, S. Naoi, Improving scene text detection by scaleadaptive segmentation and weighted CRF verication, in: 2011 International

6. Conclusion The problem of text extraction from natural scene images can be divided into the following sub-problems: (i) detection and localization and (ii) segmentation and enhancement. According to the features utilized, text detection and localization methods can be categorized into ve types: edge-based, texture-based, CCbased, stroke-based and the others. Although the precise location of text in an image can be indicated by bounding boxes, the text still needs to be segmented from the background to facilitate its recognition. This means that the extracted text image has to be converted to a binary image and enhanced before it is fed into an OCR engine. Text extraction is the stage where the text components are segmented from the background. Enhancement of the extracted text components is required because the text region usually has low-resolution and is prone to noise. Thereafter, the extracted text images can be transformed into plain text using OCR technology. The proposed paper is to classify and assess these algorithms. Then, this paper offers to the researchers a link to the public image database for the algorithm assessment of text extraction from natural scene images. Finally, we list several popular applications of text extraction from natural scene images in recent years. As no comprehensive surveys make on text extraction from natural scene images specially, even though, there are surveys on Licence plate recognition [14] and Text extraction in images and videos [15] respectively. The purpose of this paper is to classify, review and analyze these algorithms, discuss the databases and performance measurement and point out future work.


H. Zhang et al. / Neurocomputing 122 (2013) 310 323 Conference on Document Analysis and Recognition (ICDAR), IEEE, 2011, pp. 759763. B. Epshtein, E. Ofek, Y. Wexler, Detecting text in natural scenes with stroke width transform, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 29632970. G. Zhou, Y. Liu, Q. Meng, Y. Zhang, Detecting multilingual text in natural scene, in: 2011 1st International Symposium on Access Spaces (ISAS), IEEE, 2011, pp. 116120. Y. Cui, J. Yang, D. Liang, An edge-based approach for sign text extraction, Image Technology 1 (2006). A.K. Jain, B. Yu, Automatic text location in images and video frames, Pattern Recognition 31 (12) (1998) 20552076. N. Ezaki, M. Bulacu, L. Schomaker, Text detection from natural scene images: towards a system for visually impaired persons, in: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 2, IEEE, 2004, pp. 683686. S.M. Hanif, L. Prevost, P.A. Negri, A cascade detector for text detection in natural scene images, in: 19th International Conference on Pattern Recognition, ICPR 2008, IEEE, 2008, pp. 14. S.M. Hanif, L. Prevost, Text detection and localization in complex scene images using constrained adaboost algorithm, in: 10th International Conference on Document Analysis and Recognition, 2009, ICDAR'09, IEEE, 2009, pp. 15. R. Ji, P. Xu, H. Yao, Z. Zhang, X. Sun, T. Liu, Directional correlation analysis of local haar binary pattern for text detection, in: 2008 IEEE International Conference on Multimedia and Expo, IEEE, 2008, pp. 885888. T. Saoi, H. Goto, H. Kobayashi, Text detection in color scene images based on unsupervised clustering of multi-channel wavelet features, in: Eighth International Conference on Document Analysis and Recognition (ICDAR'05), vol. 2, 2005, pp. 690694. J. Gllavata, R. Ewerth, B. Freisleben, Text detection in images based on unsupervised classication of high-frequency wavelet coefcients, in: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, IEEE, vol. 1, 2004, pp. 425428. Z. Liu, S. Sarkar, Robust outdoor text detection using text intensity and shape features, in: 2008 19th International Conference on Pattern Recognition, ICPR 2008, IEEE, 2008, pp. 14. S. Lu, C.L. Tan, Camera text recognition based on perspective invariants, in: 2006 18th International Conference on Pattern Recognition, ICPR 2006, vol. 2, IEEE, 2006, pp. 10421045. R. Jiang, F. Qi, L. Xu, G. Wu, Using connected-components' features to detect and segment text, Journal of Image and Graphics 11 (2006). H. Zhang, C. Liu, C. Yang, X. Ding, K.Q. Wang, An improved scene text extraction method using conditional random eld and optical character recognition, in: 2011 International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2011, pp. 708712. K. Wang, J.A. Kangas, Character location in scene images from digital camera, Pattern Recognition 36 (10) (2003) 22872299. H. Wang, J. Kangas, Character-like region verication for extracting text in scene images, in: Proceedings Sixth International Conference on Document Analysis and Recognition, IEEE, 2001, pp. 957962. H. Wang, Automatic character location and segmentation in color scene images, in: Proceedings 11th International Conference on Image Analysis and Processing, IEEE, 2001, pp. 27. C. Yi, Y.L. Tian, Text string detection from natural scenes by structure-based partition and grouping, IEEE Transactions on Image Processing 20 (9) (2011) 25942605. A. Srivastav, J. Kumar, Text detection in scene images using stroke width and nearest-neighbor constraints, in: IEEE Region 10 Conference on TENCON 20082008, IEEE, 2008, pp. 15. Sezer Karaoglu, Basura Fernando, Alain Trmeau, Saint Etienne, A novel algorithm for text detection and localization in natural scene images, 2010. J.M. Park, H. Chung, Y.K. Seong, Scene text detection suitable for parallelizing on multi-core, in: 2009 16th IEEE International Conference on Image Processing (ICIP), IEEE, 2009, pp. 24252428. C. Yao, X. Bai, Detecting texts of arbitrary orientations in natural images, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 10831090. Icdar2005, http://algoval.essex.ac.uk/data/icdar/ocr/digits/, 2005. Jiri Matas Lukas Neumann, Real-time scene text localization and recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2012. A. Shahab, F. Shafait, A. Dengel, Icdar 2011 robust reading competition challenge 2: reading text in scene images, in: 2011 International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2011, pp. 14911496. Long Ma, Chunheng Wang, Baihua Xiao, Text detection in natural images based on multi-scale edge detection and classication, in: 2010 3rd International Congress on Image and Signal Processing, October 2010, pp. 19611965. M. Kumar, Y.C. Kim, G.S. Lee, Text detection using multilayer separation in real scene images, in: 2010 IEEE 10th International Conference on Computer and Information Technology (CIT), IEEE, 2010, pp. 14131417. Toan Dinh Nguyen, Jonghyun Park, Gueesang Lee, Tensor voting based text localization in natural scene images, IEEE Signal Processing Letters 17 (July (7)) (2010) 639642. N. Mavaddat, T.K. Kim, R. Cipolla, Design and evaluation of features that best dene text in complex scene images, in: Proceedings of the IAPR Conference on Machine Vision Applications, 2009. [58] E. Kim, S.H. Lee, J.H. Kim, Scene text extraction using focus of mobile camera, in: 10th International Conference on Document Analysis and Recognition, 2009, ICDAR'09, IEEE, 2009, pp. 166170. [59] M. Maruyama, T. Yamaguchi, Extraction of characters on signboards in natural scene images by stump classiers, in: 2009 10th International Conference on Document Analysis and Recognition, ICDAR'09, IEEE, 2009, pp. 13651369. [60] J. Gao, J. Yang, An adaptive algorithm for text detection from natural scenes, in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 2, IEEE, 2001, p. II-84. [61] R.D. Brown, Example-based machine translation in the pangloss system, in: Proceedings of the 16th Conference on Computational Linguistics, vol. 1, Association for Computational Linguistics, 1996, pp. 169174. [62] M.S. Brown, W.B. Seales, Image restoration of arbitrarily warped documents, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (10) (2004) 12951306. [63] S. Pollard, M. Pilu, Building cameras for capturing documents, International Journal on Document Analysis and Recognition 7 (2) (2005) 123137. [64] P. Clark, M. Mirmehdi, On the recovery of oriented documents from single images, in: Proceedings of ACIVS, 2002, pp. 190197. [65] Y.C. Tsoi, M.S. Brown, Geometric and shading correction for images of printed materials: a unied approach using boundary, in: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2004, vol. 1, IEEE, 2004, p. I-240. [66] Y. Zhu, J. Sun, S. Naoi, Recognizing natural scene characters by convolutional neural network and bimodal image enhancement, in: Camera-Based Document Analysis and Recognition, 2012, pp. 6982. [67] K. Kita, T. Wakahara, Binarization of color characters in scene images using k-means clustering and support vector machines, in: 2010 20th International Conference on Pattern Recognition (ICPR), IEEE, 2010, pp. 31833186. [68] S. Zhou, C. Liu, Z. Cui, S. Gong, An improved adaptive document image binarization method, in: 2nd International Congress on Image and Signal Processing, 2009, CISP'09, IEEE, 2009, pp. 15. [69] Anand Mishra, Karteek Alahari, C.V. Jawahar, Top-down and bottom-up cues for scene text recognition, in: CVPR'12, 2012, pp. 26872694. [70] J. Liu, C. Wang, An algorithm for image binarization based on adaptive threshold, in: 2009 Chinese Control and Decision Conference, CCDC'09, IEEE, 2009, pp. 39583962. [71] L. Jiang, K. Chen, S. Yan, Y. Zhou, H. Guan, Adaptive binarization for degraded document images, in: 2009 International Conference on Information Engineering and Computer Science, ICIECS 2009, IEEE, 2009, pp. 14. [72] J. Sauvola, M. Pietikinen, Adaptive document image binarization, Pattern Recognition 33 (2) (2000) 225236. [73] H.P. Le, G.S. Lee, Text correction in distorted label images by applying biquadratic transformation, in: 2009 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), IEEE, 2009, pp. 326329. [74] M. Valizadeh, N. Armanfard, M. Komeili, E. Kabir, A novel hybrid algorithm for binarization of badly illuminated document images, in: 2009 14th International CSI Computer Conference, CSICC 2009, IEEE, 2009, pp. 121126. [75] S. Huang, M.A. Sid-Ahmed, M. Ahmadi, I. El-Feghi, A binarization method for scanned documents based on hidden Markov model, in: 2006 Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS 2006, IEEE, 2006, p. 4. [76] M. Elmore, M. Martonosi, A morphological image preprocessing suite for OCR on natural scene images. [77] J. Huading, L. Binjie, W. Li, A new binarization algorithm based on maximum gradient of histogram, in: ICIG 2007, Fourth International Conference on Image and Graphics, IEEE, 2007, pp. 368371. [78] H. Yoshida, N. Tanaka, A new binarization method for a sign board image with the blanket method, in: 2009 Chinese Conference on Image and Graphics, Pattern Recognition 2009, CCPR 2009, IEEE, 2009, pp. 14. [79] S.M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, Icdar 2003 robust reading competitions, in: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, 2003, pp. 682687. [80] Evaluation metrics for text extraction algorithms, http://tomazkovacic.com/ blog/74/evaluation-metrics-for-text-extraction-algorithms/ , 2011. [81] Neocr: Natural environment ocr dataset, http://www6.cs.fau.de/research/ projects/pixtract/neocr/, 2011. [82] The chars74k dataset: character recognition in natural images, http://www. ee.surrey.ac.uk/CVSSP/demos/chars74k/, 2009. [83] The street view house numbers (SVHN) dataset, http://udl.stanford.edu/ housenumbers/, 2012. [84] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (VOC) challenge, International Journal of Computer Vision 88 (2) (2010) 303338. [85] Google goggles, http://www.google.com/mobile/goggles/#text, 2011. [86] The voice for android, http://www.articialvision.com/android.htm, 2011. [87] Sypole project, http://tcts.fpms.ac.be/projects/sypole/index.php?lang=en, 2006. [88] Xiaogang Wang Bo Luo, Xiaoou Tang, A world wide web based image search engine using text and image content features, Internet Imaging IV, in: Simone Santini, Raimondo Schettini (Eds.), Proceedings of SPIE-IS&T Electronic Imaging, vol. 5018, 2003. [89] Horst Bischof Clemens Arth, Florian Limberger, Real-time license plate recognition on an embedded DSP-platform, in: Computer Vision and Pattern Recognition CVPR, 2010.



[31] [32] [33]








[41] [42]

[43] [44]




[48] [49]


[51] [52] [53]





H. Zhang et al. / Neurocomputing 122 (2013) 310 323 [90] Sung Shik Koh Thi Thi Zin, Hiromitsu Hama, Main peak nding for signboard recognition under diverse conditions, in: International Journal of Computer Science and Network Security 12 (2006). [91] J. Ghorpade, R. Palvankar, A. Patankar, S. Rathi, Extracting text from video, Signal & Image Processing: An International Journal (SIPIJ) 2 (2011) 34.


He has published over 20 papers at highly ranked international journals such as ACM TOG, IEEE TVCG, IEEE TIP and CVIU, and leading international conferences such as ECCV, SIGGRAPH ASIA, BMVC and ISMAR. He co-founded UK Student Vision Workshop in 2009 and has previously obtained several honours and awards such as the Best Dissertation Award from University of Cambridge. He is currently a Member of IEEE (Institute of Electrical and Electronics Engineers), SPIE (International Society for Optics and Photonics), as well as a member of BMVA (British Machine Vision Association).

Honggang Zhang received the BS degree from the department of Electrical Engineering, Shandong University in 1996, the Master and PhD degrees from the school of Information Engineering, Beijing University of Posts and Telecommunications (BUPT) in 1999 and 2003 respectively. He worked as a visiting scholar in School of Computer Science, Carnegie Mellon University (CMU) from 20072008. He is currently an associate professor and director of web search center at BUPT. His research interests include image retrieval, computer vision and pattern recognition. He published more than 30 papers on TPAMI, SCIENCE, Machine Vision and Applications, AAAI, ICPR, ICIP. He is a senior member of IEEE.

Kaili Zhao is a Phd student in School of Information and Telecommunication Engineering School at Beijing University of Posts and Telecommunications (BUPT). She got her bachelor of Automation at Hefei University of Technology (HFUT). During her undergraduate years, she has been awarded the National Encouragement Scholarship and the second level Scholarship each year in HFUT. She was admitted by Pattern Recognition and Intelligent System lab in BUPT in 2012. Now she's a participant of a National Science Foundation Project.

Jun Guo received his Ph.D. from Tohuku-Gakuin University, in 1993. He is currently the Vice-President of BUPT, a distinguished professor at Beijing University of Posts and Telecommunications, and the dean of the school of Information and Communication Engineering. He is mainly engaged in the research of pattern recognition, Web searching, and network management. He has more than 200 publications at International top journals and conferences, including SCIENCE, IEEE Trans. on PAMI, IEICE Trans, ICPR, ICCV, SIGIR. He received numerous international and national awards, including 3 IEEE International Awards, the second prize of Beijing scientic and technological progress, the second prize of the Ministry of Posts and Telecommunications scientic and technological progress.

Yi-Zhe Song is a Lecturer (Assistant Professor) at School of Electronic Engineering and Computer Science, Queen Mary, University of London. He received both the B.Sc. and Ph.D. degrees in Computer Science from the Department of Computer Science, University of Bath, UK, in 2003 and 2009 respectively; prior to his doctoral studies, he obtained a Diploma (M.Sc.) degree in Computer Science from the Computer Laboratory, University of Cambridge, UK, in 2004. After his Ph.D., he continued in the same department to become a research and teaching fellow. His major research interests include computer vision, computer graphics, pattern recognition, machine learning and multimedia sensor networks.