An Approach For Segmentation Handwritten Kannada Document

IPASJ International Journal of Electronics & Communication (IIJEC)
Web Site: http://www.ipasj.org/IIJEC/IIJEC.htm

Email: editoriijec@ipasj.org
ISSN 2321-5984
A Publisher for Research Motivation........
Volume 3, Issue 5, May 2015
An Approach for Segmentation Handwritten

Kannada Document
1
Nagaveni M N, Pooja A P2
M.Tech student, Signal Processing, Vidyavardhaka College of Engineering, Mysore
Assistant Professor, Dept. of ECE, Vidyavardhaka College of Engineering, Mysore
ABSTRACT
Handwritten text line segmentation is considered to be a major challenge in document image analysis. It has to solve some
obstacles that are uncommon in modern printed text. Among the most are: skewed lines, curvilinear lines, touching and
overlapping components, usually words or letters, between lines and irregularity in geometrical properties of line. Text line
extraction from unconstrained handwritten documents is a challenge because the text lines are often skewed and the space
between lines is not obvious. The complexity involved in the segmentation of the Handwritten Documents for Indian languages
like Telugu, Tamil and Malayalam. Curved and non-parallel text lines handwritten documents also make the segmentation and
recognition challenging. Text line segmentation of handwritten documents written in Kannada script by using Independent
Component Analysis (ICA) is taken into consideration
Keywords: Projection profile, ICA, Handwritten documents, text line and word segmentation, Matlab 2008.
1. INTRODUCTION
The segmentation of both printed documents and handwritten text lines is sometimes supported numerous heuristics
and assumptions. While multiple text line form based documents can usually be easily separated text lines, poorly
written documents with varying text line slope and overlapping ascenders or descenders are difficult to segment into
text lines. Handwritten document analysis may follow many paths. However, separated into text lines, then words and
in certain case character is required, in order to extract recognition features, such as word length or word moments.
Recognition can be word based or character based. Once the block of text is separated into text lines, word
segmentation is considerably easier.
Text line segmentation is usually combined with skew detection so literature survey is done for skew detection
techniques. Most popular methods in the literature for skew detection, hence text line segmentation, are the methods
based on projection profile technique, connected component analysis, Hough transforms and Fourier transform based
algorithms. The method presented in this paper uses the projected profile of the text lines for initialization.
In this paper, we propose an ICA segmentation algorithm, which can be used effectively on Kannada handwritten
document images containing many different kinds of overlapping and touching words in adjacent lines.
2. LITERATURE SURVEY
In this section, a brief audit of recent work on text line and word segmentation in handwritten documents. To the
extent, the accompanying methods either accomplished the best results in the relating test datasets, or are components
of integrated systems for particular tasks.
Weliwitage, A.L.Harvey, A.B.Jennings [1], used Cut Text Minimization (CTM) method. It based on a cost function to
detect segmentation lines between text rows, work efficiently on poorly written documents with varying text line slope
and overlapping ascenders or descenders. There is a wrong detection of number of text lines in the initial run of the
vertical projection histogram of the image.
S. Basu, C. Chaudhuri, M.Kundu, M. Nasipuri, D.K. Basu [2], used connected component labeling method. In this
paper, it based on label all line spacing in the document irrespective of their degrees of uniformity and to identify
separately all unlabeled stripes left after labeling of line spacing in it. Failed-larger word spacing and misalignment of
consecutive words and also where a single text line is divided into two or more parts.
Florence Luthy, TamasVarga and Horst Bunke [3], used a Hidden Markov Models based offline handwritten text line
segmentation approach. The segmentation was considered as being a text line recognition task, adapted to the
characteristics of segmentation. Over segmentation and under segmentation errors are the two drawbacks observed
here.
Fei Yin, Cheng-Lin Liu [4], used Variational Bayes method. Here the document image is viewed as distribution of
pixels, each text line can be modeled as bivariate Gaussian distribution and the document is a mixture of Gaussians. It
Page 25


ISSN 2321-5984
can spilt components as well as eliminate the redundant components and selectively control the orientation of the text
lines.It fails to detect the segmentation error and merged error.
A.Nicolaou and B.Gatos [5], Shredding method is used.This technique is used to segment handwritten document
images into text lines by shredding their surface with local minima tracers.The drawback is that the occurrence of
overlapping and touching components and also variations in letter size.
Mamatha H R, SrikantamurthyK [6], have proposed morphology based handwritten line segmentation using projection
profile. Morphology technique is used for removing disconnected components and constructing bridge between the
components. It fails during the line segmentation were due to the consonant conjuncts which appear below the base
consonant which results in a false white space in the horizontal projection. Also overlapping of the consonant conjuncts
of one line with the vowel modifiers
3.METHODOLOGY
In this section, the proposed methodology is described. A typical handwriting recognition system consists of preprocessing and segmentation stages. The schematic diagram of the proposed method is shown in Figure.1.
Figure 1 Block Diagram

3.1 Input
The aim of the Document Recognition System is to process the image of a scanned document page containing
characters and render the information in a suitable form for modification and manipulation. It is a process of
converting scanned images into the original text. Therefore, unconstrained input is considered.
3.2 Pre-processing
Pre-processing of the scanned image is done to prepare it for another stage. It increases the accuracy of the recognizing
algorithms by enhancing some of the features and eliminating some of the inconsistencies. The raw input of the
digitizer typically contains noise due to erratic hand movements and inaccuracies in digitization of the actual input.
Original documents are often dirty and due to smearing and smudging of text and aging. In some cases, documents are
of very poor quality due to seeping of ink from the other side of the page and general degradation of the paper and ink.
Pre-processing is concerned mainly with the reduction of these kinds of noise and variability in the input. The number
and type of pre-processing algorithms employ on the scanned image depend on many factors such as paper quality,
resolution of the scanned image, the amount of skew in the image and the layout of the text. Pre-processing is at the
image-to-image transformation level. It is the process of compensating a poor quality and/ or poor-quality scanning.
Pre-processing operations performed prior to recognition are: Thresholding, Skeletonization, Line Segmentation,
Character Segmentation, Slant removal, and Normalization. The image is then ready for segmentation.
RGB to gray conversion and skew removal using projection profile technique is done in this section.
3.3 Segmentation
Text line segmentation is a very important task in Kannada handwritten documents. It is a crucial step in segmentation.
The main characteristics of Kannada script to point out the main difficulties for segmenting. Kannada is a popular
script and it is the official language of the southern Indian state, Karnataka. Kannada is a Dravidian language mainly
Page 26


ISSN 2321-5984
used by the people of Karnataka, Andhra Pradesh, Tamil Nadu and Maharashtra. Kannada is spoken by about 44
million people. The language has 47 characters in its alphabet set
Kannada Handwritten text line segmentation is considered to be a major challenge in document image analysis. It has
to solve some obstacles that are uncommon in modern printed text. Among the most are: skewed lines, curvilinear
lines, touching and overlapping components, usually words or letters, between the lines and irregularity in geometrical
properties of line.
Text line segmentation is done by using an Independent Component Analysis (ICA) [7]
The estimated model to be generated for ICA using random vector y as
This definition of ICA is the simplest and widest used in most of the research on ICA.
There are many other definitions for ICA, which is found in literature [8, 9].
The three criteria used to choose the ICA
a. All the independent components si, with the conceivable special case of one component, must be non-Gaussian.
b. The number of observed linear mixtures N must be at least as large as the number of independent components M,
i.e.., N>=M.
c. The matrix B must be of the full column matrix.
Vector matrix (VM) is a source signal for ICA to isolate the words. VM is a 3-bymatrix, where r is the row
and c is the column of the overlap words. Sum the elements of row wise and save into the column matrix. Subtract each
element of the sum matrix from the number of columns. Then store the segmented matrix.
4.EXPERIMENTAL RESULTS
4.1 Input
There are some overlapping and touching the components in contiguous lines, which are to separate as shown in figure
2.
Figure 2 Input image

4.2 Deskwed image
The input image will be skewed image. After this step the skewed image will be deskewed by removing the skew angle.
The deskewed image is as shown in the figure 3.
Page 27


ISSN 2321-5984
Figure 3 Deskwed image

4.3 Segmented image
After skew removal step, the image is now ready to segmentation. The segmentation is done by using the ICA
algorithm. The segmented image is as shown in figure 4.
Figure 4 Segmented images
5 CONCLUSION
In Handwritten document has some difficulties among are overlapping and touching components, ascenders and
descenders in words. The proposed ICA algorithm is overcome from this problem. It has good execution separating the
overlapping words and touching components, which incorporate loops in ascenders, descenders and upper case letters
in adjacent lines.
Page 28


ISSN 2321-5984
REFERENCES
[1]. C.Weliwitage, A.L.Harvey, A.B.Jennings, Handwritten Document Offline Text Line Segmentation,Proceeding of
the Digital Imaging Computing: Techniques and Applications (DICTA), IEEE, 2005
[2]. S. Basu, C. Chaudhuri, M. Kundu, M. Nasipuri, D.K. Basu, Text line extraction from multi-skewed handwritten
documents, Elsevier, 2006
[3]. Florence Luthy, TamasVarga and Horst Bunke, Using Hidden Markov Models as a Tool for Handwritten Text
Line Segmentation, 9th International Conference on Document Analysis and Recognition (ICDAR), 2007
[4]. Fei Yin, Cheng-Lin Liu, A Variational Bayes Method for Handwritten Text Line Segmentation, 10th
International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2009
[5]. A.Nicolaou and B.Gatos,Handwritten Text Line Segmentation by Shredding Text into its Lines, 10th
International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2009.
[6]. Mamatha H R, Srikantamurthy K, Morphological Operations and Projection Profiles based Segmentation of
Handwritten Kannada Document, International Journal of Applied Information Systems (IJAIS), 2012
[7]. Yan Chen and Graham Leedham, Independent Component Analysis Segmentation Algorithm",Eight
International Conference on Document Analysis and Recognition (ICDAR), IEEE, 2009.
[8]. P.Common Independent component analysis - a new concept, Signal Processing, 36, 1994, pp: 287-314
[9]. C. Jutten, J. Herault, Blind separation of sources, Signal Processing, Part I: An adaptive algorithm based on
neuromimetic architecture. 24, 1991, pp: 1-10
Page 29

An Approach For Segmentation Handwritten Kannada Document

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

An Approach For Segmentation Handwritten Kannada Document

Enviado por

Direitos autorais:

Formatos disponíveis

IPASJ International Journal of Electronics & Communication (IIJEC)

Web Site: http://www.ipasj.org/IIJEC/IIJEC.htm

A Publisher for Research Motivation........

Volume 3, Issue 5, May 2015