10 1 1 50 1441

1
Large Vocabulary Recognition of On-Line A. Strategies for Cursive Word Recognition

Handwritten Cursive Words
Two major approaches have traditionally been used in
cursive handwriting recognition: segmentation-based and
Giovanni Seni, Rohini K. Srihari, Nasser Nasrabadi word-based. In the segmentation-based approach, each
word is segmented into its component letters and a recogni-
Abstract | This paper presents a writer independent sys- tion technique is then used to identify each letter. Unfortu-
tem for large vocabulary recognition of on-line handwrit- nately, the nature of cursive script is such that the letter seg-
ten cursive words. The system rst uses a ltering module, mentation points (i.e., points where one letter ends and the
based on simple letter features, to quickly reduce a large ref- succeeding one begins) can only be correctly identied when
erence dictionary (lexicon) to a more manageable size; the
reduced lexicon is subsequently fed to a recognition mod- the correct letter sequence is known. On the other hand, re-
ule. The recognition module uses a temporal representa- cognition of characters can only be done successfully when
tion of the input, instead of a static 2-dimensional image, the segmentation is correct. A relaxed segmentation cri-
thereby preserving the sequential nature of the data and en- teria is commonly used whereby a large number of poten-
abling the use of a Time-Delay Neural Network (TDNN);
such networks havee been previously successful in the con- tial segmentation points are generated; this in turn can
tinuous speech recognition domain. Explicit segmentation result in combinatorial complexity when combining mul-
of the input words into characters is avoided by sequen- tiple decisions about individual characters. Therefore, a
tially presenting the input word representation to the neural
network-based recognizer. The outputs of the recognition recognition engine that performs character recognition and
module are collected and converted into a string of charac- segmentation in parallel is desirable. Segmentation-based
ters that is matched against the reduced lexicon using an systems also make poor use of \contextual" information
extended Damerau-Levenshtein function. Trained on 2,443
unconstrained word images (11k characters) from 55 writers provided by neighboring characters.
and using a 21k lexicon we reached a 97.9% and 82.4% top- In the word-based approach, individual letters are not
5 word recognition rate on a writer-dependent and writer- recognized as such but a global feature vector is extracted
independent test respectively. from the input word which is matched against a stored dic-
Keywords | Handwritten text recognition, on-line cursive tionary of prototype words; a distance measure is used to
handwriting recognition, neural network applications, pen-
based systems. choose the best candidate. This recognition method is usu-
ally faster and avoids problems associated with segment-
I. INTRODUCTION ation. It also re ects the human reading process more
which is not character by character, but rather by words
Recent advances in pen-based hardware and wireless or even phrases. The main disadvantages of this method
communication have been in uential factors in the renewed are the need to train the machine with samples of each word
interest in on-line recognition systems. On-line handwriting in the established dictionary and the diculty in devising
recognition is fundamentally a pattern classication task; word-level feature vectors that uniquely characterize words,
the objective is to take an input pattern, the handwritten thereby constraining vocabulary size.
signal collected on-line via a digitizing device, and classify This research adopts an intermediate position between
it as one of a pre-specied set of words (i.e., the system's the above approaches, and incorporates the following three
lexicon or reference dictionary). Because exact recognition concepts relating to the cognition of cursive handwriting.
is very dicult, a lexicon is used to constrain the recog- First, the perception of words by humans is a two step
nition output to a known vocabulary. Lexicon size aects process: characteristic letters are found in the word image
recognition performance because the larger the lexicon, the which are used to select candidate words; an attempt is then
larger the number of words that can be confused. made to align these words with the input image. Second, the
Most of the research eorts in this area have been de- dynamic pattern of motion in cursive handwriting is gener-
voted to the recognition of isolated characters (particularly ally consistent and carries valuable information for recogni-
important for large-alphabet languages such as Chinese) tion [1]. Third, separating a character from its background
[8], [13], or run-on hand-printed words [15], [6]. A smal- is not a necessary pre-processing step for identifying the
ler number of recognition systems have been devised for character.
cursive words [16], [12], a dicult task due to the presence
of the letter segmentation problem (partitioning the word B. System Overview
into letters), and large variation at the letter level due to The structure of our cursive word recognition system is
co-articulation (the in uence of one letter on another) and shown in Fig. 1. The system is composed of three modules:
presence of ligatures. Most existing systems restrict the Preprocessing, Filtering and Recognition. A preprocessing
working dictionary sizes to less than a few thousand words. module (section II) is necessary because the output of the
In this paper we focus on the problem of cursive word re- digitizing tablet is noisy (due to quantization eects and the
cognition using a large vocabulary. A solution to the more shaking of the hand) and usually contains too many points.
general problem of recognizing unconstrained handwritten The ltering module (section III) takes a preprocessed
words (i.e., written using a combination of cursive, discrete word image and extracts a structural description of it in
and/or run-on discrete styles) can be obtained by combin- terms of basic features (stroke primitives). The string of
ing specialized algorithms developed to handle each basic (concatenated) stroke primitives, representing the visual
writing style. shape of the input word, is then used to derive a set of
2
Preprocessing Module
Orientation, slant
✍ Digitizing {(X(t),Y(t),Z(t))} Data reduction
& size normalization
{(X(t),Y(t),Z(t))}
tablet Raw data & enhancement Preprocessed data
Handwriting
Filtering Module
(a)
Reduced lexicon Search algorithm α=α1α2...αn Primitive extraction
(Matchable words) (Production rules) Description string (Vfeature)
Large vocabulary (ASCII dictionary)
Recognition String Interpretation Output {Ol(t)} TDNN-style {F(t)} Trajectory

result matching string(s) parsing Output sequence recognizer Frame sequence encoding (τ)
(ranked word choices)
(b)
Recognition Module
Fig. 1. Structure of word recognition system. Fig. 2. Preprocessing example: (a) a raw image of word `recognition',
and (b) the preprocessed version of it.
matchable words. This set consists of words from the sys-
tem's lexicon that are similar in shape or structure to the III. FILTERING MODULE
input word (e.g., the words ìmaginative', ìmmigration',
and ìmagination' are similar based on coarse shape). The In this section we brie y describe how the task of lter-
set of matchable words forms the reduced lexicon; it is de- ing/reducing the lexicon is achieved. More details can be
termined by generating all possible letter strings that can found in [18]; an extended paper is under preparation.
be derived from the string of primitives using a set of rules The principal idea underlying the ltering module is that
mapping the composition of stroke primitives into English the visual conguration of a word written in cursive script
characters. An important consequence of having a reduced can be captured by a stroke description string. The vocab-
lexicon is in limiting the amount of computation required ulary of the description string corresponds to the dierent
during the string matching { postprocessing { stage (see types of downward strokes (i.e., portions of input between
below). pairs of consecutive y-maxima and y-minima in the pen
The Recognition module (section IV) uses a represent- trace) made by a writer in writing the word. Downward
ation of the input that preserves the sequential nature of strokes constitute a simple but robust cue that contributes
the cursive data and justies the use of a network architec- signicantly to letter identication; furthermore, they allow
ture similar to the Time-Delay Neural Network (TDNN). for a compact description of the overall shape of a word
A further advantage of such a representation scheme is that without having to consider its internal details. This idea
stroke absences (from unintentional pen lifts) and acci- has provided formal grounding for the notion of visually sim-
dental intersections (i.e., overlapping or touching charac- ilar neighborhood (VSN); in Fig. 3 our preprocessed image
ters) which signicantly alter the topological (static) pat- of word `recognition' is shown with its extracted downward
tern of the word, have little or no in uence on the \dy- strokes and a description of its shape as captured by the
namic" pattern of it. The neural network-based recognizer string that results of concatenating them. The visually sim-
is trained to classify the signal within its xed-size input ilar words described by this string are also shown.
window as this window sequentially scans the input word
representation, thus bypassing a potentially erroneous seg-
mentation procedure. By training and recognizing char-
half-line
acters in context (i.e., using a portion of the word image base-line
that precedes and follows each given character) we minimize

spurius responses and, to some extent, account for the co- (a)
articulation phenomena. Finally, the recognizer's outputs
are collected and converted into an ASCII string that is
matched with the reduced lexicon provided by the ltering
module. M M M M M D M M M A M M M M
(b)
II. PREPROCESSING MODULE emigration
composition migration
inauguration
Data reduction and enhancement is achieved by a res- conjunction

α = MMMMMDMMMAMMMM
incorporation
ampling algorithm which (i) removes duplicated points,

resignation(s) imagination(s)
imaginative reunification
(ii) enforces even spacing between points, and (iii) per- immigration originators
forms smoothing (see [7] for an overview of these oper-

verification recognition unification
(c)
ations). Normalization of dierent writing orientations,
writing slant, and writing sizes is also essential in order Fig. shown
3. Filtering example: (a) a preprocessed image of word `recognition'
with base-line, half-line and extracted downward strokes; (b)
to reduce writer-dependent variability. A normalization al- the coarse representation of the word-shape provided by the string of
gorithm is employed for this purpose, based on the work of concatenated stroke primitives; and (c) the set of matchable words
[3], [19]. In Fig. 2 a raw image of the word `recognition' derived from this string with a 21k lexicon.
is shown with the output produced by the preprocessing
module on it. The stroke description scheme identies 9 dierent types
3
of strokes, some of which capture spatio-temporal informa- between two consecutive directional angles. Guyon et al.
tion such as retrograde motion of the pen, and others simply [8] suggest that the angle (t) = (t + 1) ? (t ? 1) be
x x
capture length and position relative to the reference lines. represented by its sine and cosine values. However, we
In the example of Fig. 3 there are three dierent strokes, found that the values of cos (t) behave more smoothly than
namely Ascender (A), Median (M), and Descender (D). The those of sin (t); for small values of (t) (i.e., little change
stroke descriptions for the letters `r', è', `c', ò', `g', `n', in direction) cos (t) remains at at the high value of +1
ì', and `t' would then be \M", \M", \M", \M", \MD", whereas sin (t) oscillates around zero. We choose cos (t)
\MM", \M", and \A" respectively. The stroke description as our only curvature descriptor: it goes down to ?1 for
for the word `recognition' would simply be the concatena- sharp cusps (independent of their orientation) and down to
tion \MMMMMDMNMAMMMM". around 0 for more smoother turns.
IV. RECOGNITION MODULE A.1. Time Frames. Given a sequence f(X (t); Y (t); Z (t))g
The task of the recognition module is accomplished in of on-line data, we dene a time frame F (t) to be a
four steps. The rst is the encoding of pen trajectory as a 4-dimensional feature vector consisting of four elements
sequence of frames F (t) (a frame denotes one discrete time (sin x (t); sin y (t); cos (t); zone(t)), where the rst three
step worth of data { features). In the second step, a TDNN- elements have already been described above. The fourth
style network operates on a window of frames (comprising element, zone(t), is introduced to help distinguish pairs
a character and parts of its neighbors) and produces an such as è'-`l', which have similar temporal representations
output at every time interval. In the third step, a postpro- in terms of direction and curvature alone. These pairs can
cessor interprets this output sequence to generate a letter be more easily dierentiated by encoding their correspond-
sequence (interpretation string). Finally, in the fourth step, ing Y (t) values into the previously determined zones: the
a string-to-string similarity function is used to match the in- middle zone (between the base-line and the half-line), the
terpretation string(s) with the reduced lexicon produced by ascender zone (above the half-line) and the descender zone
the ltering module. Each of these steps is now described. (below the base-line).
A. Trajectory Representation For a point P (t)=(X (t); Y (t)) falling within the middle
On-line data represents text as a sequence of points zone we make zone(t) = 0; otherwise, we have 0 <
zone(t) 1:0 if the point falls within the ascender zone,
fP (t) = (X (t); Y (t); Z (t))g, where X; Y are the coordinates and ?1:0 zone(t) < 0 if the point falls within the des-
of the pen tip, and Z indicates pen-up/pen-down informa- cender zone; specically, the zone(t) parameter is com-
tion. All relevant dynamic information about handwriting puted by passing the value of the vertical distance (dist)
can theoretically be inferred from this sequence but this between point P (t) and the half-line (or base-line) through
data is too unconstrained; more ecient methods of en- a thresholding function:
coding it must be employed. We choose mainly to encode
information pertaining to local direction and curvature in 10:0dist
the pen trajectory, and rely on the neural network-based re- zone(t) = f ( ? 5:0)
cognizer for the selection of more complex features relevant body hght
to performing the classication task.
Two parameters are used to encode direction: (i) where f (x) is the sigmoid function; body hght corresponds
sin (t) - sine of the angle between each segment
y
to the distance between the base-line and half-line so that
P (t ? 1)P (t) of the trajectory and the Y-axis, and (ii) when point P (t) is further away than body hght from the
sin (t) - sine of the angle between P (t ? 1)P (t) and the X- half-line (or base-line), zone(t) is 1:0 (or ?1:0). This cod-
x
axis. By restricting (t) and (t) to vary between ?=2 ing scheme appears robust against writing distortions where
x y
and +=2 we make the parameters unambiguous; a negative ascenders/descenders are made atypically large or when
value of sin (t) indicates that point P (t) is before point medium-size letters do not fully fall within the reference
y
P (t?1) (i.e., a backward pen movement was made in going lines.
from P (t ? 1) to P (t)), and a positive value indicates that
point P (t) is after point P (t?1) (i.e., a forward pen move- A.2. Varying duration and scaling. Since we are dealing
ment was made). Similarly, the sign of sin (t) indicates
x with unsegmented words, a constant number of frames per
whether point P (t) is above or below point P (t ? 1) (i.e., letter across a word or across a set of samples cannot be
if an upward or downward pen movement was made). Al- guarantee (i.e., varying duration). To reduce such variabil-
though the values of (t) and (t) could have been used
y x ity in letter length, the size normalization step of the pre-
directly, the sine function makes them easier to compute, processing module uses the ratio H=MLH as scale factor;
conveniently bounds them between -1 and +1, and provides MLH { median letter height { is an estimate of the height
us with some quantization eect. of small letters (i.e., those that fall between the base-line
We also nd the location of the points in the traject- and the half-line), and H is the normalization height (cur-
ory at which sharp changes in the direction of movement rently set at about 3mm). Because the distance between
(i.e., cusps) take place. A very simple measurement of points is kept constant, the above procedure eectively min-
\local" curvature can be obtained by calculating the change imizes time distortions of letters.
4
B. Neural Network Recognizer that each hidden unit has a receptive eld that is limited
The Time-Delay Neural Network (TDNN) is a multilayer along the time domain. In the rst hidden layer there are 15
feedforward architecture whose hidden units are replicated units replicated 30 times (i.e., weights are shared), each re-
across time, rst developed for speech recognition, that has ceiving input from 9 consecutive frames in the input layer.
been successful in learning to recognize time sequences [20]. The choice of 9 as the width of the receptive eld of these
TDNNs are trained with a modied back-propagation (BP) units re ects the goal of detecting features with short dur-
algorithm [14] and are usually less dicult to train than (al- ation at this level, but also long enough for it to be mean-
though sometimes outperformed by) recurrent networks [2]. ingful (e.g., a cusp). In the second hidden layer, there are
BP networks have proven to be very competitive with clas- 20 units replicated 9 times, each looking at a 156 window
sical pattern recognition methods, especially for problems of activity levels in the rst hidden layer. These units re-
requiring complex decision boundaries [10]. The ability of ceive information spanning a larger time interval from the
BP networks to deal directly with large amounts of low level input, and hence are expected to detect more complex and
information rather than higher-order (more elaborated) fea- global features (i.e., longer in duration). Finally, the output
ture vectors has also been demonstrated in dierent applic- layer has 26 units (one for each of the English letters) fully
ations. connected to the second hidden layer.
The architecture of our three layer TDNN-style network Weight-sharing is a general paradigm that allows us to
is inspired by that of Waibel et al. [20] for phoneme recog- build reduced size networks; this in turn is an eective
nition and that of Guyon et al. [8] for uppercase handprin- way of increasing the likelihood of correct generalization
ted letter recognition. Its overall structure is shown in Fig. [11]. Weight sharing also enables the development of shift-
4. The choice of L = 96 frames as the length of the input invariant feature detectors [14] by constraining units to
window to the network (network receptive eld) is related learn the same pattern of weights as their neighboring ones
to H, the normalization height. H is selected as small as do. This corresponds to the intuition that if a particular
possible so as to minimize the convolution time needed to feature detector is useful on one part of the sequence, it is
do full word recognition. Having H available, L is selec- likely to be useful on other parts of the sequence as well.
ted so that L frames are enough to represent a character C. Neural Network Simulation
and, in most cases, include part of the characters on each
side of it for contextual information. The length of the two We choose the activation range of our neurons to be
hidden layers is then determined using an undersampling between ?1 and +1 with the following computationally ef-
factor of 3, a technique that allows to reduce the size of cient activation function [4] :
the network [11]. This leads to the notion of a pyramidal f (u) = 1+u u with derivative f (u) = (1+1u )2 + of f set
0
structure in which the input image is recognized at varying j j j j
levels of detail. To compensate for the loss of resolution as- where juj stands for the absolute value of the weighted
sociated with undersampling, a commonly used approach sum and of f set is a constant suggested by Fahlman [5] to
is to increase the number of hidden units as one moves up kill at spots. Weights are initialized with random numbers
the network pyramid. uniformly distributed between ?0:1 and +0:1. A single bias
unit is used by all weight-shared units that are controlled
by the same weight kernel, as opposed to an independent
a
b
bias per unit (we found no reason to have independent bias

c
... ...
Output
p
units in order to develop truly invariant feature detectors).
z
C.1. Training Data Set. For training data, we used 2,443

lowercase cursive word images (11,691 characters) from 55
20 • Hidden
dierent writers. There were 516 dierent words in this
data set. About half of these images come from a large
9
6
data collection eort where paid donors where asked to
15 • Hidden write full sentences presented acoustically to them. Re-
30
corded sentences were then semi-automatically segmented

9
into words and labelled by a truther as cursive, printed, or
4 Input mixed based on visual inspection. For the second half of the
96 database, unpaid donors were instructed to write cursive a
Fig. 4. The architecture of a TDNN-style network for cursive word list of 50 dierent words. These 50 words were randomly
recognition. The net has two hidden layers, an input layer consisting selected from a 60,000 words dictionary, keeping the let-
of 96 time frames, an output layer of 26 units, and 7081 independent ter frequency \roughly" uniform and ensuring coverage of
weights. The rst hidden layer consist of 1530 units, each of which typical letter pairs (e.g., ìn'). A Wacom SD-311 digitiz-
is connected to a window of 9 time steps. The second hidden layer ing tablet interfaced to a workstation was used for both
consist of 209 units, each of which is connected to a window of 6
time steps. collection eorts.
Each word sample in the training data set was la-
The weight connection in the network is arranged such belled with the positions of each inter-character boundary
5
(roughly where one character ends and the next one begins).
This information was then used to pair each frame F (t), in
the dynamic representation of the word, with an output vec-
tor. The goal was to generate a target signal that ramps up
about halfway through the character and then quickly backs
down afterwards, in such a way that the network learns to
recognize a character whenever the center of the character
is in the center of the network's receptive eld. For each
word in the training data set, a target signal that ramps (a)
up at 30% of each character's length, reaches its maximum −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Actual Expected Normalized
between 45% and 55% of the character length, and sub-

Letter Size
Begin/End time Begin/End time Size
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
n 21, 30 19, 31 7.73 0.51 Recognition
sequently backs down to its minimum was generated.

e 50, 57 49, 57 8.16 0.93 Result
o 107,117 106,118 11.88 0.71
g 149,165 146,168 9.32 0.30 2.050 recognition
n 204,212 202,214 7.24 0.48 3.650 imagination
m 218,221 210,228 1.55 0.07 4.000 inauguration
D. Output Trace Parsing

i 233,239 233,239 3.30 0.79 * 4.150 resignation
l 259,264 255,267 1.41 0.08 4.200 migration
t 257,266 255,267 2.50 0.17 * 4.450 emigration
m 319,323 312,330 2.16 0.10 4.550 imaginative
n 326,337 325,337 12.78 0.84 4.750 imaginations
Full word recognition is achieved by continuously mov- 0.590697 ne ognit n

− −
4.950
5.200
composition
immigration
ing the input window of the network across the frame se- (b) (c)
quence fF (t)g thus generating activation traces Ol (t) at the Fig. 5. Recognition example: (a) plot of network output traces when
output of the network, where Ol (t) corresponds to the net- presented with the preprocessed image of word `recognition', (b)
work's condence in recognizing a letter l at time t. These corresponding detected peaks and generated interpretation string
output traces are subsequently examined to determine the (`ne ognit n'), and (c) nal recognition result after matching inter-
ASCII string(s) best representing the word image. In or- pretation string with reduced lexicon provided by the ltering module.
der to convert this output trace signal into letters, the sizes
and widths of all activation \peaks" for every output unit
are determined. Fig. 5 shows the output activation traces written after the whole word was written. These delayed
strokes constitute an exception to our \dynamic" repres-
Ol (t), for all the 26 output nodes, generated by the network
when presented with our preprocessed image of word `re- entation scheme of cursive handwriting because they viol-
cognition'. Eleven signicant activation peaks are visible, ate the (strict) time-order of the letter patterns. Because
each one corresponding to a letter detected in the input diacritical marks are many times missing or badly posi-
image. tioned in an image, we decided that they should be used as
The sizes of all activation peaks are computed by scan- \condence boosters" and not as required features for let-
ter identication. That is, the recognizer should be able to
ning the output activation traces, from left to right, looking
for activation levels that exceed a given threshold. When hypothesize the presence of a letter ì', `j' or `t' in the input
the activation value of an output unit exceeds the thresholdscript even if the diacritical mark is missing. The existence
(currently set at ?0:8), a summing process begins for that of a diacritical mark is then simply used to conrm the hy-
unit, that ends when its activation value falls below the pothesis or resolve any ambiguity (say between ì' and è' or
threshold. Activation peaks with a maximum value be- between `t' and `l'). Diacritical marks are thus detected and
low ?0:2 are ignored (i.e., they are not considered su- removed from the image prior to recognition; a peak for the
ciently condent). In order to compensate for smaller let- letter ì', `j', `t' or `x' in the output activation traces is then
ters, which are shorter in the temporal domain, we normal- said to be \in uenced" by a diacritical mark if some of the
corresponding frames in the input trajectory are covered
ize the size of a peak by its expected size [9]. The expected
peak size for a given letter is given by the average size of(in a horizontal sense) by the diacritical mark. Condence
of in uenced peaks is then boosted by an amount propor-
all the peaks in the training signal for that letter. The set
fPi g of all selected activation peaks is then ordered basedtional to its current value. In Fig. 5 in uenced peaks are
on the beginning time of each peak Pi . A directed inter- indicated with a `'.
pretation graph is subsequently constructed from this set Sometimes it is also possible for the peak parsing routine
as follows: there is a node Ni in the graph for every activ-to \hint" that a character is missing in the output interpret-
ation peak Pi , and there is an edge between nodes Ni and ation string. A missing character in the output interpreta-
Nj (i < j ) if peaks Pi and Pj are adjacent and their widthstion string is usually the result of poor handwriting and a
do not overlap; otherwise, nodes Ni and Nj will lie on par- corresponding low-activation peak which discarded because
allel paths of the graph. Word hypotheses are generated of low condence during the peak identication process.
by traversing all possible paths in the graph from the root This situation often results in a large \no-response" time
to all the \leaves". The condence of a word hypothesis is interval in the output activation traces; that is, a period
set using the average of the node's normalized sizes in the of time for which no O (t) is active. To detect these cases
l
corresponding path. we have computed the expected inter-peak gap, from our
training data set, for every pair of characters. Then, dur-
D.1. Delayed Strokes and Missing Peaks. Diacritical ing the traversal of the interpretation graph, if the time-gap
marks such as dots on letters ì' and `j', and horizontal between two adjacent activation peaks is larger than its ex-
bars on letter `t' (and sometimes `x' slash also) are often pected value, a special symbol (` ') is output to indicate
6
that a character is probably missing. When matching an and avoidance of error-prone segmentation of the script by
interpretation string containing symbol ` ' with a lexicon means of a scanning window technique. Experimental res-
entry, any character is allowed to match ` ' with a small ults clearly show that a recognition system designed accord-
penalty. ing to these ideas can be successful.
E. String Matching ACKNOWLEDGEMENTS
In order to validate the output interpretation strings pro- This work was supported in part by a grant from the Na-
duced by the recognizer, we need to look them up in the list tional Science Foundation (IRI-9315006) under the Human
of words (reduced lexicon) which is provided by the lter- Language Technology program.
ing module. Since there may be errors in the interpretation References
string(s) s, a similarity metric is needed to determine the [1] M. Babcock and J. Freyd. Perception of dynamic information
likelihood that a word w in the reduced lexicon is the \true" in static handwritten forms. American Journal of Psychology,
value of s. For this purpose we extended the Damerau- 101(1):111{130, 1988.
Levenshtein metric (see [17] for details) to more accurately [2] Y. Bengio. A connectionist approach to speech recognition. Intl.
Jour. Pattern Recog. Artif. Intell., 7(4):3{22, 1993.
compensate for the types of errors that are present in the [3] E. Brocklehurst and P. Kenward. Preprocessing for cursive script
script recognition domain. The extended metric quanties recognition. NPL Report DITC 132/88, 1988.
[4] D. Elliott. A better activation function for articial neural net-
the minimum cost of transforming each word hypothesis works. Technical Report TR93-8, Institute for Systems Research,
into every word in the reduced lexicon using edit opera- University of Maryland, 1993.
tions, namely letter substitution, insertion, deletion, and [5] S. Fahlman. An empirical study of learning speed in back-
propagation networks. Technical Report CMU-CS-DD-88-162,
three newly created operations: merge, splits, and pair- Computer Science Department, Carnegie Mellon University,
substitutions. Dierent cost weights were assigned to the 1988.
individual operations, re ecting the fact that some errors [6] T. Fujisaki, H. Beigi, C. Tappert, M. Ukelson, and C. Wolf. On-
line recognition of unconstrained handprinting: a stroke-based
are more likely to occur than others. system and its evaluation. In S. Impevodo and J. Simon, ed-
itors, From Pixels to Features III. Elsevier Science Publishers,
F. Testing of Recognition Module 1992.
[7] W. Guerfali and R. Plamondon. Normalizing and restoring on-
Table I describes the data used for testing of the recogni- line handwriting. Pattern Recognition, 26(3):419{431, 1993.
tion module and summarize (word level) performance res- [8] I. Guyon, P. Albrecht, Y. LeCun, J. Denker, and W. Hubbard.
ults. Images used for the writer-dependent test were writ- Design of a neural network character recognizer for a touch ter-
minal. Pattern Recognition, 24(2):105{119, 1991.
ten by the same donors who participated in the isolated- [9] N. Hakim, J. Kaufman, G. Cerf, and H. Meadows. Cursive
word training data collection eort (see description of train- script online character recognition with a recurrent neural net-
work model. In IJCNN. IEEE, 1992.
ing data set above). [10] W. Huang and R. Lippman. Comparisons between neural net
and conventional classiers. In First IEEE Conference on Neural
Writer-dependent Test
Networks, San Diego, 1987.
Images Words Writers Top 1 Top 5 Top 10 [11] Y. LeCun. Generalization and network design strategies. In
443 50 20 91.6% 97.9% 99.3% L. S. R. Pfeifer, F. Fogelman-Soulie, editor, Connectionism in
Perspective. Elsevier Science Publishers, 1989.
Writer-independent Test [12] P. Morasso, L. Barberis, S. Pagliano, and D. Vergano. Re-
Images Words Writers Top 1 Top 5 Top 10 cognition experiments of cursive dynamic handwriting with self-
organizing networks. Pattern Recognition, 26(3):451{460, 1993.
466 300 9 62.4% 82.4% 88.1% [13] K. Ohmori. On-line handwritten kanji character recognition us-
TABLE I ing hypothesis generation in the space of hierarchical knowledge.
In IWFHR III, 1993.
Experimental Test Data. [14] D. Rumelhart, G. Hinton, and R. Williams. Learning internal
Images used for the writer-independent test come, on the representations by error propagation, volume 1, pages 318{362.
Bradford Books, 1986.
other hand, from the sentence-image database. Because of [15] M. Schenkel, H. Weissman, I. Guyon, C. Nohl, and D. Hende-
this, their quality is generally poorer. Furthermore, word rson. Recognition-based segmentation of on-line hand-printed
words. In Advances in Neural Information Processing Systems
frequency re ects natural language where short words are V. Morgan Kaufmann, 1993.
very common (the most common words in the set are `the', [16] L. Schomaker. Using stroke or character-based self-organizing
`to', ànd', and òf'); it is in shorter words where the sys- maps in the recognition of on-line, connected cursive script. Pat-
tern Recognition, 26(3):443{450, 1993.
tem is more prone to errors. These two factors contribute [17] G. Seni, V. Kripasundar, and R. Srihari. Generalizing edit dis-
signicantly to the dierence in performance between the tance for handwritten text recognition. In SPIE/IS&T Confer-
two tests. ence on Document Recognition, 1995.
[18] G. Seni and R. Srihari. A hierarchical approach to on-line script
recognition using a large vocabulary. In IWFHR IV, 1994.
V. CONCLUSIONS [19] Y. Singer and N. Tishby. A discrete dynamical approach to
A model for the recognition of on-line handwritten curs- cursive handwriting analysis. Technical Report CS93-4, Institute
of Computer Science, The Hebrew University of Jerusalem, 1993.
ive words motivated by several psychological research nd- [20] A. Waibel, T.Hanazawa, G. Hinton, K.Shikano, and K. Lang.
ings about the human perception of handwriting has been Phoneme recognition using time-delay neural networks. IEEE
presented. In particular, we emphasized how to eciently Trans. on Acoustics, Speech and Signal Processing, 37:328{339,
1989.
deal with large lexicon sizes, the role of dynamic inform-
ation over traditional feature-analysis models in the re-
cognition process, and the incorporation of letter context

10 1 1 50 1441

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

10 1 1 50 1441

Enviado por

Direitos autorais:

Formatos disponíveis

1

Large Vocabulary Recognition of On-Line A. Strategies for Cursive Word Recognition

Large vocabulary (ASCII dictionary)

Recognition String Interpretation Output {Ol(t)} TDNN-style {F(t)} Trajectory

acters in context (i.e., using a portion of the word image base-line

that precedes and follows each given character) we minimize

Data reduction and enhancement is achieved by a res- conjunction

ampling algorithm which (i) removes duplicated points,

forms smoothing (see [7] for an overview of these oper-

structure in which the input image is recognized at varying j j j j

bias per unit (we found no reason to have independent bias

C.1. Training Data Set. For training data, we used 2,443

corded sentences were then semi-automatically segmented

between 45% and 55% of the character length, and sub-

sequently backs down to its minimum was generated.

D. Output Trace Parsing

Full word recognition is achieved by continuously mov- 0.590697 ne ognit n

Você também pode gostar