Escolar Documentos
Profissional Documentos
Cultura Documentos
Preprocessing Module
Orientation, slant
✍ Digitizing {(X(t),Y(t),Z(t))} Data reduction
& size normalization
{(X(t),Y(t),Z(t))}
tablet Raw data & enhancement Preprocessed data
Handwriting
Filtering Module
(a)
Reduced lexicon Search algorithm α=α1α2...αn Primitive extraction
(Matchable words) (Production rules) Description string (Vfeature)
(b)
Recognition Module
Fig. 1. Structure of word recognition system. Fig. 2. Preprocessing example: (a) a raw image of word `recognition',
and (b) the preprocessed version of it.
matchable words. This set consists of words from the sys-
tem's lexicon that are similar in shape or structure to the III. FILTERING MODULE
input word (e.g., the words `imaginative', `immigration',
and `imagination' are similar based on coarse shape). The In this section we brie
y describe how the task of lter-
set of matchable words forms the reduced lexicon; it is de- ing/reducing the lexicon is achieved. More details can be
termined by generating all possible letter strings that can found in [18]; an extended paper is under preparation.
be derived from the string of primitives using a set of rules The principal idea underlying the ltering module is that
mapping the composition of stroke primitives into English the visual conguration of a word written in cursive script
characters. An important consequence of having a reduced can be captured by a stroke description string. The vocab-
lexicon is in limiting the amount of computation required ulary of the description string corresponds to the dierent
during the string matching { postprocessing { stage (see types of downward strokes (i.e., portions of input between
below). pairs of consecutive y-maxima and y-minima in the pen
The Recognition module (section IV) uses a represent- trace) made by a writer in writing the word. Downward
ation of the input that preserves the sequential nature of strokes constitute a simple but robust cue that contributes
the cursive data and justies the use of a network architec- signicantly to letter identication; furthermore, they allow
ture similar to the Time-Delay Neural Network (TDNN). for a compact description of the overall shape of a word
A further advantage of such a representation scheme is that without having to consider its internal details. This idea
stroke absences (from unintentional pen lifts) and acci- has provided formal grounding for the notion of visually sim-
dental intersections (i.e., overlapping or touching charac- ilar neighborhood (VSN); in Fig. 3 our preprocessed image
ters) which signicantly alter the topological (static) pat- of word `recognition' is shown with its extracted downward
tern of the word, have little or no in
uence on the \dy- strokes and a description of its shape as captured by the
namic" pattern of it. The neural network-based recognizer string that results of concatenating them. The visually sim-
is trained to classify the signal within its xed-size input ilar words described by this string are also shown.
window as this window sequentially scans the input word
representation, thus bypassing a potentially erroneous seg-
mentation procedure. By training and recognizing char-
half-line
(b)
II. PREPROCESSING MODULE emigration
composition migration
inauguration
imaginative reunification
(ii) enforces even spacing between points, and (iii) per- immigration originators
(c)
ations). Normalization of dierent writing orientations,
writing slant, and writing sizes is also essential in order Fig. shown
3. Filtering example: (a) a preprocessed image of word `recognition'
with base-line, half-line and extracted downward strokes; (b)
to reduce writer-dependent variability. A normalization al- the coarse representation of the word-shape provided by the string of
gorithm is employed for this purpose, based on the work of concatenated stroke primitives; and (c) the set of matchable words
[3], [19]. In Fig. 2 a raw image of the word `recognition' derived from this string with a 21k lexicon.
is shown with the output produced by the preprocessing
module on it. The stroke description scheme identies 9 dierent types
3
of strokes, some of which capture spatio-temporal informa- between two consecutive directional angles. Guyon et al.
tion such as retrograde motion of the pen, and others simply [8] suggest that the angle (t) = (t + 1) ? (t ? 1) be
x x
capture length and position relative to the reference lines. represented by its sine and cosine values. However, we
In the example of Fig. 3 there are three dierent strokes, found that the values of cos (t) behave more smoothly than
namely Ascender (A), Median (M), and Descender (D). The those of sin (t); for small values of (t) (i.e., little change
stroke descriptions for the letters `r', `e', `c', `o', `g', `n', in direction) cos (t) remains
at at the high value of +1
`i', and `t' would then be \M", \M", \M", \M", \MD", whereas sin (t) oscillates around zero. We choose cos (t)
\MM", \M", and \A" respectively. The stroke description as our only curvature descriptor: it goes down to ?1 for
for the word `recognition' would simply be the concatena- sharp cusps (independent of their orientation) and down to
tion \MMMMMDMNMAMMMM". around 0 for more smoother turns.
IV. RECOGNITION MODULE A.1. Time Frames. Given a sequence f(X (t); Y (t); Z (t))g
The task of the recognition module is accomplished in of on-line data, we dene a time frame F (t) to be a
four steps. The rst is the encoding of pen trajectory as a 4-dimensional feature vector consisting of four elements
sequence of frames F (t) (a frame denotes one discrete time (sin x (t); sin y (t); cos (t); zone(t)), where the rst three
step worth of data { features). In the second step, a TDNN- elements have already been described above. The fourth
style network operates on a window of frames (comprising element, zone(t), is introduced to help distinguish pairs
a character and parts of its neighbors) and produces an such as `e'-`l', which have similar temporal representations
output at every time interval. In the third step, a postpro- in terms of direction and curvature alone. These pairs can
cessor interprets this output sequence to generate a letter be more easily dierentiated by encoding their correspond-
sequence (interpretation string). Finally, in the fourth step, ing Y (t) values into the previously determined zones: the
a string-to-string similarity function is used to match the in- middle zone (between the base-line and the half-line), the
terpretation string(s) with the reduced lexicon produced by ascender zone (above the half-line) and the descender zone
the ltering module. Each of these steps is now described. (below the base-line).
A. Trajectory Representation For a point P (t)=(X (t); Y (t)) falling within the middle
On-line data represents text as a sequence of points zone we make zone(t) = 0; otherwise, we have 0 <
zone(t) 1:0 if the point falls within the ascender zone,
fP (t) = (X (t); Y (t); Z (t))g, where X; Y are the coordinates and ?1:0 zone(t) < 0 if the point falls within the des-
of the pen tip, and Z indicates pen-up/pen-down informa- cender zone; specically, the zone(t) parameter is com-
tion. All relevant dynamic information about handwriting puted by passing the value of the vertical distance (dist)
can theoretically be inferred from this sequence but this between point P (t) and the half-line (or base-line) through
data is too unconstrained; more ecient methods of en- a thresholding function:
coding it must be employed. We choose mainly to encode
information pertaining to local direction and curvature in 10:0dist
the pen trajectory, and rely on the neural network-based re- zone(t) = f ( ? 5:0)
cognizer for the selection of more complex features relevant body hght
to performing the classication task.
Two parameters are used to encode direction: (i) where f (x) is the sigmoid function; body hght corresponds
sin (t) - sine of the angle between each segment
y
to the distance between the base-line and half-line so that
P (t ? 1)P (t) of the trajectory and the Y-axis, and (ii) when point P (t) is further away than body hght from the
sin (t) - sine of the angle between P (t ? 1)P (t) and the X- half-line (or base-line), zone(t) is 1:0 (or ?1:0). This cod-
x
axis. By restricting (t) and (t) to vary between ?=2 ing scheme appears robust against writing distortions where
x y
and +=2 we make the parameters unambiguous; a negative ascenders/descenders are made atypically large or when
value of sin (t) indicates that point P (t) is before point medium-size letters do not fully fall within the reference
y
P (t?1) (i.e., a backward pen movement was made in going lines.
from P (t ? 1) to P (t)), and a positive value indicates that
point P (t) is after point P (t?1) (i.e., a forward pen move- A.2. Varying duration and scaling. Since we are dealing
ment was made). Similarly, the sign of sin (t) indicates
x with unsegmented words, a constant number of frames per
whether point P (t) is above or below point P (t ? 1) (i.e., letter across a word or across a set of samples cannot be
if an upward or downward pen movement was made). Al- guarantee (i.e., varying duration). To reduce such variabil-
though the values of (t) and (t) could have been used
y x ity in letter length, the size normalization step of the pre-
directly, the sine function makes them easier to compute, processing module uses the ratio H=MLH as scale factor;
conveniently bounds them between -1 and +1, and provides MLH { median letter height { is an estimate of the height
us with some quantization eect. of small letters (i.e., those that fall between the base-line
We also nd the location of the points in the traject- and the half-line), and H is the normalization height (cur-
ory at which sharp changes in the direction of movement rently set at about 3mm). Because the distance between
(i.e., cusps) take place. A very simple measurement of points is kept constant, the above procedure eectively min-
\local" curvature can be obtained by calculating the change imizes time distortions of letters.
4
B. Neural Network Recognizer that each hidden unit has a receptive eld that is limited
The Time-Delay Neural Network (TDNN) is a multilayer along the time domain. In the rst hidden layer there are 15
feedforward architecture whose hidden units are replicated units replicated 30 times (i.e., weights are shared), each re-
across time, rst developed for speech recognition, that has ceiving input from 9 consecutive frames in the input layer.
been successful in learning to recognize time sequences [20]. The choice of 9 as the width of the receptive eld of these
TDNNs are trained with a modied back-propagation (BP) units re
ects the goal of detecting features with short dur-
algorithm [14] and are usually less dicult to train than (al- ation at this level, but also long enough for it to be mean-
though sometimes outperformed by) recurrent networks [2]. ingful (e.g., a cusp). In the second hidden layer, there are
BP networks have proven to be very competitive with clas- 20 units replicated 9 times, each looking at a 156 window
sical pattern recognition methods, especially for problems of activity levels in the rst hidden layer. These units re-
requiring complex decision boundaries [10]. The ability of ceive information spanning a larger time interval from the
BP networks to deal directly with large amounts of low level input, and hence are expected to detect more complex and
information rather than higher-order (more elaborated) fea- global features (i.e., longer in duration). Finally, the output
ture vectors has also been demonstrated in dierent applic- layer has 26 units (one for each of the English letters) fully
ations. connected to the second hidden layer.
The architecture of our three layer TDNN-style network Weight-sharing is a general paradigm that allows us to
is inspired by that of Waibel et al. [20] for phoneme recog- build reduced size networks; this in turn is an eective
nition and that of Guyon et al. [8] for uppercase handprin- way of increasing the likelihood of correct generalization
ted letter recognition. Its overall structure is shown in Fig. [11]. Weight sharing also enables the development of shift-
4. The choice of L = 96 frames as the length of the input invariant feature detectors [14] by constraining units to
window to the network (network receptive eld) is related learn the same pattern of weights as their neighboring ones
to H, the normalization height. H is selected as small as do. This corresponds to the intuition that if a particular
possible so as to minimize the convolution time needed to feature detector is useful on one part of the sequence, it is
do full word recognition. Having H available, L is selec- likely to be useful on other parts of the sequence as well.
ted so that L frames are enough to represent a character C. Neural Network Simulation
and, in most cases, include part of the characters on each
side of it for contextual information. The length of the two We choose the activation range of our neurons to be
hidden layers is then determined using an undersampling between ?1 and +1 with the following computationally ef-
factor of 3, a technique that allows to reduce the size of cient activation function [4] :
the network [11]. This leads to the notion of a pyramidal f (u) = 1+u u with derivative f (u) = (1+1u )2 + of f set
0
levels of detail. To compensate for the loss of resolution as- where juj stands for the absolute value of the weighted
sociated with undersampling, a commonly used approach sum and of f set is a constant suggested by Fahlman [5] to
is to increase the number of hidden units as one moves up kill
at spots. Weights are initialized with random numbers
the network pyramid. uniformly distributed between ?0:1 and +0:1. A single bias
unit is used by all weight-shared units that are controlled
by the same weight kernel, as opposed to an independent
a
b
Output
p
units in order to develop truly invariant feature detectors).
z
6
data collection eort where paid donors where asked to
15 • Hidden write full sentences presented acoustically to them. Re-
30
(roughly where one character ends and the next one begins).
This information was then used to pair each frame F (t), in
the dynamic representation of the word, with an output vec-
tor. The goal was to generate a target signal that ramps up
about halfway through the character and then quickly backs
down afterwards, in such a way that the network learns to
recognize a character whenever the center of the character
is in the center of the network's receptive eld. For each
word in the training data set, a target signal that ramps (a)
up at 30% of each character's length, reaches its maximum −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Actual Expected Normalized
ing the input window of the network across the frame se- (b) (c)
quence fF (t)g thus generating activation traces Ol (t) at the Fig. 5. Recognition example: (a) plot of network output traces when
output of the network, where Ol (t) corresponds to the net- presented with the preprocessed image of word `recognition', (b)
work's condence in recognizing a letter l at time t. These corresponding detected peaks and generated interpretation string
output traces are subsequently examined to determine the (`ne ognit n'), and (c) nal recognition result after matching inter-
ASCII string(s) best representing the word image. In or- pretation string with reduced lexicon provided by the ltering module.
der to convert this output trace signal into letters, the sizes
and widths of all activation \peaks" for every output unit
are determined. Fig. 5 shows the output activation traces written after the whole word was written. These delayed
strokes constitute an exception to our \dynamic" repres-
Ol (t), for all the 26 output nodes, generated by the network
when presented with our preprocessed image of word `re- entation scheme of cursive handwriting because they viol-
cognition'. Eleven signicant activation peaks are visible, ate the (strict) time-order of the letter patterns. Because
each one corresponding to a letter detected in the input diacritical marks are many times missing or badly posi-
image. tioned in an image, we decided that they should be used as
The sizes of all activation peaks are computed by scan- \condence boosters" and not as required features for let-
ter identication. That is, the recognizer should be able to
ning the output activation traces, from left to right, looking
for activation levels that exceed a given threshold. When hypothesize the presence of a letter `i', `j' or `t' in the input
the activation value of an output unit exceeds the thresholdscript even if the diacritical mark is missing. The existence
(currently set at ?0:8), a summing process begins for that of a diacritical mark is then simply used to conrm the hy-
unit, that ends when its activation value falls below the pothesis or resolve any ambiguity (say between `i' and `e' or
threshold. Activation peaks with a maximum value be- between `t' and `l'). Diacritical marks are thus detected and
low ?0:2 are ignored (i.e., they are not considered su- removed from the image prior to recognition; a peak for the
ciently condent). In order to compensate for smaller let- letter `i', `j', `t' or `x' in the output activation traces is then
ters, which are shorter in the temporal domain, we normal- said to be \in
uenced" by a diacritical mark if some of the
corresponding frames in the input trajectory are covered
ize the size of a peak by its expected size [9]. The expected
peak size for a given letter is given by the average size of(in a horizontal sense) by the diacritical mark. Condence
of in
uenced peaks is then boosted by an amount propor-
all the peaks in the training signal for that letter. The set
fPi g of all selected activation peaks is then ordered basedtional to its current value. In Fig. 5 in
uenced peaks are
on the beginning time of each peak Pi . A directed inter- indicated with a `'.
pretation graph is subsequently constructed from this set Sometimes it is also possible for the peak parsing routine
as follows: there is a node Ni in the graph for every activ-to \hint" that a character is missing in the output interpret-
ation peak Pi , and there is an edge between nodes Ni and ation string. A missing character in the output interpreta-
Nj (i < j ) if peaks Pi and Pj are adjacent and their widthstion string is usually the result of poor handwriting and a
do not overlap; otherwise, nodes Ni and Nj will lie on par- corresponding low-activation peak which discarded because
allel paths of the graph. Word hypotheses are generated of low condence during the peak identication process.
by traversing all possible paths in the graph from the root This situation often results in a large \no-response" time
to all the \leaves". The condence of a word hypothesis is interval in the output activation traces; that is, a period
set using the average of the node's normalized sizes in the of time for which no O (t) is active. To detect these cases
l
corresponding path. we have computed the expected inter-peak gap, from our
training data set, for every pair of characters. Then, dur-
D.1. Delayed Strokes and Missing Peaks. Diacritical ing the traversal of the interpretation graph, if the time-gap
marks such as dots on letters `i' and `j', and horizontal between two adjacent activation peaks is larger than its ex-
bars on letter `t' (and sometimes `x' slash also) are often pected value, a special symbol (` ') is output to indicate
6
that a character is probably missing. When matching an and avoidance of error-prone segmentation of the script by
interpretation string containing symbol ` ' with a lexicon means of a scanning window technique. Experimental res-
entry, any character is allowed to match ` ' with a small ults clearly show that a recognition system designed accord-
penalty. ing to these ideas can be successful.
E. String Matching ACKNOWLEDGEMENTS
In order to validate the output interpretation strings pro- This work was supported in part by a grant from the Na-
duced by the recognizer, we need to look them up in the list tional Science Foundation (IRI-9315006) under the Human
of words (reduced lexicon) which is provided by the lter- Language Technology program.
ing module. Since there may be errors in the interpretation References
string(s) s, a similarity metric is needed to determine the [1] M. Babcock and J. Freyd. Perception of dynamic information
likelihood that a word w in the reduced lexicon is the \true" in static handwritten forms. American Journal of Psychology,
value of s. For this purpose we extended the Damerau- 101(1):111{130, 1988.
Levenshtein metric (see [17] for details) to more accurately [2] Y. Bengio. A connectionist approach to speech recognition. Intl.
Jour. Pattern Recog. Artif. Intell., 7(4):3{22, 1993.
compensate for the types of errors that are present in the [3] E. Brocklehurst and P. Kenward. Preprocessing for cursive script
script recognition domain. The extended metric quanties recognition. NPL Report DITC 132/88, 1988.
[4] D. Elliott. A better activation function for articial neural net-
the minimum cost of transforming each word hypothesis works. Technical Report TR93-8, Institute for Systems Research,
into every word in the reduced lexicon using edit opera- University of Maryland, 1993.
tions, namely letter substitution, insertion, deletion, and [5] S. Fahlman. An empirical study of learning speed in back-
propagation networks. Technical Report CMU-CS-DD-88-162,
three newly created operations: merge, splits, and pair- Computer Science Department, Carnegie Mellon University,
substitutions. Dierent cost weights were assigned to the 1988.
individual operations, re
ecting the fact that some errors [6] T. Fujisaki, H. Beigi, C. Tappert, M. Ukelson, and C. Wolf. On-
line recognition of unconstrained handprinting: a stroke-based
are more likely to occur than others. system and its evaluation. In S. Impevodo and J. Simon, ed-
itors, From Pixels to Features III. Elsevier Science Publishers,
F. Testing of Recognition Module 1992.
[7] W. Guerfali and R. Plamondon. Normalizing and restoring on-
Table I describes the data used for testing of the recogni- line handwriting. Pattern Recognition, 26(3):419{431, 1993.
tion module and summarize (word level) performance res- [8] I. Guyon, P. Albrecht, Y. LeCun, J. Denker, and W. Hubbard.
ults. Images used for the writer-dependent test were writ- Design of a neural network character recognizer for a touch ter-
minal. Pattern Recognition, 24(2):105{119, 1991.
ten by the same donors who participated in the isolated- [9] N. Hakim, J. Kaufman, G. Cerf, and H. Meadows. Cursive
word training data collection eort (see description of train- script online character recognition with a recurrent neural net-
work model. In IJCNN. IEEE, 1992.
ing data set above). [10] W. Huang and R. Lippman. Comparisons between neural net
and conventional classiers. In First IEEE Conference on Neural
Writer-dependent Test
Networks, San Diego, 1987.
Images Words Writers Top 1 Top 5 Top 10 [11] Y. LeCun. Generalization and network design strategies. In
443 50 20 91.6% 97.9% 99.3% L. S. R. Pfeifer, F. Fogelman-Soulie, editor, Connectionism in
Perspective. Elsevier Science Publishers, 1989.
Writer-independent Test [12] P. Morasso, L. Barberis, S. Pagliano, and D. Vergano. Re-
Images Words Writers Top 1 Top 5 Top 10 cognition experiments of cursive dynamic handwriting with self-
organizing networks. Pattern Recognition, 26(3):451{460, 1993.
466 300 9 62.4% 82.4% 88.1% [13] K. Ohmori. On-line handwritten kanji character recognition us-
TABLE I ing hypothesis generation in the space of hierarchical knowledge.
In IWFHR III, 1993.
Experimental Test Data. [14] D. Rumelhart, G. Hinton, and R. Williams. Learning internal
Images used for the writer-independent test come, on the representations by error propagation, volume 1, pages 318{362.
Bradford Books, 1986.
other hand, from the sentence-image database. Because of [15] M. Schenkel, H. Weissman, I. Guyon, C. Nohl, and D. Hende-
this, their quality is generally poorer. Furthermore, word rson. Recognition-based segmentation of on-line hand-printed
words. In Advances in Neural Information Processing Systems
frequency re
ects natural language where short words are V. Morgan Kaufmann, 1993.
very common (the most common words in the set are `the', [16] L. Schomaker. Using stroke or character-based self-organizing
`to', `and', and `of'); it is in shorter words where the sys- maps in the recognition of on-line, connected cursive script. Pat-
tern Recognition, 26(3):443{450, 1993.
tem is more prone to errors. These two factors contribute [17] G. Seni, V. Kripasundar, and R. Srihari. Generalizing edit dis-
signicantly to the dierence in performance between the tance for handwritten text recognition. In SPIE/IS&T Confer-
two tests. ence on Document Recognition, 1995.
[18] G. Seni and R. Srihari. A hierarchical approach to on-line script
recognition using a large vocabulary. In IWFHR IV, 1994.
V. CONCLUSIONS [19] Y. Singer and N. Tishby. A discrete dynamical approach to
A model for the recognition of on-line handwritten curs- cursive handwriting analysis. Technical Report CS93-4, Institute
of Computer Science, The Hebrew University of Jerusalem, 1993.
ive words motivated by several psychological research nd- [20] A. Waibel, T.Hanazawa, G. Hinton, K.Shikano, and K. Lang.
ings about the human perception of handwriting has been Phoneme recognition using time-delay neural networks. IEEE
presented. In particular, we emphasized how to eciently Trans. on Acoustics, Speech and Signal Processing, 37:328{339,
1989.
deal with large lexicon sizes, the role of dynamic inform-
ation over traditional feature-analysis models in the re-
cognition process, and the incorporation of letter context