Aaron L.-F. Han: 2013.08 at CUHK, Hong Kong

Aaron L.-F.
Han
Natural Language Processing & Portuguese-Chinese Machine Translation
Laboratory
University of Macau, Macau S.A.R., China
2013.08 @ CUHK, Hong Kong
Email: hanlifengaaron AT gmail DOT com

Homepage: http://www.linkedin.com/in/aaronhan
The importance of machine translation (MT) evaluation

Automatic MT evaluation metrics introduction
1.
2.
3.
Lexical similarity
Linguistic features
Metrics combination
Designed metric: LEPOR Series

1.
2.
3.
4.
Motivation
LEPOR Metrics Description
Performances on international ACL-WMT corpora
Publications and Open source tools
Other research interests and publications
Eager communication with each other of different

nationalities
Promote the translation technology
Rapid development of Machine translation

machine translation (MT) began as early as in the 1950s
(Weaver, 1955)
big progress science the 1990s due to the development of
computers (storage capacity and computational power)
and the enlarged bilingual corpora (Marino et al. 2006)
Some recent works of MT research:

Och (2003) present MERT (Minimum Error Rate Training)
for log-linear SMT
Su et al. (2009) use the Thematic Role Templates model to
improve the translation
Xiong et al. (2011) employ the maximum-entropy model,
etc.
The data-driven methods including example-based MT
(Carl and Way, 2003) and statistical MT (Koehn, 2010)
became main approaches in MT literature.
How well the MT systems perform and whether they

make some progress?
Difficulties of MT evaluation
language variability results in no single correct translation
the natural languages are highly ambiguous and different
languages do not always express the same content in the
same way (Arnold, 2003)
Traditional manual evaluation criteria:

intelligibility (measuring how understandable the
sentence is)
fidelity (measuring how much information the translated
sentence retains as compared to the original) by the
Automatic Language Processing Advisory Committee
(ALPAC) around 1966 (Carroll, 1966)
adequacy (similar as fidelity), fluency (whether the
sentence is well-formed and fluent) and comprehension
(improved intelligibility) by Defense Advanced Research
Projects Agency (DARPA) of US (White et al., 1994)
Problems of manual evaluations :
Time-consuming
Expensive
Unrepeatable
Low agreement (Callison-Burch, et al., 2011)
2.1 Lexical similarity

2.2 Linguistic features
2.3 Metrics combination
Precision-based
Bleu (Papineni et al., 2002 ACL)
Recall-based
ROUGE(Lin, 2004 WAS)
Precision and Recall
Meteor (Banerjee and Lavie, 2005 ACL)
Word-order based
NKT_NSR(Isozaki et al., 2010EMNLP), Port (Chen
et al., 2012 ACL), ATEC (Wong et al., 2008AMTA)
Word-alignment based
AER (Och and Ney, 2003 J.CL)
Edit distance-based
WER(Su et al., 1992Coling), PER(Tillmann et al.,
1997 EUROSPEECH), TER (Snover et al., 2006
AMTA)
Language model
LM-SVM (Gamon et al., 2005EAMT)
Shallow parsing
GLEU (Mutton et al., 2007ACL), TerrorCat (Fishel
et al., 2012WMT)
Semantic roles
Named entity, morphological, synonymy,
paraphrasing, discourse representation, etc.
MTeRater-Plus (Parton et al., 2011WMT)

Combine BLEU, TERp (Snover et al., 2009) and Meteor
(Banerjee and Lavie, 2005; Lavie and Denkowski, 2009)
MPF & WMPBleu (Popovic, 2011WMT)

Arithmetic mean of F score and BLEU score
SIA (Liu and Gildea, 2006ACL)

Combine the advantages of n-gram-based metrics and
loose-sequence-based metrics
LEPOR: automatic machine translation evaluation

metric considering the: Length Penalty, Precision, ngram Position difference Penalty and Recall.
Weaknesses in existing metrics:

perform well on certain language pairs but weak on others,
which we call as the language-bias problem;
consider no linguistic information (leading the metrics
result in low correlation with human judgments) or too
many linguistic features (difficult in replicability), which we
call as the extremism problem;
present incomprehensive factors (e.g. BLEU focus on
precision only).
What to do?
to address some of the existing problems:

Design tunable parameters to address the language-bias
problem;
Use concise or optimized linguistic features for the
linguistic extremism problem;
Design augmented factors.
Sub-factors:
exp 1
: <
= 1
=
exp 1 : >
(1)
: length of reference sentence

: length of candidate (system-output) sentence
= exp
=
(2)
| |
=1
(3)
= | | (4)
: position of matched token in
output sentence
: position of matched token in reference
sentence
Fig. 1. N-gram word alignment algorithm
Fig. 2. Example of n-gram word alignment
Fig. 3. Example of NPD calculation
N-gram precision and recall:

=
#
#
#
#
= , =
(5)
(6)
+
(7)
Fig. 4. Example of bigram matching
LEPOR Metrics:
= (, ) (8)
=
, ,
=
=1
=1
+ +

+ +
= (
(9)
=1 )
(10)
Example, employment of linguistic features:
Fig. 5. Example of n-gram POS alignment
Fig. 6. Example of NPD calculation
Combination with linguistic features:

=
1
(
+
) (11)
and use the same

algorithm on POS sequence and word sequence
respectively.
When multi-references:
Select the alignment that results in the minimum NPD
score.
Fig. 7. N-gram alignment when multi-references
How reliable is the automatic metric?

Evaluation criteria for evaluation metrics:
Human judgments are the golden to approach, currently.
Correlation with human judgments:

System-level correlation
Segment-level correlation
System-level correlation:
Spearman rank correlation coefficient:
= 1
2
6
=1
(2 1)
(12)
= 1 , , , = {1 , , }
Pearson correlation coefficient:

=
=1( )( )
=1
=1
(13)
Segment-level Kendalls tau correlation:

=
The segment unit can be a single sentence or

fragment that contains several sentences.
(14)
Performances on ACL-WMT 2011 copora

Two translation directions:
English-to-other (Spanish, German, French, Czech)
Other-to-English
System-level metrics:
=
| |
=1
(15)
= (, ) (16)
Table 1. system-level Spearman correlation with human judgment on WMT11 corpora
Performances on ACL-WMT 2013 copora

Two translation directions:
English-to-other (Spanish, German, French, Czech, and
Russian)
Other-to-English
System-level & sentence-level

LEPOR_v3.1: hLEPOR, nLEPOR_baseline
=
| |
=1
(17)
1
(
+
+ ) (18)
= exp(
=1 )
(19)
Table 2&3. system-level Pearson correlation with human judgment
Table 4&5. segment-level Kendalls tau correlation with human judgment
LEPOR: A Robust Evaluation Metric for Machine

Translation with Augmented Factors
Aaron L.-F. Han, Derek F. Wong and Lidia S. Chao.
Proceedings of COLING 2012: Posters, pages 441450,
Mumbai, India.
Language-independent Model for Machine

Translation Evaluation with Reinforced Factors
Aaron L.-F. Han, Derek Wong, Lidia S. Chao, Liangye He, Yi
Lu, Junwen Xing, Xiaodong Zeng. Proceedings of MT
Summit 2013. Nice, France.
Language independent MT evaluation-LEPOR:

https://github.com/aaronlifenghan/aaron-project-lepor
MT evaluation with linguistic features-hLEPOR:
https://github.com/aaronlifenghan/aaron-project-hlepor
English-French Phrase tagset mapping and application in
unsupervised MT evaluation-HPPR:
https://github.com/aaronlifenghan/aaron-project-hppr
Unsupervised English-Spanish MT evaluation-EBLEU:
https://github.com/aaronlifenghan/aaron-project-ebleu
Projects Homepage: https://github.com/aaronlifenghan
My research interests:
Natural Language Processing

Signal Processing
Machine Learning
Artificial Intelligence
Pattern Recognition
My past research works:

Machine Translation Evaluation, Word Segmentation,
Entity Recognition, Multilingual Treebanks
Other publications:
A Description of Tunable Machine Translation Evaluation Systems in
WMT13 Metrics Task
Aaron Li-Feng Han, Derek Wong, Lidia S. Chao, Yi Lu, Yervant
Ho, Yiming Wang, Zhou jiaji. Proceedings of the ACL 2013 EIGHTH
WORKSHOP ON STATISTICAL MACHINE TRANSLATION (ACL-WMT
2013), 8-9 August 2013. Sofia, Bulgaria.
ACL-WMT13 METRICS TASK:

Our metrics are language independent
English-vs-other (French, Spanish, Czech, German, Russian)
Can perform on both system-level and segment-level
The official results show our metrics have advantages as compared to others.
Quality Estimation for Machine Translation Using the Joint Method of

Evaluation Criteria and Statistical Modeling
Aaron Li-Feng Han, Yi Lu, Derek F. Wong, Lidia S. Chao, Yervant
Ho, Anson Xing. Proceedings of the ACL 2013 EIGHTH WORKSHOP ON
STATISTICAL MACHINE TRANSLATION (ACL-WMT 2013), 8-9 August
2013. Sofia, Bulgaria.
ACL-WMT13 QUALITY ESTIMATION TASK (no reference translation):
Task 1.1: sentence level EN-ES quality estimation
Task 1.2: system selection, EN-ES, EN-DE, new
Task 2: word-level QE, EN-ES, binary classification, multi-class classification, new
We design novel EN-ES POS tagset mapping and metric EBLEU in task 1.1.
We explore the Nave Bayes and Support Vector Machine in task 1.2.
We achieve the highest F1 score in task 2 using Conditional Random Field.
Designed POS tagset mapping of Spanish (Tree tagger) to universal tagset

(Petrov et al., 2012)
= 1 exp( log(( , )))

=
1 <
#
#
#
#
Bayes rule:
1 , 2 , , =
1 ,2 ,, ( )
(1 ,2 ,, )
SVM, find the point with smallest margin to hyper

plane and then maximize this margin:
, {(
+ ))
}
||||
Conditional random fields:

exp( , , | , +
, (, | , ))
Designed features for CRF & NB algorithms
ACL-WMT13 word level quality estimation task results
Phrase Tagset Mapping for French and English Treebanks and Its
Application in Machine Translation Evaluation
Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Shuo
Li, Lynn Ling Zhu. In GSCL 2013. LNCS Vol. 8105, Volume Editors: Iryna
Gurevych, Chris Biemann and Torsten Zesch.
German Society for Computational Linguistics (oral presentation):
To facilitate future research in unsupervised induction of syntactic structures
We design French-English phrase tagset mapping
We propose a universal phrase tagset
Phase tags extracted from French Treebank and English Penn Treebank
Explore the employment of the proposed mapping in unsupervised MT evaluation
Designed phrase tagset mapping for English and French
Evaluation based on parsing information from syntactic treebanks
Convert the word sequence into universal phrase tagset sequence
= ( 1 , 2 , 3 )
1 =
2 =
3 = (
3 )
A Study of Chinese Word Segmentation Based on the Characteristics of

Chinese
Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho, Lynn Ling
Zhu, Shuo Li. Accepted. In GSCL 2013. LNCS Vol. 8105, Volume Editors:
Iryna Gurevych, Chris Biemann and Torsten Zesch.
German Society for Computational Linguistics (poster paper):
No word boundary in the Chinese expression
Chinese word segmentation is a difficult problem
Word segmentation is crucial to the word alignment in machine translation
We discuss the characteristics of Chinese and design optimized features
We formulize some problems and issues in Chinese word segmentation
AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH

INFORMATION
Aaron Li-Feng Han, Derek F. Wong, Lidia S. Chao, Yervant Ho. In TSD
2013. Plzen, Czech Republic. LNAI Vol. 8082, pp. 121-128. Volume
Editors: I. Habernal and V. Matousek. Springer-Verlag Berlin
Heidelberg.
Text, Speech and Dialogue 2013 (oral presentation):
We explore the unsupervised machine translation evaluation method
We design hLEPOR algorithm for the first time
We explore the POS usage in unsupervised MT evaluation
Experiments are performed on English vs French, German
Chinese Named Entity Recognition with Conditional Random Fields in the

Light of Chinese Characteristics
Aaron Li-Feng Han, Derek Fai Wong and Lidia Sam Chao. In Proceeding
of LP&IIS. M.A. Klopotek et al. (Eds.): IIS 2013, LNCS Vol. 7912, pp. 57
68, Warsaw, Poland. Springer-Verlag Berlin Heidelberg.
Intelligent Information System 2013 (oral presentation):
Named entity recognition is important in IR, MT, text analysis, etc.
Chinese named entity recognition is more difficult due to no word boundary
We compare the performances of different algorithm, NB, CRF, SVM, ME
We analysis the characteristics respectively on personal, location, organization names
We show the performance of different features and select the optimized one.
Ongoing and further works:

The combination of translation and evaluation, tuning the
translation model using evaluation metrics
Evaluation models from the perspective of semantics
The exploration of unsupervised evaluation models,
extracting features from source and target languages
Actually speaking, the evaluation works are very

related to the similarity measuring. Where I have
employed them is in the MT evaluation only. These
works can be further developed into other literature:
information retrieval
question and answering
Searching
text analysis
etc.
Aaron L.-F. Han, 2013.08
Q and A
Thanks for your attention!
1. Weaver, Warren.: Translation. In William Locke and A. Donald Booth, editors,

Machine Translation of Languages: Fourteen Essays. John Wiley and Sons, New
York, pages 15-23 (1955)
2. Marino B. Jose, Rafael E. Banchs, Josep M. Crego, Adria de Gispert, Patrik Lambert,
Jose A. Fonollosa, Marta R. Costa-jussa: N-gram based machine translation,
Computational Linguistics, Vol. 32, No. 4. pp. 527-549, MIT Press (2006)
3. Och, F. J.: Minimum Error Rate Training for Statistical Machine Translation. In
Proceedings of (ACL-2003). pp. 160-167 (2003)
4. Su Hung-Yu and Chung-Hsien Wu: Improving Structural Statistical Machine Translation
for Sign Language With Small Corpus Using Thematic Role Templates as
Translation Memory, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE
PROCESSING, VOL. 17, NO. 7, SEPTEMBER (2009)
5. Xiong D., M. Zhang, H. Li: A Maximum-Entropy Segmentation Model for Statistical
Machine Translation, Audio, Speech, and Language Processing, IEEE Transactions
on, Volume: 19, Issue: 8, 2011 , pp. 2494- 2505 (2011)
6. Carl, M. and A. Way (eds): Recent Advances in Example-Based Machine Translation.
Kluwer Academic Publishers, Dordrecht, The Netherlands (2003)
7. Koehn P.: Statistical Machine Translation, (University of Edinburgh), Cambridge

University Press (2010)
8. Arnold, D.: Why translation is dicult for computers. In Computers and Translation:
A translator's guide. Benjamins Translation Library (2003)
9. Carroll, J. B.: Aan experiment in evaluating the quality of translation, Pierce, J.
(Chair), Languages and machines: computers in translation and linguistics. A report
by the Automatic Language Processing Advisory Committee (ALPAC), Publication
1416, Division of Behavioral Sciences, National Academy of Sciences, National Research
Council, page 67-75 (1966)
10. White, J. S., O'Connell, T. A., and O'Mara, F. E.: The ARPA MT evaluation
methodologies: Evolution, lessons, and future approaches. In Proceedings of the
Conference of the Association for Machine Translation in the Americas (AMTA
1994). pp 193-205 (1994)
11. Su Keh-Yih, Wu Ming-Wen and Chang Jing-Shin: A New Quantitative Quality
Measure for Machine Translation Systems. In Proceedings of the 14th International
Conference on Computational Linguistics, pages 433-439, Nantes, France, July
(1992)
12. Tillmann C., Stephan Vogel, Hermann Ney, Arkaitz Zubiaga, and Hassan Sawaf:
Accelerated DP Based Search For Statistical Translation. In Proceedings of the 5th
European Conference on Speech Communication and Technology (EUROSPEECH97)
(1997)
13. Papineni, K., Roukos, S., Ward, T. and Zhu, W. J.: BLEU: a method for automatic
evaluation of machine translation. In Proceedings of the (ACL 2002), pages 311-318,
Philadelphia, PA, USA (2002)
14. Doddington, G.: Automatic evaluation of machine translation quality using ngram
co-occurrence statistics. In Proceedings of the second international conference
on Human Language Technology Research(HLT 2002), pages 138-145, San Diego,
California, USA (2002)
15. Turian, J. P., Shen, L. and Melanmed, I. D.: Evaluation of machine translation
and its evaluation. In Proceedings of MT Summit IX, pages 386-393, New Orleans,
LA, USA (2003)
16. Banerjee, S. and Lavie, A.: Meteor: an automatic metric for MT evaluation with
high levels of correlation with human judgments. In Proceedings of ACL-WMT,
pages 65-72, Prague, Czech Republic (2005)
17. Denkowski, M. and Lavie, A.: Meteor 1.3: Automatic metric for reliable optimization
and evaluation of machine translation systems. In Proceedings of (ACL-WMT),
pages 85-91, Edinburgh, Scotland, UK (2011)
18. Snover, M., Dorr, B., Schwartz, R., Micciulla, L. and Makhoul, J.: A study of
translation edit rate with targeted human annotation. In Proceedings of the Conference
of the Association for Machine Translation in the Americas (AMTA), pages
223-231, Boston, USA (2006)
19. Chen, B. and Kuhn, R.: Amber: A modied bleu, enhanced ranking metric. In
Proceedings of (ACL-WMT), pages 71-77, Edinburgh, Scotland, UK (2011)
20. Bicici, E. and Yuret, D.: RegMT system for machine translation, system combination,
and evaluation. In Proceedings ACL-WMT, pages 323-329, Edinburgh,
Scotland, UK (2011)
21. Taylor, J. Shawe and N. Cristianini: Kernel Methods for Pattern Analysis. Cambridge
University Press 2004.
22. Wong, B. T-M and Kit, C.: Word choice and word position for automatic MT
evaluation. In Workshop: MetricsMATR of the Association for Machine Translation
in the Americas (AMTA), short paper, 3 pages, Waikiki, Hawai'I, USA (2008)
23. Isozaki, H., Hirao, T., Duh, K., Sudoh, K., and Tsukada, H.: Automatic evaluation
of translation quality for distant language pairs. In Proceedings of the 2010
Conference on (EMNLP), pages 944{952, Cambridge, MA (2010)
24. Talbot, D., Kazawa, H., Ichikawa, H., Katz-Brown, J., Seno, M. and Och, F.: A
Lightweight Evaluation Framework for Machine Translation Reordering. In Proceedings
of the Sixth (ACL-WMT), pages 12-21, Edinburgh, Scotland, UK (2011)
25. Song, X. and Cohn, T.: Regression and ranking based optimisation for sentence
level MT evaluation. In Proceedings of the (ACL-WMT), pages 123-129, Edinburgh,
Scotland, UK (2011)
26. Popovic, M.: Morphemes and POS tags for n-gram based evaluation metrics. In
Proceedings of (ACL-WMT), pages 104-107, Edinburgh, Scotland, UK (2011)
27. Popovic, M., Vilar, D., Avramidis, E. and Burchardt, A.: Evaluation without references:
IBM1 scores as evaluation metrics. In Proceedings of the (ACL-WMT),
pages 99-103, Edinburgh, Scotland, UK (2011)
28. Petrov S., Leon Barrett, Romain Thibaux, and Dan Klein: Learning accurate,
compact, and interpretable tree annotation. Proceedings of the 21st ACL, pages
433-440, Sydney, July (2006)
29. Callison-Bruch, C., Koehn, P., Monz, C. and Zaidan, O. F.: Findings of the 2011
Workshop on Statistical Machine Translation. In Proceedings of (ACL-WMT), pages
22-64, Edinburgh, Scotland, UK (2011)
30. Callison-Burch, C., Koehn, P., Monz, C., Peterson, K., Przybocki, M. and Zaidan,
O. F.: Findings of the 2010 Joint Workshop on Statistical Machine Translation and
Metrics for Machine Translation. In Proceedings of (ACL-WMT), pages 17-53, PA,
USA (2010)
31. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Findings of the 2009
Workshop on Statistical Machine Translation. In Proceedings of ACL-WMT, pages
1-28, Athens, Greece (2009)
32. Callison-Burch, C., Koehn, P., Monz,C. and Schroeder, J.: Further meta-evaluation
of machine translation. In Proceedings of (ACL-WMT), pages 70-106, Columbus,
Ohio, USA (2008)
33. Avramidis E., Popovic, M., Vilar, D., Burchardt, A.: Evaluate with Condence
Estimation: Machine ranking of translation outputs using grammatical features. In
Proceedings of the Sixth Workshop on Statistical Machine Translation, Association
for Computational Linguistics (ACL-WMT), pages 65-70, Edinburgh, Scotland, UK
(2011)

Aaron L.-F. Han: 2013.08 at CUHK, Hong Kong

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Aaron L.-F. Han: 2013.08 at CUHK, Hong Kong

Enviado por

Direitos autorais:

Formatos disponíveis

Aaron L.-F.

2013.08 @ CUHK, Hong Kong

Email: hanlifengaaron AT gmail DOT com

The importance of machine translation (MT) evaluation

Designed metric: LEPOR Series

Eager communication with each other of different

Rapid development of Machine translation

Some recent works of MT research:

How well the MT systems perform and whether they

Traditional manual evaluation criteria:

Problems of manual evaluations :

2.1 Lexical similarity

MTeRater-Plus (Parton et al., 2011WMT)

MPF & WMPBleu (Popovic, 2011WMT)

SIA (Liu and Gildea, 2006ACL)

LEPOR: automatic machine translation evaluation

Weaknesses in existing metrics:

to address some of the existing problems:

: length of reference sentence

Fig. 1. N-gram word alignment algorithm

Fig. 2. Example of n-gram word alignment

Fig. 3. Example of NPD calculation

N-gram precision and recall:

Fig. 4. Example of bigram matching

Example, employment of linguistic features:

Fig. 5. Example of n-gram POS alignment

Fig. 6. Example of NPD calculation

Combination with linguistic features:

and use the same

Fig. 7. N-gram alignment when multi-references

How reliable is the automatic metric?

Correlation with human judgments:

Pearson correlation coefficient:

Segment-level Kendalls tau correlation:

The segment unit can be a single sentence or

Performances on ACL-WMT 2011 copora

Table 1. system-level Spearman correlation with human judgment on WMT11 corpora

Performances on ACL-WMT 2013 copora

System-level & sentence-level

Table 2&3. system-level Pearson correlation with human judgment

Table 4&5. segment-level Kendalls tau correlation with human judgment

LEPOR: A Robust Evaluation Metric for Machine

Language-independent Model for Machine

Language independent MT evaluation-LEPOR:

Natural Language Processing

My past research works:

ACL-WMT13 METRICS TASK:

Quality Estimation for Machine Translation Using the Joint Method of

Designed POS tagset mapping of Spanish (Tree tagger) to universal tagset

= 1 exp( log(( , )))

SVM, find the point with smallest margin to hyper

Conditional random fields:

Designed features for CRF & NB algorithms

ACL-WMT13 word level quality estimation task results

Designed phrase tagset mapping for English and French

Evaluation based on parsing information from syntactic treebanks

Convert the word sequence into universal phrase tagset sequence

A Study of Chinese Word Segmentation Based on the Characteristics of

AUTOMATIC MACHINE TRANSLATION EVALUATION WITH PART-OF-SPEECH

Chinese Named Entity Recognition with Conditional Random Fields in the

Ongoing and further works:

Actually speaking, the evaluation works are very

Aaron L.-F. Han, 2013.08