Você está na página 1de 9

IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 15

Computer Assisted Language Learning Based on Corpora and


Natural Language Processing: The Experience of Project
CANDLE
Jason S. CHANG1 and Yu-Chia CHANG 2
1
Department of Computer Science, National Tsing Hua University, Taiwan
jschang@cs.nthu.edu.tw
2
Department of Computer Science, National Tsing Hua University, Taiwan
u881222@alumni.nthu.edu.tw

Abstract
This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora
and NLP technologies to construct an online English learning environment for learners in
Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an
English-Chinese parallel corpus, Sinorama, was used as the main course material for reading,
writing, and culture-based learning courses. Second, an online bilingual concordancer,
TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and
other corpora. Third, many online lessons, including extensive reading, verb-noun collocations,
and vocabulary, were designed to be used alone or together with TotalRecall and TANGO.
Fourth, an online collocation check program, MUST, was developed for detecting V-N
miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis
of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other
computational scaffoldings are under development. It is hoped that this project will help
intermediate learners in Taiwan enhance their English proficiency with effective pedagogical
approaches and versatile language reference tools.

Keywords: computer assisted language learning, corpus, concordance, natural language


processing, collocation

1 Introduction
Researchers and teachers in the field of Computer Assisted Language Learning (CALL) have been
working on harnessing speech and natural language processing technology and Internet resources to
revitalize traditional language learning. They have also explored new pedagogy made possible by
computers and the Internet. The first goal can be met by an adaptive CALL system which provides a
learning environment that makes systematic and ongoing adjustment based on individual differences of
learners. Drill sessions for different language proficiency levels with feedback and guidance are
accessible online to facilitate learning of structural knowledge. As a new pedagogy, digitalized corpora
are used to facilitate inductive data-driven language learning in ways that have not been possible in the
past. Language data is important for learning because it activates learners’ mental mechanisms and
becomes essential input for second or foreign language acquisition. Various language learning activities
or tasks that include listening, speaking, reading, writing, translation or a combination of two or more
skills can be constructed based on various corpora and adaptive/automatic tools to achieve the goal of
computational scaffolding.
Computer, Corpora, and Computational Linguistics have an increasing role to play in ELT and
e-Learning. This new combination of texts and technology has forever changed the way we study, teach,
and learn a language. We can take full advantage of the “New Way,” only if we embrace new pedagogy
and Natural Language Processing technology. NLP technology enables us to compile better dictionaries,
16 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23

concordancers, and collocation aids from corpora of authentic text. NLP technology serves as a
microscopic look at trouble spots of L2 learning in a learner corpus and offers remedies. NLP
technology even helps teachers with preparation of better reading comprehension tests in shorter time.
And most importantly, NLP technology makes possible a new pedagogy that encourages
student-centered, inductive, and culture-based learning. This paper will describe the first-year
experience we have with Project CANDLE under the National Science and Technology Program for
Digital Learning. This three-year project for advancing Computer Assisted Language Learning (CALL)
is based upon corpora and NLP tools, to provide a digital English learning environment for intermediate
learners in Taiwan. It is thus named, Corpora And NLP for Digital Learning of English (CANDLE). The
project integrates expertise in four research areas: (1) NLP technologies and applications, (2) An
intelligent self-access English reading environment, (3) English learning through writing and translation,
and (4) Bilingual corpus and culture-based English learning. Visit http://candle.cs.nthu.edu.tw/candle/
to try out an array of language resources, computational tools, and web-based learning units.
The project is unique, in Taiwan and internationally, drawing on language resources mainly from a
bilingual parallel corpus, the Sinorama corpus, and builds on learners’ first language background
knowledge to empower learners with culture-based materials. Both first language and its culture provide
a scaffold for learners to use while learning a new language. The project is aimed at providing various
types of English learning activities which emphasize structural knowledge, and complex problem
solving learning (Chan, et al., 2001): reading, writing, listening, speaking, and translation. Its major
features include online practice that adapts to learners’ levels, automatically assessing/monitoring
learner’s progress, and profiling of learners’ preferences and level of proficiency. The paper is organized
with a detailed description of the CANDLE project with its four sub-projects, unique features of corpora
processing and development of NLP tools, and a conclusion.
Successful research effort on digital language learning requires close collaboration between
computer engineers, teachers, and content experts. In this three-year CANDLE project, ten researchers
from the research areas of computer science (specifically NLP and speech recognition) and English
teaching (CALL), have been working together with the aim of leveraging cutting edge NLP tools and
corpora processing tools to advance English learning for students in Taiwan. The CANDLE website
(http://candle.cs.nthu.edu.tw/) will be used by students from six participating institutions associated
with the research team. Attempts are being made to build and assess an English learning environment
meeting the needs of local students.

2 Description of the CANDLE project and its initial achievements


The main goal of the CANDLE project is achieved through collaboration of four subprojects:

(1) Natural Language Processing and Assessment Tools,


(2) An Intelligent Self-Access Reading Environment,
(3) Learning a Foreign Language Through Written Exercises and Translation,
(4) Bilingual Corpus and Culture-based Language Learning.

The first subproject provides essential and advanced NLP tools and activity tracking mechanisms for
the other three sub-projects to facilitate and monitor online learners. The second subproject focuses on
construction and assessment of an intelligent self-access reading environment that adapts to learners’
English levels. The third subproject works on exploring the potential of learning English based on
writing and translation exercises. The fourth sub-project uses the bilingual corpus to enhance
culture-based English learning, an area that has not been fully explored yet. All four subprojects are
innovative for natural language processing and English learning. The first subproject also produces the
much-needed digital and content-related advanced technology for the other three subprojects which
have to do with fundamental research on digital learning strategy and behavior and assessment of
usefulness of the proposed approach to advanced digital learning of English language (see Figure 1 for
illustration of the role of the first subproject).
IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 17

ur objectives and anticipated progress in the three-year project are planned in three stages: in the
first year, we will work on web-based learning material development; in the second year, formative
evaluation; in the third year, summative evaluation.

NLP Reading
TotalRecall, Tango, Self-access module,
Collocation checker Text grader, Speedy reading,
Speech recognizer Strategy trainer
Supported by TotalRecall
CANDLE
Interface

Computer assisted
management system
[1st year]

Writing Culture
Writing with TotalRecall, Culture courses
Collocation practice Candle talk
Supported by TotalRecall, Supported by TotalRecall
Tango, & Collocation checker

Figure 1 Integration of the Four Sub-projects with its Respective Modules

3 Corpora Processing and NLP Tools


Corpus linguists study real texts, using explicit algorithms to extract linguistic knowledge from corpora.
An important function of corpora in the language classroom is to provide the learners with concentrated
exposure to particular patterns of repetition. With the use of corpus tools, language learners can avoid
unhelpful reliance on oversimplified 'rules' prepackaged by the teacher; instead they develop
proficiency through focused, purposeful exposure to, and use of, language in specific contexts (Teubert,
1996).

3.1 Sinorama, Main Material for Reading, Writing, and Culture-based Learning Courses
A Parallel Corpus is a collection of "parallel" texts in different languages or in different varieties of a
language. There is a bilingual Chinese-English electronic corpus in Taiwan: Sinorama, which is a
monthly published over three decades by the Government Information Office (GIO) in Taiwan. Among
several of the magazines published by the GIO, Sinorama remains the most popular because it includes
insightful reports on the life styles, society, economy and cultures related to the people in Taiwan. The
topics of the articles in Sinorama include the following: art achievement, literature, painting and
calligraphy, film, dance, music, architecture, museums, traditional opera, handicrafts, clothing and
accessories, stories told in Chinese paper cuts, drama, Taiwanese culture, and influx of Western culture.
The reasons for reading articles in Sinorama are many; the most important is that students are allowed to
interact with the authentic texts that they are reading. Using Sinorama also gives the students
opportunities to read articles about their own culture through the texts that are specific to the Taiwanese
context. In addition, this can familiarize the students with the writing style of Sinorama and prepare
them to utilize a concordancer that is based on this corpus.
18 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23

A: Database selection B: English query C: Mandarin query D: Number of entries per page
E: Normal F: Clustered summary according to translation
H: Order by I: Submit bottom J: Page index K: English citation L: Mandarin citation
N: All citations in the cluster O: Full text context P: Paragraph context

Figure 2 The results of searching for “example”

The current project, Corpora and Natural language processing for Digital Learning of English
(CANDLE) aims to use a range of corpora and advanced natural language processing tools to revitalize
traditional English learning for intermediate learners in Taiwan via activities of reading, translation and
culture.

3.2 Online Bilingual Concordancer and Collocation Reference Tool

3.2.1 TotalRecall
TotalRecall is a Chinese-English bilingual concordancer using Sinoroma parallel corpus (Wu, et al.,
2003). This project involves bilingual sentence, word and phrase alignment, sophisticated queries that
meet learners’ various needs of searching, and output display after ranking. It can display (see Figures 2
and 3):

a) Collocation information in a concordance


b) Four levels of context for citation: sub-sentential, sentential, paragraph, and text
c) Highlighting of accurate word and phrase level alignment of translation equivalent.

3.2.2 TANGO
With the collocation types and instances extracted from the corpus, we built an online collocational
concordancer called TANGO for looking up collocation instances and translations (Jian, 2004). A user
can type in any English words as query and select the expected part of speech of the accompanying
words. For example in Figure 4, after the query “influence” is submitted, the result of possible collocates
will be displayed on the return page. The user can even select different adjacent collocates for further
IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 19

investigation. Moreover, using the technique of bilingual collocation alignment and sentence alignment,
the system will display the target collocation and its translation equivalents highlighted in different
sentential contexts. Translators or learners, through this browser-based interface, can easily gain access
to the usage of each collocation with relevant instances. This may help learners speed-up their
internalization process. This bilingual collocational concordancer could be a very useful tool for
self-inductive learning tailored to intermediate or advanced English learners.

Figure 3. Chinese-English Bilingual Concordance, TotalRecall

3.3 Online Lessons that work with TotalRecall and TANGO


In addition to the development of NLP tools and planning of a learner corpus compilation, four English
teaching master’s theses have been completed and more are under way that shed light on development
of NLP tools or online instructional units and provide a curriculum model that can be used by other
English teachers to infuse CANDLE into their own instructional contexts. The topics of those thesis
researches include:

(1) Subsentential alignment of bilingual corpus by interleaving text and punctuation matches.
(2) Automatic acquisition of VN and other types of collocation in free texts.
(3) Effects of automatic essay grading and bilingual concordancing on college students’ EFL
writing: This thesis investigates how a commercial online essay grader, My Access, and
TotalRecall can help college English students’ writing, revision, and error-correction. It will
provide a curriculum model. Post-tests have shown that the students improved their score after
using these tools.
(4) Effects of CALL approaches on learning of college students’ English verb-noun collocation:
This thesis develops 8 online instructional units of teaching English verb-noun collocation and
uses TotalRecall to investigate whether the two CALL approaches can help college students’
learning of collocation. It has both development and pedagogical implications.
(5) Effects of online extensive reading of English texts with controlled vocabulary on college
students’ incidental learning: This thesis uses some text processing programs to control
20 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23

vocabulary difficulty level and number of new words appearing in a group of texts and
investigates whether such arrangement of text selection can help college English student
readers to acquire more new words. It has both development and pedagogical implications.
(6) The feasibility of using the Sinorama bilingual concordance in a culture awareness language
course for non-English-major students: This thesis integrates the use of TotalRecall into a
college English course and explores the learning process and product. It will provide a
curriculum model.

Figure 4 Web-based Collocational Concordance, TANGO

Preparation of developing online cultural materials for English learning and pilot testing of use of
Sinorama and TotalRecall in the English teaching contexts in participating colleges are under way.

3.4 Online Collocation Check Program


We have also developed a web-based automatic collocation-detection system as an online aid for EFL
writers and especially tackle learners’ miscollocations attributable to L1 translation interference on the
verb collocate (Chang, 2004). The system provides relevant adequate collocation as feedback messages
according to the mutual translations between learner’s L1 and L2. When user inputs a V-N collocation,
system will check and derive a list of candidate English verbs that share the same Chinese translations
via processing of bilingual corpora. After combing nouns with those candidate verbs as V-N pairs, the
system makes use of a reference English corpus to exclude the inappropriate V-N pairs so as to single
out the proper collocations. An example of correcting a misused collocation “publish album” is shown
in Figure 5. The system can promptly provide the exact suggestive collocation which the learner intends
to write but misuses. It is hoped that this online assistant can facilitate EFL learner-writers’ collocations
use and transfer this knowledge to their future writing.
IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 21

Figure 5 The Interface of Online Collocation Check Program, MUST

4 Conclusion
In the first year of project CANDLE, we have built several NLP tools for CALL. These tools have been
used in various ELT research and teaching activities with promising results. We plan to develop more
tools based on NLP technologies to explore the area such as semi-automatic test generation and grading.
By emphasizing advanced NLP technologies, sound English pedagogical theories and empirical
assessment of usability based on real learners, we have confidence that we will reach our optimum goal
of creating a digital learning environment that meets real needs of English learners in Taiwan.

In the coming years, we will achieve the following goals via the CANDLE website:

(1) Providing access to the CANDLE learning center to as many students as we can reach.
(2) Providing empirical evidence or usability testing data to prove CANDLE usefulness or
effectiveness.
(3) Exploring the possibilities of curriculum infusion in various universities or colleges for different
learners.

By putting natural language processing technologies to work with sound English pedagogical
theories and usability study on real learners, we hope to advance the state of the art of English Language
Teaching. Bilingual corpus, bilingual concordancers, browser-based interactive tests or exercises are
among the advanced technologies we have developed. Additionally, we will make the CANDLE
environment meet the English learning needs of local learners by attending to their specific difficulties.
Evaluation methods such as psychometric means in a comparison design, discourse analysis, or
portfolio will be conducted in the third year to advance the understanding of learners’ behavior when
they work online. We envision that learners will be capable of the complex problem solving needed to
network with foreign language users in other countries. We hope to achieve the goal of promoting
learner autonomy and life-long learning, so as to enable learners to fully participate in the English
speaking discourse community.
22 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23

Acknowledgements
The paper is supported by research grant from National Science Council under the projects
NSC92-2524-S007-002 and NSC 93-2524-S-007-002.

References
Chan, T. W., Hue, C. W., Chou, C. Y., & Tzeng, O. J. L. 2001. Four spaces of network learning models.
Computers & Education, 37, 141-161.
Chang, J.S., David Yu, Chun-Jun Lee. Statistical Translation Model for Phrases, Vol. 6, No. 2, pp. 43-64, 2001.
Chang, Richard, T-P Chen, Jason S. Chang. 2004. An Automatic Collocation Writing Assistant for Taiwanese
EFL Learners: Using Corpora for language teaching and learning based on NLP Technology, EUROCALL.
Chuang, Thomas C., and Jason S Chang, 2002. Adaptive Sentence Alignment based on Length and Lexical
Information, In Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, Comp.
Volume, 91-92.
Chuang, Thomas C., Jian-Cheng Wu, Tracy Lin, Web-Chie Shei and Jason S. Chang, “Bilingual Sentence
Alignment Based on Punctuation Statistics and Lexicon, “ Proceedings of the first International Joint
Conference on Natural Language, IJCNLP-04, PP. 644-651, Hainan Island, China, Jan 2004.
Chuang, Thomas C., NG You, and Jason S Chang, 2002. Adaptive Sentence Alignment, Proceedings of the Fifth
Conference of the Association for Machine Translation in the Americas, AMTA'2002, Tiburon, California.
Conzett, J. (2000). Integrating collocation into a reading and writing course. In Lewis, M. (Ed.), Teaching
collocation: Further developments in the lexical approach (pp. 70-86). London: Language Teaching
Publications.
Farghal, M. & Obiedat, H. (1995). Collocations: A neglected variables in EFL. International Review of Applied
Linguistics, 33, 313-331.
Gale, W. & K. W. Church, "A Program for Aligning Sentences in Bilingual Corpora" Proceedings of the 29th
Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, 1991.
Granger, S. 2003. Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal, 20(3),
465-480.
Jian, J. Y., Chang, Y. C., & Chang, J. S. 2004. Collocational Translation Memory Extraction Based on Statistical
Linguistic Information, Paper presented in ROCLING 2004, Conference on Computational Linguistics and
Speech Processing, Taipei.
Jian, Jia-Yan, Yu-Chia Chang, Jason S. Chang. “TANGO: Bilingual Collocational Concordancer, " Proceedings of
the 42th Annual Meeting of Association for Computational Linguistics,” Comp. Vol., 2004.
Lee, S. H. (2003). ESL learners’ vocabulary use in writing and the effects of explicit vocabulary instruction,
System, 31, 537-561.
Lewis, M. (2000). Teaching collocation: Further development in lexical approach. London: Language Teaching
Publications.
Lin, T., C.J. Wu, and J.S. Chang. 2003 Word Transliteration Alignment, Proceedings of the fifteenth Research on
Computational Linguistics Conference, ROCLING XV, Hsinchu.
Liou, H. C., et al. 2003. Using corpora and computational scaffolding to construct an advanced digital English
learning environment: The CANDLE project. The 7th Int’l Conference on Multimedia Language Education,
Chia-Yi, December 19-21.
Liu, C. P. (1999). An analysis of collocation errors in EFL writings. The proceedings of the Eighth International
Symposium on English Teaching (pp. 483-494). Taipei: Crane.
Liu, C. P. (2000). A study of strategy use in producing lexical collocations. Selected Papers from the Ninth
International Symposium on English Teaching (pp. 481-492). Taipei: Crane.
Liu, L. E. (2002). A corpus-based lexical semantic investigation of verb-noun miscollocations in Taiwan learners’
English. Unpublished master’s thesis, Tamkang University, Taipei, January.
Macklovitch, E., Simard, M., Langlais, P.: TransSearch: A Free Translation Memory on the World Wide Web.
Proc. LREC 2000 III, 1201--1208 (2000).
Melamed, I. D. 1997. A Word-to-Word Model of Translational Equivalence. Proc. of the ACL97. pp 490-497.
Madrid Spain, 1997.
Mitkov, Ruslan, and Le An Ha 2003, Computer-Aided Generation of Multiple-Choice Tests, In Proceedings of
HLT-NAACL 2003.
Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge Press.
Nattinger, J. R., & DeCarrico, J. D. (1992). Lexical phrase and language teaching. Oxford: Oxford University
Press.
IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 23

Nesselhauf, N (2003). The use of collocations by advanced learners of English and some implications for teaching.
Applied Linguistics, 24, 223-242.
Shei, C. C., & Pain, H. (2000). An ESL writer’s collocation aid. Computer Assisted Language Learning, 13,
167-182.
Simard, M., G. Foster & P. Isabelle (1992), Using cognates to align sentences in bilingual corpora. In Proceedings
of TMI92, Montreal, Canada, pp. 67-81.
Teubert, W. 1996. Comparable or Parallel Corpora? International Journal of Lexicography Vol. 9, Number 3, pp.
238-265.
Teubert, W. 1996. Why corpus linguistics? International Journal of Corpus Linguistics, 1(1).
Teubert, W. 2003. Parallel Corpora and Language Learning, 12th International Symposium on English Teaching,
Taipei.
Wang, Yi-Chia, Jian-Cheng Wu, Tyne Liang, Jason S. Chang. 2004. Using the Web as Corpus for Un-supervised
Learning in Question Answering, to appear in ROCLING XVI: Conference on Computational Linguistics and
Speech Processing, September 2-3, 2004, Howard Pacific Green Bay, Taipei, Taiwan, ROC.
Whitelock, P., & Edmonds, P. 2000. The Sharp intelligent dictionary. Proceedings of the 9th EURALEX, pp.
871-876.
Wu, Chien-Cheng, and Jason S. Chang. Bilingual Collocation Extraction Based on Syntactic and Statistical
Analyses, Computational Linguistics and Chinese Language Processing, Vol. 9, No. 1, 2004, pp. 1-20.
Wu, CJ and J.S. Chang. Alignment of Collocation via Syntactic and Statstical Analyses. Proceedings of the
fifteenth Research on Computational Linguistics Conference, ROCLING XV, Hsinchu.
Wu, CJ, K. Yeh, T.C. Chuang, W.C. Shei and Jason S. Chang. 2003. ‘TotalRecall: A Bilingual Concordance for
Computer Assisted Translation and Language Learning,’ In Proceedings of the 41st Annual Meeting of
Association for Computational Linguistics, Comp. Volume, 201-204.
Wu, Dekai (1994), Aligning a parallel English-Chinese corpus statistically with lexical criteria. In The
Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New Mexico, USA,
pp. 80-87.
Wu, J.C., Thomas C. Chuang, Wen-Chi Shei and Jason S. Chang. “Subsentential Translation Memory for
Computer Assisted Writing and Translation, " Proceedings of the 42nd Annual Meeting of Association for
Computational Linguistics, Comp. Vol. 2004.
Zhang, X. (1993). English collocations and their effect on the writing of native and non-native college freshmen.
Ph.D. thesis, Indiana University of Pennsylvania.

Você também pode gostar