Você está na página 1de 6

Nine Terminology Extraction Tools: Are they useful for translators?

Hernani Costa, Anna Zaretskaya, Gloria Corpas Pastor and Miriam Seghiri
University of Malaga
Malaga, Spain
{hercos,annazar,g.corpas,seghiri}@uma.es

Abstract such broad range of applications, these tools are


often designed for one specific purpose, which
Terminology extraction tools have become consequently makes their usage challenging
an indispensable resource in education, when employed in a different setting.
research and business. Today, users
can find a great variety of terminology One of the most important areas where
extraction tools of all kinds, and they terminology extraction is extremely helpful is
all offer different features. Apart from in the translation industry. Today, more and
many other areas, these tools are especially more language service providers (LSP) as well as
helpful in the professional translation freelance translators and interpreters understand
setting. We do not know, however, if the benefits of automatizing terminology tasks.
the existing tools have all the necessary
It not only allows them to quickly identify the
features for this kind of work. In search for
the answer, we make an overview of nine domain of the documents they are dealing with,
selected tools available on the market and but also to easily find words and phrases that need
find out if they provide the translators’ most to be paid special attention to. While translating
favourite features. terminological units, in many cases it is necessary
to consider the domain and look up the term
equivalents in special resources like terminology
1 Terminology extraction tools and their databases. And in addition, it helps maintain
areas of application terminological consistency throughout the project
The purpose of terminology extraction tools between all the parts involved: the translator, the
(TET) is to help users build terminological LSP and the client.
resources in a (semi-)automatic way. The Apart from saving time, another significant
need for such resources comes mostly from advantage of using TET instead of manual
the growing needs in information management terminology search consists in the possibility to
and translation, which make it more and more specify different search criteria, which allows to
necessary to have some automated assistance adapt the search query to a particular task. This
when performing terminology-related tasks. allows users to see all kinds of information they
Companies, freelancers and professionals in need about the term, and also to narrow the search
various linguistic fields can resort to these tools and filter the results depending on what they are
to, for example build glossaries, thesauri and looking for. As an example, many state-of-the-
terminological dictionaries that they use directly art TET offer a possibility to see linguistic and
in their work. Moreover, TE is embedded in statistic information about the term, the context
a number of natural language processing and where it appears, specify the number of words
linguistic research tasks, such as automatic in the term, and many other useful features.
indexing, machine translation, information Unfortunately, not every TET offers a full set of
extraction, creation of ontologies and knowledge desirable features and settings, which makes it
bases, and corpus analysis. Although they have sometimes challenging to find the perfect tool for
the task in hand. Apart from the functionalities export glossaries from and to different technology
they offer, TET also differ as to the environment environments. In addition, its integration with
they work in. For instance, standalone installable SDL Multiterm gives access to many convenient
tools require an installation process and work as term-management functions, such as manually
independent computer programs. There also exist adding a variety of meta-data information to the
web-based tools, which work within a browser. terms, such as synonyms, context, definitions,
And finally, there are reusable software that illustrations, part-of-speech tags, URLs, etc., and
facilitates the development of larger applications, searching not only the indexed terms but also
called frameworks. their descriptive fields.
Considering the existing variety, it is not
Simple Extractor as its name implies, offers
clear how a professional translator is to proceed
significantly less functionalities compared to the
when choosing a TET suitable for the job. As
previous tool. It is a commercial TET developed
we will see further, there are some TET that
by DAIL Software S.L. for Mac OS, Linux and
are specifically created for translators. But do
Windows platforms. This clean and easy-to-
they have all the necessary characteristics for
use standalone Java application was designed to
translators? And, furthermore, what exactly are
automatically extract the most frequent words
these characteristics?
and multi-word terms from English, Portuguese,
2 Standalone Terminology Extraction Spanish, French and Russian documents. Simple
Tools Extractor not only permits to extract a list of
terms (from unigrams up to seven-grams), but
Standalone software is probably the most popular also specify the minimum and maximum number
type of software today, and TET are no exception. of occurrences of a term. Moreover, Simple
Standalone TET are tools that can be installed on Extractor offers an option to load stopword lists,
the computer and operate independently of any an advanced search functionality that permits
other device or system. to search through the extracted list of terms,
to explore all the contexts that a specific term
SDL MultiTerm Extract is one of such appears, to edit the term text, to filter the extracted
applications. It is a component of SDL terms according to the number of words that form
MultiTerm, a commercial terminology them, and to sort the displayed output by any of its
management tool that provides one solution fields (frequency, term and context in alphabetical
to store, extract and manage multilingual order). Finally, Simple Extractor permits to print
terminology. Multiterm exists as a standalone out or export to a file (.pdf, .doc, .csv or .txt) all
application, and can also be integrated in SDL the extract terms, as well as their frequencies and
Trados Studio. It is one of the few tools that were corresponding contexts.
designed specifically to be used by translators
and is probably the most well-known TET in TermSuite is an open-source and platform-
the translation industry. This TE system locates independent TET written in Java and distributed
potential monolingual and bilingual terminology under the Apache License 2.0. It was developed
in documents and translation memories using a within the scope of the TTC (Terminology
statistic-based method. The user can validate Extraction, Translation Tools and Comparable
the extracted candidate terms by looking at a Corpora) project, whose purpose was to design a
monolingual or bilingual concordance. A big tool capable of extracting bilingual terminology
advantage of this tool is its support for any from comparable corpora in seven languages:
language, including Unicode languages. In English, French, German, Spanish, Chinese and
addition, it offers a number of functionalities Russian. TermSuite’s architecture is composed by
that are useful in different translation scenarios, 3-step modules: the Spotter, the Indexer and the
such as ability to compile a dictionary from Aligner. The Spotter module is responsible for
parallel texts; flexible filtering that ensures preprocessing the input monolingual corpus, i.e.,
that only the most frequent candidate terms it performs tokenization, part-of-speech tagging,
are extracted; possibility to store an unlimited stemming and lemmatization. Then, the Indexer
number of terms in any language; import and module uses both a statistic and a linguistic-based
approach to extract monolingual terminology whether to extract only single words (keywords)
from a monolingual corpus processed by the or multi-word terminological units (terms). In the
Spotter. Finally, the Aligner computes the output, the user can see the keywords or terms,
translation of a source terminology into a target links to the five most relevant Wikipedia articles
language. The source and target terms required for each of them, the term’s score, its frequency
are these already computed by the Indexer in the searched corpus, and its frequency in the
module, which means that the previous two reference corpus. There are a variety of search
steps should be repeated for the target language. options that can be tuned. For instance, the
The user can choose from several alignment user can choose a different reference corpus,
options, such as the selection of the maximum decide whether search for words or lemmas,
number of translation candidates for a given and accentuate low or high-frequency keywords
source term, the use of similarity measures to according to the preferences. The output can
compare the contexts of the term in the source be downloaded as a TBX or CSV file. In
and the target languages, amongst other advanced order to perform multilingual term extraction the
settings. Once all the parameters are set, it is user needs to upload a TMX file with a parallel
possible to view and explore all the translation corpus aligned on the sentence or paragraph level.
candidates ranked according to their similarity The terminology is first extracted within each
score within the tool or use the output XML file language resulting in lists of candidate terms. In
for other purposes. the second step, the system searches for such
pairs of candidates which co-locate in the parallel
3 Web-Based Terminology Extraction documents most often. The resulting list of
Tools candidate pairs (terms in two languages) is then
presented to the user. Results can be saved in a
Although standalone TET still are predominant
TBX or TXT file, which is especially convenient
on todays TE applications market, the future
for computer-assisted translation tool users.
web-based TE technologies will certainly evolve
by migrating all standalone features to a web- Translated s.r.l. a leading LSP developed a
based environment, which will allow them to web-based tool that can be accessed directly
consequently take over the leadership in the near on the company’s website. It was created in
future. As we will see, there are already some order to help translators with their translation
examples of this trend. The advantages are that jobs by identifying the difficulties in the text and
web-based TET, compared to standalone tools, simplifying the process of creating glossaries. Up
do not require any prior installation as they can to the current date it supports only English, Italian
be accessed within a web browser and that they and French. The system output includes the top
make use of web technologies. Although most of 20 terms ranked by their score. In addition, the
web-based TET are often integrated as features terms are given as hyperlinks to the corresponding
in cutting-edge web-based applications with a Google search results. Below the list of terms the
wider purpose, such as managing corpora or tool also shows all the terms in their full-sentence
terminology (e.g., Sketch Engine and Terminus, context. In order to easily differentiate the terms,
respectively), there also exist tools like the TET each term is highlighted by a different color. In
by Translated, which were developed with the general, this tool is quite simple compared to the
proper purpose of terminology extraction. others, but can provide a fast and free solution any
time it is needed.
Sketch Engine is an online tool created
by Lexical Computing Ltd for building and Terminus is a web-based application for corpus
managing corpora, which along with a number of and terminology management developed at the
corpus-processing features includes terminology University Pompeu Fabra, Spain and it can
extraction. It can be accessed under a paid be accessed by software licensing. The
commercial or academic license and supports 82 purpose of this tool is to integrate the complete
languages. This tool offers both monolingual process of terminographic work: textual corpus
and multilingual extraction. When extracting search, compilation and analysis, term extraction,
monolingual terminology, the user can choose glossary and project management, database
creation and maintenance, and dictionary edition. Rainbow is a simple, yet powerful open-source
This is done with the help of a number platform-independent terminology extraction tool
of articulated modules, including the Analysis written in Java that uses statistic-based methods
module, which has a semi-automatic term to automatically extract terms from multiple
extraction feature. The extraction process has two files and formats in any language. It is based
options: the user can train a term extractor in on the Okapi Framework, a free, open-source
a specific domain by incorporating an electronic and cross-platform framework that has a set of
dictionary containing terms of the same field, components and applications designed to help
or simply apply a generic ready-to-use term engineers, developers, translators and project
extractor to any textual corpus. In addition, one managers involved in localization and translation-
can use other features to extract term candidates, related tasks.
such as the n-gram extractor, bi-gram extraction
with association measures, keywords, and later Java Automatic Term Extraction (JATE) is a
manually validate relevant terms. JAVA toolkit that comprises several state-of-the-
art term extractions algorithms. The motivation
of this TET is three-fold: make available several
4 Frameworks automatic term extraction algorithms for the
research community; encourage developers to
Frameworks are different from the other two
built their methods under a uniform framework;
types of tools because they are not complete
and, enable comparative studies between different
software products but reusable software
term extraction algorithms. JATE’s workflow
environments or libraries that can be used or
follows the typical TET steps: extract candidate
even completely integrated in larger translation
terms from a corpus using linguistic tools;
software applications, products or solutions. In
extract the candidates statistical features from
particular, systems of this type are often used
the corpus; and, apply automatic terminology
in information retrieval, where identification
extraction algorithms to score the candidate
and indexing of terminology serves as an aid
terms domain representativeness based on their
to information retrieval queries. In detail, the
statistical features. So far, JATE’s current
purpose of terminology extraction for both
version includes twelve state-of-the-art statistical
information retrieval and document retrieval is to
algorithms.
isolate terms that contain enough informational
content to support retrieval based on the queries 5 Translators’ preferences and opinions
supplied when querying a set of documents. on the features of TET
Keyphrase Extraction Algorithm (Kea) is a As we mentioned above, translation is one of
framework specially designed for automatically the most important applications of terminology
assigning terms to a document (aka keyphrase extraction. However, it has not yet become
indexing). Kea is a platform-independent toolkit a common part of the professional translation
implemented in Java and distributed under the workflow. This was demonstrated by a
GNU General Public License. In detail, this user survey replied by over 600 translation
framework can either be used for free indexing professionals Zaretskaya et al. (2015), which
or for indexing with a controlled vocabulary. showed that only 25% of the respondents
When used as free indexing, Kea looks for regularly resorted to TE in their work. It
significant terms in a document. If on one could be due to unsatisfying performance of the
hand, the free indexing option can be applied existing tools, their interface design, or simply to
to any document and working language (as long translators’ lack of awareness of these tools and
as the corresponding stopword file and stemmer of the benefits they can yield.
are provided). The controlled indexing, on the We have already seen that TET can differ as to
other hand, has the advantage that all documents various characteristics, such as their interface type
are indexed in a consistent way disregarding their (standalone, web-based or reusable libraries),
wording as the algorithm only collects those n- document formats they support, languages they
grams that match thesaurus terms. work with, as well as different search options.
r
Extracto
ultiterm

d s.r.l.
Engine

s
it

te

Rainbow
Terminu
TermSu
SDL M

Transla
Simple

Sketch

JATE
Kea
Bilingual extraction 3 3 3
Source and target context comparison 3
Terms validation 3 3 3 3 3 3 3
Bilingual dictionaries compilation 3 3
Context extraction 3 3 3 3 3 3 3 3
Support various file formats 3 3 3 3 3 3 3 3
Rank terms by frequency 3 3 3 3 3 3 3 3
Support for many languages 3 3 3 3 3 3 3
Specify the minimal number of 3 3 3 3 3 3 3
occurrences
Show linguistic information 3 3 3
Specify the maximum number of 3
translations
Stopword list option 3 3 3 3 3 3
Choose the minimum and maximum 3 3 3 3 3
number of words per term
Term statistics 3 3 3 3 3 3 3 3 3

Table 1: Comparison chart of features for the selected tools.

According to the survey findings, 27% of the harder to perform than only monolingual as
respondents preferred to have a TE feature it requires a good word alignment system, so
within their computer-assisted translation (CAT) not many existing tools offer this feature. In
tool instead of a separate TE software. Some particular, among the tools we considered in the
translators, however, preferred a web-based previous section only SDL Multiterm Extract
application (9%) or installing a standalone tool and Sketch Engine have bilingual extraction.
on their computer (8%). Nevertheless, the Similarly, TermSuite also offers translation
majority (56%) reported that they did not have candidates for the extracted monolingual terms,
any preference regarding the tool’s interface. The which is a different procedure, but still leads to
fact that translators prefer to have a TE system the same results: terms in two languages. The
integrated in their CAT tool is related to the second ranked feature was the possibility to
general tendency of CAT tools to include more compare the context of the term in the source
and more different features. Indeed, translators and the target language, which is another type
have to deal with a great number of tools that help of bilingual analysis suitable for the translation
them automatize different stages of the translation task. This feature is also quite rare, and of all the
process, so they prefer having one tool with considered tools, only SDL Multiterm Extract
multiple functions rather than having to look for allows such analysis. The possibility to validate
and in many cases pay for several tools. terms or, in other words, choose the terms that
should be extracted instead of extracting all
Regarding the importance of the TE’s
terms was ranked third and is also considered
features, the most useful feature according
useful for translators. This feature is offered by
to survey’s participants was bilingual term
almost all systems, except for TermSuite and
extraction. In fact, considering that within a
Translated. Compiling a bilingual dictionary
translation workflow, terminology extraction is
from parallel texts is another useful feature,
performed with the final objective to translate
which is offered only by SDL Multiterm Extract
the extracted terms, it is more convenient to
and by TermSuite. Finally, the respondents
have the terms extracted in the two languages
considered it useful to extract context together
simultaneously. Bilingual extraction is much
with terms or to see examples from the corpus. extraction, compiling dictionaries and comparing
This is a common feature for many of the studied context in different languages as essential features
tools, including SDL Multiterm Extract, Simple for translators’ work. In addition, it is also very
Extractor or Translated. useful for translators to see the terms in their
Other features that were considered included context in order to understand their meaning and
support for different file formats, possibility be able to find an adequate translation equivalent.
to sort terms by frequency, support for many Not all existing tools, however, provide these
languages, possibility to specify the minimal functionalities. We suggest that developing TET
number of occurrences of the words, show more suitable for the purpose of translation
linguistic information about the term, and select could help professionals in the industry take
the maximum number of translations for one term. better advantage of TE technology. This has
All of them were considered useful, but were not to be done, first of all, by taking into account
among the most useful features. the user requirements. As a step further in this
And finally, some features were not considered direction, it would be necessary to investigate
so important by the respondents. One of them in more detail translators’ attitudes towards TE
was the stopword list option: some of the tools. Especially, the reasons that prevent the
tools, like Simple Extractor, allow to choose vast majority of professional translators to adopt
whether to use a stopword list, and others use them. For instance, many translators might not
it by default. Choosing the minimum and the be aware of their existence or understand their
maximum number of words per term, which was purpose, do not have time to learn how to use
also among the least useful features, can be tuned another complicated interface, or simply have
by all the mentioned TE frameworks, for example. other established procedures for dealing with
And finally, term statistics, which to some extent terminology.
are provided by all tools, were not very important
for most translators either. Table 1 shows which Acknowledgements
of the aforementioned features are presented in Hernani Costa and Anna Zaretskaya contributed
the 9 selected tools. equally to this work and are both supported by
Availability Notes the People Programme (Marie Curie Actions)
SDL Multiterm $500 Free demo of the European Union’s Framework Programme
available (FP7/2007-2013) under REA grant agreement
Simple Extractor $140 60-days demo n.317471.
TermSuit Open Source
Sketch Engine $65/ year 30-days demo References
Translated s.r.l Free
Terminus $440/ year 15-days demo Zaretskaya, A., Corpas Pastor, G., and Seghiri,
Kea Open Source M. (2015). Translators’ requirements for
Raibow Open Source translation technologies: a user survey. In
JATE Open Source New Horizons in Translation and Interpreting
Studies (Full papers), pages 247–254,
Table 2: Depending on the purpose the quotes may
Genebra, Switzerland. Tradulex.
vary. This table only shows the prices for licenses paid
by individuals.

6 Conclusion
Although terminology extraction plays an
important role in several disciplines such as
linguistic research or language teaching, it is
in the field of translation, particularly in the
translation industry where its advantages are
fully exploited and integrated in the workflow.
An example of that is the use of bilingual term

Você também pode gostar