Escolar Documentos
Profissional Documentos
Cultura Documentos
Table 2: Example of a document and corresponding action sequence. We mark in red the MT words that have been corrected
and in blue their replacement. The actions used here were W (wait), JSF (jump sentence forward), JF (jump forward), D
(delete), MC (mouse clicks), MS (mouse selections), JB (jump back), R (replace) and S (stop).
cause simultaneous changes to several words (e.g. ward of 1 sentence and 8 words. After deleting
when pasting text or deleting a selected block), and “se”, position 9 became “traduira”. Since the edi-
we reserve separate actions for these. Overall, five tor opted to delete “par” (using a mouse selection)
text-editing actions are considered: inserting (I), before changing the verb, there is a jump forward
deleting (D), and replacing (R) a single word, and of 1 word to position 10. Then we have a jump
inserting (BI) and deleting (BD) a block of words. back of 1 before changing the verb to “traduit”.
Each action is appended with the corresponding
word or block of words, as shown in Table 1.
Other actions, dubbed non-editing actions, do Datasets. We introduce two datasets for this
not change the text directly. Jump-forward (JF) task, one for English-French (En-Fr) and an-
and jump-backward operations (JB) count the dis- other for English-German (En-De). For each
tance in words between two consecutive edits. An- dataset, we provide the action sequences for
other pair of actions informs when a new sen- full documents, along with an editor identifier.
tence is edited: a sentence jump (JSF/JSB) indi- To ensure reproducibility of our results, we
cates that we moved a certain number of sentences release both datasets as part of this paper, avail-
forth/back since the previous edit. Mouse clicks able in https://github.com/Unbabel/
(MC) and mouse selections (MS) count their occur- translator2vec/releases/download/
rences between two consecutive edits. Wait (W) v1.0/keystrokes_dataset.zip. For
counts the seconds between the beginning of two anonymization purposes, we convert all editor
consecutive edits. Finally, stop (S) marks the end names and the 50 tokens in the word vocabulary
of the post-editing session. to numeric identifiers. Statistics of the dataset are
Since we do not want to rely on lexical informa- shown in Table 3: it is the largest ever released
tion to identify the human post-editors, only the 50 dataset with post-editing action sequences, and
most frequent words were kept (most containing the only one we are aware of with document-level
punctuation symbols and stop-words), with the re- information.2 Each document corresponds to a
maining ones converted to a special unknown sym- customer service email with an average of 116.6
bol (UNK). Moreover, the first waiting time is split tokens per document. Each sentence has an
in two: the time until the first keystroke occurs and, average length of 9.4 tokens.
in case the first keystroke is not part of the first ac-
tion (e.g. a mouse click), a second waiting time
2
until the first action begins. The closest comparable dataset was released by Specia et al.
(2017) in the scope of the QT21 project, containing 176,476
Table 2 shows an example of a small document, sentences spanning multiple language pairs (about 4 times
along with the editor’s action sequence. The edi- less), with raw keystroke sequences being available by re-
quest. In contrast to ours, their units are sentences and not
tor began on sentence 2 (“Certains agents...”) and full documents, which precludes studying how human post-
the word on position 9, since there was a jump for- editors jump between sentences when translating a document.
Figure 1: Left: our Action Seq model for editor identification. Right: our model for post-editing time prediction.
# docs # sents # words into bins—e.g. waiting times of 200 seconds and
higher all correspond to the same one-hot repre-
train 17,464 154,026 1,895,389
sentation. Bins were defined manually, providing
En-Fr dev 5,514 52,366 659,675
higher granularity to small values than to larger
test 9,441 86,111 1,072,807
ones.3 Each one-hot is then mapped to a learn-
train 17,403 169,478 2,053,407 able embedding, and the sequence of embeddings
En-De dev 6,722 66,521 826,791 is fed to a 2-layer bidirectional LSTM (biLSTM;
test 9,724 98,920 1,221,319 Hochreiter and Schmidhuber (1997); Graves and
Total 66,268 627,422 7,729,388 Schmidhuber (2005)), resulting in two final states
−
→ ← −
h , h . Then we concatenate both, apply dropout
Table 3: Number of documents, sentences, and words in En- (Srivastava et al., 2014) and feed them to a feed-
glish source text per dataset. There are 149 unique editors forward layer with a ReLU activation (Glorot et al.,
across all En-Fr datasets, and 183 in En-De.
2011) to form a vector h. This vector is taken as
the representation of the action sequence. Finally,
3 Editor Identification we define P (y | x1 , . . . , xL ) = softmax(W h +
b).
We now make use of the dataset just described to We call this model Action Seq, since it exploits
answer the three research questions stated at the information from the action sequences.
end of §1, starting with editor identification.
3.3 Baselines
3.1 Data Preparation
To assess how much information action sequences
For this experiment, we took the action sequence
provide about human editors beyond the initial
dataset described in §2 and selected a small num-
(machine translation) and final (post-edited) text,
ber of human translators for both language pairs
we implemented various baselines which do not
who post-edited a number of documents above a
use fine-grained information from the action se-
threshold: this yielded 6 editors for En-Fr and 7
quences. All use pre-trained text embeddings from
editors for En-De. To ensure balanced datasets,
FastText (Joulin et al., 2017), and they are all tuned
we filtered them to contain the same number of
for dropout and learning rate:
samples per selected editor. This filtering yielded a
total of 998/58/58 training/dev/test documents per • One using the machine-translated text only
editor for En-Fr, and 641/128/72 for En-De. (MT). Since this text has not been touched by
A random baseline for this dataset would obtain the human post-editor, we expect this system to
an editor identification accuracy of 1/6 = 16.7% perform similarly to the random baseline. The
for En-Fr and 1/7 = 14.3% for En-De. goal of this baseline is to control whether there
is a bias in the content each editor receives that
3.2 A Model for Editor Identification
could discriminate her identity. It uses word
Let hx1 , . . . , xL i be an action sequence produced embeddings as input to a biLSTM, followed by
by a post-editor y. To identify the editor of a feed-forward and softmax layers.
task, we build a model P (y | x1 , . . . , xL ) using
3
a neural network as we next describe (shown in We used {0, . . . , 5, 7, 10, 15, 20, 30, 50, 75, 100, 150, 200+}
for wait and jump events (in seconds and word positions,
Figure 1). Each action xi is first associated to a respectively); and {0, . . . , 5, 7, 10+} for sentence jumps and
one-hot vector. All numeric actions are grouped mouse events (in sentence positions and clicks).
• Another one using the posted-edited text only En-De (%) En-Fr (%)
(PE). This is used to control for the linguistic
Delta 16.15 26.09
style of the post-editor. We expect this to be a
MT 18.21 16.44
weak baseline, since although there are positive
PE 27.38 30.00
results on translator stylometry (El-Fiqi et al.,
MT + PE 26.63 31.78
2019), the task of post-editing provides less op-
MT + PE + Att 30.12 35.06
portunity to leave a fingerprint than if writing a
translation from scratch. The architecture is the Action Seq 84.37 67.07
same as in the MT baseline.
Table 4: Results of model and baselines for editor identifica-
• A baseline combining both MT and PE using a tion. Reported are average test set accuracies of 5 runs, with
dual encoder architecture (MT + PE), inspired 7 editors for En-De and 6 editors for En-Fr.
by models from dialogue response (Lowe et al.,
2015; Lu et al., 2017). This baseline is stronger En-De (%) En-Fr (%)
than the previous two, since it is able to look Action Seq 83.31 73.16
at the differences between the initial and final
text produced by the post-editor, although it ig- w/out editing actions 80.60 69.37
nores the process by which these differences w/out mouse info 75.49 66.38
have been generated. Two separate biLSTMs w/out waiting time 80.42 70.92
encode the two sequences of word embeddings, w/out 1st waiting time 78.60 71.15
the final encoded states are concatenated and fed only editing actions 60.20 59.08
to a feed-forward and a softmax layer to provide only mouse info 56.43 55.06
the editors’ probabilities. only waiting time 53.53 44.02
only 1st waiting time 24.22 23.11
• Finally, a stronger baseline (MT + PE + Att)
that is able to “align” the MT and PE, by aug- Table 5: Ablations studies for editor identification. Reported
menting the dual encoder above with an atten- are average development set accuracies of 5 runs, with 7 edi-
tors for En-De and 6 editors for En-Fr.
tion mechanism, inspired by work in natural lan-
guage inference (Rocktäschel et al., 2016). The
model resembles the one in Figure 1 (right), concatenated into a vector e, followed by a feed-
with a softmax output layer and without the forward layer and a softmax activation.
editor representation layer. Two separate biL-
STMs are used to encode the machine-translated 3.4 Editor Identification Accuracy
and the post-edited text. The final state of Table 4 compares our system with the baselines
the MT is used to compute attention over the above. Among the baselines, we observe a grad-
PE, then this attention-weighted PE is concate- ual improvement as models have access to more
nated with MT’s final state and passed through information. The fact that the MT baseline per-
a feed-forward layer. Symmetrically we obtain forms closely to the random baseline is reassuring,
a representation from PE’s final state and an showing that there is no bias in the type of text
attention-weighted MT. Finally both vectors are that each editor receives. As expected, the dual en-
concatenated and turned into editors’ probabili- coder model with attention, being able to attend to
ties through another feed-forward layer. each word of the MT and post-edited text, is the
one which performs the best, surpassing the ran-
Additionally, we prepare another baseline dom baseline by a large margin. However, none of
(Delta) as a tuple with meta information contain- these baselines have a satisfactory performance on
ing statistics about the difference between the ini- the editor identification task.
tial and final text (still not depending on the ac- By contrast, the accuracies achieved by our pro-
tion sequences). This tuple contains the following posed model (Action Seq) are striking: 84.37% in
5 elements: a count of sentences in the document, En-De and 67.07% in En-Fr, way above the closest
minimum edit distance between MT and PE, count baselines. This large gap confirms our hypothesis
of words in the original document, in MT and in that the editing process itself contains informa-
PE. Each of these elements is binned and mapped tion which is much richer than the initial and
to a learnable embedding. The 5 embeddings are final text only.
Ablation studies. To understand the importance
of each action type in predicting the editor’s iden-
tity, we conduct a series of ablation studies and re-
port development set accuracies in Table 5. These
experiments involve removing mouse information,
time information, initial waiting time or editing ac-
tions. Also, we try keeping only each of the previ-
ous four. We find that all action types contribute to
the global accuracy, although to different extents. (a) En-De training set (b) En-Fr training set
Also, some action types achieve high performance
on their own. Somewhat surprisingly, mouse infor-
mation alone achieves remarkably high accuracy.
Although waiting times also perform well on their
own, removing them has little impact on the final
score.
4 Editor Representation
The previous section has shown how the action se- (c) En-De test set (d) En-Fr test set
quences are very effective for identifying editors.
As a by-product, the Action Seq model used for Figure 2: Embeddings of each translation session in the edi-
tor identification train and test sets, with editors identified by
that task produced an internal vector h that repre- different colors. For each language, the dimensionality re-
sents the full post-editing session. This suggests duction was learned by training parametric t-SNE (Maaten,
a strategy for obtaining editor representations: 2009) on the train data, and then applying it to both train and
test data. En-De contains 7 editors, each with 641 train and
simply average all such vectors from each editor. 72 test samples per editor. En-Fr contains 6 editors, each with
One way of looking at this is regarding editor iden- 998 train and 58 test samples per editor.
tification as an auxiliary task that assists us in find-
ing good editor representations. This draws inspi-
ration from previous work, such as Mikolov et al. borders, possibly meaning that some editors be-
(2013), as well as its applications to recommenda- have in a more distinctive manner than others.
tion systems (Grbovic et al., 2015, 2016). In the
Visualization of editor representations. To
last two works, an auxiliary task also helps to pro-
represent an editor with a single vector, we aver-
vide a latent representation of an object of interest.
age the h’s of all tasks of a given editor to obtain
Visualization of translation sessions. To visu- his representation. Figure 3 contains representa-
alize the vectors h produced during our auxiliary tions for En-Fr editors (similar results have been
task, we use Parametric t-SNE (Maaten, 2009) for achieved for En-De editors), using the exact same
dimensionality reduction. Unlike the original t- model as in Figure 2b to produce session embed-
SNE (Maaten and Hinton, 2008), the parametric dings, and the same t-SNE model for visualization.
version allows to reapply a learned dimensional- To reduce noise we discard editors with less than
ity reduction to new data. This way it is possible 10 samples, keeping 117 out of 149 editors. In Fig-
to infer a 2D structure using the training data, and ure 3 we show percentiles for 3 editor features, us-
check how well it fits the test data. ing one point per editor and setting color to repre-
In Figure 2 we show a projection of vectors sent a different feature in each panel. In Figure 3a,
h for both language pairs, using a t-SNE model color represents percentiles of average initial wait-
learned on the training set vectors; each color cor- ing time, and in Figure 3b, percentiles of counts of
responds to a different editor. In the training set jump-backs per MT token. We can observe that the
(used to train both the editor identification model model learned to map high waiting times to the left
and the Parametric t-SNE) there is one clear clus- and high counts of jump-backs to the right. In Fig-
ter for each editor, in both languages. Using test ure 3c we have mouse activity per user (percentiles
set data, new tasks also form clusters which are of counts of mouse clicks and selections). Here we
closely related to the editors’ identity. Some clus- can see a distribution very similar to that of count
ters are isolated while others get mixed near their of jump-backs.
Mouse and JB (%) 1st WT and JB (%)
En-Fr 80.75 −39.65
En-De 59.62 −31.11
Lacruz, Isabel, Gregory M Shreve, and Erik An- Sanchez-Torron, Marina and Philipp Koehn. 2016.
gelone. 2012. Average pause ratio as an indi- Machine translation quality and post-editor pro-
cator of cognitive effort in post-editing: A case ductivity. AMTA 2016, Vol., page 16.
study. In AMTA 2012 Workshop on Post-Editing Sin-wai, Chan. 2014. The development of trans-
Technology and Practice (WPTP 2012), pages lation technology. Routledge Encyclopedia of
21–30. AMTA. Translation Technology, page 3.
Lowe, Ryan, Nissan Pow, Iulian Serban, and Joelle Snover, Matthew, Bonnie Dorr, Richard Schwartz,
Pineau. 2015. The ubuntu dialogue corpus: A Linnea Micciulla, and John Makhoul. 2006. A
large dataset for research in unstructured multi- study of translation edit rate with targeted hu-
turn dialogue systems. CoRR, abs/1506.08909. man annotation. In Proceedings of association
Lu, Yichao, Phillip Keung, Shaonan Zhang, Jason for machine translation in the Americas, volume
Sun, and Vikas Bhardwaj. 2017. A practical ap- 200.
proach to dialogue response generation in closed Specia, Lucia. 2011. Exploiting objective annota-
domains. arXiv preprint arXiv:1703.09439. tions for measuring translation post-editing ef-
Maaten, Laurens. 2009. Learning a parametric em- fort. In Proceedings of the 15th Conference of
bedding by preserving local structure. In Artifi- the European Association for Machine Transla-
cial Intelligence and Statistics, pages 384–391. tion, pages 73–80.
Maaten, Laurens van der and Geoffrey Hinton. Specia, Lucia, Kim Harris, Frédéric Blain,
2008. Visualizing data using t-sne. Journal of Aljoscha Burchardt, Viviven Macketanz, Inguna
machine learning research, 9(Nov):2579–2605. Skadina, Matteo Negri, , and Marco Turchi.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S 2017. Translation quality and productivity: A
Corrado, and Jeff Dean. 2013. Distributed repre- study on rich morphology languages. In Ma-
chine Translation Summit XVI, pages 55–71,
Nagoya, Japan.
Specia, Lucia, Carolina Scarton, and Gustavo Hen-
rique Paetzold. 2018. Quality estimation for ma-
chine translation. Synthesis Lectures on Human
Language Technologies, 11(1):1–162.
Srivastava, Nitish, Geoffrey Hinton, Alex
Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. 2014. Dropout: a simple way
to prevent neural networks from overfitting.
The Journal of Machine Learning Research,
15(1):1929–1958.
Tatsumi, Midori, Takako Aikawa, Kentaro Ya-
mamoto, and Hitoshi Isahara. 2012. How good
is crowd post-editing? its potential and limita-
tions. In AMTA 2012 Workshop on Post-Editing
Technology and Practice (WPTP 2012), pages
69–77.
Toral, Antonio, Martijn Wieling, and Andy Way.
2018. Post-editing effort of a novel with statisti-
cal and neural machine translation. Frontiers in
Digital Humanities, 5:9.
Vaswani, Ashish, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. At-
tention is all you need. In Advances in Neural
Information Processing Systems, pages 6000–
6010.
Yang, Diyi, Aaron Halfaker, Robert Kraut, and Ed-
uard Hovy. 2017. Identifying semantic edit in-
tentions from revisions in wikipedia. In Pro-
ceedings of the 2017 Conference on Empiri-
cal Methods in Natural Language Processing,
pages 2000–2010.
Yin, Pengcheng, Graham Neubig, Miltiadis Alla-
manis, Marc Brockschmidt, and Alexander L.
Gaunt. 2019. Learning to represent edits. In In-
ternational Conference on Learning Represen-
tations (ICLR), New Orleans, LA, USA.