COMP 551 Project 4 - Reproducible Machine Learning

COMP 551 - Project 4 - Reproducible Machine Learning
Teamname: The Deathstars

Seara Chen Benjamin Paul-Dubois-Taine Lino Toran Jenner
siyu.chen2@mail.mcgill.ca benjamin.paul-dubois-taine@mail.mcgill.ca lino.toranjenner@mail.mcgill.ca
260670714 260673170 260793554
AbstractThis is a reproduction of the paper compressing en- chosen CNN architecture and compare the results for three
coding of words for use in character-level convolutional networks of the datasets used in the original paper. We analyze both
for text classification submitted to the ICLR2018 conference. accuracy as well as running times. We further discuss general
We reproduce its proposed encoding scheme and convolutional
network. We try to replicate their computation for three of the reproducibility of the paper, and focus on aspects that were
analyzed datasets. We were able to achieve similar results in both difficult to reproduce without further information.
test error as well as running time for all three datasets. While
most of the main ideas in the paper are explained well and could II. T ECHNICAL S UMMARY OF M ETHOD E MPLOYED
therefore be replicated, some implementation details are missing. In this section we quickly present the approach of the
authors. We first describe the encoding of documents, secondly
I. INTRODUCTION
we present the network architecture used to train the model,
This project is part of the ICLR2018 Reproducibility Chal- and finally we give an overview of how this model was tested.
lenge. The goal of the challenge is to improve the quality of
submissions by highlighting the importance of reproducibility. A. Encoding Procedure
In this paper we will present our review of reproducibility Given a number of documents, each belonging to a distinct
by reproducing one of the papers submitted to the ICLR2018 category, each document is first encoded to be interpretable
Conference. by a CNN. The documents are encoded at the character level.
Zhang, Zhao, and LeCun [1] show, that convolutional neural First, the characters are ranked by frequency over the entire
networks (CNNs) can be used effectively on text classification training set. Each character is then encoded depending on its
tasks given enough data. For this, they encode their training rank: if a character has rank i, its encoding will start and
data on a character level, and let a multi-layered CNN classify end with a 1, and will have i times the digit 0 in between.
it. They use a one-hot encoding scheme to represent the For example, if the three most frequent characters in a corpus
different characters. are e, a and t, then their encoding will be 11, 101
For this project, we try to replicate results of the paper and 1001, respectively. The same scheme is applied until
compressing encoding of words for use in character-level the character with lowest rank is reached. This results in a
convolutional networks for text classification, submitted on distinct encoding for each character in the documents.
the openReview platform for ICLR2018. The authors of this Once this is done, each document is encoded as a matrix,
paper try to improve the model of Zhang, Zhao, and LeCun [1] with each row representing a word of the document. Words
by using a more efficient character encoding scheme, leading are encoded by concatenating the codes of each character, up
to faster computation times of the network. to a maximum size of 256 digits. The maximum number of
Instead of using a one-hot encoding, they propose an words represented per document is set to 128. Hence each
encoding that depends on the frequency of occurrence of document is represented by a matrix of size 128x256. If words
the characters. It is claimed that this encoding scheme will representations are less than 256 digits, or if documents consist
be competitive in accuracy while allowing for faster training of less than 128 words, then the remaining columns or rows
times. With the new encoding, characters that appear often are simply padded with 0s.
in the document can be represented in a compressed way, Compared to the model by Zhang, Zhao, and LeCun [1],
resulting in smaller input matrices of the network, and thus which used the first 1012 characters of each document as input
potentially shorter training time. and used a one-hot encoding for each character (resulting in a
We chose to replicate this paper because we found the 1012 by 70 matrix for each document), this approach is much
approach of using convolutional networks in a natural lan- more efficient, since it can fit more information in a smaller
guage setting to be very interesting. This approach might matrix.
revolutionize the way we see natural language problems, since
it shows that classification can be done, even when solely B. Network Architecture
looking at the smallest pieces of language. For classification, the authors use a CNN with multiple
We will verify the papers claims by replicating their in- layers. From the figure posted in the original paper we
vestigation process. We implement their proposed encoding reconstructed the following network architecture: The input
scheme from the description in the paper, try to replicate their is fed into 4 parallel convolutional layers (256 filters, kernel
size = 3). The output of each of these is maxpooled (poolsize To make the paper under review more reproducible in the
= 3). The outputs of these layers are then concatenated in future, it should include a direct link to dataset it used, as well
a merge layer. Two more convolutional (filter = 256, kernel as an original description of the data and the corresponding
size = 5 for both) and two MaxPool layers (pool sizes 3 code used. But overall, if the link to the dataset is directly
and 4) follow this step (in the order Conv->MaxPool->Conv- provided, the data collection process is highly reproducible.
>MaxPool). The result of this is flattend and passed to a fully To account for the scope of this project, we decided to only
connected layer consisting of 128 neurons. The final fully focus on three of the eight provided datasets. We chose the
connected layer consists of a number of neurons consistent following 3 datasets:
with the distinct classes of the current problem. The activation AGs news dataset: a corpus made up of news articles
functions of the network are not specified in the paper. classified in 4 classes, with 120,000 training examples
and 7,600 testing examples [1].
C. Testing DBpedia ontology dataset: a corpus of wikipedia articles
The performance of the model is assessed in two ways: the classified into 14 categories with 560,000 training exam-
training time and the error rate. They compare the running ples and 70,000 testing examples [2].
time on 2 GPUs, an Nvidia GeForce 1080ti and an Nvidia Yelp polarity dataset: a corpus of Yelp reviews classified
GeForce 930M, and the compare the test error rate between as either positive or negative, with 560,000 training
their model and the results provided in the paper by Zhang, examples and 38,000 testing examples [3].
Zhao, and LeCun [1]. They claim that the training time is
B. Implementation of the Model
faster, even on a less powerful GPU, and that the error rate
is comparable to the model proposed by Zhang, Zhao, and In order to reproduce the results from the paper, it was first
LeCun [1]. necessary to implement the model given in the article, namely
both the encoding and the CNN architecture, since the code
III. R EPRODUCIBILITY M ETHODOLOGY was not provided to us. We chose to use an environment as
close as possible to the one used by the authors of the article.
In this section we discuss which aspects of the original The coding was hence done in Python 3.6 and the CNN was
paper we tried to reproduce and how we approached the implemented using Keras 2.0 [4], with Theano as backend [5].
reproduction. a) Encoding and Preprocessing: Using the description
of the original paper, we wrote an encoding function. The
A. Data Collection
compressed code of each character in the training set was
The original paper contains little description of the datasets calculated in the way described above. We then created a
used, except mentioning that they are the same as the ones in dictionary of characters present in the training set and mapped
Zhang, Zhao, and LeCun [1], and that a detailed description them to their respective encodings (based on frequency in the
can be found in that paper. In the paper by Zhang, Zhao, training set). Each batch of data sent to the CNN is encoded
and LeCun [1], the way in which the data is procured is using this function.
not entirely clear. Some of the datasets are publicly available, Several difficulties emerged while preprocessing the data.
well-maintained with proper documentation, and available for The authors wrote, that they preprocessed the data by using
access, such as the Yahoo! Answers Topic Classification. only lowercase characters. However no further instructions of
Other datasets were extracted from public websites, such as data preprocessing were given. This lead to some confusion
the DBpedia Ontology Classification dataset. However, it was when we took a closer look at the datasets: For example,
not clear as to how the information was scrapped from the some datasets contained many occurrences of the character
websites, and no corresponding code was provided. Moreover, \ between 2 words, instead of a space (the readme files
there were other datasets for which a link is provided, but informed us, that these represent newline characters). However
the information provided through the link is outdated, and no no explanation was given in the paper on how to encode or
longer useful, as it was the case for the AG news dataset. whether to ignore these characters in the original paper. We
Specifically, a link was provided to the raw data where the chose the easiest approach and treated all characters as regular
news articles were extracted from, however almost all the links characters.
to the news articles were no longer functional. The datasets include a title and a description. It is not stated
The lack of clarity in terms of data access caused confusion in the paper, which of these should be used for classification.
when trying to reproduce the results. A significant amount We decided, coherent with the procedure used by Zhang and
of time had to be spent tracking down the used datasets. LeCun [6], to use the concatenation of title and description
However, upon examining the website of the lab at which the as data samples. For easier reproduction, clear explanations of
paper by Zhang, Zhao, and LeCun [1] was written, a link to all the preprocessing decisions used should be provided.
the actual data was discovered, but the paper itself does not b) Convolutional Neural Network: More challenges for
include a direct reference to it. Within the content provided, reproducibility came up for the implementation of the CNN.
train and test sets are provided separately, making the split The authors described their network architecture solely using a
easily reproducible. table with the columns Layer type, shape, # of parameters,
Dataset Original Results Reproduced Results
connected to. We deduced our final implemented network ag_news 12.33 11.89
structure from this table. Implementing the CNN was not dbpedia 2.07 2.33
a straightforward task, as many parameters of the network yelp_polarity 7.96 7.84
had to be inferred by looking at the values in the table and TABLE I: Comparison of the test error (in %) between original
making deductions about the network structure. For instance, and replicated results.
we deduced, that the convolutional layers use 256 filters and
the valid padding with stride of 1, by looking at the described
output: (None,126,256). For reproducibility purposes, it would documents of a large corpus. Therefore we need to recompute
have been helpful (and time saving), to state all the relevant the encoding of a document at each epoch. This means that
parameters of the layers, or to provide the code. the encoding of a document is calculated 5 times in total (one
One parameter, that was completely missing from the paper, time for each epoch). We do not know if the authors had to
and that was therefore hard to correctly reproduce, were the use a similar trick, or if they had a larger amount of memory
chosen activation functions. We contacted the authors but to work with.
received no response in time for submission. In the end, we In order to train the model, we obtained Google Cloud
chose to implement ReLU (rectified linear units) activation Computing credits which enabled us to train the model on
functions for the convolutional layers and the first fully con- a Nvidia K80 GPU.
nected layer. We expect that reLu activations are the ones used Since we did not have the same GPU, our goal was not
by the author, since they are typically used in CNNs. We chose to get the same training time, but to get the same proportion
to implement a softmax activation function for the output layer. between training times on different datasets. For example, on
While some of the hyperparameters of the model were stated the Nvidia 930M GPU, the authors report a training time of
very clearly in the paper, others were missing. The number 25 minutes per epoch on the AG news corpus, and a training
of epochs, the optimizer (with parameters) and batch sizes time of 116 minutes per epoch on the DBpedia dataset. It
were provided and copied for our implementation. However, thus took about 4.6 times longer to train the model on the
no information was provided on the loss function used, or DBpedia dataset. For reproduction, we wanted to see if we
if the batches were provided to the CNN in shuffled order. would reach a similar factor between the training time of our
Additionally, hardware, operating system and programming implementation of the model on those 2 datasets. Note that the
environment was specified by the authors. They also provided factors are not the same between the two GPUs used in the
a good description of the used libraries (with version numbers). original paper. We consider the factors obtained on the 930M
This made it easy to compare the final running times and make GPU to be more reliable as the training time on this GPU
deductions as to why there might be differences. is longer, and therefore the time reported is less sensitive to
rounding approximations.
C. Testing Accuracy on Different Corpora
Once we had implemented this model, the second step was IV. E MPIRICAL R ESULTS
to try and reproduce the performance the authors claimed it In this section we provide the empirical results obtained
achieved on the different corpora used. during our approach and compare them with the results from
One issue we ran into was figuring out how the loss the original paper.
used to compare models in Table 6 of the original paper
was calculated. The authors titled the table "Loss comparison A. Error Rates
among traditional models", but failed to include more details. Table I shows the error rates obtained on the test set, for
We contacted the authors using the openReview platform and the three different datasets that were analyzed. The error rates
they quickly answered. They informed us, that the values obtained are very close to the ones from the original paper, but
represented in table 6 were the error rates on the test set. not exactly the same. Interestingly, we achieve better results
Once this was made clear, we were able to try and reproduce for two of the datasets (ag_news and yelp_polarity) but worse
the error rates on the three datasets mentioned earlier. results on the dbpedia dataset.
D. Comparing Running Time B. Running Times
In the original paper, the authors stress out on several Table II shows the training time per epoch of the original
occasions that their model achieves fast training time (espe- paper and our reproduction. For better comparison, the original
cially compared with the running times in [6]). It therefore paper tested their running times on two different GPUs. Our
seems important to us to be able to reproduce the same times are similar to the times achieved by the original times
training time. However, the authors trained their network with when testing on the 1080ti GPU. These times are significantly
GPUs (either Nvidia 1080ti or Nvidia 930M) that were not faster than the ones obtained on a 930M GPU. What is
available to us. Even if they were, the exact same running important to notice is that we obtained a similar proportion
time might not be possible to achieve because of differences between training times on the different datasets, even though
in the implementations of the model. For example, because of the model seems to be training a little faster than expected on
limited memory, we are not able to store the encoding of all the the yelp_polarity dataset.
Dataset Original Time on Nvidia 1080ti GPU Original Time on Nvidia 930M GPU Reproduced Time (on Nvidia Tesla K80)
ag_news 3 min 25 min 4.87 min
dbpedia 18 min 116 min 23.11 min
yelp_polarity 21 min 119 min 20.17 min
TABLE II: Time comparison of training time (per epoch) between the reports of the original paper and our reproduction.
V. D ISCUSSION Finally, interactions with the authors were necessary to

clarify a few points that were left ambiguous after reading
In this section we discuss our results and the overall their paper. As mentioned earlier, the authors provided us
reproducibility of the paper. with a clarification of the loss used to compare models. We
Our results (test error as well as running times) look very also contacted them about the activation functions used in the
similar to the results of the original paper, and we were able network but received no reply until the submission deadline.
to reproduce it for the most part. Our running times (on a
Tesla K80) were also similar to the authors. This is especially VI. CONCLUSIONS
important, since the authors tried to show, that their encoding Overall, the work under investigation was fairly repro-
approach is more effective than the one by Zhang, Zhao, and ducible, with both the reproduced test error and running
LeCun [1]. Since our running times are consistent with theirs, time resembling closely to the results reported by the paper.
it is very probable, that this encoding scheme is in fact more However, an exact reproduction was not possible mainly
time-efficient while achieving similar accuracies. To make sure due to the fact that not all of the details of preprocessing,
of this, one would have to experiment the model on the exact network architecture as well as hyperparameters involved were
same machines used by the authors. available to us.
We could however not reproduce the exact results of the
original paper. This was due to the fact, that not all of the VII. S TATEMENT O F C ONTRIBUTIONS
details of preprocessing, network architecture and hyperpa- Benjamin was in charge of implementing the encoding of
rameters were available to us. In the following, we address the documents and wrote some of the report.
the main difficulties we faced in our replication: Seara was reponsible for data collecting, and some portion
The original paper did not contain any direct links or other of the report.
information on the datasets. Therefore, some extra work and Lino implemented the CNN and wrote part of the report.
time had to be spent tracking them down. Once we had found We hereby state, that all the work presented in this report
the datasets, we were able to exactly replicate the split into is that of the authors.
training and test set, because this split was already provided.
ACKNOWLEDGMENT
Overall, this paper is well-written and easy to follow. The
description of the encoding is very clear, and the network ar- We would like to thank Google Compute Cloud for provid-
chitecture, although missing a few important details like what ing us access to a computing instance. We also want to thank
the activation functions are, is pretty clear as well. However, the authors of the original paper for answering our questions.
for easier replication, it would be helpful to include the exact R EFERENCES
configuration so that deductions about the parameters used
[1] X. Zhang, J. Zhao, and Y. LeCun, Character-level con-
are not necessary. Additionally, more exact description of the
volutional networks for text classification, in Advances
preprocessing employed is also needed for exact replication.
in neural information processing systems, 2015, pp. 649
The computing infrastructure (including library versions)
657.
used was clearly explained, and even though we did not
[2] DBpedia, DBpedia Ontology, http : / / wiki . dbpedia . org /
possess the exact same environment, we believe that one
services-resources/ontology, accessed 2017-12-14.
would be able to set up the exact same infrastructure with
[3] Yelp, Yelp Dataset Challenge, https : / / www. yelp . com /
the information provided.
dataset/challenge, accessed 2017-12-14.
Unfortunately, the authors did not provide any code to [4] F. Chollet et al., Keras, https://github.com/fchollet/keras,
support their findings, and therefore we had to implement the 2015.
full model on our own, sometimes making arbitrary decisions.
However, the clarity of the paper and its level of details
were of big help in this implementation. Implementing the
full encoding and the network architecture did require some
time on our hand, but we found that once this was done, the
experiments ran rather smoothly. It is important to note that
without Google Cloud computing services, reproducing the
results from the paper would have been significantly harder
as the training time would have been very slow, and thus we
would have been limited in our different trials.
[5] R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller, expressions, ArXiv e-prints, vol. abs/1605.02688, May
D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, 2016. [Online]. Available: http : / / arxiv. org / abs / 1605 .
A. Belopolsky, Y. Bengio, A. Bergeron, J. Bergstra, V. 02688.
Bisson, J. Bleecher Snyder, N. Bouchard, N. Boulanger- [6] X. Zhang and Y. LeCun, Text understanding from
Lewandowski, X. Bouthillier, A. de Brbisson, O. scratch, ArXiv preprint arXiv:1502.01710, 2015.
Breuleux, P.-L. Carrier, K. Cho, J. Chorowski, P. Chris- A PPENDIX
tiano, T. Cooijmans, M.-A. Ct, M. Ct, A. Courville,
Y. N. Dauphin, O. Delalleau, J. Demouth, G. Desjardins, Our review on OpenReview:
S. Dieleman, L. Dinh, M. Ducoffe, V. Dumoulin, S. This is an executive summary of a more detailed report that
Ebrahimi Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, can be found here: (link will be included in final submission)
X. Glorot, I. Goodfellow, M. Graham, C. Gulcehre, P. This project is part of the ICLR2018 Reproducibility Chal-
Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi, S. Honari, A. lenge. The goal of the challenge is to improve the quality of
Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, submissions by highlighting the importance of reproducibility.
P. Lamblin, E. Larsen, C. Laurent, S. Lee, S. Lefrancois, In this review we present the main challenges we faced
S. Lemieux, N. Lonard, Z. Lin, J. A. Livezey, C. Lorenz, while reproducing the ideas presented in this paper as well
J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, as highlighting which aspects were easily reproducible.
We gathered three of the analyzed data sets, implemented
R. T. McGibbon, R. Memisevic, B. van Merrinboer,
the encoding scheme and the convolutional neural network
V. Michalski, M. Mirza, A. Orlandi, C. Pal, R. Pascanu,
according to the description given in the paper.
M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A.
Our results (test error as well as running times) are very
Romero, M. Roth, P. Sadowski, J. Salvatier, F. Savard,
similar to the results of the original paper, and we were able
J. Schlter, J. Schulman, G. Schwartz, I. V. Serban,
to reproduce it for the most part. The test errors achieved on
D. Serdyuk, S. Shabanian, . Simon, S. Spieckermann,
the datasets were the following (original results in parenthesis):
S. R. Subramanyam, J. Sygnowski, J. Tanguay, G. van
ag_news 11.89 (12.33), dbpedia 2.33 (2.07) & yelp_polarity
Tulder, J. Turian, S. Urban, P. Vincent, F. Visin, H. de
7.84 (7.96). Our training time per epoch (trained on a Nvidia
Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu,
Tesla K80 whereas the authors used a Nvidia 1080ti): ag_news
L. Xue, L. Yao, S. Zhang, and Y. Zhang, Theano: A
4.87 min (3 min), dbpedia 23.11 min (18 min) & yelp_polarity
Python framework for fast computation of mathematical
20.17 min (21 min).
However, as can be seen from these values, we could not a more detailed description would have been helpful.
reproduce exact results. This was mainly due to the fact, that The computing infrastructure (including library versions)
not all of the details of preprocessing, network architecture used was clearly explained, and even though we did not
and hyperparameters were available to us. In the following, we possess the exact same environment, we believe that one
address the most relevant points we faced in our replication: would be able to set up the exact same infrastructure with
the information provided.
The original paper does not contain any direct links or
We did not have access to the code of the authors and
any other information on the data sets used. All the
therefore had to implement the full model on our own. In
information on the data sets was gathered from papers
some cases we were missing information on parameters
cited. Therefore, extra work and time had to be spent
or how exactly things were implemented (see above).
tracking these data sets down. Once we had found the
Our implementation could therefore be different from the
data sets, we were able to exactly replicate the split
one of the original authors, affecting computation times.
into training and test set, because this split was already
However, since the paper was pretty clear for the most
provided.
part, and our results resemble the ones of the authors, we
The description of the encoding function is very clear
are relatively certain that this has not been a big issue.
and examples make it easy to understand. Were confident
Finally, some interaction with the authors was necessary
in our replication of the encoding process. However, the
to clarify a few points that were left ambiguous after
paper doesnt go into detail on the preprocessing of the
reading their paper. Some of the results and tables were
data sets. We were unsure how newline characters in the
not described very extensively by the authors and there-
documents were preprocessed for instance. For replica-
fore needed clarification. We contacted them using this
tion purposes, a detailed description of the preprocessing
platform and received a quick answer. We also contacted
employed would have been helpful.
them about the activation functions used in the network
The network architecture is presented in form of a table.
but received no reply until the submission deadline.
Some important implementation details are missing (ac-
tivation functions, loss function used), and others have to
be deduced by observing the output dimensions of the
individual layers of the network. This makes it difficult
to exactly replicate the network the authors used. Again,

COMP 551 Project 4 - Reproducible Machine Learning

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

COMP 551 Project 4 - Reproducible Machine Learning

Enviado por

Direitos autorais:

Formatos disponíveis

COMP 551 - Project 4 - Reproducible Machine Learning

Teamname: The Deathstars

V. D ISCUSSION Finally, interactions with the authors were necessary to

Você também pode gostar