Escolar Documentos
Profissional Documentos
Cultura Documentos
AbstractThis is a reproduction of the paper compressing en- chosen CNN architecture and compare the results for three
coding of words for use in character-level convolutional networks of the datasets used in the original paper. We analyze both
for text classification submitted to the ICLR2018 conference. accuracy as well as running times. We further discuss general
We reproduce its proposed encoding scheme and convolutional
network. We try to replicate their computation for three of the reproducibility of the paper, and focus on aspects that were
analyzed datasets. We were able to achieve similar results in both difficult to reproduce without further information.
test error as well as running time for all three datasets. While
most of the main ideas in the paper are explained well and could II. T ECHNICAL S UMMARY OF M ETHOD E MPLOYED
therefore be replicated, some implementation details are missing. In this section we quickly present the approach of the
authors. We first describe the encoding of documents, secondly
I. INTRODUCTION
we present the network architecture used to train the model,
This project is part of the ICLR2018 Reproducibility Chal- and finally we give an overview of how this model was tested.
lenge. The goal of the challenge is to improve the quality of
submissions by highlighting the importance of reproducibility. A. Encoding Procedure
In this paper we will present our review of reproducibility Given a number of documents, each belonging to a distinct
by reproducing one of the papers submitted to the ICLR2018 category, each document is first encoded to be interpretable
Conference. by a CNN. The documents are encoded at the character level.
Zhang, Zhao, and LeCun [1] show, that convolutional neural First, the characters are ranked by frequency over the entire
networks (CNNs) can be used effectively on text classification training set. Each character is then encoded depending on its
tasks given enough data. For this, they encode their training rank: if a character has rank i, its encoding will start and
data on a character level, and let a multi-layered CNN classify end with a 1, and will have i times the digit 0 in between.
it. They use a one-hot encoding scheme to represent the For example, if the three most frequent characters in a corpus
different characters. are e, a and t, then their encoding will be 11, 101
For this project, we try to replicate results of the paper and 1001, respectively. The same scheme is applied until
compressing encoding of words for use in character-level the character with lowest rank is reached. This results in a
convolutional networks for text classification, submitted on distinct encoding for each character in the documents.
the openReview platform for ICLR2018. The authors of this Once this is done, each document is encoded as a matrix,
paper try to improve the model of Zhang, Zhao, and LeCun [1] with each row representing a word of the document. Words
by using a more efficient character encoding scheme, leading are encoded by concatenating the codes of each character, up
to faster computation times of the network. to a maximum size of 256 digits. The maximum number of
Instead of using a one-hot encoding, they propose an words represented per document is set to 128. Hence each
encoding that depends on the frequency of occurrence of document is represented by a matrix of size 128x256. If words
the characters. It is claimed that this encoding scheme will representations are less than 256 digits, or if documents consist
be competitive in accuracy while allowing for faster training of less than 128 words, then the remaining columns or rows
times. With the new encoding, characters that appear often are simply padded with 0s.
in the document can be represented in a compressed way, Compared to the model by Zhang, Zhao, and LeCun [1],
resulting in smaller input matrices of the network, and thus which used the first 1012 characters of each document as input
potentially shorter training time. and used a one-hot encoding for each character (resulting in a
We chose to replicate this paper because we found the 1012 by 70 matrix for each document), this approach is much
approach of using convolutional networks in a natural lan- more efficient, since it can fit more information in a smaller
guage setting to be very interesting. This approach might matrix.
revolutionize the way we see natural language problems, since
it shows that classification can be done, even when solely B. Network Architecture
looking at the smallest pieces of language. For classification, the authors use a CNN with multiple
We will verify the papers claims by replicating their in- layers. From the figure posted in the original paper we
vestigation process. We implement their proposed encoding reconstructed the following network architecture: The input
scheme from the description in the paper, try to replicate their is fed into 4 parallel convolutional layers (256 filters, kernel
size = 3). The output of each of these is maxpooled (poolsize To make the paper under review more reproducible in the
= 3). The outputs of these layers are then concatenated in future, it should include a direct link to dataset it used, as well
a merge layer. Two more convolutional (filter = 256, kernel as an original description of the data and the corresponding
size = 5 for both) and two MaxPool layers (pool sizes 3 code used. But overall, if the link to the dataset is directly
and 4) follow this step (in the order Conv->MaxPool->Conv- provided, the data collection process is highly reproducible.
>MaxPool). The result of this is flattend and passed to a fully To account for the scope of this project, we decided to only
connected layer consisting of 128 neurons. The final fully focus on three of the eight provided datasets. We chose the
connected layer consists of a number of neurons consistent following 3 datasets:
with the distinct classes of the current problem. The activation AGs news dataset: a corpus made up of news articles
functions of the network are not specified in the paper. classified in 4 classes, with 120,000 training examples
and 7,600 testing examples [1].
C. Testing DBpedia ontology dataset: a corpus of wikipedia articles
The performance of the model is assessed in two ways: the classified into 14 categories with 560,000 training exam-
training time and the error rate. They compare the running ples and 70,000 testing examples [2].
time on 2 GPUs, an Nvidia GeForce 1080ti and an Nvidia Yelp polarity dataset: a corpus of Yelp reviews classified
GeForce 930M, and the compare the test error rate between as either positive or negative, with 560,000 training
their model and the results provided in the paper by Zhang, examples and 38,000 testing examples [3].
Zhao, and LeCun [1]. They claim that the training time is
B. Implementation of the Model
faster, even on a less powerful GPU, and that the error rate
is comparable to the model proposed by Zhang, Zhao, and In order to reproduce the results from the paper, it was first
LeCun [1]. necessary to implement the model given in the article, namely
both the encoding and the CNN architecture, since the code
III. R EPRODUCIBILITY M ETHODOLOGY was not provided to us. We chose to use an environment as
close as possible to the one used by the authors of the article.
In this section we discuss which aspects of the original The coding was hence done in Python 3.6 and the CNN was
paper we tried to reproduce and how we approached the implemented using Keras 2.0 [4], with Theano as backend [5].
reproduction. a) Encoding and Preprocessing: Using the description
of the original paper, we wrote an encoding function. The
A. Data Collection
compressed code of each character in the training set was
The original paper contains little description of the datasets calculated in the way described above. We then created a
used, except mentioning that they are the same as the ones in dictionary of characters present in the training set and mapped
Zhang, Zhao, and LeCun [1], and that a detailed description them to their respective encodings (based on frequency in the
can be found in that paper. In the paper by Zhang, Zhao, training set). Each batch of data sent to the CNN is encoded
and LeCun [1], the way in which the data is procured is using this function.
not entirely clear. Some of the datasets are publicly available, Several difficulties emerged while preprocessing the data.
well-maintained with proper documentation, and available for The authors wrote, that they preprocessed the data by using
access, such as the Yahoo! Answers Topic Classification. only lowercase characters. However no further instructions of
Other datasets were extracted from public websites, such as data preprocessing were given. This lead to some confusion
the DBpedia Ontology Classification dataset. However, it was when we took a closer look at the datasets: For example,
not clear as to how the information was scrapped from the some datasets contained many occurrences of the character
websites, and no corresponding code was provided. Moreover, \ between 2 words, instead of a space (the readme files
there were other datasets for which a link is provided, but informed us, that these represent newline characters). However
the information provided through the link is outdated, and no no explanation was given in the paper on how to encode or
longer useful, as it was the case for the AG news dataset. whether to ignore these characters in the original paper. We
Specifically, a link was provided to the raw data where the chose the easiest approach and treated all characters as regular
news articles were extracted from, however almost all the links characters.
to the news articles were no longer functional. The datasets include a title and a description. It is not stated
The lack of clarity in terms of data access caused confusion in the paper, which of these should be used for classification.
when trying to reproduce the results. A significant amount We decided, coherent with the procedure used by Zhang and
of time had to be spent tracking down the used datasets. LeCun [6], to use the concatenation of title and description
However, upon examining the website of the lab at which the as data samples. For easier reproduction, clear explanations of
paper by Zhang, Zhao, and LeCun [1] was written, a link to all the preprocessing decisions used should be provided.
the actual data was discovered, but the paper itself does not b) Convolutional Neural Network: More challenges for
include a direct reference to it. Within the content provided, reproducibility came up for the implementation of the CNN.
train and test sets are provided separately, making the split The authors described their network architecture solely using a
easily reproducible. table with the columns Layer type, shape, # of parameters,
Dataset Original Results Reproduced Results
connected to. We deduced our final implemented network ag_news 12.33 11.89
structure from this table. Implementing the CNN was not dbpedia 2.07 2.33
a straightforward task, as many parameters of the network yelp_polarity 7.96 7.84
had to be inferred by looking at the values in the table and TABLE I: Comparison of the test error (in %) between original
making deductions about the network structure. For instance, and replicated results.
we deduced, that the convolutional layers use 256 filters and
the valid padding with stride of 1, by looking at the described
output: (None,126,256). For reproducibility purposes, it would documents of a large corpus. Therefore we need to recompute
have been helpful (and time saving), to state all the relevant the encoding of a document at each epoch. This means that
parameters of the layers, or to provide the code. the encoding of a document is calculated 5 times in total (one
One parameter, that was completely missing from the paper, time for each epoch). We do not know if the authors had to
and that was therefore hard to correctly reproduce, were the use a similar trick, or if they had a larger amount of memory
chosen activation functions. We contacted the authors but to work with.
received no response in time for submission. In the end, we In order to train the model, we obtained Google Cloud
chose to implement ReLU (rectified linear units) activation Computing credits which enabled us to train the model on
functions for the convolutional layers and the first fully con- a Nvidia K80 GPU.
nected layer. We expect that reLu activations are the ones used Since we did not have the same GPU, our goal was not
by the author, since they are typically used in CNNs. We chose to get the same training time, but to get the same proportion
to implement a softmax activation function for the output layer. between training times on different datasets. For example, on
While some of the hyperparameters of the model were stated the Nvidia 930M GPU, the authors report a training time of
very clearly in the paper, others were missing. The number 25 minutes per epoch on the AG news corpus, and a training
of epochs, the optimizer (with parameters) and batch sizes time of 116 minutes per epoch on the DBpedia dataset. It
were provided and copied for our implementation. However, thus took about 4.6 times longer to train the model on the
no information was provided on the loss function used, or DBpedia dataset. For reproduction, we wanted to see if we
if the batches were provided to the CNN in shuffled order. would reach a similar factor between the training time of our
Additionally, hardware, operating system and programming implementation of the model on those 2 datasets. Note that the
environment was specified by the authors. They also provided factors are not the same between the two GPUs used in the
a good description of the used libraries (with version numbers). original paper. We consider the factors obtained on the 930M
This made it easy to compare the final running times and make GPU to be more reliable as the training time on this GPU
deductions as to why there might be differences. is longer, and therefore the time reported is less sensitive to
rounding approximations.
C. Testing Accuracy on Different Corpora
Once we had implemented this model, the second step was IV. E MPIRICAL R ESULTS
to try and reproduce the performance the authors claimed it In this section we provide the empirical results obtained
achieved on the different corpora used. during our approach and compare them with the results from
One issue we ran into was figuring out how the loss the original paper.
used to compare models in Table 6 of the original paper
was calculated. The authors titled the table "Loss comparison A. Error Rates
among traditional models", but failed to include more details. Table I shows the error rates obtained on the test set, for
We contacted the authors using the openReview platform and the three different datasets that were analyzed. The error rates
they quickly answered. They informed us, that the values obtained are very close to the ones from the original paper, but
represented in table 6 were the error rates on the test set. not exactly the same. Interestingly, we achieve better results
Once this was made clear, we were able to try and reproduce for two of the datasets (ag_news and yelp_polarity) but worse
the error rates on the three datasets mentioned earlier. results on the dbpedia dataset.
D. Comparing Running Time B. Running Times
In the original paper, the authors stress out on several Table II shows the training time per epoch of the original
occasions that their model achieves fast training time (espe- paper and our reproduction. For better comparison, the original
cially compared with the running times in [6]). It therefore paper tested their running times on two different GPUs. Our
seems important to us to be able to reproduce the same times are similar to the times achieved by the original times
training time. However, the authors trained their network with when testing on the 1080ti GPU. These times are significantly
GPUs (either Nvidia 1080ti or Nvidia 930M) that were not faster than the ones obtained on a 930M GPU. What is
available to us. Even if they were, the exact same running important to notice is that we obtained a similar proportion
time might not be possible to achieve because of differences between training times on the different datasets, even though
in the implementations of the model. For example, because of the model seems to be training a little faster than expected on
limited memory, we are not able to store the encoding of all the the yelp_polarity dataset.
Dataset Original Time on Nvidia 1080ti GPU Original Time on Nvidia 930M GPU Reproduced Time (on Nvidia Tesla K80)
ag_news 3 min 25 min 4.87 min
dbpedia 18 min 116 min 23.11 min
yelp_polarity 21 min 119 min 20.17 min
TABLE II: Time comparison of training time (per epoch) between the reports of the original paper and our reproduction.