Escolar Documentos
Profissional Documentos
Cultura Documentos
SEMESTER 2, 2018
DATA ANALYSIS CHALLENGE:
CLASSIFYING NEWS ARTICLES
Nikhil Taralkar28980433
Kunal Sharma 29009626
Varun Mathur 28954114
Contents
1. Introduction .................................................................................................................................... 1
2. Libraries Used.................................................................................................................................. 1
3. Pre-processing Steps ....................................................................................................................... 1
4. Feature Selection ............................................................................................................................ 2
5. Model/ Algorithm for classifiers ..................................................................................................... 2
5.1 Naïve Bayes ................................................................................................................................... 2
5.2 Auto-Encoder ................................................................................................................................ 3
5.3 SVM (Support Vector Machines)................................................................................................... 4
6. Conclusion ....................................................................................................................................... 4
7. References ...................................................................................................................................... 5
1. Introduction
This is an interesting project that asks us to classify a set of articles (Text documents). A set of
training_docs, testing_docs and training_labels are provided with which a classifier must be
built that classifies the testing articles to predict the class label for each testing article. An
appropriate classifier needs to be built to satisfy the requirements.
2. Libraries Used
Sno: Library used Description
1 e1701 The SVM package is present in this library. This package needs to
be loaded for the SVM function to run.
3. Pre-processing Steps
Following pre-processing steps were followed:
• Case Normalization: While counting frequencies, tokens need to be folded to a lower
case as this will help in treating a word that is spelled both in caps and in lower case as
the same word. This will help in better analysis of the documents.
• Tokenization: Tokenization has been used to segment the text into smaller units such
as numbers, punctuations, words, alpha-numeric etc. In the process of tokenization,
some characters like punctuation marks are discarded.
• Removal of stop-words: Stop words are the words that are quite frequent but provide
very less information. Removing stop words helped in better analysis of our text.
• Removing the most/least frequent word: The words that appeared more than 95%
of the documents and less than 5% of the documents were removed.
4. Feature Selection
Feature selection helps in making statistical predictions about what is going to happen in a
sentence. The following feature selections were done to process our data:
• Unigram feature: The unigrams (single words) were identified and were processed.
This would allow us to have the frequency of the word as a predictor.
• Bigram feature: Each two adjacent words create a bigram. A bigram helps to a
prediction for a word based on the one before. The bigrams were identified and were
processed.
• TF-IDF5 (Term Frequency-Inverse Document Frequency): TF-idf is used to
evaluate how important a word is for a document. The words in the document with a
high tfidf score signifies that they occur frequently in the document and provide the
maximum information about that specific document. It helps in reducing the weightage
of more common words.
We have used python for the processing and generation of the feature selection. The code for
the feature selection has been run on both the training and the testing documents.
This classier was run on both the training_docs and testing_docs and the accuracy predicted
was as follows:
Accuracy = 44.7%
5.2 Auto-Encoder
Auto encoder is based on the neural networks. Given some contextual information on a
document (can be in the form of unigrams, bigrams etc), this classifier helps to predict the class
label (positive, negative, neural).
Artificial neural network is a nonlinear and a non-parametric model while the other statistical
methods are parametric model that need higher background of statistic. These models are used
to solve super complicated models that have too many variables to be simplified. It has the
ability to detect complex non-linear relationships between dependent and independent
variables and detect possible interactions between predictor variables.
This classier was run on both the training_docs and testing_docs and the accuracy predicted
was as follows:
Accuracy = 65.611%
5.3 SVM (Support Vector Machines)
An SVM is a discriminative classifier that is defined by separating hyperplane. Given a labelled
training data, the algorithm outputs an optimal hyperplane which can categorize new examples.
SVM’s are useful in case of non-regularity in the data, ie, when data is not regularly distributed.
SVM’s deliver a unique solution, which is an advantage over Neural Networks, that have
multiple solutions associated with local minima and may not be robust over different samples.
SVM works well and can solve the problem of unbalanced data.
This classier was run on both the training_docs and testing_docs and the accuracy predicted
was as follows:
Accuracy = 66.44%
6. Conclusion
Different classifiers are built that classifies the testing articles to predict the class label for
each testing article. The accuracies predicted by Naïve Bayes classifier is 44.7%, by Auto-
Encoder classifier is 65.611% and by SVM (Support Vector Machines) is 66.44%. The best
results are obtained by using the SVM classifier.
7. References
Nazrul, S. (2017). Multinomial Naive Bayes Classifier for Text Analysis (Python).
Retrieved from https://towardsdatascience.com/multinomial-naive-bayes-
classifier-for-text-analysis-python-8dd6825ece67