Você está na página 1de 2

Warner and Hirschberg [6] detected hate speech on

the basis of different aspects including religion. They


defined hate speech in their work and then gathered
data from Yahoo and American Jews Congress (AJC),
where Yahoo provided its data from news groups and
AJC gave URL marked as offensive websites. Data
gathered from Yahoo was not lengthy as was in
Attenberg’s URLs, which contained huge text
descriptions. They classified data at paragraph level in
their first attempt and then used this data set for
annotation by asking annotators to manually annotate
the data set. They focused on stereotype and thus
decided to make language model for stereotypes to
mark hate speech. They made an anti-semitic speech
classifier first. They identified 9000 paragraphs
matching to their regular expression and then removed
those paragraphs that were not offensive. Then further
seven categories were chosen to annotate the data.
After this annotation for their gold corpus, they used
two fold cross validation classifier to find a refined
data set. A template based classification strategy was
adopted along with brown cluster and parts of speech
tagging. Log odds based on ration were used to select
feature which were 4379 features earlier. Than these
feature were fed to SVM classifier which reduced
these features to 3537 features after elimination
process (based on 1.5 thresholds of log odds). They
achieved 0.91 base line accuracy. They used SVM
with linear kernel function and 10 fold cross validation
by getting an overall accuracy of 0.96, precision 0.59,
recall 0.68 and f-measure 0.63.
Motivated by work done in [6], Kwok and Wang
[4] proposed a method for detecting hatred speech
against black over Twitter. They arranged hundreds of
tweets to analyse keywords or sentiments indicating
hate speeches. To judge the severity of arguments, a
questionnaire was floated to students of different
races. A training dataset of 24582 tweets was
pre-processed to correct spelling variation, remove
stop words and eliminate URL etc. In order to classify
tweets, NB classifier was used to identify racist and
non-racist tweets and prominent feature were
identified from those tweets. The classifier has shown
an accuracy of 86%. Later on, a unigram approach was
used for making vocabulary. Thus, 9437 unique words
were classified in the racist training dataset and 8401
unique word in non-racist data set. An accuracy of
76% was achieved by using 10 fold cross validation.
Burnap and Williams [9] used machine learning
approach for classification of tweets as hate speech or
antagonistic focusing racism, ethnicity and religion.
Motivation behind this work was extensive public
reaction over social media over murder of a drummer,
Lee Rigby in Woolwich, UK. They collected 450,000
tweets during the heat period of this event. After that a
sample data of 2000 tweets was annotated by four
people and tweets agreed by three people (75%) were
selected and others were discarded. In feature
selection phase, they employed Stanford Lexical
Parser along with Context Free Lexical Model to take
out typed dependencies in tweets. They used Bag of
words (BoW) as feature in two phases. In first phase,
they used all typed dependencies as features and in
second phase, they executed Meta-analysis to find best
feature for hate speech classification. For this purpose,
they used Bayesian Logistic Regression. They used
WEKA classifier such as Random Forest Decision
Trees, SVM etc. They achieved 95% f-measure using
10 cross fold validation.
Ting et al. [17] in their work proposed architecture for
discovering hate groups over Facebook with the help
of social network and text mining analysis. Extracted
features include keywords that are frequently used in
groups. Sureka et al. [19] proposed an approach that is
based upon the data mining and social network
analysis for discovering hate promoting videos, users
and their hidden communities on YouTube. Chen et al.
[23] presented a framework for identification of
extremist videos on YouTube. Author extracted
International Journal of Advances in Electronics and Computer Science, ISSN: 2393-2835 Volume-4, Issue-1, Jan.-2017

Você também pode gostar