Você está na página 1de 5

Sur: A tool for pattern identification and sentiment detection in microtexts

Carlos Zuniga-Solis, Jose A. Olivas SMILE Research Group, University of Castilla-La Mancha, Ciudad Real, Castilla-La Mancha, Spain

Abstract Knowledge exchange and opinion sharing over the Internet has reached levels never experienced before. People from different regions and socio-cultural backgrounds now have the possibility to create as much web content as they wish. This data represents a massive source of information useful to understand many aspects of society. Through sentiment analysis it is possible to leverage this highly topical data to identify people's perception of a certain topic. Different approaches have been implemented in order to detect sentiment in microtexts using mainly lexical ontologies and classification models. In this work, a tool designed for sentiment detection in microtexts named Sur is presented. This tool leverages inductive learning in order to find differentiating patterns in opinions about a given topic. Using the identified patterns Sur creates decision trees to classify microtexts as supportive or unsupportive towards the analyzed topic. Keywords: Induction, Microtexts, Sentiment Analysis.

Sentiment analysis is the task of identifying positive and negative opinions, emotions, and evaluations [1]. Through sentiment analysis of tweets, researchers have found that Americans seem considerably happier on Sunday morning than Thursday evening, and that West Coast American residents might be happier than their East Coast counterpart[2]. Different approaches converge in the search for the optimal sentiment detection technique. A very straightforward method is to use sets of terms with positive and negative connotations, to classify the sentiment of the words forming a message. However, the explosion of the web 2.0 and a larger internet coverage, have allowed people worldwide, to express their opinion about any topic. In this scenario, positive or negative meaning of a word becomes relative to the region or sociocultural context of a debate or a conversation. For this reason it might be more suitable to use an approach that learns from the contextual relationships of words, instead of their linguistic relationships. This could be accomplished through inductive learning, leveraging its capabilities to discover distinguishing patterns, and thus indetifying recurrent concepts in positive or negative messages in any context. In this work a tool for sentiment detection in microtexts is presented. The developed tool named Sur leverages inductive learning and decision trees to classify microtexts as supportive or unsupportive towards a given topic. A test case using real life data is also presented showing promising results.

Introduction

The advancement in web technologies over the last decade has allowed users to generate as much content on internet as they wish. This has led to knowledge exchange and opinion sharing at levels never experienced before, and the social networks have been crucial facilitators of this process. The ease with which people can create social network messages is one of the key success factors of these web sites. According to Twitter Blog, in June of 2011, Twitter users sent 200 million tweets per day, enough text to fill a 10 millionpage book every day. Through these 140 characters long, unstructured and conversational toned writings, 100 million users express what they are doing, what they are thinking and how they are feeling. By reading a person's tweets, it is possible to understand many things about her. Therefore, studying the data of hundreds of thousands of tweets might help to develop a better understanding of our society at large. A powerful technique to leverage this huge amount of data is sentiment analysis.

State of the art

After the release of Twitter developer platform in 2006, researchers obtained instant access to what might be the largest dataset in history, with lots of new highly topical messages available every second. This provided a great opportunity to analyze thousands of points of views and to find useful information for many different fields using specialized software for sentiment detection. Twitter developer platform consists of a set of APIs that allow developers to integrate their custom applications with this social network. One of these APIs, the Search API is particularly useful for data extraction tasks since it allows to query for tweets from a custom application, thus it is possible

to obtain and store thousands of tweets about a certain topic within a given date range in a matter of seconds. The increasing number of studies about sentiment analysis, published since the release of Twitter developer platform, and the growing number of countries developing researches about this topic, may be good indicators of the facilities given by this platform. The ease with which messages on social networks can be created has been one of the key factors for the success of the most popular social network web sites. A message on a social network falls under the category of Microtext. Microtexts are very short individual contributions, often consisting of a single sentence that can be as short as a single word. They are further characterized by using an informal and unstructured grammar. These messages have a conversational tone, usually including colloquial expressions. Usually these messages are not edited before being sent, which makes them error prone [3]. These characteristics of the microtexts allow inexperienced users to quickly become creators of content. Nevertheless, these characteristics also make microtexts particularly hard to analyze for sentiment detection. Extracting sentiment information out of a sentence consisting of just a few words is not an easy task. For these reason researchers have implemented different methodological approaches in order to obtain the best results out of the available data. Lexical ontologies and classification models highlight as the preferred techniques. As a result of a literature review of studies related to sentiment detection in microtexts, 24 studies were found. Implementations of classification models were found in 14 of the reviewed works, lexical ontologies were found in 7 studies and the remaining 3 works used third party software or other techniques. According to the literature review performed, when classification models were implemented, the use of Naive Bayes classifiers and SVM method were predominant. To a lesser extent, Bag of words and K-nearest neighbor classification model were used. On the other hand, in those cases about sentiment detection using lexical ontologies, a strong tendency towards the use of lexical English database WordNet and derivates of this, such as WordNet-Affect or SentiWordNet was found. The reason why classification models seem to be more popular between sentiment detection researchers is their capacity to leverage the context information, without depending on it. The same classification model could be used for sentiment detection in different languages. On the other hand, lexical ontologies are domain dependent, and require a

significant effort to generate lists of words for each domain [4]. To avoid the burden of creating a lexical ontology for every domain universe a simple approach was identified. In several studies a generic ontology such as WordNet was used. However, this method is not always suitable, since the meaning of many terms and expression is often context dependent. The importance of the context where a term is used becomes clear when a variety of practical applications found in the reviewed literature is analyzed. As suggested by many studies, sentiment detection in social network messages could be a useful tool to quickly identify or even to anticipate situations affecting, positively or negatively, a certain group of the people. Flu outbreaks, radicalization of opinions, and popularity loss of a touristic place are some of the examples found. As can be seen, the variety of the topics addressed might generate accuracy problems if a generic lexical ontology is used.

Architecture and Methodology

In order to classify microtexts as supportive or unsupportive towards a given topic leveraging the context of the information, the sentiment detection software Sur1 was created. Sur consists of two modules: Microtext extraction and Text analysis.

3.1

Microtext extraction

The module for data extraction allows extracting microtexts from 5 sources: The digital versions of Spanish newspapers El Pas, El Mundo and ABC, Twitter and Facebook. To obtain microtexts from the digital version of El Pas, El Mundo and ABC newspapers, a custom process was built to obtain the comments out of a web article. To recover microtexts from Twitter, an interface was built on top of Twitter's Search API, allowing querying for Twitter content, in the same way a user would search for tweets using this social network search engine. Once the search is done, the tweets found can be stored for later analysis. Whilst Twitter's Search API allows retrieving thousands of tweets in a matter of seconds, a very important limitation must be considered: Twitter's index includes only the last 6 - 9 days of tweets.

Sur is a Bribri word that means fishing arrow, however for the native Costa Rican tribe Bribris, arrows are capable of creating cracks through which light and thinking may pass through.

To obtain microtexts from Facebook, Sur leverages the Facebook's Graph API2. This interface grants access to different objects of this social network, and allows navigating through the relationships of these objects. For microtext extraction from Facebook Sur offers an interface to access comments on posts on public pages. This API grants access to the social network's different objects and to the relationships between them. The user must introduce the URL of a Facebook public page and a date range to search for its posts. Once the search is finished, Sur shows to the user the posts found, their creation date, and the number of comments received. The last step is to select one of the retrieved posts, and its comments will be extracted and stored for later analysis. 3.2 Text analysis

allows finding differentiating patterns between positive and negative microtexts of a certain subject. This function has been implemented following the rules of the First Order Inductive Learner algorithm [5] in order to obtain exclusive patterns for positive and negative comments. It is worth noticing that, rarely used words are not useful to find representative patterns, for this reason, Sur ignores words that appear in less than the 20% of microtexts forming the positive or negative training set. To find and remove these words Sur internally uses its term-frequency functionality. When searching for differentiating term patterns, it is necessary to choose the number of terms included in the patterns to find. This permits to identify the patterns with the widest coverage throughout the training sets. After a training set has been stored in a text index, it is possible to perform automatic sentiment detection on evaluation sets by creating decision trees based on the differentiating term patterns of the training set. This process is performed by Sur through implementations of the ID3 and CART algorithms. To create the ID3 or CART decision tree the user must choose the number of terms that make up the patterns to find. For this reason, it is recommended to perform prior tests on the training set using Sur's pattern identification functionality, in order to determine the appropriate number of terms to cover the maximum amount of microtexts possible. The set of terms of the patterns found will be used later to create the nodes of the decision trees. For this reason, Sur offers the possibility to specify a minimal number of occurrences for the patterns to use. Doing this allows to prune the tree before it is created. The following step is to decide whether to use an ID3 or a CART algorithm to create the decision tree. For the ID3 algorithm, the terms of the patterns found will be the nodes of the tree, the possible values for every term will be whether or not the term is present in a given microtext, in other words true or false. For the CART algorithm the nodes of the tree will be a condition composed of a term and one of its possible values in a microtext: true or false. Given the nature of the CART algorithm, every node will have two children, one for those cases where the condition is met, and another one for those cases where the condition is not met. The leafs on the ID3 and CART trees will be sentiment classifiers 'positive' and 'negative'. Because of the characteristics of the methodology used in Sur, both algorithms produce binary trees. Once the decision tree has been created, the user will be able to use it for sentiment detection on individual examples.

The text analysis capabilities offered by Sur are designed to be used after a dataset about a certain topic has been obtained through the Microtext extraction module. The Text analysis module allows using the retrieved microtexts as a training set or as an evaluation set. The evaluation set is used for automatic sentiment detection supported by a training set previously created. In order to create a training set, Sur allows experts to classify the microtexts by its sentiment as positive, negative, neutral or noise. The noise category should be used on those cases where a message does not offer any point of view strictly related to the subject matter addressed. Once the microtexts have been classified, they receive a linguistic pre-processing and then the resulting data is stored in a text index. The pre-processing consists of the following tasks: Tokenization Punctuation removal Lowercasing Stopwords removal Stemming

Once the training set has been pre-processed and indexed, Sur permits users knowing the term-frequency distribution of the classified microtexts, filtered by the sentiment given by the expert. Sur also has the ability to find recurring groups of terms in the positive and negative training sets. This functionality
2

https://developers.facebook.com/docs/reference/api/

Practical case

thousands of seldom used terms were removed, improving system performance and result quality. Very few patterns consisting of 5 and 4 terms were found. Besides, those patterns covered only a minimal percentage of the microtexts of the dataset. On the other hand, an important amount of 3 words patterns emerged, enough to cover the entire dataset if put together. Table 1 shows a sample of the most frequently found patterns. Table 1. Frequently found patterns sample Pattern [garzon, juez, ley] [derech, juez, ley] [espa, gazon, juez] [garzon, juez, pais] Occurrences 53 38 31 28 Sentiment Positive Positive Negative Negative

In order to test Sur's sentiment detection accuracy a relevant topic to Spanish society was selected: The conviction of former Spanish judge Baltasar Garzon Real. The reason to choose this topic was that the generated views about it on social networks were so opposing that it would be very easy for a slightly informed person to tell if a comment supports or rejects Garzon's convitcion. Nevertheless, despite of the ease with which a person could detect the sentiment in the related microtexts, the automatic classification of messages as supportive or rejective faces significant challenges. This trivial case illustrates the complexities of Artificial Intelligence and Natural Language processing. The first step was to find articles related to Garzon's conviction in the digital version of Spanish newspapers El Mundo and El Pais. These two daily newspapers were selected because of the opposite political tendencies of its frequent readers. Once the articles were chosen, a total of 9760 comments were retrieved by Sur, 2780 from El Mundo and 6980 from El Pais. Subsequently, 4000 of the retrieved comments were classified by an expert, 2000 from El Pais, and 2000 from El Mundo. For mere classification purposes, those comments supporting Garzon's conviction were categorized as positive and those rejecting it as negative. A total of 963 comments were classified as positive and 675 as negative, the remaining 2362 were marked as noise. The positive and negative microtexts received linguistic preprocessing and were stored in an index text as the training set for this test case. Following the linguistic pre-processing of microtexts an interesting problem related to the context of this particular case of analysis emerged. Both Spanish terms "derecho" and "derecha" stems to "derech". Nevertheless, in the microtexts retrieved, "derecho" was frequently used as a part of expressions such as: "estado de derecho" (rule of law) and "derecho de defensa" (right of defense), very common expression in opinions in favor of the Garzon's convition. On the other hand, the term "derecha" referred to right-wing politics, a very usual reference in the microtexts rejecting the above mentioned judgement. In this case, the stemming truncates the contextual value of the term. This problem might be subsequently solved leveraging synonyms and multi-term expressions in the linguistic pre-processing stage. The next step was to identify the best differentiating patterns in order to determine the optimal parameters to specify later on when the decision trees were created. It was determined that only those terms with a frequency equal or greater than the 20% of the total of microtexts in the training set will be taken into account for pattern identification. Thus,

Because of the differentiating capabilities of the patterns, and given that 3 words patterns offered great dataset coverage, it was decided to use these patterns for the later creation of the decision trees. In order to test the sentiment detection capabilities of the presented approach, the 4000 microtexts classified by an expert were used as the training set to create ID3 and CART decision trees. Subsequently, a set of 186 comments from articles related to Garzon's sentence from the digital version of Spanish newspaper ABC was retrieved. Only comments expressing positive or negative opinions were taken into account. Of the 186 retrieved comments, an expert classified 130 as positive and 56 as negative. Subsequently, these 186 comments were automatically classified by Sur and compared to the results given by the expert to determine Sur's accuracy. Table 2 shows the results of the comparison. Table 2. Automatic classification accuracy Method Accurate positive 107 105 Accurate negative 37 38 Average accuracy 77.4% 76.8%

CART ID3

As the results show, none of the classification methods is considerably more accurate than the other. It is also worth mentioning that both methods do better classifying the positive examples than negatives. By analyzing the negative patterns found, a large number of different sets, but with a low average number of occurrences can be seen. This evidences a trend to use a richer variety of terms used in negative microtexts than that in the positives for this particular topic, which might make negative sentiment harder to spot.

Conclusions and open issues

This works presents a tool for pattern identification and sentiment detection in microtexts based on inductive learning and classification trees. The tool named Sur is still a prototype; however is it capable of leveraging the vast amount of available data to extract information from context instead of using linguistic relationships. Sur has only been tested with the data previously discussed, shortly it will be tested using different data sources such as Twitter and analyzing new topics. Moreover, adding new functionality such as multiterm expression detection and synonym pre-processing will probably improve the accuracy of the results, which for this first test may be considered good.

References

[1] Wilson, T., Wiebe, J., Hoffman, P.: Recognizing contextual polarity in phrase-level sentiment analysis. Proc. of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. pp. 347354 (2005). [2] Savage, N.: Twitter as medium and Communications of the ACM. 54, 18 (2011). message.

[3] Ellen, J.: All about microtext: A working definition and a survey of current microtext research within artificial intelligence and natural language processing. ICAART 2011 Proc. of the 3rd International Conference on Agents and Artificial Intelligence. pp. 329-336 (2011). [4] Guerra, P.H.C., Veloso, A., Meira, W., Almeida, V.: From bias to opinion: A transfer-learning approach to realtime sentiment analysis. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 150-158 (2011). [5] Quinlan, J.R.: Learning logical definitions relations. Machine Learning. 5, pp. 239-266 (1990). from

Você também pode gostar