Location based social networks (LBSNs), such as foursquare, allow users to post micro-reviews, or tips, about the visited places (venues) the number of "likes" a tip receives ultimately reflects its popularity or helpfulness among users. Our experimental evaluation shows that a linear regression strategy, using features of both the user who posted the tip and the venue where it was posted as predictors, produces results that are at least as good as those produced by more sophisticated
Location based social networks (LBSNs), such as foursquare, allow users to post micro-reviews, or tips, about the visited places (venues) the number of "likes" a tip receives ultimately reflects its popularity or helpfulness among users. Our experimental evaluation shows that a linear regression strategy, using features of both the user who posted the tip and the venue where it was posted as predictors, produces results that are at least as good as those produced by more sophisticated
Location based social networks (LBSNs), such as foursquare, allow users to post micro-reviews, or tips, about the visited places (venues) the number of "likes" a tip receives ultimately reflects its popularity or helpfulness among users. Our experimental evaluation shows that a linear regression strategy, using features of both the user who posted the tip and the venue where it was posted as predictors, produces results that are at least as good as those produced by more sophisticated
Predicting the Popularity of Micro-Reviews in Foursquare
Marisa Vasconcelos, Jussara Almeida e Marcos Gonalves {marisav,jussara,mgoncalv}@dcc.ufmg.br Universidade Federal de Minas Gerais, Belo Horizonte, Brasil ABSTRACT Location based social networks (LBSNs), such as Foursquare, allow users to post micro-reviews, or tips, about the visited places (venues) and tolikepreviously posted reviews. Tips may help attracting future visitors, besides providing valu- able feedback to business owners. In particular, the number of likes a tip receives ultimately reects its popularity or helpfulness among users. In that context, accurate predic- tions of which tips will attract more attention (i.e., likes) can drive the design of automatic tip ltering and recom- mendation schemes and the creation of more eective mar- keting strategies. Previous eorts to automatically assess the helpfulness of online reviews targeted mainly more ver- bose and formally structured reviews often exploiting tex- tual features. However, tips are typically much shorter and contain more informal content. Thus, we here propose and evaluate new regression and classication based models to predict the future popularity of tips. Our experimental eval- uation shows that a linear regression strategy, using features of both the user who posted the tip and the venue where it was posted as predictors, produces results that are at least as good as those produced by more sophisticated approaches and/or by using each set of features in isolation. Categories and Subject Descriptors H.3.5 [Online Information Services]: Web-based ser- vices; J.4 [Computer Applications]: Social and behav- ioral sciences Keywords Popularity Prediction, Content Filtering, Location-based So- cial Networks, Micro-reviews 1. INTRODUCTION The success of online social networks and GPS (Global Positioning System) based applications have led to an abun- dance of data on personal interests and activities shared on Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full cita- tion on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Request permissions from Permissions@acm.org. SAC 2014, March 2428, 2014, Gyeongju, Republic of Korea Copyright 2014 ACM 978-1-4503-2469-4/14/03 ...$15.00. http://dx.doi.org/10.1145/2554850.2554911. the Web. Indeed, the so-called location-based social net- works (LBSNs), such as Foursquare and Google+ 1 , as well as social networks that have incorporated location-based ser- vices (e.g., Facebook and Yelp) allow users to share not only where they are but also location-tagged content, including personal reviews about visited places. Our present focus is on Foursquare, where users can post micro-reviews, or tips, with their opinions about registered venues. Tips can serve as feedback to help other users choose places to visit. For example, a tip recommending a hotel or disapproving the service of a restaurant may guide others in their choices. Tips nourish the relationship between users and real businesses, being key features to attract future visi- tors, besides providing valuable feedback to business owners. Unlike check ins, which are visible only to the users friends, tips are visible to everyone, and thus may greatly impact online information sharing and business marketing. Users can also evaluate previously posted tips by liking them in sign of agreement with their content. The number of likes can be seen as feedback from other users on the help- fulness, or popularity of the tip. Thus, it is expected that both users and venue owners may have particular interest in retrieving tips that have already received or have the poten- tial of receiving a large number of likes, as they reect the opinion of more people and thus should be more valuable. However, nding and retrieving those tips may be a chal- lenge given the presence of an increasing amount of spam content [18], unhelpful and possibly misleading information. Moreover, Foursquare displays the tips posted at a given venue sorted by the number of likes received 2 , which con- tributes to the rich-get-richer eect [13], as well-evaluated tips are more visible and may keep receiving more feedback, while recently posted tips may have diculty to be seen. Thus, predicting, as early as possible, the future popular- ity of a tip may help determining its potential as valuable information, which enables the design of eective tip rec- ommendation and ltering tools to better meet the needs of both users and venue owners. For example, venue own- ers can benet with a rapid feedback about more widespread opinions on their businesses and associated events (e.g., pro- motions) [11], while users, who often access LBSNs using small screen mobile devices, might enjoy an interface that favors tips that are more likely to become popular. The identication of potentially popular tips may also drive the design of cost-eective marketing strategies for the LBSN 1 http://www.foursquare.com https://plus.google.com 2 This is the only option for mobile users. environment, and help minimizing the rich-get-richer eect, as those tips may receive greater visibility in the system. Previous eorts towards predicting the popularity or help- fulness of online reviews focused mostly on textual features [11, 13, 15]. However, compared to reviews in other systems (e.g., TripAdvisor and Yelp), tips are typically much more concise (restricted to 200 characters), and may contain more subjective and informal content (e.g. I love this airport, Go Yankees). Moreover, users, often using mobile devices, tend to be more direct and brief, avoiding many details about specic characteristics of the venue. Thus, some as- pects exploited by existing solutions, especially those related to the textual content, may not be adequate to predicting the popularity of shorter texts such as tips. Moreover, unlike in other systems, Foursquare does not let users assign unhelpfulness signals to tips. Thus, the lack of a like does not imply that a tip was not helpful or inter- esting, which makes the prediction of tip popularity much harder. Foursquare also enables the analysis of aspects that are specic to online social networks, where properties of the social graph may impact content popularity [1]. Finally, the prediction of tip popularity is also inherently dierent from previous analyses of information diusion on Twitter [10] and news sites [2] which exploited mainly aspects related to the user who posted the information. A tip is associated with a venue and tends to be less ephemeral. Thus its pop- ularity may also be aected by features of the target venue. In this context, we here introduce the problem of predict- ing the future popularity of Foursquare tips, estimating a tips popularity by the number of likes it receives. More specically, we analyze various aspects related to the user who posted the tip, the venue where it was posted and its content, and investigate the potential benets from exploit- ing these aspects to estimate the popularity level of a tip in the future. To that end, we propose and evaluate several regression and classication models as well as various sets of user, venue and tips content features as predictor variables. Our evaluation on a large dataset, with over 1,8 million users, 6 million tips and 5 million likes, indicate that a sim- pler multivariate linear regression strategy produces results that are at least as good as those obtained with the more so- phisticated Support Vector Regression (SVR) algorithm and Support Vector Machine (SVM) classication algorithm [8]. Moreover, using sets of features related to both the tips au- thor and the venue where it was posted as predictors leads to more accurate predictions than exploiting features of only one of these entities, while adding features related to the tips content leads to no further signicant improvements. 2. RELATED WORK The task of assessing the helpfulness or quality of a re- view is typically addressed by employing classication or regression-based methods[11, 15, 13]. For example, Kim et al. [11] used SVR to rank reviews according to their help- fulness, exploiting features such as the length and the uni- grams of a review and the rating score given by the review- ers. OMahony et al. [15] proposed a classication-based approach to recommend TripAdvisor reviews, using features related to the user reviewing history and the scores assigned to the hotels. Similarly, Liu et al. [13] proposed a non-linear regression model that incorporated the reviewers expertise, the review timeliness and its writing style for predicting the helpfulness of movie reviews. The authors also used their model as a classier to retrieve only reviews having a pre- dicted helpfulness higher than a threshold. These prior stud- ies are based mostly on content features, which are suitable for more verbose and objective reviews. Such features may not be adequate for predicting the popularity of Foursquare tips, which tend to be more concise and subjective. Other related studies focused on designing models for in- formation diusion in blogs [7] and Twitter [2, 10], often exploiting features of the social graph, like the number of friends and their probability of inuence [1]. Textual fea- tures extracted from the messages (e.g., hashtags) or the topic of the message, and user features (e.g., number of fol- lowers) have also been used to predict content popularity in these systems [2, 10]. Unlike blog posts and tweets, tips are associated with specic venues, and tend to remain associ- ated with them (and thus visible) for longer. Thus, a tips popularity may be aected by features of the target venue. Finally, we are not aware of any previous study of pop- ularity prediction of tips or other micro-reviews. The only previous analysis of tips and likes on Foursquare was done in [18], where the authors uncovered dierent proles related to how users exploit these features. In comparison with that work, we here focus on a dierent (though related) problem - the prediction of tip popularity. 3. POPULARITY PREDICTION MODELS Our goal is to develop models to predict the popular- ity level of a tip at a given future time. We estimate the popularity of a tip by the number of likes received, and we perform predictions at the time the tip is posted. To that end, we categorize tips into various popularity levels, dened based on the number of likes received up to a certain point in time p after they were posted. As in [2, 10, 13], we choose to predict the popularity level instead of the number of likes because the latter is harder, particularly given the skewed distribution of number of likes per tip [18], and the former should be good enough for most purposes. We consider p equal to 1 month and two popularity levels: low and high. Tips that receive at most 4 likes during the rst month af- ter postage have low popularity, whereas tips that receive at least 5 likes in the same period have high popularity. We note that the availability of enough examples from each class for learning the model parameters impacts the def- inition of the popularity levels. We also experimented with other numbers of levels (e.g., three) and range denitions, nding, in general, similar results. Alternatively, we tried to group tips using various features (discussed below) and clustering techniques (e.g., k-means, x-means, spectral and density-based clustering). However, as previously observed for quality prediction in question answering services [12], the resulting clusters were not stable (i.e., results varied with the seeds used). This may be due to the lack of discriminative features specically for the clustering process, which does not imply that the features cannot be useful for the prediction task, our main goal here. As in [11, 13, 15] we exploit both regression and classi- cation techniques in the design of our tip popularity pre- diction models. We experiment with two types of regression algorithms, which produce as output a real value, which is rounded to the nearest integer that represents a popular- ity level, as done for predicting the helpfulness of movie re- views in [13]. Our rst approach is an ordinary least squares (OLS) multivariate linear regression model. It estimates the popularity level of a tip t at a given point in time p, R(t, p), as a linear function of k predictor variables x1, x2 x k , i.e., R(t, p) = 0 + 1x1 + 2x2 + k x k . Model parame- ters 0, 1 k are determined by the minimization of the least squared errors [8] in the training data (see Section 5.1). We also consider the more sophisticated Support Vector Regression (SVR) algorithm [8], a state-of-the-art method for regression learning. Unlike the OLS model, SVR does not consider errors that are within a certain distance of the true value (within the margin). It also allows the use of dierent kernel functions, which help solving a larger set of problems, compared to linear regression. We use both linear and radial basis function (RBF) kernels, available in the LIBSVM package [3], as the latter handles non-linear relationships. We also experiment with the Support Vector Machine (SVM) algorithm [8] to classify tips into popular- ity levels. The goal of SVM is to nd an optimal separating hyperplane in the feature space that gives the largest mini- mum distance to the training examples. Like with SVR, we use both linear and RBF kernels in our experiments. We compare the accuracy of the OLS, SVM and SVM models against that of a simple (baseline) strategy that uses the median number of likes of tips previously posted by a user u as an estimate of the number of likes that a new tip posted by u will receive (and thus its popularity level). Having presented our models, we now turn to the predic- tor variables x1, x2, , x k . These variables are features of the tip t whose popularity level we are trying to predict. We here consider k = 129 features related to the user who posted t, the venue where t was posted, and ts textual con- tent. Our selected features are grouped into three sets: 3 User features: includes features related to the users activ- ities (e.g., number of tips, number of likes received/given) and her social network (e.g., number of friends/followers, percentage of likes coming from her social network, num- bers of tips posted and likes given by her social network). We consider two scenarios when computing user features: (1) all tips posted by the user, and (2) only tips posted by the user at venues of the same category of the venue where t was posted. We refer to the latter as user/cat features. Venue features: this set of features captures the activities at the venue where t was posted or its visibility. Examples are the number of tips posted at the venue, the number of likes received by those tips , the numbers of check ins and unique visitors, and the venue category. Tips Content Features: this set contains features, pro- posed in [4, 5, 16, 14], related to the structure, syntactics, readability and sentiment of the tips content. It includes the numbers of characters, words, and URLs or e-mail ad- dresses in the tip, readability metrics and features based on the Part-Of-Speech tags such as percentages of nouns, and adjectives. It also includes three scores - positive, negative and neutral - that capture the tips sentiment. These scores are computed using SentiWordNet [6], an established senti- ment lexical for supporting opinion mining in English texts. In SentiWordNet, each term is associated with a numerical score in the range [0, 1] for positive, negative and objectiv- ity (neutral) sentiment information. We compute the scores of a tip by taking the averages of the corresponding scores over all words in the tip that appear in SentiWordNet. To handle negation, we adapted a technique proposed in [17] 3 A complete set of all features used is available at https: //sites.google.com/site/sac2014sub/ Table 1: Overview of Selected Features Type Feature Min Mean Max CV User # posted tips 1 3.72 5,791 3.25 # likes received 0 3.13 208,619 63.40 median # likes rec. 0 0.48 657.0 2.77 mean # likes rec. 0 0.58 858.10 2.70 social net (SN) size 0 44.79 318,890 15.95 % likes from SN 0 0.71 1 0.43 Venue # posted tips 1 2.13 2,419 2.23 # likes received 0 1.80 7,103 11.60 median # likes 0 0.45 390 2.81 # check ins 0 217.33 484,683 6.18 # visitors 0 87.35 167,125 5.87 Content # words 1 10.25 66 0.78 # characters 1 59.78 200 0.75 # URLs 0 0.02 9 8.27 that reverses the polarity of words between a negation word (no, didnt, etc.) and the next punctuation mark. 4. FEATURE ANALYSIS In this section, we rst describe the dataset used in our experiments (Section 4.1), and then briey analyze some selected features (Section 4.2). 4.1 Dataset We crawled Foursquare using the system API from August to October 2011, collecting data of more than 13 million users. We believe that this represents a large fraction of the total user population at the time, which, reportedly, varied from 10 to 15 million between June and December 2011 4 . Our complete dataset contains almost 16 million venues and over 10 million tips. However, to avoid introducing bi- ases towards very old or very recent tips, we restricted our analyses to tips and likes created between January 1 st 2010 to May 31 st 2011. This lter left us with over 6 million tips and almost 5,8 million likes, posted in slightly more than 3 million venues by more than 1,8 million users. We note that around 34% of the tips in our analyzed dataset received, during the considered period, at least one like. 4.2 Feature Characterization Table 1 presents statistics of the selected features for users and venues with at least one tip in our dataset. The table shows, for each feature, minimum, mean, maximum values and coecient of variation (CV), which is the ratio of the standard deviation to the mean. We note that most fea- tures have very large variations (large CVs), reecting their heavy tailed distributions 5 , as observed in [18]. In particu- lar, the CVs of the numbers of likes received by tips previ- ously posted by the user and at the venue are very large. Indeed, most users posted very few tips and/or received few likes while most tips and likes were posted by few users. For example, 46% and 48% of the users posted only one tip and received only one like, while only 1,499 and 2,318 users posted more than 100 tips and received more than 100 likes, respectively. These heavy tailed distributions suggest that tips may experience the rich-get-richer eect. Moreover, the median number of likes per user is, on average, only 0.48. Thus, many users have this feature equal to 0. This will impact our prediction results, as discussed in Section 5. 4 https://foursquare.com/infographics/10million and http://www.socialmedianews.com.au/foursquare- reaches-15-million-users/ 5 We omit these distributions due to space constraints. We nd that the Spearmans correlation coecient (), which is a non-parametric measure of statistical dependence between two variables [19], computed over the numbers of tips and likes received by each user is moderate (=0.54), and between the numbers of tips and likes given by each user is even lower (=0.37). Thus, in general, users who tip more do not necessarily receive or give more feedback. We also nd that the social network is quite sparse among users who post tips: a user has only 44 friends/followers, on average, while 37% of the users have at most 10 friends (fol- lowers), although the maximum reaches 318,890. Moreover, the fraction of likes coming from the users social network tends to be reasonably large (70% on average). Thus, the users social network does inuence the popularity of tips. Regarding venue features, there is also a heavy concentra- tion of activities (check ins, visitors, tips, and likes) on few venues. The maximum number of likes per venue exceeds 7,000, but is only 1.8 on average. The correlation between either the number of check ins or the number unique visi- tors and the number of tips of the venue is moderate (around 0.52). Thus, to some extent, popular venues do tend to at- tract more tips, although this is not a strong trend. The correlation between the number of tips and the total number of likes per venue is also moderate (=0.5). Thus, in general, tipping tends to be somewhat eective in attracting visibility (likes) to a venue. The same correlation exists be- tween the total number of likes and the number of check ins (=0.5), but we cannot infer any causality relationship as temporal information is not available. However, the correla- tion between the median number of likes and the number of check ins is weaker (=0.3), implying that not all tips posted at the same venue receive comparable number of likes. This is probably due to various levels of interestingness of those tips, but might also reect the rich-get-richer eect. Finally, regarding content features, we nd that most tips are very short, with, on average, around 60 characters and 10 words, and at most 200 characters (limit imposed by the ap- plication) and 66 words. Moreover, the vast majority (98%) of the tips carry no URL or e-mail address, and we nd no correlation between the size of the tip and the number of likes received by it (<0.07), as one might expect. 5. PREDICTION MODEL EVALUATION We now evaluate the models proposed to predict, at the time a tip t is posted, the popularity level t will achieve 1 month later. We discuss our experimental setup (Section 5.1), our main results (Section 5.2), and the relative impor- tance of the features for prediction accuracy (Section 5.3). 5.1 Experimental Setup Our methodology consists of dividing the available data into training and test sets, learning model parameters using the training set, and evaluating the learned model in the test set. We split the tips chronologically into training and test sets, rather than randomly, to avoid including in the training set tips that were posted after tips for which predictions will be performed. We build 5 runs, thus producing 5 results, by sliding the window containing training and test sets. To learn the models, we consider the most recent tip posted by each user in the training set as candidate for pop- ularity prediction, and use information related to the other tips to compute the features used as predictor variables. All tips in the test set are taken as candidates for prediction, and their feature values are computed using all information (e.g., tips, likes) available until the moment when the tip was posted (i.e., the prediction is performed). For each can- didate in both training and test sets, we compute the feature values by rst applying a logarithm transformation on the raw numbers to reduce their large variability, normalizing the results between 0 and 1. Moreover, in order to have enough historical data about users who posted tips, we con- sider only users who posted at least 5 tips. To control for the age of the tip, we only consider tips that were posted at least 1 month before the end of training/test sets. We also focus on tips written in English, since the tools used to compute some textual features are available only for that language. After applying these lters, we ended up with 707,254 tips that are candidates for prediction. The distribution of candidate tips into the popularity lev- els is very skewed, with 99.5% of the tips with low popularity. Such imbalance, particularly in the training set, poses great challenges to regression/classication accuracy. Indeed, ini- tial results from using all available data were very poor. A widely used strategy to cope with the eects of class imbalance in the training data is under-sampling [9]. Sup- pose there are n tips in the smallest category in the training set. We produce a balanced training set by randomly se- lecting equal sized samples from each category, each with n tips. Note that under-sampling is performed only in the training set. The test set remains unchanged. Because of the random selection of tips for under-sampling, we perform this operation 5 times for each sliding window, thus pro- ducing 25 dierent results, in total. The results reported in the next sections are averages of these 25 results, along with corresponding 95% condence intervals. OLS model parameters are dened by minimizing the least squared errors of predictions for the candidate tips in the training set. This can be done as their popularity level (i.e., number of likes one month after postage) is known. Best values for SVM and SVR parameters are dened in a similar manner, using cross-validation within the training set. We assess the accuracy of our prediction models using re- call, precision, macro-averaged precision, and macro-averaged recall. The precision of class i is the ratio of the number of tips correctly assigned to i to the total number of tips pre- dicted as i, whereas the recall is the ratio of the number of tips correctly classied to the number of tips in class i. The macro-averaged precision (recall) is the average preci- sion (recall) computed across classes. 5.2 Prediction Results Figures 1-a) and 1-b) show macro-averaged precision and recall, along with 95% condence intervals, for 30 strate- gies for predicting a tips popularity level. These strategies emerge from the combination of ve prediction algorithms with alternative sets of predictor variables. In particular, for OLS, SVR (linear and RBF kernels) and SVM 6 algorithms, we consider the following sets of predictors: only user fea- tures, only venue features, only content features, all venue and user features, all user, venue and content features. We also consider only user features restricted to the category of the venue where the tip was posted (user/cat features). For predictions using the median number of likes of the user, here referred to as median strategy, we compute this num- 6 We show only results for the linear kernel as they are similar for the RBF kernel. (a) Macro-Averaged Precision (b) Macro-Averaged Recall (c) Recall for High Popularity Tips Figure 1: Eectiveness of Alternative Prediction Models and Sets of Predictor Variables ber over all tips of the user and only over the tips posted at venues of the same (target) category 7 . The signicance of the results was tested using two statistical approaches (one- way ANOVA and Kruskal-Wallis [19]) with 95% condence. Figure 1-a) shows that there is little dierence in macro- averaged precision across the prediction algorithms, except for SVR with linear kernel using only content features, which performs worse. The same degradation of SVR is observed in terms of macro-averaged recall (Figure 1-b). For that metric, the superiority of SVM, SVR and OLS over the sim- pler median strategy is clear. For any of those strategies, the best macro-averaged recall was obtained using user and venue features, with gains of up to 64% over the other sets of predictors. Comparing the dierent techniques using user and venue features, we see small gains in macro-averaged re- call for SVM and OLS over SVR (up to 1.15% and 0.84%, respectively) while OLS and SVM produce statistically tied results. However, we note that OLS is much simpler than both SVM and SVR, as will be discussed below. We note that recall results are higher than precision re- sults in Figures 1-a,b). This is because the precision for the smaller class (high popularity) is very low even with a high recognition rate (large number of true positives). Recall that under-sampling is applied only to the training set, and thus class distribution in the test set is still very unbalanced and dominated by the larger class (low popularity). Thus, even small false negative rates for the larger class results in very low precision for the other (smaller) class. For that rea- son, towards a deeper analysis of the prediction strategies, we focus primarily on the recall metric and discuss separate results for each class, with especial interest in the high pop- ularity class. The focus on recall is particularly interesting for tip ltering and recommendation tools where users (reg- ular users or venue/business owners) may want to retrieve most of the potentially popular tips at the possible expense of some noise (tips in the low popularity class). Figure 1-c) shows average recall of tips for the high pop- ularity level. Note that, for any prediction model, there are no signicant dierences in the recall of high popularity tips across the various sets of predictors, provided that user features are included. Moreover, the use of venue features jointly with user features improves the precision of the high popularity class (omitted due to space constraints) in up to 46% (35% for the OLS algorithm). That is, adding venue features reduces the amount of noise when trying to retrieve potentially popular tips, with no signicant impact on the 7 The gures show results of the median prediction model only for the user and user/cat sets of predictor variables since they are the same for the other sets. recall of that class, and is thus preferable over exploring only user features. Note also that including content features as predictors lead to no further (statistically signicant) gains. Comparing OLS, SVR and SVM with user and venue fea- tures as predictors, we nd only limited gains (if any) of the more sophisticated SVR and SVM algorithms over the sim- pler OLS strategy. In particular, SVR (with RBF kernel) leads to a small improvement (2% on average) in the re- call of the high popularity level, being statistically tied with SVM. However, in terms of precision of that class, OLS out- performs SVR by 13.5% on average, being statistically tied with SVM. We observe that SVR tends to overestimate the popularity of a tip, thus slightly improving the true positives of the high popularity class (and thus its recall) but also in- creasing the false negatives of the low popularity class, which ultimately introduces a lot of noise into the predictions for the high popularity class (hurting its precision). Note that the limited gains in recall of SVR come at the expense of a 30 times longer model learning process, for a xed training set. Thus, the simpler OLS model produces results that, from a practical perspective, are very competitive (if not better) to those obtained with SVM and SVR. 5.3 Feature Importance We now analyze the relative importance of each feature for prediction accuracy by using the Information Gain feature selection technique [5] to rank features according to their discriminative capacity for the prediction task. Due to space constraints, we focus on the most discriminative features. The three most important features are related to the tips author. They are the average, total and standard deviation of the number of likes received by tips posted by the user. Thus, the feedback received on previous tips of the user is the most important factor for predicting the popularity level of her future tips. Figure 2a, which shows the complemen- tary cumulative distributions of the best of these features (average number of likes) for tips in each class, clearly in- dicates that it is very discriminative of tips with dierent (future) popularity levels. Similar gaps exist between the distributions of the other two aforementioned features. Features related to the social network of the tips author are also important. The number of friends/followers of the author and the total number of likes given by them occupy the 4 th and 9 th positions of the ranking, respectively. More- over, we nd that authors of tips that achieve high popular- ity tend to have more friends/followers. The best venue feature, which occupies the 6 th position of the ranking, is the number of unique visitors (Figure 2b). Moreover, total number of check ins (7 th position), venue category (8 th position) and total number of likes received (a) Best user feature (b) Best venue feature Figure 2: Important Features for Tip Popularity Figure 3: Recall for tips in the high popularity level when re- moving one feature at time. by tips posted at the venue (11 th position) are also very discriminative, appearing above other user features, such as number of tips (19 th position) in the ranking. Finally, we extend our evaluation of the importance of dif- ferent sets of features by assessing the accuracy of the OLS strategy as we remove one feature at a time, in increasing order of importance given by the Information Gain. Figure 3 shows the impact on the average recall of the high pop- ularity class as each feature is removed, starting with the complete set of user, venue and content features. For exam- ple, the second bar shows results after removing the least discriminative feature (number of common misspellings per tip). Note that the removal of many of the least discrimina- tive features has no signicant impact on recall, indicating that these features are redundant. Accuracy loss is observed only after we start removing features in the top-10 positions. Among those, the largest losses are observed when the num- ber of check ins at the venue and the size of the users social network are removed, which reinforces the importance of venue and social network features for the prediction task. Similar results are also obtained for macro-averaged recall (omitted). In sum, using the top 10 most important fea- tures produces predictions that are as accurate as those of using the complete set of features. We note that we did test whether multicollinearity exists among dierent predictors, which could aect the accuracy of the OLS model. We found that despite the strong correlations between some pairs of predictors, removing some of these variables from the model does not improve accuracy. 6. CONCLUSIONS AND FUTURE WORK We have tackled the problem of predicting the future pop- ularity of micro-reviews (or tips) in Foursquare. To that end, we proposed and evaluated classication and regression- based strategies exploring dierent algorithms and various sets of features as predictors. Our experimental results showed that the simpler OLS algorithm produces results that are statistically as good as (if not better than) those obtained with the more sophisticated SVM and SVR methods, par- ticularly for tips in the high popularity level but also when considering average performance across both popularity lev- els analyzed. This work is a rst eort to identify the most relevant aspects for predicting the popularity of Foursquare tips. This is a challenging problem which, in comparison with previous related eorts, has unique aspects and inher- ently dierent nature, and may depend on a non-trivial com- bination of various user, venue and content features. Nev- ertheless, the knowledge derived from such eort may bring valuable insights into the design of more eective automatic tip ltering and ranking strategies. 7. ACKNOWLEDGMENTS This research is partially funded by the Brazilian National In- stitute of Science and Technology for the Web (MCT/CNPq/ INCT grant number 573871/2008-6), CNPq, CAPES and FAPEMIG. 8. REFERENCES [1] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. The Role of Social Networks in Information Diusion. In WWW, 2012. [2] R. Bandari, S. Asur, and B. Huberman. The Pulse of News in Social Media: Forecasting Popularity. In ICWSM, 2012. [3] C. Chang and C. Lin. LIBSVM: a Library for Support Vector Machines, 2001. [4] B. Chen, J. Guo, B. Tseng, and J. Yang. User Reputation in a Comment Rating Environment. In KDD, 2011. [5] D. Dalip, M. Gon calves, M. Cristo, and P. Calado. Automatic Assessment of Document Quality in Web Collaborative Digital Libraries. JDIQ, 2(3), 2011. [6] A. Esuli and F. Sebastiani. SentiWordNet: A Publicly Available Lexical Resource for Opinion Mining. In LREC, 2006. [7] D. Gruhl, R. Guha, D. Liben, and A. Tomkins. Information Diusion Through Blogspace. In WWW, 2004. [8] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer, 2008. [9] H. He and E. Garcia. Learning from Imbalanced Data. IEEE TKDE, 21(9), 2009. [10] L. Hong, O. Dan, and B. Davison. Predicting Popular Messages in Twitter. In WWW, 2011. [11] S.-M. Kim, P. Pantel, T. Chklovski, and M. Pennacchiotti. Automatically Assessing Review Helpfulness. In EMNLP, 2006. [12] B. Li, T. Jin, M. Lyu, I. King, and B. Mak. Analyzing and Predicting Question Quality in Community Question Answering Services. In WWW, 2012. [13] Y. Liu, X. Huang, A. An, and X. Yu. Modeling and Predicting The Helpfulness of Online Reviews. In ICDM, 2008. [14] E. Momeni and G. Sageder. An Empirical Analysis of Characteristics of Useful Comments in Social Media. In WebSci, 2013. [15] M. OMahony and B. Smyth. Learning to Recommend Helpful Hotel Reviews. In RecSys, 2009. [16] M. OMahony and B. Smyth. Using Readability Tests to Predict Helpful Product Reviews. In RIAO, 2010. [17] B. Pang, L. Lee, and S. Vaithyanathan. Thumbs Up? Sentiment Classication Using Machine Learning Techniques. In EMNLP, 2002. [18] M. Vasconcelos, S. Ricci, J. Almeida, F. Benevenuto, and V. Almeida. Tips, Dones and Todos: Uncovering User Proles in Foursquare. In WSDM, 2012. [19] D. Zwillinger and S. Kokoska. CRC Standard Probability and Statistics. Chapman & Hall, 2000.
Te.161730 The Effect of Selft Monitoring Aproach To Reading and Thinking (S.m.a.r.t) Strategy On Students Reading Comprhension at The Eight Grade of SMP It Ash-Shiddiiqi Batang Hari