A Study On Reducing Supervision in Splog Detection by SVM

A Study on Reducing Supervision in Splog Detection by SVM
Takehito Utsuro Taichi Katayama Yuuki Sato

University of Tsukuba Tsukuba, 305-8573, JAPAN
Takayuki Yoshinaka
Tokyo Denki University Tokyo, 101-8457, JAPAN
Yasuhide Kawada
Navix Co., Ltd. Tokyo, 141-0031, JAPAN
Tomohiro Fukuhara
University of Tokyo, Kashiwa 277-8568, JAPAN
ABSTRACT
blog analysis systems. [11] proposed a system called blogWatcher that collects and analyzes Japanese blog articles. This paper studies how to reduce the amount of human [2] proposed a system called BlogPulse that analyzes trends supervision for identifying splogs / authentic blogs in the of blog articles. With respect to blog analysis services on the context of continuously updating splog data sets year by Internet, there are several commercial and non-commercial year. We especially discuss the case where authentic blogs services such as Technorati1 , BlogPulse2 , kizasi.jp3 , and blogare dominant in the input blog homepage data (e.g, 90% of Watcher4 . With respect to multilingual blog services, Globe input data are authentic blog homepages). In such a situaof Blogs5 provides a retrieval function of blog articles across tion, in order to avoid performance damage of updated claslanguages. Best Blogs in Asia Directory6 also provides a siers, it is quite necessary to incorporate a certain selective retrieval function for Asian language blogs. Blogwise7 also sampling technique which outputs samples under less imanalyzes multilingual blog articles. balanced distribution. To tackle this problem, we introduce As with most Internet-enabled applications, the ease of a condence measure of Support Vector Machines (SVMs) content creation and distribution makes the blogosphere spam to the task of separating highly condent samples and less prone [3, 1, 6, 9, 5]. Spam blogs or splogs are blogs hosting condent samples in terms of identifying splogs / authentic spam posts, created using machine generated or hijacked blogs. content for the sole purpose of hosting advertisements or raising the PageRank of target sites. [6] reported that for Categories and Subject Descriptors English blogs, around 88% of all pinging URLs (i.e., blog H.3.0 [INFORMATION STORAGE AND RETRIEVAL]: homepages) are splogs, which account for about 75% of all pings. Based on this estimation, as stated in [1, 8], splogs General can cause problems including the degradation of information retrieval quality and the signicant waste of network and General Terms storage resources. Several previous works [6, 9, 5] reported important characteristics of splogs. [9] reported characterReliability istics of ping time series, in-degree/out-degree distributions, and typical words in splogs found in TREC8 Blog06 data colKeywords lection. [6, 5] also reported the results of analyzing splogs in spam blog detection, SVM, condence measure, splog changes the BlogPulse data set. In the context of semi-automatically collecting web spams including splogs, [16] discuss how to collect spammer-targeted keywords to be used when collect1. INTRODUCTION ing a large number of web spams eciently. [12] also anWeblogs or blogs are considered to be one of personal jouralyzes (Japanese) splogs based on various characteristics of nals, market or product commentaries. While traditional keywords contained in them. search engines continue to discover and index blogs, the bloAlong with those analysis on splogs reported in previous gosphere has produced custom blog search and analysis enworks, several splog detection techniques (e.g., [10, 4, 8]) gines, systems that employ specialized information retrieval have been proposed. [4] studied features for splog detection techniques. There are several previous works and services on such as words, URLs, anchor texts, links, and HTML meta
1 2
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
http://technorati.com/ http://www.blogpulse.com/ 3 http://kizasi.jp/ (in Japanese) 4 http://blogwatcher.pi.titech.ac.jp/ (in Japanese) 5 http://www.globeofblogs.com/ 6 http://www.misohoni.com/bba/ 7 http://www.blogwise.com/ 8 http://trec.nist.gov/
Data Sets Years 2007-2008 Years 2008-2009 total
Table 1: Statistics of Splogs / Authentic Blogs Data Sets (a) Statistics of Total Data Sets Available # of splogs # of authentic blogs 768 3318 696 1122 1464 4440 (b) Statistics of Virtual Data Sets for Figure 1 4018=720 4018=720 4018=720 (4010=400 used) 4018=720 (4010=400 used) (b) Statistics of Virtual Data Sets for Figure 2 6018=1080 (609=540 used) 6018=1080 (609=540 used) 6018=1080 (6010=600 used) 6018=1080
total 4086 1818 5904
Years 2007-2008 Years 2008-2009
1440 1440 (800 used)
Years 2007-2008 Years 2008-2009
2160 (1080 used) 2160 (1680 used)
tags in supervised learning by SVMs. As features of SVMs, [8] studied temporal self similarities of splogs such as posting times, post contents, and aliated links. [10] also studied detecting link spam in splogs by comparing the language models among the blog post, the comment, and pages linked by the comments. However, splogs may change year by year. This is partially because text content of splogs is mostly excerpted from other sources such as news articles, blog articles (posts), advertisement pages, and other web texts. Sources of splog contents such as those above may change day by day, and thus, splog contents excerpted from those sources also may change. Furthermore, certain percentage of splogs may be created automatically, where their html structures are automatically generated and their text contents are excerpted from other sources. Such automatic procedures may also change year by year, and hence, it is quite reasonable to suppose that certain characteristics of generated splogs may change year by year. As we show in section 5.2, this is indirectly supported by the fact that the performance of applying an SVM model for splog detection trained with a Japanese splog data set developed in the years of 2007-2008 [12] against another Japanese splog data set developed in the years of 2008-2009 is relatively damaged compared with that of applying the same model against a held out data set from the years of 2007-2008. Therefore, in order to overcome such a diculty regarding splog detection, it is quite necessary to design a framework for continuously updating splog data sets year by year. In such a framework, one of the most important issues is how to reduce human supervision in continuously updating splog data sets. Here, this paper discusses another important issue, namely, splogs / authentic blogs distribution in training/evaluation data sets, which has not been studied in previous works on splog detection. As presented in [12], in situations such as marketing based on analyzing blog posts including a certain keyword, splog / authentic blog distribution in the collected blog posts may drastically vary from keyword to keyword. Therefore, in a practical situation where technologies for splog / authentic blog distinction are really required, the framework of updating splog data sets must stably keep human supervision minimum against any distribution of splogs / authentic blogs in the input data. Based on this underlying motivation, this paper studies how to reduce the amount of human supervision for identi-
fying splogs / authentic blogs within given input blog homepage data. As we present in sections 5.3 and 5.4, we especially discuss the case where authentic blogs are dominant in the input blog homepage data (e.g, 90% of input data are authentic blog homepages). In such a case, if we update an existing splog / authentic blog data set by simply annotating each of the input blog homepage data with splog / authentic blog distinction manually, authentic blogs are certainly dominant also in the updated data set. In the experimental evaluation of this paper, Support Vector Machines (SVMs) [15] tend to have their performance damaged when trained with data sets updated with imbalanced input data. We further experimentally show that the SVM classier trained with data sets updated with less imbalanced input data performs better. Finally, we conclude that reduction of human supervision can be actually achieved by incorporating a certain selective sampling technique which outputs samples under less imbalanced distribution. In order to tackle this problem, in this paper, we examine a fundamental technique of a condence measure of Support Vector Machines (SVMs) [15], which can separate those condently expected to be splogs or authentic blogs and those with less condence. We also show that, as such a condence measure, the distance from the separating hyperplane to each test instance [14] performs well in outputting splogs / authentic blogs under less imbalanced distribution.
2. SPLOGS / AUTHENTIC BLOGS DATA SETS

In this paper, we examine splogs changes over time with two data sets of Japanese splog / authentic blog homepages, where one set is developed in the years of 2007-2008 (from September 2007 to February 2008) [12], and the other set is developed in the years of 2008-2009 (from December 2008 to January 2009). As shown in Table 1, in both data sets, blog homepages are collected according to certain procedures, and then, splog / authentic blog judgement is manually annotated. Here, roughly speaking, splog / authentic blog judgement is based on the criterion below, while the detailed discussion on this criterion is in [12]: 1. If one of the followings holds for the given homepage, then it is mostly9 splog.
9
By mostly, we mean that it is usually necessary to judge
(a) The feature originally written textdoes not hold. (b) The feature originally written text holds and at least one of the features links to aliated sites, advertisement articles (posts), or articles (posts) with adult content holds. 2. Otherwise, the given homepage is an authentic blog. The reason why splog / authentic blog distributions dier between the two data sets is that candidate blog homepages to which splog / authentic blog distinction is annotated are collected according to dierent procedures. Since it was our rst experience of developing splog data set, for the one developed in the years of 2007-2008, we examine characteristics of splogs in advance, and we collect candidate blog homepages which satisfy the requirements below: i) We collect various keywords including those which are supposed to be frequently included in splogs. And then, for each keyword, we simply collect blog homepages which contain the keyword. ii) In our observation, the rate of splogs among the blog homepages that contain a given keyword may be higher on the burst date than on other dates. Based on this tendency, we collect blog homepages containing a given keyword on the date with its most frequent occurrence. iii) Out of the collected blog homepages for a given keyword and on the date above ii), we prefer blog homepages with more posts per day than those with fewer posts per day. When developing the other data set in the years 20082009, on the other hand, we collect candidate blog homepages more eciently. According to our analysis [12] as well as further observation, splog homepages tend to share outlinks to aliated sites. Based on this tendency, in the years 2008-2009, we collect splog / authentic blog homepages which share outlinks with splog homepages included in the data set developed in the years 2007-200810 . More specically, rst, URLs of the outlinks are extracted from the html les of the splog / authentic blog homepages in the data set developed in the years 2007-2008. Then, the set of blacklist URLs is constructed by collecting URLs which satisfy both of the following two requirements: i) The URL is not included in the html les of any of the training instances of authentic blog homepages. ii) The URL is included in the html les of the training instances of splog homepages, and its total frequency in the whole training splog homepages is more than one.
10
About 5,000 blacklist URLs are collected, and nally, for each blacklist URL, candidate blog homepages which include outlink to it are collected. The data set shown in Table 1 is developed from a part of all the collected candidate blog homepages.
3. FEATURES FOR SPLOG DETECTION

This section describes features of SVMs for splog detection. All of those features are evaluated through splog detection performance, and in the evaluation of section 5, only about half of them are manually selected, since the set of those selected features perform best compared with other combinations out of all the features.
3.1 Blacklist/Whitelist URLs

Given the training instances of splog / authentic blog homepages, URLs of the outlinks are extracted from their html les. Then, the set of whitelist URLs is constructed by collecting URLs which satisfy both of the following two requirements: i) The URL is not included in the html les of any of the training instances of splog homepages. ii) The URL is included in the html les of the training instances of authentic blog homepages, and its total frequency in the whole training authentic blog homepages is more than one. From the Japanese splog data set developed in the years of 2007-2008 [12] (section 2), about 13,000 whitelist URLs are collected. Next, given a whitelist URL u, the following weight is calculated and is used as a value of the whitelist URLs feature: total frequency of u in the whole training instances of authentic blog homepages total frequency of u in the test instance
log
u
The set of blacklist URLs is also constructed according to a similar procedure (as described in section 2). About 5,000 blacklist URLs are collected, and are evaluated as a feature for splog detection. However, the blacklist URLs feature does not contribute to improving the performance of the best combination of features.
3.2 Noun Phrases

As previously reported in [16, 12], splogs and authentic blogs tend to have word distributions dierent from each other, and certain types of words may appear in splogs more often than in authentic blogs. In this paper, we introduce a feature for observing occurrences of noun phrases, so that such a dierence can be detected. First, body texts of splog / authentic blog posts are extracted, and are morphologically analyzed11 , from which noun phrases are extracted. Then, the set of splog noun phrases is constructed by collecting noun phrases which satisfy the following requirement.
11
by considering the contents of each blog. One may argue that candidate blog homepages sharing outlinks with splogs taken from the data set of the years 20072008 could be biased toward those similar to the ones included the data set of the years 2007-2008. However, as we show in section 5.2, the SVMs model trained with the data set of years 2007-2008 perform worse against splogs / authentic blogs of the years 2008-2009, than against those of the years 2007-2008. Thus, at present, we conclude that our data set of the years 2008-2009 is worthy of further analysis on changes in splogs over time and being examined through splog detection research activities.
The Japanese morphologically analyzer ChaSen (http: //chasen-legacy.sourceforge.jp/) and the lexicon ipadic are used.
Given a noun phrase w, the prob. of w in the whole training instances of splog homepages the prob. of w in the whole training instances of authentic blog homepages
s
Ancf W (w, s) is more than one is constructed, where from the Japanese splog data set developed in the years of 2007-2008, about 320 anchor text noun phrases are collected. Next, given an anchor text noun phrase w, the following weight is calculated and is used as value of a feature named anchor text noun phrase outlinked to whitelist URLs. total freq. of w in the test instance
P (w) =
> 0
From the Japanese splog data set developed in the years of 2007-2008, about 130,000 splog noun phrases are collected. Next, the following weight is calculated and is used as a value of the splog noun phrase feature: log
w
log
w
Ancf W (w, s) training splog homepage s
P (w) (total frequency of w in the test instance)
3.4 Link Structure

In addition to the features introduced so far, we also examine features for representing link structures such as the out-degree, the maximum number of outlinks from a blog homepage to any one URL, and the number of mutual links to any other blog homepages. However, any of those features contribute to improving the performance of the best combination of features.
The set of authentic blog noun phrases is also constructed according to a similar procedure, and is evaluated as a feature for splog detection. However, the authentic blog noun phrase feature does not contribute to improving the performance of the best combination of features.
3.3 Noun Phrases in Anchor Texts and Linked URLs

Another type of features which are more specic than the blacklist/whitelist features and noun phrase features, but are more eective in splog detection is that of (loose) tuple of noun phrases in anchor texts and their linked URLs. In order to introduce this feature, given a noun phrase w and a splog / authentic blog homepage s, rst we dene the frequencies Ancf B(w, s) and Ancf W (w, s) of a noun phrase w in s: # of times of w in s s.t. w is included in an anchor text of an outlink to a blacklist URL or a post of splog homepage included in the training data set. # of times of w in s s.t. w is included in an anchor text of an outlink to a whitelist URL or a post of authentic blog homepage included in the training data set.
4. SPLOG DETECTION AND A CONFIDENCE MEASURE 4.1 Splog Detection by SVMs

As a tool for learning SVMs, we use TinySVM (http: //chasen.org/~taku/software/TinySVM/). As the kernel function, we compare the linear and the polynomial (2nd order) kernels, where they perform mostly comparably. Thus, in section 5, we show the results with the linear kernel.
Ancf B(w, s) =
4.2 A Condence Measure

As a condence measure of SVMs learning, we employ the distance from the separating hyperplane to each test instance [14]12 . More specically, we introduce two lower bounds LBDs and LBDab for the distance from the separating hyperplane to each test instance, where LBDs is for the test instances judged as splogs, while LBDab is for those judged as authentic blogs. If a test instance x is judged as a splog, but its distance from the separating hyperplane is not greater than LBDs , then the decision is not regarded as condent and is rejected. In the case of x being judged as an authentic blog, the lower bound LBDab is considered in a similar fashion.
Ancf W (w, s) =
Then, the set of splog anchor text noun phrases outlinked to blacklist URLs whose total frequency throughout the whole Ancf B(w, s) is more than one training splog homepages is constructed, where from the Japanese splog data set developed in the years of 2007-2008, about 2,000 anchor text noun phrases are collected. Next, given an anchor text noun phrase w, the following weight is calculated and is used as value of a feature named anchor text noun phrase outlinked to blacklist URLs. Ancf B(w, s)
w s
5. EXPERIMENTAL ANALYSIS ON REDUCING SUPERVISION IN SPLOG DETECTION 5.1 Evaluation Measures

Throughout this paper, the performance of splogs / authentic blogs detection are shown with plots of recall and
12
log training splog homepage s
total freq. of w in the test instance
In a similar procedure, the set of splog anchor text noun phrases outlinked to whitelist URLs whose total frequency throughout the whole training splog homepages
[14] studied this measure in the context of active learning [7, 14, 13], one of major minimally supervised approaches in machine learning communities and statistical natural language processing communities, where, in active learning framework, the least condent samples are collected, manually annotated, and added to the training data set.
(a) Splog Detection
(b) Authentic Blog Detection

9 18
Figure 1: Evaluation Results of Splog / Authentic Blog Detection (Training: 1 + T r, Evaluation: 18 of Ds08-09 (Years 2008-2009)) precision. The performance curve of splog detection is plotted by varying the lower bound LBDs in splog side of the data space, while that of authentic blog detection is plotted by varying the lower bound LBDab on authentic blog side of the data space. catch up with.
of Ds07-08 (Years 2007-2008)
5.2 Estimating Changes in Splogs over Time through Performance of Splog Detection
In the rst experimental evaluation, we compare the two data sets developed in the years 2007-2008 and 2008-2009 through performance of splog / authentic blog detection. Here, we trained two classiers, one with the data set developed in the years 2007-2008, while the other with the mixture of the two data sets13 . Then, we evaluate the two classiers against a held out data taken from the data set developed in the years 2008-2009. In Figure 1, the plots labeled with T r=none and |T r| 9 = 18 , T r Ds07-08 indicate the performance of those trained with the data set from the years 2007-2008, where the training data size for the former is half of that for the lat9 5 ter. Those with |T r|= 18 , T r Ds08-09 and |T r|= 18 , T r Ds08-09 indicate the performance of those trained with the mixture of the two data sets. As can be clearly shown in the results of Figure 1, the best performance are obtained when the classier is trained with the mixture of the two data sets. This is obviously because both (a part of) training and evaluation data are from the data sets of the same period. Furthermore, with more training data from the years 2008-2009, the classier performs better. This result clearly indicates that (at least) splogs change over time (in this case, about one year), and their changes are somehow not very easy for the SVM classier trained in the past to
13
5.3 Correlation of Splogs / Authentic Blogs Distribution and Performance of Splog / Authentic Blog Detection
In the next experimental evaluation, we especially examine the case where authentic blogs are dominant in the input novel blog homepage data, i.e., in the data set from the years 2008-2009. Here, we assume that 90% of the novel blog homepages are authentic, and all the classiers are evaluated against a held out data consisting of 10% splogs and 90% authentic blogs taken from the years 2008-2009. Again, the specications of the training and evaluation data sets are given in Table 1 (c). In this evaluation, all the classiers except one are trained with the mixture of the two data sets from the years 2007-2008 and from the years 2008-2009. As shown in Figure 2, for those from the years 2007-2008, the distribution of splogs and authentic blogs is evenly 50% and 50%, while for those from the years 2008-2009, the distributions and the scales of their sizes are 2 to 2, 5 to 5, 1 to 9, 3 to 9, and 9 to 9. Compared with the best performing classier trained with a data set totally balanced between splogs and authentic blogs for both years 2007-2008 and years 2008-2009 (la9 beled with |T r(splog)| = |T r(authentic)| = 18 ), those trained with data sets updated with the most imbalanced in1 put data (labeled with |T r(splog)| = 18 , |T r(authentic)| 9 = 18 ) performs much worse. In addition to that, other classiers trained with data sets updated with less imbalanced input data performs better. Those results clearly show that in a situation where authentic blogs are dominant in the input novel blog homepage data, it is quite necessary to incorporate a certain selective sampling technique which outputs samples under less imbalanced distribution.
The specications of the training and evaluation data sets are given in Table 1 (b). Out of the total splogs / authentic blogs data sets available in Table 1 (a), the training data sets of the same size are constructed both from the years 20072008 and from the years 2008-2009. Here, note that within both the training and evaluation data sets, the distribution of splogs and authentic blogs is evenly 50% and 50%.
5.4 Reducing Supervision in Updating Balanced Labeled Data Set by a Condence Measure
Based on the evaluation results in the previous section, we examine the condence measure introduced in section 4.2, and show that it is capable of selectively ltering the input novel blog homepages with quite imbalanced distribution of splogs and authentic blogs into much less imbalanced distribution. Here, let us focus on plots labeled with T r = none in Figure 2, which indicate a situation that the classier trained with the data set from the years 2007-2008 faces with input novel blog homepages with quite imbalanced distribution of splogs and authentic blogs. Varying the two lower bonds LBDs and LBDab for the distance from the separating hyperplane to each novel blog homepage data, this classier can be easily tuned with a small number of a held out data from the years of 2008-2009 so that it performs with the best precisions for both splogs side and authentic blogs side. Especially for the splogs side, the classier can select splogs with 65% precision. Thus, we can conclude that our framework with this condence measure is capable of selectively ltering the input novel blog homepages with quite imbalanced distribution of splogs and authentic blogs into much less imbalanced distribution.
6.
CONCLUSION
This paper studied how to reduce the amount of human supervision for identifying splogs / authentic blogs in the context of continuously updating splog data sets year by year. We especially discussed the case where authentic blogs are dominant in the input blog homepage data (e.g, 90% of input data are authentic blog homepages). In such a situation, in order to avoid performance damage of updated classiers, it is quite necessary to incorporate a certain selective sampling technique which outputs samples under less imbalanced distribution. To tackle this problem, we nally introduced a condence measure of Support Vector Machines (SVMs) to the task of separating highly condent samples and less condent samples in terms of identifying splogs / authentic blogs. Future works include introducing other features such as ping time series that are studied in the previous works [9, 6, 5]. We are also working on incrementally updating classiers by training with less imbalanced updated training data, and the results will be reported in the near future.
[6] P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proc. 3rd Ann. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006. [7] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classiers. In Proc. 17th SIGIR, pages 312, 1994. [8] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proc. 3rd AIRWeb, pages 18, 2007. [9] C. Macdonald and I. Ounis. The TREC Blogs06 collection : Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science, 2006. [10] G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. 1st AIRWeb, 2005. [11] T. Nanno, T. Fujiki, Y. Suzuki, and M. Okumura. Automatically collecting, monitoring, and mining Japanese weblogs. In WWW Alt. 04: Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pages 320321. ACM Press, 2004. [12] Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami, H. Nakagawa, and N. Kando. Analysing features of Japanese splogs and characteristics of keywords. In Proc. 4th AIRWeb, 2008. [13] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th ICML, pages 839846, 2000. [14] S. Tong and D. Koller. Support vector machine active learning with applications to text classication. In Proc. 17th ICML, pages 9991006, 2000. [15] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, 1995. [16] Y. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: Connecting web spammers with advertisers,. In Proc. 16th WWW Conf., pages 291300, 2007.
7.
REFERENCES
[1] Wikipedia, Spam blog. http://en.wikipedia.org/wiki/ Spam_blog. [2] N. Glance, M. Hurst, and T. Tomokiyo. Blogpulse: Automated trend discovery for Weblogs. In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004. [3] Z. Gyngyi and H. Garcia-Molina. Web spam o taxonomy. In Proc. 1st AIRWeb, pages 3947, 2005. [4] P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identication and Splog detection. In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pages 9299, 2006. [5] P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social media. In Tutorial at ICWSM, 2007.
(1-a) |Splog| : |Authentic| = 1 : x (in T r) (1) Splog Detection
(1-b) |Splog| : |Authentic| = 1 : 1 (in T r)
(2-a) |Splog| : |Authentic| = 1 : x (in T r) (2-b) |Splog| : |Authentic| = 1 : 1 (in T r) (2) Authentic Blog Detection
9 Figure 2: Evaluation Results of Splog / Authentic Blog Detection (Training: 18 of Ds07-08 (Years 2007-2008) 1 9 + T r, Evaluation: 18 of Splogs and 18 of Authentic Blogs in Ds08-09 (Years 2008-2009))

A Study On Reducing Supervision in Splog Detection by SVM

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Study On Reducing Supervision in Splog Detection by SVM

Enviado por

Direitos autorais:

Formatos disponíveis

A Study on Reducing Supervision in Splog Detection by SVM

Takehito Utsuro Taichi Katayama Yuuki Sato

Data Sets Years 2007-2008 Years 2008-2009 total

total 4086 1818 5904

Years 2007-2008 Years 2008-2009

1440 1440 (800 used)

Years 2007-2008 Years 2008-2009

2160 (1080 used) 2160 (1680 used)

2. SPLOGS / AUTHENTIC BLOGS DATA SETS

By mostly, we mean that it is usually necessary to judge

3. FEATURES FOR SPLOG DETECTION

3.1 Blacklist/Whitelist URLs

3.2 Noun Phrases

Ancf W (w, s) training splog homepage s

P (w) (total frequency of w in the test instance)

3.4 Link Structure

3.3 Noun Phrases in Anchor Texts and Linked URLs

4. SPLOG DETECTION AND A CONFIDENCE MEASURE 4.1 Splog Detection by SVMs

4.2 A Condence Measure

5. EXPERIMENTAL ANALYSIS ON REDUCING SUPERVISION IN SPLOG DETECTION 5.1 Evaluation Measures

log training splog homepage s

total freq. of w in the test instance

(a) Splog Detection

(b) Authentic Blog Detection

of Ds07-08 (Years 2007-2008)

(1-a) |Splog| : |Authentic| = 1 : x (in T r) (1) Splog Detection

(1-b) |Splog| : |Authentic| = 1 : 1 (in T r)

Você também pode gostar