The various spams in the internet is classified
based on the properties of spam such as spam content, type
and ranking. The impact of various spams in social
network, email, image, content and links is discussed and
the techniques are listed to prevent the spam in various
area. Also it reviews the two groups of spam detection
techniques and algorithms such as content based methods
and link based methods. Link based methods are further
subdivided into five groups such as label propagation, link
pruning and reweighting, label refinement, graph
regularization. The review compares the label propagation,
link pruning and reweighting, label refinement, graph
regularization and feature based methods based on the
various factors like type of information used, algorithms,
working, complexity and mining techniques.
The various spams in the internet is classified
based on the properties of spam such as spam content, type
and ranking. The impact of various spams in social
network, email, image, content and links is discussed and
the techniques are listed to prevent the spam in various
area. Also it reviews the two groups of spam detection
techniques and algorithms such as content based methods
and link based methods. Link based methods are further
subdivided into five groups such as label propagation, link
pruning and reweighting, label refinement, graph
regularization. The review compares the label propagation,
link pruning and reweighting, label refinement, graph
regularization and feature based methods based on the
various factors like type of information used, algorithms,
working, complexity and mining techniques.
The various spams in the internet is classified
based on the properties of spam such as spam content, type
and ranking. The impact of various spams in social
network, email, image, content and links is discussed and
the techniques are listed to prevent the spam in various
area. Also it reviews the two groups of spam detection
techniques and algorithms such as content based methods
and link based methods. Link based methods are further
subdivided into five groups such as label propagation, link
pruning and reweighting, label refinement, graph
regularization. The review compares the label propagation,
link pruning and reweighting, label refinement, graph
regularization and feature based methods based on the
various factors like type of information used, algorithms,
working, complexity and mining techniques.
1 Mphil Scholar, Department of Computer Science, Gobi Arts and Science College, Gobi, Tamil Nadu, India. 2 Associate professor , Department of Computer Science, Gobi Arts and Science College, Gobi, Tamil Nadu, India
ABSTRACT The various spams in the internet is classified based on the properties of spam such as spam content, type and ranking. The impact of various spams in social network, email, image, content and links is discussed and the techniques are listed to prevent the spam in various area. Also it reviews the two groups of spam detection techniques and algorithms such as content based methods and link based methods. Link based methods are further subdivided into five groups such as label propagation, link pruning and reweighting, label refinement, graph regularization. The review compares the label propagation, link pruning and reweighting, label refinement, graph regularization and feature based methods based on the various factors like type of information used, algorithms, working, complexity and mining techniques. 1. INTRODUCTION Spam spreads any type of information system like e-mail or web, social, blog or review platforms. The concept of web spamdexing was introduced in the year 1996. Web spam was recognized as one of the key challenges of the search engines industry. Now a days, all search engine companies are identified information retrieval as a main concern. First, spam spoils the quality of search results and prevents the legal websites of revenue that might earn in the absence of spam. Second it weakens the trust of user in a search engine provider which is a notable issue since the user can easily continue the search form one search engine to other. In the websites the spam spreads by adult content distribution and malware spreading and also leads for phishing. For example, ranked 100 million webpages using page rank algorithm [1]. And found that 11 out of top 20 results were pornographic websites that achieves high ranking by content and web link manipulation [2]. Web spam phenomenon occurs mainly by the following fact. The fraction of webpage referrals that come from search engines is very high and also the users examine only top ranked results. Thus, 85% of queries only the first result page is requested and only 3 to 5 links are clicked [3,4]. So the website owners attempt to manipulate search engine rankings. The manipulation is done in different forms like undesired link creation, cloaking, click fraud and tag spam. I t i s shown that 6% of English language websites are classified as spam [5]. Reports that 22% of spams are found on host level and states that it is 16.5%. The goal of this survey is to give ideas about various spams in the internet and the algorithms used in spam detection and also to create awareness in further research in area of adversarial information retrieval. Next section of this review explains about the Web spam classification. Section 3 describes the Spam detection algorithms and section 4 describes the key principles to apply for spam detection. Section 5 concludes this review. 2. CLASSIFICATION OF WEB SPAMS In this analyze the various spam and its preventing techniques are discussed. With the recent researches and the impact of spam in various areas in internet, the spam classification is given in the fig.1.
Fig.1. Classification of internet spam 2.1 Content Spam The content spam is the first and most common form of web spam. It is so common because of most of the search engines use information retrieval models based on the web page content and rank of the web pages. Such as vector space model and statistical models. Based on the document structure content spamming is subdivided into five types namely, title spamming, body spamming, meta-tags spamming, anchor text spamming, and URL spamming. I nternational J ournal of Computer Trends and Technology (I J CTT) volume 4 I ssue 9 Sep 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 3127 Title Spamming: Search engines usually give a higher importance to terms that appear in the title of the document. Thus, it makes sense to include the spam terms in the document title. Body Spam: The spam terms are included in the document body. This spamming technique is among the simplest and most popular ones, and it is almost as old search engines themselves. Meta-Tags Spamming: The HTML meta tags that become visible in the document header have always been the target of spamming. Because of the heavy spamming, search engines currently give low priority to these tags, or even ignore these tags completely. Anchor Text Spamming: Search engines assign higher weight to anchor text terms, as they are supposed to offer a summary of the pointed document. Therefore, spam terms are sometimes included in the anchor text of the HTML hyperlinks to a page. This spamming technique is different from Meta-tag spamming. In the sense that the spam terms are added not to a target page itself, but other pages that direct to the target. Anchor text gets indexed for both pages, spamming has impact on the ranking of both the source and target pages. URL Spamming: Some search engines also break down the URL of a page into a set of terms that are used to establish the significance of the page. To develop this, spammers sometimes create long URLs that include sequences of spam terms. And sometimes Create URL for a page from words which should be mentioned in the targeted set of queries. 2.2 Social network spam In past few years the growth of social networking sites is very high. The people converse with their friends and chat or share multimedia contents. Sites like facebook, twitter are constantly among top 20 most viewed websites on the internet [7]. Statistics shows, compare with other sites people spent more time on social network. The increase in status of social networks allows them to collect a huge amount of personal information about their friends. In social network a person can reach any person which is attracted by the malicious parties. In 2008, 83% of users received minimum one unknown friend request or message [8]. Unfortunately social networking sites do not provide strong authentication mechanisms, and it is easy to act like a person and slip into persons network of trust [9]. The volatile growth of unwanted emails has encouraged the development of frequent spam filtering techniques [10]. It also discusses some features that may allow one to differentiate a spammer from valid users, such as node degree and frequency of messages. Spam detection and the behaviour of users have been studied for a long time. Bayesian classification algorithm is used to differentiate the suspicious behaviours from normal ones. A Bayesian spam filter is better to a static keyword based spam filter because it can continuously change to undertake new spam by learning keywords in new spam emails. Few current spam filters use social networks to aid spam detection. Concerning the drawbacks in Bayesian spam filter and user-friendly spam filter called Social network Aided Personalized and effective spam filter (SOAP) is used [11]. Previous filters techniques that focus on parsing keywords (e.g., Bayesian filter), SOAP exploits the social relationship among email. And SOAP detects the spam adaptively and automatically. SOAP integrates three components into basic Bayesian filter namely social closeness spam filtering, social interest-based spam filtering, and adaptive trust management. SOAP can significantly improve the performance of Bayesian spam filters in terms of the accuracy, attack-resilience and efficiency of spam detection. 2.3 Email spam The most common communication in the internet is using email communication. With the huge growth in email and its reputation of unwanted e-mail also emerged very quickly with almost 90% of all email messages. The cost of sending these e-mails is very close to zero being easy to reach a high number of possible consumers [12]. In this situation, spam consumes resources; time spent reading unwanted messages, bandwidth, CPU, disk. The email system design can easily be broken by spammers who send incorrect information. All email on the Internet is sent via a Simple Mail Transfer Protocol (SMTP).SMTP is designed to capture the information about the path that an email message travels from sender to receiver. The SMTP protocol provides no security to emails, email is not confidential, and there is no way to authenticate the uniqueness of the email source. The lack of security in SMTP, and specifically the lack of reliable information identifying the email source, are often broken by spammers and allows for considerable fraud on the Internet. For handling email spam introduced a novel hybrid model, Partitioned Logistic Regression, which has several advantages over both naive Bayes and logistic regression. Developed an ant-colony based spam filter to evaluate and predict spam messages. The developed spam filter is compared with other techniques like multi- layer perception, naive Bayes and Ripper classifiers the developed filter is alternate tool in predicting the spam and also yield better accuracy. A rule based filter for light weight and accurate detection of email spam [14]. This filter cascade three filters, one for fingerprints of message bodies, and another for white and black list of email address in form of header and last for the words specific to the spam and legitimate email in message header. The method has high performance of about 90 emails per seconds.
I nternational J ournal of Computer Trends and Technology (I J CTT) volume 4 I ssue 9 Sep 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 3128 2.4 Image spam Recently, spammers have proliferated image spam, emails which contain the text of the spam message in a human readable image instead of the message body. It consists in embedding the spam message into images which are sent as email attachments. Its goal is to circumvent the Analysis of the emails textual content performed by spam filters, including automatic text classifiers. Since attached images are displayed by default by most email clients, the message is directly conveyed to the user as soon as the email is opened. The simplest kind of image spam can be viewed as a screen shot of a plain text written using a standard text editor. Making detection to image spam by conventional content filters is very difficult. New techniques are needed to filter these messages. Often spam images are constructed by introducing random changes to a given template image, to make signature-based detection techniques ineffective, and are obfuscated to prevent optical character recognition (OCR) tools from reading the embedded text. Ironically, some text obfuscation techniques used against OCR tools are very similar to the ones exploited to design Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA). A common type of CAPTCHA requires the user to type letters and/or digits from a distorted image that appears on the screen. Such tests are commonly used to prevent unwanted internet bots from accessing websites, since a normal human can easily read a CAPTCHA, while the bot cannot process the image letters and therefore, cannot answer properly, or at all. A comprehensive solution has been presented to image spam filtering, which combines cluster analysis of spam images on the server side and active learning classification on the client side for effectively filtering image spam [15]. Extensive experimental evaluations of both server-side algorithms and client-side algorithms on a real image spam dataset collected from an e-mail server demonstrated the efficiency of this method. A machine learning method to detect the spam images from images in normal emails [16]. The proposed method extracts efficient global image features to train an advanced binary classifier to distinguish the spam images, which achieves promising preliminary results on the limited sample database. The proposed method achieves fairly good performance in 5-fold cross- validation on the data set with 928 spam images and 810 natural images. 2.5 Click spam. In this type of spam the spammers generate fraud clicks and make the control function towards their websites. To achieve the goal spammers submit queries to search engine and click on the links point to the target pages [17,18]. Online advertising is other motivation for spammers to generate fake clicks [19].
2.6 Cloaking and Redirection Cloaking is a search engine optimization technique. In which the content presented in the search engine is different from that is presented in the user browser. Redirection is to send users automatically to another URL after loading the needed URL. Both Cloaking and redirection are used in the search engine spamming [20]. To differentiate users from the crawlers, spammers analyze a user field of HTTP request and keep track of IP address by using search engine crawlers. Cloaking is very hard to detect. Cloaking detection method identify the cloaking by using the most frequent words from the MSN query log and highest revenue generating words from the MSN advertisement log. Cloaking could be detected by comparing crawls with different user agent strings and IP addresses. Spammers purchase sites that terminate their operation and fill them with spam. For some time these sites emerge with their previous content both in the search engine index and also in the input for the spam classificatory. Meanwhile the crawler will fetch the new Content believed to be honest, follows its links and prioritizes its processes in support for the spammers target. 2.7 Link spam Link spam is also called as comment spam or blog spam. This kind of spam targets weblogs, guestbooks and discussion boards. Any web application that displays hyperlinks submitted by visitors or the referring URLs of the web visitors may be the target. Link spamming occurs in the internet guestbooks where spammers repeatedly fill a guest book with link to their own site. The two types of link spam are outgoing link spam and incoming link spam. 3. ALGORITHMS The algorithms are categorized into three groups. The first one consists of techniques which analyze content features, such as word counts or language models, and content duplication. Second one consist of link based information such as neighbour graph connectivity, performs link-based trust and distrust propagation, link pruning, graph-based label smoothing and study statistical anomalies. Finally, the last group includes algorithms that exploit click stream data and user behaviour data, query popularity information, and HTTP sessions information. 3.1 Content-based Spam Detection In this type of spam detection web pages can be identified through statistical analysis. Since spam pages are usually automatically generated, using phrase stitching and weaving techniques [26]. Researchers found that the URLs of spam pages have exceptional number of dots, dashes, digits and length. They report that 80 of the 100 longest discovered host names refer to adult websites, while 11 refer to financial credit related websites. They also show that pages themselves have a duplicating nature- most spam pages that reside on the same host have very low word count variance. Another interesting observation is that the spam pages count changes very rapidly.
I nternational J ournal of Computer Trends and Technology (I J CTT) volume 4 I ssue 9 Sep 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 3129 Table.1. Comparison of Link-Based Spam Detection Method
Specifically they studied average amount of week-to- week changes of all the web pages on a given host and found that the most volatile spam hosts can be detected with 97.2% based only on this feature.
3.2 Link based Spam Detection All link-based spam detection algorithms can be subdivided into five groups. The first group exploits the topological relationship between the web pages and a set of pages for which labels are known [21]. The second group of algorithms focus on identification of suspicious nodes and links and their subsequent downweighting [22]. The third one works by extracting link-based features for each node and use various machine learning algorithms to perform spam detection [23]. The fourth group of link- based algorithms uses the idea of labels refinement based on web graph topology, when preliminary labels predicted by the base algorithm are modified using propagating through the hyperlink graph or a stacked classifier [24]. Finally, there is a group of algorithms which is based on graph regularization techniques for web detection [25]. Table 1compares the link-based spam detection methods discussed above. The main criteria used for comparison are algorithms, working, type of information used, and complexity. Among all the algorithms HIT and PageRank is more important one. Because most of the link-based methods are support PageRank and HIT algorithms. 4. KEY PRINCIPLES Analyzed all the related work devoted to
the topic of web spam mining, identify a set of underlying principles that are frequently used for algorithms construction. Due to machine-generated nature and its focus on search engines manipulation, spam shows abnormal properties such as high level of duplicate content and links; rapid changes of content; and the language models built for spam pages deviate significantly from the models built for the normal web. Spam pages deviate from power law distributions based on numerous web graph statistics such as PageRank or number of in- links. Spammers mostly target popular queries and queries with high advertising value. Spammers build their link farms with the aim to boost ranking as high as possible, and therefore link farms have specific topologies the can be theoretically analyzed on optimality. According to experiments, the principle of approximate isolation of good pages takes place: good pages mostly link to good pages, while bad pages link either to good pages or a few selected spam target pages. It has also been observed that connected pages have some level of semantic similarity topical locality of the web, and therefore label
Criteria Techniques Link Propagation Link pruning and Reweighting Label Refinement Graph Regularization Feature Based Methods Algorithms PageRank TrustPage Hits PageRank Clustering Algorithms PageRank Truncated PageRank Working Exploits the topological relationship between the web pages Identify the suspicious nodes and links and their subsequent downweighting Extacting link- based features for each node and use various machine learning algorithms Uses the idea of label refinement based on the web graph topology Work based on graph regularization method Mining Technique used WSM WSM, WCM WCM WSM WSM Type of information used Topological Relationship Downweighting of nodes & links Link base features of each node URL Structural Patterns Complexity Internal Structure Relationship between the nodes Limited data set are allowed URL Classification - I nternational J ournal of Computer Trends and Technology (I J CTT) volume 4 I ssue 9 Sep 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 3130 smoothing using the web graph is useful strategy. Numerous algorithms use the idea of trust and distrust propagation using various similarity measures, propagation strategies and seed selection heuristics. Because one spammer can have a lot of pages under one website and use them all to boost ranking of some target pages, it makes sense to analyze host graph or even perform clustering and consider clusters as a logical unit of link support. In addition to traditional page content and links, there are a lot of other sources of information such as user behaviour or HTTP requests. There is hope that more will be developed in the near future. Clever feature engineering is especially important for web spam detection. Despite the fact that new and sophisticated features can boost the safe-of-the-art further, proper selection and training of a machine learning models is also of high importance. 5. CONCLUSION The different spams which affect the internet are classified and the techniques used to fight against the spam are studied. The problem caused by different spam in the websites and the solutions available to filter the spam is also discussed. For filtering spam in social network advanced Bayesian filter techniques SOAP, for filtering email spam a rule based filter using data mining concepts is suggested. Analyze all the types of spam in the internet, effective spam filter methods are also listed. And also discuss the numerous algorithms for web spam detection. And analyze their characteristics and underlying ideas. At the end, summarize all the key principles behind anti-spam algorithms.
REFERENCES
[1] Eiron N, McCurley K S, and Tomlin J A Ranking the web frontier,In Proceedings of the 13th International Conference on World Wide Web, WWW04, New York, NY, 2004.
[2] Henzinger M R, Motwani R, and Silverstein C.Challenges in web search engines, SIGIR Forum, 36, 2002.
[3]Silverstin R, Marais H,, Henzinger M, and Moriz M.Analysis of very large web search engine query log SIGIR Forum, 33, Sept. 1999.
[4]Joachims T, Granka L, Pan B, Hembrooke M ,and Gay G.Accurately interpreting click through data as implicit feedback, In Proceedings of the 28 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR05, Salvador, Brazil, 2005.
[5] Adali S, Liu T, and Magdon-Ismail Z. Optimal Link Bombs are Uncoordinated. In Proceedings of the first International Workshop on Adversarial Information Retrival On the Web, AIRWeb05, Chiba,Japan,2005.
[6] Castillo C, Donato D, Gionis A, Murdock V, and Silvestri F. Know your neighbors: web spam detection using the web topology, In Proceedings of the 30 th
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR07, Amsterdam, The Netherlands, 2007.
[7] http://www.alexa.com/topsites/global
[8] Harris Interactive Public relation research .A study of social networks scams, 2008.
[9] Moyer S and Harniel N. Satan is on my friend list: Attacking social networks. http://www.blackhat.com/html/bh- usa-08/bh-usa-08-archive.html,2008.
[10] Brown G, Howe T, Ihbe M, Prakash A, and Borders K. Social networks and context-aware spam. In ACM Conference on Supportive Cooperative Work, 2008.
[11] Li, Ze , Shen, Haiying Helen. SOAP: A Social network Aided Personalized and effective spam filter to clean your e-mail box. INFOCOM, 2011 Proceedings IEEE Conference Publications , April 11.
[12] Cheng Y and Li C, Personalized Spam Filtering with Semisupervised Classifier Ensemble, in IEEE/WIC/ACM International Conference on Web Intelligence, 2006.
[13] M.-T.Chang, W.-T.Yih, and C. Meek.Partitioned Logistic Regression for Spam Filtering Proc. 14th ACM SIGKDD Intl Conf.Knowledge Discovery and Data mining (KDD), pp. 97-105, 2008.
[14] Takesue, Masaru . Cascaded Simple Filters for Accurate and Lightweight Email-Spam Detection ., Emerging Security Information Systems and Technologies (SECURWARE), FourthInternational Conference , July 2010.
[15] Gao, Yan . A Comprehensive Approach to Image Spam Detection: From Server to Client Solution, InformationForensics and Security., IEEE Transactions , Dec. 2010.
[16] Gao, Yan Image spam hunter, Acoustics, Speech and Signal Processing., 2008. ICASSP 2008. IEEE International Conference on April 2008.
[17] Chellapilla K and Chickering K. Improving cloaking detection using search query popularity and monetiz-ability, 2006.
[18] Daswani N and Stoppelman M. The anatomy of click-bot.a. In Proceedings of the First Conference on First Workshop on Hot Topics in Understanding Botnets,Berkeley, CA, 2007. USENIX Association.
[19] Cheng Z, Gao B, Sun B, Jiang Y, and Liu.Let T Y web spammers expose themselves. In Proceedings of the fourth ACM International Conference on research and Data Mining, WSDM11, Hong Kong,China, 2011.
[20] Dalvi, Domingos P, Mausam, Sanghai S, and Verma D. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD04, WA, USA, 2004.
[21] Benczr A A, Csalogny C, Sarls T, and Uher M. Spamrank: Fully automatic link spam detection workin progress. In Proceedings of the First International Workshop on Adversarial Information I nternational J ournal of Computer Trends and Technology (I J CTT) volume 4 I ssue 9 Sep 2013 ISSN: 2231-2803 http://www.ijcttjournal.org Page 3131 Retrieval on the Web, AIRWeb05, May 2005. [22] Da Costa Carvalho A L, Chirita P A, de Moura E S, Calado P, and Nejdl W. Site level noise removal for search engines. In Proceedings of the 15th Inter- national Conference on World Wide Web, WWW06,Edinburgh, Scotland, 2006.
[23] Amitay E, Carmel D, Darlow A, Lempel R, and Soer A. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, Nottingham, UK, 2003.
[24] Gan Q and Suel T. Improving web spam classifiers using link structure. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, AIRWeb07, Ban, Alberta, 2007.
[25] Abernethy J, Chapelle O, and Castillo C. Graph regularization methods for web spam detection. Mach.Learn., Vol. 81, Nov. 2010.
[26] Gyongyi Z and Garcia-Molina H. Web spam taxonomy. In Proceeding of the First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb05, Chiba, Japan, May 2005.
[18476228 - Organization, Technology and Management in Construction_ an International Journal] Adaptive Reuse_ an Innovative Approach for Generating Sustainable Values for Historic Buildings in Developing Countries