A Study On Web Spam Classification and Algorithms

I nternational J ournal of Computer Trends and Technology (I J CTT) volume 4 I ssue 9 Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 3126

A Study on Web Spam Classification and Algorithms

K.K. Arthi M.Sc
1
, Dr. V.Thiagarasu M.Sc., PGDCA, M.Phil, B.Ed, Ph.D
2

1
Mphil Scholar, Department of Computer Science, Gobi Arts and Science College, Gobi, Tamil Nadu, India.
2
Associate professor , Department of Computer Science, Gobi Arts and Science College, Gobi, Tamil Nadu, India

ABSTRACT
The various spams in the internet is classified
based on the properties of spam such as spam content, type
and ranking. The impact of various spams in social
network, email, image, content and links is discussed and
the techniques are listed to prevent the spam in various
area. Also it reviews the two groups of spam detection
techniques and algorithms such as content based methods
and link based methods. Link based methods are further
subdivided into five groups such as label propagation, link
pruning and reweighting, label refinement, graph
regularization. The review compares the label propagation,
link pruning and reweighting, label refinement, graph
regularization and feature based methods based on the
various factors like type of information used, algorithms,
working, complexity and mining techniques.
1. INTRODUCTION
Spam spreads any type of information system
like e-mail or web, social, blog or review platforms. The
concept of web spamdexing was introduced in the year
1996. Web spam was recognized as one of the key
challenges of the search engines industry. Now a days,
all search engine companies are identified information
retrieval as a main concern. First, spam spoils the
quality of search results and prevents the legal
websites of revenue that might earn in the absence of
spam. Second it weakens the trust of user in a search
engine provider which is a notable issue since the user
can easily continue the search form one search engine to
other.
In the websites the spam spreads by adult
content distribution and malware spreading and also
leads for phishing. For example, ranked 100 million
webpages using page rank algorithm [1]. And
found that 11 out of top 20 results were
pornographic websites that achieves high ranking
by content and web link manipulation [2]. Web spam
phenomenon occurs mainly by the following fact.
The fraction of webpage referrals that come
from search engines is very high and also the users
examine only top ranked results. Thus, 85% of
queries only the first result page is requested and
only 3 to 5 links are clicked [3,4]. So the
website owners attempt to manipulate search
engine rankings. The manipulation is done in
different forms like undesired link creation,
cloaking, click fraud and tag spam. I t i s shown that
6% of English language websites are classified as
spam [5]. Reports that 22% of spams are found on
host level and states that it is 16.5%. The goal of this
survey is to give ideas about various spams in the
internet and the algorithms used in spam detection and
also to create awareness in further research in area
of adversarial information retrieval.
Next section of this review explains about the
Web spam classification. Section 3 describes the Spam
detection algorithms and section 4 describes the key
principles to apply for spam detection. Section 5
concludes this review.
2. CLASSIFICATION OF WEB SPAMS
In this analyze the various spam and its
preventing techniques are discussed. With the recent
researches and the impact of spam in various areas in
internet, the spam classification is given in the fig.1.

Fig.1. Classification of internet spam
2.1 Content Spam
The content spam is the first and most common
form of web spam. It is so common because of most of
the search engines use information retrieval models based
on the web page content and rank of the web pages. Such
as vector space model and statistical models.
Based on the document structure content
spamming is subdivided into five types namely, title
spamming, body spamming, meta-tags spamming, anchor
text spamming, and URL spamming.
Title Spamming: Search engines usually give a
higher importance to terms that appear in the title of the
document. Thus, it makes sense to include the spam
terms in the document title.
Body Spam: The spam terms are included in the
document body. This spamming technique is among the
simplest and most popular ones, and it is almost as old
search engines themselves.
Meta-Tags Spamming: The HTML meta tags
that become visible in the document header have always
been the target of spamming. Because of the heavy
spamming, search engines currently give low priority to
these tags, or even ignore these tags completely.
Anchor Text Spamming: Search engines
assign higher weight to anchor text terms, as they are
supposed to offer a summary of the pointed document.
Therefore, spam terms are sometimes included in the
anchor text of the HTML hyperlinks to a page. This
spamming technique is different from Meta-tag
spamming. In the sense that the spam terms are added not
to a target page itself, but other pages that direct to the
target. Anchor text gets indexed for both pages,
spamming has impact on the ranking of both the source
and target pages.
URL Spamming: Some search engines also
break down the URL of a page into a set of terms that are
used to establish the significance of the page. To develop
this, spammers sometimes create long URLs that include
sequences of spam terms. And sometimes Create URL
for a page from words which should be mentioned in the
targeted set of queries.
2.2 Social network spam
In past few years the growth of social
networking sites is very high. The people
converse with their friends and chat or share
multimedia contents. Sites like facebook, twitter are
constantly among top 20 most viewed websites on
the internet [7]. Statistics shows, compare with
other sites people spent more time on social
network. The increase in status of social
networks allows them to collect a huge amount of
personal information about their friends. In social
network a person can reach any person which is
attracted by the malicious parties. In 2008, 83% of
users received minimum one unknown friend
request or message [8]. Unfortunately social
networking sites do not provide strong authentication
mechanisms, and it is easy to act like a person and
slip into persons network of trust [9]. The volatile
growth of unwanted emails has encouraged the
development of frequent spam filtering techniques
[10]. It also discusses some features that may allow one
to differentiate a spammer from valid users, such as
node degree and frequency of messages. Spam
detection and the behaviour of users have been studied
for a long time. Bayesian classification algorithm
is used to differentiate the suspicious behaviours from
normal ones.
A Bayesian spam filter is better to a static
keyword based spam filter because it can
continuously change to undertake new spam by
learning keywords in new spam emails. Few
current spam filters use social networks to aid spam
detection. Concerning the drawbacks in Bayesian
spam filter and user-friendly spam filter called
Social network Aided Personalized and
effective spam filter (SOAP) is used [11]. Previous
filters techniques that focus on parsing keywords
(e.g., Bayesian filter), SOAP exploits the social
relationship among email. And SOAP detects the
spam adaptively and automatically. SOAP
integrates three components into basic
Bayesian filter namely social closeness spam
filtering, social interest-based spam filtering, and
adaptive trust management. SOAP can significantly
improve the performance of Bayesian spam filters in
terms of the accuracy, attack-resilience and
efficiency of spam detection.
2.3 Email spam
The most common communication in the
internet is using email communication. With the huge
growth in email and its reputation of unwanted
e-mail also emerged very quickly with almost 90% of
all email messages. The cost of sending these e-mails is
very close to zero being easy to reach a high number
of possible consumers [12]. In this situation, spam
consumes resources; time spent reading unwanted
messages, bandwidth, CPU, disk.
The email system design can easily be
broken by spammers who send incorrect
information. All email on the Internet is sent via a
Simple Mail Transfer Protocol (SMTP).SMTP is
designed to capture the information about the
path that an email message travels from sender to
receiver. The SMTP protocol provides no security to
emails, email is not confidential, and there is no
way to authenticate the uniqueness of the email
source. The lack of security in SMTP, and
specifically the lack of reliable information
identifying the email source, are often broken
by spammers and allows for considerable
fraud on the Internet. For handling email spam
introduced a novel hybrid model, Partitioned
Logistic Regression, which has several advantages over
both naive Bayes and logistic regression. Developed
an ant-colony based spam filter to evaluate and
predict spam messages. The developed spam filter
is compared with other techniques like multi-
layer perception, naive Bayes and Ripper classifiers
the developed filter is alternate tool in
predicting the spam and also yield better
accuracy. A rule based filter for light weight and
accurate detection of email spam [14]. This filter
cascade three filters, one for fingerprints of message
bodies, and another for white and black list of email
address in form of header and last for the words
specific to the spam and legitimate email in message
header. The method has high performance of about
90 emails per seconds.

2.4 Image spam
Recently, spammers have proliferated image spam,
emails which contain the text of the spam message
in a human readable image instead of the
message body. It consists in embedding the spam
message into images which are sent as email
attachments. Its goal is to circumvent the Analysis
of the emails textual content performed by spam
filters, including automatic text classifiers. Since
attached images are displayed by default by most email
clients, the message is directly conveyed to the user as
soon as the email is opened. The simplest kind of
image spam can be viewed as a screen shot of a plain
text written using a standard text editor. Making
detection to image spam by conventional
content filters is very difficult. New techniques
are needed to filter these messages. Often spam
images are constructed by introducing random
changes to a given template image, to make
signature-based detection techniques ineffective, and
are obfuscated to prevent optical character
recognition (OCR) tools from reading the
embedded text.
Ironically, some text obfuscation techniques
used against OCR tools are very similar to the ones
exploited to design Completely Automated
Public Turing test to tell Computers and Humans Apart
(CAPTCHA). A common type of CAPTCHA requires
the user to type letters and/or digits from a distorted
image that appears on the screen. Such tests are
commonly used to prevent unwanted internet bots
from accessing websites, since a normal human can
easily read a CAPTCHA, while the bot cannot process
the image letters and therefore, cannot answer
properly, or at all. A comprehensive solution has been
presented to image spam filtering, which
combines cluster analysis of spam images on the server
side and active learning classification on the client
side for effectively filtering image spam [15].
Extensive experimental evaluations of both
server-side algorithms and client-side
algorithms on a real image spam dataset
collected from an e-mail server demonstrated the
efficiency of this method. A machine learning method
to detect the spam images from images in normal
emails [16]. The proposed method extracts efficient
global image features to train an advanced binary
classifier to distinguish the spam images, which
achieves promising preliminary results on the
limited sample database. The proposed method achieves
fairly good performance in 5-fold cross-
validation on the data set with 928 spam images and 810
natural images.
2.5 Click spam.
In this type of spam the spammers generate
fraud clicks and make the control function
towards their websites. To achieve the goal spammers
submit queries to search engine and click on the links
point to the target pages [17,18]. Online
advertising is other motivation for spammers to
generate fake clicks [19].

2.6 Cloaking and Redirection
Cloaking is a search engine optimization
technique. In which the content presented in the search
engine is different from that is presented in the
user browser. Redirection is to send users
automatically to another URL after loading the needed
URL. Both Cloaking and redirection are used in the
search engine spamming [20]. To differentiate
users from the crawlers, spammers analyze a user
field of HTTP request and keep track of IP
address by using search engine crawlers. Cloaking
is very hard to detect. Cloaking detection method
identify the cloaking by using the most frequent
words from the MSN query log and highest revenue
generating words from the MSN advertisement log.
Cloaking could be detected by comparing crawls with
different user agent strings and IP addresses.
Spammers purchase sites that terminate their
operation and fill them with spam. For some time
these sites emerge with their previous content both
in the search engine index and also in the input for the
spam classificatory. Meanwhile the crawler will fetch
the new Content believed to be honest, follows its
links and prioritizes its processes in support for the
spammers target.
2.7 Link spam
Link spam is also called as comment
spam or blog spam. This kind of spam targets weblogs,
guestbooks and discussion boards. Any web
application that displays hyperlinks submitted by
visitors or the referring URLs of the web visitors may
be the target. Link spamming occurs in the internet
guestbooks where spammers repeatedly fill a guest
book with link to their own site. The two types of link
spam are outgoing link spam and incoming link spam.
3. ALGORITHMS
The algorithms are categorized into three groups.
The first one consists of techniques which analyze content
features, such as word counts or language models, and
content duplication. Second one consist of link based
information such as neighbour graph connectivity,
performs link-based trust and distrust propagation, link
pruning, graph-based label smoothing and study statistical
anomalies. Finally, the last group includes algorithms that
exploit click stream data and user behaviour data, query
popularity information, and HTTP sessions information.
3.1 Content-based Spam Detection
In this type of spam detection web pages can be
identified through statistical analysis. Since spam pages
are usually automatically generated, using phrase
stitching and weaving techniques [26]. Researchers found
that the URLs of spam pages have exceptional number of
dots, dashes, digits and length. They report that 80 of the
100 longest discovered host names refer to adult websites,
while 11 refer to financial credit related websites. They
also show that pages themselves have a duplicating
nature- most spam pages that reside on the same host
have very low word count variance. Another interesting
observation is that the spam pages count changes very
rapidly.

Table.1. Comparison of Link-Based Spam Detection Method

Specifically they studied average amount of week-to-
week changes of all the web pages on a given host and
found that the most volatile spam hosts can be detected
with 97.2% based only on this feature.

3.2 Link based Spam Detection
All link-based spam detection algorithms can be
subdivided into five groups. The first group exploits the
topological relationship between the web pages and a set
of pages for which labels are known [21]. The second
group of algorithms focus on identification of suspicious
nodes and links and their subsequent downweighting [22].
The third one works by extracting link-based features for
each node and use various machine learning algorithms to
perform spam detection [23]. The fourth group of link-
based algorithms uses the idea of labels refinement based
on web graph topology, when preliminary labels
predicted by the base algorithm are modified using
propagating through the hyperlink graph or a stacked
classifier [24]. Finally, there is a group of algorithms
which is based on graph regularization techniques for web
detection [25].
Table 1compares the link-based spam detection
methods discussed above. The main criteria used for
comparison are algorithms, working, type of information
used, and complexity. Among all the algorithms HIT and
PageRank is more important one. Because most of the
link-based methods are support PageRank and HIT
algorithms.
4. KEY PRINCIPLES
Analyzed all the related work devoted to

the topic of web spam mining, identify a set of
underlying principles that are frequently used for
algorithms construction.
Due to machine-generated nature and its
focus on search engines manipulation, spam
shows abnormal properties such as high level
of duplicate content and links; rapid changes
of content; and the language models built for
spam pages deviate significantly from the
models built for the normal web.
Spam pages deviate from power law
distributions based on numerous web graph
statistics such as PageRank or number of in-
links.
Spammers mostly target popular queries and
queries with high advertising value.
Spammers build their link farms with the aim
to boost ranking as high as possible, and
therefore link farms have specific topologies
the can be theoretically analyzed on
optimality.
According to experiments, the principle of
approximate isolation of good pages takes
place: good pages mostly link to good pages,
while bad pages link either to good pages or
a few selected spam target pages. It has also
been observed that connected pages have
some level of semantic similarity topical
locality of the web, and therefore label

Criteria
Techniques
Link
Propagation
Link pruning and
Reweighting
Label
Refinement
Graph
Regularization
Feature Based
Methods
Algorithms PageRank
TrustPage
Hits
PageRank
Clustering
Algorithms
PageRank Truncated
PageRank
Working Exploits the
topological
relationship
between the
web pages
Identify the
suspicious nodes
and links and their
subsequent
downweighting
Extacting link-
based features for
each node and use
various machine
learning
algorithms
Uses the idea of
label refinement
based on the
web graph
topology
Work based on
graph
regularization
method
Mining
Technique used
WSM WSM, WCM WCM WSM WSM
Type of
information
used
Topological
Relationship
Downweighting of
nodes & links
Link base features
of each node
URL Structural
Patterns
Complexity Internal
Structure
Relationship
between the nodes
Limited data set
are allowed
URL
Classification
-
smoothing using the web graph is useful
strategy.
Numerous algorithms use the idea of trust
and distrust propagation using various
similarity measures, propagation strategies
and seed selection heuristics.
Because one spammer can have a lot of pages
under one website and use them all to boost
ranking of some target pages, it makes sense to
analyze host graph or even perform clustering
and consider clusters as a logical unit of link
support.
In addition to traditional page content and
links, there are a lot of other sources of
information such as user behaviour or HTTP
requests. There is hope that more will be
developed in the near future. Clever feature
engineering is especially important for web spam
detection.
Despite the fact that new and sophisticated features
can boost the safe-of-the-art further, proper
selection and training of a machine learning models
is also of high importance.
5. CONCLUSION
The different spams which affect the internet are
classified and the techniques used to fight against the
spam are studied. The problem caused by different spam
in the websites and the solutions available to filter the
spam is also discussed. For filtering spam in social
network advanced Bayesian filter techniques SOAP, for
filtering email spam a rule based filter using data mining
concepts is suggested. Analyze all the types of spam in
the internet, effective spam filter methods are also listed.
And also discuss the numerous algorithms for web spam
detection. And analyze their characteristics and
underlying ideas. At the end, summarize all the key
principles behind anti-spam algorithms.

REFERENCES

[1] Eiron N, McCurley K S, and Tomlin J A Ranking the web
frontier,In Proceedings of the 13th International
Conference on World Wide Web, WWW04, New York,
NY, 2004.

[2] Henzinger M R, Motwani R, and Silverstein C.Challenges
in web search engines, SIGIR Forum, 36, 2002.

[3]Silverstin R, Marais H,, Henzinger M, and Moriz
M.Analysis of very large web search engine query log
SIGIR Forum, 33, Sept. 1999.

[4]Joachims T, Granka L, Pan B, Hembrooke M ,and Gay
G.Accurately interpreting click through data as implicit
feedback, In Proceedings of the 28
th
Annual International
ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR05, Salvador, Brazil, 2005.

[5] Adali S, Liu T, and Magdon-Ismail Z. Optimal Link
Bombs are Uncoordinated. In Proceedings of the first
International Workshop on Adversarial Information
Retrival On the Web, AIRWeb05, Chiba,Japan,2005.

[6] Castillo C, Donato D, Gionis A, Murdock V, and
Silvestri F. Know your neighbors: web spam detection
using the web topology, In Proceedings of the 30
th

Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
SIGIR07, Amsterdam, The Netherlands, 2007.

[7] http://www.alexa.com/topsites/global

[8] Harris Interactive Public relation research .A study of
social networks scams, 2008.

[9] Moyer S and Harniel N. Satan is on my friend list:
Attacking social networks. http://www.blackhat.com/html/bh-
usa-08/bh-usa-08-archive.html,2008.

[10] Brown G, Howe T, Ihbe M, Prakash A, and Borders K.
Social networks and context-aware spam. In ACM
Conference on Supportive Cooperative Work, 2008.

[11] Li, Ze , Shen, Haiying Helen. SOAP: A Social
network Aided Personalized and effective spam filter to
clean your e-mail box. INFOCOM, 2011 Proceedings
IEEE Conference Publications , April 11.

[12] Cheng Y and Li C, Personalized Spam Filtering
with Semisupervised Classifier Ensemble, in
IEEE/WIC/ACM International Conference on Web
Intelligence, 2006.

[13] M.-T.Chang, W.-T.Yih, and C. Meek.Partitioned
Logistic Regression for Spam Filtering Proc. 14th ACM
SIGKDD Intl Conf.Knowledge Discovery and Data mining
(KDD), pp. 97-105, 2008.

[14] Takesue, Masaru . Cascaded Simple Filters for
Accurate and Lightweight Email-Spam Detection .,
Emerging Security Information Systems and Technologies
(SECURWARE), FourthInternational Conference , July 2010.

[15] Gao, Yan . A Comprehensive Approach to Image
Spam Detection: From Server to Client Solution,
InformationForensics and Security., IEEE Transactions , Dec.
2010.

[16] Gao, Yan Image spam hunter, Acoustics, Speech
and Signal Processing., 2008. ICASSP 2008. IEEE
International Conference on April 2008.

[17] Chellapilla K and Chickering K. Improving
cloaking detection using search query popularity and
monetiz-ability, 2006.

[18] Daswani N and Stoppelman M. The anatomy of
click-bot.a. In Proceedings of the First Conference on
First Workshop on Hot Topics in Understanding
Botnets,Berkeley, CA, 2007. USENIX Association.

[19] Cheng Z, Gao B, Sun B, Jiang Y, and Liu.Let T Y
web spammers expose themselves. In Proceedings of
the fourth ACM International Conference on research
and Data Mining, WSDM11, Hong Kong,China, 2011.

[20] Dalvi, Domingos P, Mausam, Sanghai S, and
Verma D. Adversarial classification. In Proceedings of
the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining, KDD04, WA,
USA, 2004.

[21] Benczr A A, Csalogny C, Sarls T, and Uher M.
Spamrank: Fully automatic link spam detection
workin progress. In Proceedings of the First
International Workshop on Adversarial Information
Retrieval on the Web, AIRWeb05, May 2005.
[22] Da Costa Carvalho A L, Chirita P A, de Moura E
S, Calado P, and Nejdl W. Site level noise removal
for search engines. In Proceedings of the 15th Inter-
national Conference on World Wide Web,
WWW06,Edinburgh, Scotland, 2006.

[23] Amitay E, Carmel D, Darlow A, Lempel R, and
Soer A. The connectivity sonar: detecting site
functionality by structural patterns. In Proceedings of
the fourteenth ACM conference on Hypertext and
hypermedia, Nottingham, UK, 2003.

[24] Gan Q and Suel T. Improving web spam
classifiers using link structure. In Proceedings of the
3rd International Workshop on Adversarial
Information Retrieval on the Web, AIRWeb07, Ban,
Alberta, 2007.

[25] Abernethy J, Chapelle O, and Castillo C. Graph
regularization methods for web spam detection.
Mach.Learn., Vol. 81, Nov. 2010.

[26] Gyongyi Z and Garcia-Molina H. Web spam
taxonomy. In Proceeding of the First International
Workshop on Adversarial Information Retrieval on
the Web, AIRWeb05, Chiba, Japan, May 2005.

A Study On Web Spam Classification and Algorithms

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Study On Web Spam Classification and Algorithms

Enviado por

Direitos autorais:

Formatos disponíveis

I nternational J ournal of Computer Trends and Technology (I J CTT) volume 4 I ssue 9 Sep 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 3126

Você também pode gostar