Você está na página 1de 40

Spam Detection

Jingrui He 10/08/2007

Spam Types

Email Spam

Unsolicited commercial email

Blog Spam

Unwanted comments in blogs


Fake blogs to boost PageRank

Splogs

From Learning Point of View

Spam Detection

Classification problem (ham vs. spam)

Feature Extraction

A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman

Fast Classifier

A Learning Approach to Spam Detection based on Social Networks


H.Y. Lam and D.Y. Yeung CEAS 2007

Problem Statement

n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t.

Goal

Assign the remaining account with

in

System Flow Chart

Social Network from Logs

Directed Graph Directed Edge

Email sent from

to

Edge Weight

=
is the number of emails to

sent from

System Flow Chart

Features from Email Social Networks

In-count / Out-count

The sum of in-coming / out-going edge weights

In-degree / Out-degree

The number of email accounts that a node receives emails from / sends emails to

Features from Email Social Networks

Communication Reciprocity (CR)

The percentage of interactive neighbors that a node has


The set of accounts that sent emails to

The set of accounts that received emails from

Features from Email Social Networks

Communication Interaction Average (CIA)

The level of interaction between a sender and each of the corresponding recipients

Features from Email Social Networks

Clustering Coefficient (CC)

Friends-of-friends relationship between email accounts


Number of connections between neighbors of

Number of neighbors of

System Flow Chart

Preprocessing

Sender Feature Vector


Problematic?

Weighted Features

System Flow Chart

Assigning Spam Score

Similarity Weighted k-NN method


Gaussian similarity Similarity weighted mean k-NN scores

yi Score scaling

j:x j

wij y j wij

The set of k nearest neighbors

j:x j

Experiments

Enron Dataset: 9150 Senders To Get


Legitimate Enron senders: email transactions within the Enron email domain 5000 generated spam accounts 120 senders from each class

Results Averaged over 100 Times

Number of Nearest Neighbors

Feature Weights (CC)

Feature Weights (CIA)

Feature Weights (CR)

Feature Weights

In/Out-Count & In/Out-Degree

The smaller the better

Final Weights

In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15

Conclusion

Legitimacy Score

No content needed

Can Be Combined with Content-Based Filters More Sophisticated Classifiers

SVM, boosting, etc

Classifiers Using Combined Feature

Relaxed Online SVMs for Spam Filtering


D. Sculley and G.M. Washman SIGIR 2007

Anti-Spam Controversy

Support Vector Machines (SVMs) Academic Researchers

Statistically robust State-of-the-art performance Quadratic in the number of training examples Impractical!

Practitioners

Solution: Relaxed Online SVMs

Background: SVMs

Data Set = Class Label : 1 for spam; -1 for ham Classifier: Tradeoff parameter To Find and Slack variable

Minimize:
margin Constraints: Maximizing theMinimizing the loss function

Online SVMs

Tuning the Tradeoff Parameter C

Spamassassin data set: 6034 examples

Large C preferred

Email Spam and SVMs

TREC05P-1: 92189 Messages TREC06P: 37822 messages

Blog Comment Spam and SVMs

Leave One Out Cross Validation 50 Blog Posts; 1024 Comments

Splogs and SVMs

Leave One Out Cross Validation 1380 Examples

Computational Cost

Online SVMs: Quadratic Training Time

Relaxed Online SVMs (ROSVM)

Objective Function of SVMs:


Large C Preferred

Minimizing training error more important than maximizing the margin Full margin maximization not necessary Relax this requirement

ROSVM

Three Ways to Relax SVMs (1)

Only Optimize Over the Recent p Examples

Dual form of SVMs

Constraints

The last value found for

when

Three Ways to Relax SVMs (2)

Only Update on Actual Errors


Original online SVMs

Update when Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost

ROSVM

Three Ways to Relax SVMs (3)

Reduce the Number of Iterations in Interative SVMs


SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance

Testing Reduced Size

Testing Reduced Iterations

Testing Reduced Updates

Online SVMs and ROSVM

ROSVM:

Email Spam

Blog Comment Spam

Splog Data Set

Você também pode gostar