Spam Detection

Spam Detection
Jingrui He 10/08/2007
Spam Types
Email Spam
Unsolicited commercial email
Blog Spam
Unwanted comments in blogs

Fake blogs to boost PageRank
Splogs
From Learning Point of View
Spam Detection
Classification problem (ham vs. spam)
Feature Extraction
A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman
Fast Classifier
A Learning Approach to Spam Detection based on Social Networks

H.Y. Lam and D.Y. Yeung CEAS 2007
Problem Statement
n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t.
Goal
Assign the remaining account with
in
System Flow Chart
Social Network from Logs
Directed Graph Directed Edge
Email sent from
to
Edge Weight
=
is the number of emails to
sent from
System Flow Chart
Features from Email Social Networks
In-count / Out-count
The sum of in-coming / out-going edge weights
In-degree / Out-degree
The number of email accounts that a node receives emails from / sends emails to
Communication Reciprocity (CR)
The percentage of interactive neighbors that a node has

The set of accounts that sent emails to
The set of accounts that received emails from
Communication Interaction Average (CIA)
The level of interaction between a sender and each of the corresponding recipients
Clustering Coefficient (CC)
Friends-of-friends relationship between email accounts

Number of connections between neighbors of
Number of neighbors of
System Flow Chart
Preprocessing
Sender Feature Vector

Problematic?
Weighted Features
System Flow Chart
Assigning Spam Score
Similarity Weighted k-NN method

Gaussian similarity Similarity weighted mean k-NN scores
yi Score scaling
j:x j
wij y j wij
The set of k nearest neighbors
j:x j
Experiments
Enron Dataset: 9150 Senders To Get

Legitimate Enron senders: email transactions within the Enron email domain 5000 generated spam accounts 120 senders from each class
Results Averaged over 100 Times
Number of Nearest Neighbors
Feature Weights (CC)
Feature Weights (CIA)
Feature Weights (CR)
Feature Weights
In/Out-Count & In/Out-Degree
The smaller the better
Final Weights
In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15
Conclusion
Legitimacy Score
No content needed
Can Be Combined with Content-Based Filters More Sophisticated Classifiers
SVM, boosting, etc
Classifiers Using Combined Feature
Relaxed Online SVMs for Spam Filtering

D. Sculley and G.M. Washman SIGIR 2007
Anti-Spam Controversy
Support Vector Machines (SVMs) Academic Researchers
Statistically robust State-of-the-art performance Quadratic in the number of training examples Impractical!
Practitioners

Solution: Relaxed Online SVMs
Background: SVMs
Data Set = Class Label : 1 for spam; -1 for ham Classifier: Tradeoff parameter To Find and Slack variable
Minimize:
margin Constraints: Maximizing theMinimizing the loss function
Online SVMs
Tuning the Tradeoff Parameter C
Spamassassin data set: 6034 examples
Large C preferred
Email Spam and SVMs
TREC05P-1: 92189 Messages TREC06P: 37822 messages
Blog Comment Spam and SVMs
Leave One Out Cross Validation 50 Blog Posts; 1024 Comments
Splogs and SVMs
Leave One Out Cross Validation 1380 Examples
Computational Cost
Online SVMs: Quadratic Training Time
Relaxed Online SVMs (ROSVM)
Objective Function of SVMs:

Large C Preferred
Minimizing training error more important than maximizing the margin Full margin maximization not necessary Relax this requirement
ROSVM

Three Ways to Relax SVMs (1)
Only Optimize Over the Recent p Examples
Dual form of SVMs
Constraints
The last value found for
when
Only Update on Actual Errors

Original online SVMs
Update when Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost
ROSVM

Reduce the Number of Iterations in Interative SVMs

SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance
Testing Reduced Size
Testing Reduced Iterations
Testing Reduced Updates
Online SVMs and ROSVM
ROSVM:
Email Spam
Blog Comment Spam
Splog Data Set

Spam Detection

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Spam Detection

Enviado por

Direitos autorais:

Formatos disponíveis

Spam Detection

Unsolicited commercial email

Unwanted comments in blogs

From Learning Point of View

Classification problem (ham vs. spam)

A Learning Approach to Spam Detection based on Social Networks

Assign the remaining account with

System Flow Chart

Social Network from Logs

Directed Graph Directed Edge

Email sent from

System Flow Chart

Features from Email Social Networks

The sum of in-coming / out-going edge weights

Features from Email Social Networks

Communication Reciprocity (CR)

The percentage of interactive neighbors that a node has

The set of accounts that received emails from

Features from Email Social Networks

Communication Interaction Average (CIA)

Features from Email Social Networks

Clustering Coefficient (CC)

Friends-of-friends relationship between email accounts

System Flow Chart

Sender Feature Vector

System Flow Chart

Assigning Spam Score

Similarity Weighted k-NN method

Gaussian similarity Similarity weighted mean k-NN scores

The set of k nearest neighbors

Enron Dataset: 9150 Senders To Get

Results Averaged over 100 Times

Number of Nearest Neighbors

Feature Weights (CC)

Feature Weights (CIA)

Feature Weights (CR)

In/Out-Count & In/Out-Degree

The smaller the better

In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15

Can Be Combined with Content-Based Filters More Sophisticated Classifiers

SVM, boosting, etc

Classifiers Using Combined Feature

Relaxed Online SVMs for Spam Filtering

Support Vector Machines (SVMs) Academic Researchers

Solution: Relaxed Online SVMs

Tuning the Tradeoff Parameter C

Spamassassin data set: 6034 examples

Email Spam and SVMs

TREC05P-1: 92189 Messages TREC06P: 37822 messages

Blog Comment Spam and SVMs

Leave One Out Cross Validation 50 Blog Posts; 1024 Comments

Splogs and SVMs

Leave One Out Cross Validation 1380 Examples

Online SVMs: Quadratic Training Time

Relaxed Online SVMs (ROSVM)

Objective Function of SVMs:

Three Ways to Relax SVMs (1)

Only Optimize Over the Recent p Examples

Dual form of SVMs

The last value found for

Three Ways to Relax SVMs (2)

Only Update on Actual Errors

Original online SVMs