Escolar Documentos
Profissional Documentos
Cultura Documentos
Jingrui He 10/08/2007
Spam Types
Email Spam
Blog Spam
Splogs
Spam Detection
Feature Extraction
A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung Relaxed Online SVMs for Spam Filtering. D. Sculley, G.M. Wachman
Fast Classifier
Problem Statement
n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t.
Goal
in
to
Edge Weight
=
is the number of emails to
sent from
In-count / Out-count
In-degree / Out-degree
The number of email accounts that a node receives emails from / sends emails to
The level of interaction between a sender and each of the corresponding recipients
Number of neighbors of
Preprocessing
Problematic?
Weighted Features
yi Score scaling
j:x j
wij y j wij
j:x j
Experiments
Legitimate Enron senders: email transactions within the Enron email domain 5000 generated spam accounts 120 senders from each class
Feature Weights
Final Weights
Conclusion
Legitimacy Score
No content needed
Anti-Spam Controversy
Statistically robust State-of-the-art performance Quadratic in the number of training examples Impractical!
Practitioners
Background: SVMs
Data Set = Class Label : 1 for spam; -1 for ham Classifier: Tradeoff parameter To Find and Slack variable
Minimize:
margin Constraints: Maximizing theMinimizing the loss function
Online SVMs
Large C preferred
Computational Cost
Minimizing training error more important than maximizing the margin Full margin maximization not necessary Relax this requirement
ROSVM
Constraints
when
Update when Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost
ROSVM
SMO: repeated pass over the training set to minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance
ROSVM:
Email Spam