Você está na página 1de 5

IADIS European Conference Data Ming 2007

DATA MINING TECHNIQUES FOR SUSPICIOUS EMAIL DETECTION: A COMPARATIVE STUDY


S.Appavu alias Balamurugan 1
Faculty, Dept of Information Technology, Thiagarajar College of Engineering, Madurai-15 Prof&Head, Dept of Computer Science Thiagarajar College of Engineering, Madurai-15

Dr.R.Rajaram2

G.Athiappan3
Thiagarajar College of Engineering, Madurai-15

M.Muthupandian4
Thiagarajar College of Engineering, Madurai-15

ABSTRACT Email has become one of the fastest and most economical forms of communication. This paper proposes to apply classification data mining for the task of suspicious email detection based on deception theory. In this paper, email data was classified using four different classifiers (Neural Network, SVM, Nave Bayesian and Decision Tree). The experiment was performed using WEKA based on different features by which the email corpus is classified into suspicious or normal emails. Experimental results show that simple ID3 classifier which make a binary tree, will give a promising detection rates. KEYWORDS Data Mining, Decision Tree, Neural Network, Nave Bayes, SVM, WEKA.

1. INTRODUCTION
E-mail has become one of today's standard means of communication. Email data is also growing rapidly, creating needs for automated analysis. So, to detect crime a spectrum of techniques should be applied to discover and identify patterns and make predictions. Data mining has emerged to address problems of understanding ever-growing volumes of information for structured data, finding patterns within data that are used to develop useful knowledge. As individuals increase their usage of electronic communication, there has been research into detecting deception in these new forms of communication. Models of deception assume that deception leaves a footprint. Work done by various researches suggests that deceptive writing is characterized by reduced frequency of first-person pronouns and exclusive words and elevated frequency of negative emotion words and action verbs[1]. We apply this model of deception and also the novel rich features to the set of E-mail dataset and preprocess the email body and to train the system we used different Data mining -classification algorithms [5] that categorize the email as suspicious or normal.

213

ISBN: 978-972-8924-40-9 2007 IADIS

1.1 Motivation
Concern about National security has increased significantly since the terrorist attacks on 11 September 2001.The CIA, FBI and other federal agencies are actively collecting domestic and foreign intelligence to prevent future attacks. These efforts have in turn motivated us to collect datas and undertake this paper work as a challenge.

1.2 Organization of the Paper


The paper is organized as follows: Section 2 defines problem statement and related works in this area. Section 3 describes the proposed Suspicious Email Detection methods. Experimental results and performance evaluation are presented in section 4.Finally; section 5 concludes the paper and points out some potential future work.

2. PROBLEM STATEMENT AND RELATED WORKS


Its hard to remember what our lives were like without email. Ranking up there with the web as one of the most useful features of the Internet, billions of messages are sent each year. Though email was originally developed for sending simple text messages, it has become more robust in the last few years. So, it is one possible source of data from which potential problem can be detected. Thus the problem is to find a system that identifies the deception in communication through emails. [6] developed a method based on the singular value Decomposition to detect unusual and Deceptive communication in email. The problem with this approach is that it does not deals with incomplete data in an efficient and elegant way and can not incorporate new data incrementally without having to reprocess the entire matrix.[9] compared a cross experiment between 14 classification methods, including decision tree, Nave Bayesian, Neural networks, linear square fit,Rocchio.KNN is one of the top performers, and it performs well in scaling up to very large and noisy classification problems.[7] showed a good performance reducing the classification error by discovering temporal relations in an email sequence in the form of temporal sequence patterns and embedding the discovered information into content based learning methods. [4] Proposed a model based on the Neural Network to classify personal emails and the use of principal Component Analysis as a preprocessor of NN to reduce the data in terms of both dimensionality as well as size. In the classification experiment for spam filtering, Decision tree showed better result than NB, NN, or SVM classifier [8].

3. SUSPICIOUS EMAIL DETECTION METHODS


Classification is the process of finding a set of models that describe and distinguish data classes and concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. Figure 1 shows a general framework of the classification process. In the example object corresponds to email messages and object class label correspond to email category. Every email message contains many features such as keywords, sender details, file attached that are used to predict the category of the message. Classification is a two step process,1. Build classification model using training data. Every object of the data must be pre-classified i.e. its class label must be known. In fig 1, features keyword, sender and file attached are used to build a model that replies how these attribute determine a category of each message.2. The model generated in the preceding step is tested by assigning class labels to data objects in a test dataset. The test data is different from the training data. Every element of test data is also preclassified in advance. The accuracy of the classification model is determined by comparing true class labels in the testing set with those assigned by the model.

214

IADIS European Conference Data Ming 2007

Figure 1. Classification Process

In this paper, we identified the finest classifier through experimental study for the task of classifying emails into suspicious or normal using WEKA. In this experimental study, the three steps undergone are Email Preprocessing, Building Classifier and validation.

3.1 E-Mail Preprocessing


Prior to indexing and classification, a number of preprocessing steps were performed. 1. Emails were converted to plain text from .mbox files. 2. Headers and HTML components were removed. 3. Body of the message was extracted. 4. The messages body was tokenized in to words, stop words such as articles, preposition, prefix and suffix of the words were removed, and words were converted in to lower case.

Database Used in experiment:


The dataset for the experiment is saved in the word file with the extension .arff (Attribute Relation File Format). The data are stored in the following format, the datasets name with the @relation tag, the attribute information with @attribute, and @data label to denote the starting of the instances. The possible values of the attributes must be mentioned within @attribute label {v1, v2 vn} where v1, v2,.. denotes the possible attribute values. Also the final attribute represents the Class specifying whether the email is SUSPICIOUS or NORMAL.
Table 1. Feature selection

215

ISBN: 978-972-8924-40-9 2007 IADIS

Table 2. A Sample Training Data set

A
yes no yes no no yes yes no

B
Friend Office Relative Others Friend Relative Relative Friend

C
yes no no yes yes no yes yes

D
forward forward forward created created created forward forward

E
yes no no yes no no yes no

F
yes yes yes no yes no no no

G
Text Pdf Picture Picture Text Picture Text Picture

H
yes yes no no no yes no yes

I
no no no yes no yes yes no

J
no no yes no yes yes no yes

K
no no yes yes no yes yes no

L
no no no yes no yes yes yes

M
yes yes yes no yes no no yes

Class
Suspicious Normal Suspicious Normal Suspicious Suspicious Suspicious Normal

A->Keywords(bomb,blast,attack,hijack,etc),B->Sender (relatives, office, friends, others),C->Subject, D>Forward/Created(forward, created), E->file_att,F->Virus_scan, G->format, H,->video, I-->Audio, J>*.exe, K->periodic, L->junk, M>bulk, Class->suspicious/Normal. Bayes classification technique analyzes the relationship between each independent attribute and the dependent attribute to derive conditional probability for each relationship. A prediction is made by combining the effects of the independent variables on the dependent variable to classify a new case. We used 50% of the available Email data set (5000 emails) for training the nave bayes classifier and the remaining 50% (5000 emails) for validation (to test the performance of the Bayesian classifier) and 924 emails out of 5000 in the validation set were incorrectly classified. The accuracy of suspicious email detection with Bayesian classifier is around 76.9%.

A.NAIVES BAYES

B.NEURAL NETWORK

The classification procedure using the NN has three steps, data preprocessing, data training, and testing. Feature selection is the way of selecting a set of features, which is more informative in the task while removing irrelevant or redundant features. For the training data, the selected features from the data preprocessing steps were fed into the NN, and an email classifier was generated through the NN. The learning rate and momentum was set as 0.3 and 0.2 respectively. Out of 5000 emails in the validation set 63 emails were misclassified giving the accuracy of 98.74%. SVMs are a relatively new learning process influenced highly by advances in statistical learning theory. This classification divides two separate classes, which are generated from training examples. The overall aim is to generalize well to test data. This is obtained by introducing a separating hyper plane, which must maximize the margin () between the two classes, this is known as the optimum separating hyper plane. Out of 5000 testing sample it has correctly classified 4495 instances. The accuracy is 89.9%. J48 classifier is a simple C4.5 decision tree for classification. It creates a binary tree. The accuracy rate of detecting the suspicious emails using J48 is 96.04%.

C.SVM

D. DECISION TREE D.1.J48 D.2.ID3

Algorithm for classification is based on construction of a tree to model classification process. When the complete tree is constructed it is applied to each message and hence does classification for each. The accuracy rate for ID3 classifier is 99.4%.

4. EXPERIMENTAL RESULTS
The application of data mining to the task of suspicious email detection is done using data mining classifiers. Experiments were carried out on a small email corpus. In order to conduct an experiment setting, different sets of 10,000 emails were used. A mixture containing 5000 training data set and 5000 test sample. The training dataset is given as the input for WEKA and the classifying techniques such as decision tree (ID3 and J48), Nave Bayes, Neural Network and SVM were implemented. The experimental results show that a simple ID3 algorithm (Decision Tree Classifier) will give better classification accuracy for suspicious email

216

IADIS European Conference Data Ming 2007

detection. To evaluate the classifiers on testing dataset, we defined an accuracy measure as follows. Accuracy (%) =correctly_classified_emails/Total _emails*100. An experiment measuring the performance against the size of dataset was conducted using dataset of different sizes. For example, in case of 5000 dataset, Accuracy was 99.4% using ID3 classifier.
Table 3. Classification accuracy with respect to Data size

Effect of dataset on Performance

Data size 1000 2000 3000 4000 5000

NN 93.7 98.8 98.76 98.77 98.74

SVM 85.3 85.4 88.56 94.4 89.9

Nave Bayes 77.3 76.6 76.83 76.9 76.9

J4.8 84.8 85.53 88.93 89.05 96.04

ID3 98.8 99.1 98.9 99.2 99.4


Figure 2. Comparison of classifiers w.r.t. data Size and accuracy

According to the experimental study on the testing datasets, good classification result order in the experiment was ID3 classifier.

5. CONCLUSION AND FUTURE WORKS


Email is an important vehicle for communication. It is one possible source of data from which potential problem can be detected. In this paper, all the emails were classified as Suspicious or normal. From this experiment, we can find it that a simple ID3 classifier can provide better classification result for suspicious email detection. In the near future, we plan to incorporate other techniques like different ways of feature selection. The proposed work will be helpful for identifying the suspicious email and also assist the investigators to get the information in time to take effective actions to reduce the criminal activities.

REFERENCES
[1]SAppavu alias Balamurugan, R.Rajaram & S.Senthamarai kannan, 2007.A Novel Data mining approach to detect deceptive communication in email text. Proceedings of National Conference of Advanced omputing, MIT, Chennai,India. [2]S.Appavu alias Balamurugan,R.Rajaram.et al, 2007.Association rule mining for Suspicious Email Detection: A Data mining approach. Proceedings of IEEE International Conference of Intelligence and Security Informatics, New Jersey, USA. [3]W.Cohen, et.al, 1996. Learning rules that classify email. In proceedings of the AAAI spring symposium on Machine Learning in Information Access. [4]B.Cui, A.Mondal, J.Shen, G.Cong, and K.Tan, 2005. On effiective Email classification via Neural networks. In Proceedings of DEXA,, PP.85-94. [5] Ian H.Written and Eibe Frank. Data Mining, Practical Machine Learning Tools and Techniques. [6] P.S.Keila and D.B.Skillicorn, 2005.Detecting unusual and Deceptive Communication in Email. Technical reports. [7]S.Kiritchenko, S.Matwin, and S.Abu-Hakima, 2004.Email Classification with Temporal Features. Intelligent Information Systems, pp.523-533. [8]Seongwook Youn and Dennis McLeod. A Comparative Study for Email Classification. [9]Y.Yang.et al, 2004 .An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval, Vol 1, No.1/2,pp.67-88.

217

Você também pode gostar