Você está na página 1de 17

Email Spam Filtering

Presented BY: PRADEEP KUMAR B.Tech Final Year Computer Science & Engg. Roll no: 0504210041 Session: 2008-09

CONTENTS Synopsis. What is e-mail? What is spam? What is the problem with spam? Statistics. Who benefits from spam? Spam filtering technique. Combining spam filter. Challenges. Conclusion.

Synopsis
In this seminar the topic which is E-Mail Spam Filtering is presented in which the working principle of how to filter the spam mail from the emails is shown. "E-mail spam" refers to "unsolicited bulk email. An email is spam if it meets the two criteria: Bulk.

Unsolicited.

But each email which has these properties may or may not be the spam mail. So a spam mail must have these properties but it is not necessary that all the mails having these properties are spam mail. So the various filtering technique is presented in this seminar by which we can control the spam mail to come in our inbox. The spam mails which are tracked by the filter are automatically sent to a folder called spam or junk mail. If those mails are useful then we can read those mails from that folder otherwise these spam mails are automatically deleted after a certain period of time. Thus the technique how to categorize between the ham mail and the spam mail is shown in this seminar report. In this seminar various other things are also defined which are email, spam, problem with spam and challenges which occur while the process of filtering is done.

What Is E-Mail
Electronic mail, often abbreviated to e-mail, email, or originally email, is a store-and-forward method of writing, sending, receiving and saving messages over electronic communication systems. Computers use the TCP/IP protocol suite to send email messages in the form of packets. The first thing you need to send and receive emails is an email address. When you create an account with an Internet Service Provider you are usually given an email address to send from and receive emails. If this isn't the case you can create an email address / account at web sites such as yahoo, hotmail and lycos.

It shows a typical sequence of events that takes place when Alice composes a message using her mail user agent (MUA). She types in, or selects from an address book, the e-mail address of her correspondent. She hits the "send" button.
1.

2.

Her MUA formats the message in Internet e-mail format and uses the Simple Mail Transfer Protocol (SMTP) to send the message to the local mail transfer agent (MTA), in this case smtp.a.org, run by Alice's Internet Service Provider (ISP). The MTA looks at the destination address provided in the SMTP protocol (not from the message header), in this case bob@b.org. An Internet e-mail address is a string of the form localpart@exampledomain.com, which is known as a Fully Qualified Domain Address (FQDA). The part before the @ sign is the local part of the address, often the username of the recipient, and the part after the @ sign is a domain name. The MTA looks up this domain

name in the Domain Name System to find the mail exchange servers accepting messages for that domain. 3. The DNS server for the b.org domain, ns.b.org, responds with an MX record listing the mail exchange servers for that domain, in this case mx.b.org, a server run by Bob's ISP. 4. Smtp.a.org sends the message to mx.b.org using SMTP, which delivers it to the mailbox of the user bob. 5. Bob presses the "get mail" button in his MUA, which picks up the message using the Post Office Protocol (POP3).

What Is E-mail Spam


"E-mail spam" refers to "unsolicited bulk email" that involves nearly identical messages sent to numerous recipients by e-mail. An email message is spam if it meets two criteria:1. Bulk: the recipient's personal identity and context are irrelevant because the

message is equally applicable to many other potential recipients


2. Unsolicited: the recipient has not verifiably granted deliberate, explicit, and

still-revocable permission for it to be sent.


Spam:Unsolicited commercial email Ham:Valid email False Positive:A valid email that was erroneously classified as spam False Negative:A spam email that was erroneously classified as valid

Spam is flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. Most spam is commercial advertising, often for dubious products, get-rich-quick schemes, or quasi-legal services. Spam costs the sender very little to send -- most of the costs are paid for by the recipient or the carriers rather than by the sender. Email spam lists are often created by scanning Usenet postings, stealing Internet mailing lists, or searching the Web for addresses. Email spams typically cost users money out-of-pocket to receive. Many people anyone with measured phone service - read or receive their mail while the meter is running, so to speak. Spam costs them additional money. On top of that, it costs money for ISPs and online services to transmit spam, and these costs are transmitted directly to subscribers. One particularly nasty variant of email spam is sending spam to mailing lists (public or private email discussion forums.) Because many mailing lists limit activity to their subscribers, spammers will use automated tools to subscribe to as many mailing lists as possible, so that they can grab the lists of addresses, or use the mailing list as a direct target for their attacks.

What Is The Problem With Spam


Spam's direct effects include the consumption of computer and network resources and the cost in human time and attention of dismissing unwanted messages. In addition, spam has costs stemming(arise, hold back) from the kinds of spam messages sent, from the ways spammers send them, and from the arms race between spammers and those who try to stop or control spam. In addition, there are the opportunity costs of those

who refuse the use of spam-afflicted systems. There are the direct costs, as well as the indirect costs borne by the victimsboth those related to the spamming itself, and to other crimes that usually accompany it, such as financial theft, identity theft, data and intellectual property theft, virus and other malware infection, fraud, and deceptive marketing.

"Spamming is the scourge of electronic mail and newsgroups (discussion group, group of people on the Internet which correspond on a particular topic through Usenet (Internet, Computers)) on the Internet. It can seriously interfere with the operation of public services, to say nothing of the effect it may have on any individual's e-mail mail system. Spammers are, in effect, taking resources away from users and service suppliers without compensation and without authorization." --Vint Cerf, ISOC Chairman

Some problems related to the spam are:

Cost shifting. Fraud Theft Harm to the market place Consumer perception Global implication

Statistics

1978 - An e-mail spam is sent to 600 addresses.

1994 - First large-scale spam sent to 6000 newsgroups, reaching millions of people,. 2005 - (June) 30 billion per day.

2006 - (June) 55 billion per day 2006 - (December) 85 billion per day 2007 - (February) 90 billion per day

Who Benefits From Spam

Well you can assume that some of the Spam is static used to detrain spam filters. But for most cases Spammers make money in sending the Spam, Not selling the services that goes with it. So say they charge $10,000 for a Million emails. So unexpecting company or some poor smuck think he is going to get rich quick with this stuff will pay the spamming companies so much to give the link to their website and sell a product. But there is no promise that they will sell the product, they only promise to deliver a million emails. So what normally happens the Smuck goes bankrupt and the Spammer gets the money. If the Spammer can get past the Spam filters then they can promise better visibility.

Spam Controlling Technique


Filtering and filter programs are a computing technique to try and decipher unwanted from wanted e-mail. Most filters can work well to stop messages with bad key words and such. The average filter seems to stop about 60% of SPAM. These controlling techniques can be categorized in two parts:-

1.

Fight back technique


1.1.

Reporting spam to ISP.

Most people consider unsolicited mail to be an annoyance. Some think that their bulk mail folder is just something to be deleted in its entirety. Others still feel that all this junk mail is beyond contempt and needs to be stopped. Spam is very common in email boxes everywhere. All the messages are practically identical and you can wind up with dozens, if not hundreds, of the exact same message in your mailbox throughout the life of your email address. No Internet service provider (ISP) wants their servers to be used for spam. Any ISP would welcome a message from you reporting a spammers activity and abuse of their system. Here are some steps you can take to report spam to the spammers' ISP.

Disadvantage:
1. Disguised Spammers. 2. Nave users cannot interpret the email headers

1.2.

Fight back filter

Majority of spam contain links to web pages. Spam filters could auto retrieve the URLs and crawl back to those pages, which would increase the load on the server. If all the spam receivers do this at the same time, the server might be crashed and so the cost of spamming increases.

2. Filtering technique
This type of spam controlling technique uses the specific method of filtering the spam from the emails. This is done by the following method:

2.1.

Challenge response filter

A Challenge-response system is a type of spam filter that automatically sends a reply with a challenge to the (alleged) sender of an incoming e-mail. In this reply, the sender is asked to perform some action to assure delivery of the original message, which would otherwise not be delivered. The action to be performed is typically one that. This technique almost filters all spam. No spammer would be interested to take the extra effort to prove him / her self.

can be performed once relatively effortlessly, but needs great effort if performed in large numbers, in this way effectively filtering out spammers.

2.2.

Black and white list


A "white list" is used to insure that certain emails won't be

filtered out. A "black list" is used to insure that certain emails are definitely marked as spam. Blacklists of misbehaving servers or known spammers that are collected by several sites. Sender id in the email is compared with the blacklist.

White lists are complementary to black lists, and contain addresses of trusted contacts.

Use blacklists and white lists for the first level filtering (before applying content checks) and not used as the only tool for making decision.

Disadvantage: Prone to wrong configurations with legitimate servers unable to exit from a list where they had been incorrectly inserted.

2.3. Content based filter


It is n ot a good idea to filter mails just based on blacklists .Wiser decision wiser decision will be to consider the actual content of the email. Almost all the successful spam filters use this technique. Major types: Rule-based and Bayesian

1.3.1Rule based
Rule based filters work based on some static rules to decide whether a mail is a spam or not.Rules could be words and phrases lots of uppercase characters exclamation points special characters Web links HTML messages background colors crazy Subject lines etc.

Rules are given scores, based on importance. Incoming mails are parsed and checked for known malicious patterns. Total score calculated for the triggered rules. If Final Score > Threshold, classify as spam. Otherwise, classify as legitimate mail. Threshold decided by the user. Advantage Easy to implement No training required Disadvantage Static rules too general Spammers find new ways to deceive the rules

1.3.2

Bayesian filter

Bayesian filters are the latest in spam filtering technology and the most successful. Bayes classifiers were used extensively in the field of pattern recognition. Given an unlabeled example, the classifier will calculate the most likely classification with some degree of probability. Steps in Bayes Filtering Training Validation Implementation

Training starts with two collections of mails: one of spam and one of legitimate mail. For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences. Bayesian filters are quite accurate, and adapt automatically as spam evolves. False positives are minimized by Bayesian filtering because they consider evidence of innocence as well as evidence of spam. Bayes Probability, Pr (spam | words) = (Pr (spam) * Pr (words | Spam))/ Pr (words) Probability closer to 1 would be classified as spam and closer to 0 is classified as ham. 0.5 is set as the threshold.

Combining Spam Filter

Goal Combined filter aims to improve individual filters performance.

Combined Filter = Original Filter (OF) + Received Filter (RF)

Max gain Received filter contains some feature sets not found in the original filter.

E.g. Original Filter = {Share Market, Higher Studies}

Received Filter = {Share Market, Job Alerts}

Challenges
Decisions (Spam / Ham) made by both filters individually

Decisions agree No Problem Disagreement Due to difference of feature sets How do we select the correct decision or filter? Who selects it

Conclusion

1. 2.

We discussed the techniques to kill spam. Comparison between various techniques.

3. 4.

So far, Bayesian seems to be reliable. Discussed a new approach to combine filters.

5. Does not analyze other types of spam, apart from email spam. 6. Future work :
a) b)

Learning techniques for Filter. Selector Better Similarity measures.

======================================================================== ========================================================================

Você também pode gostar