NAIVE BAYES: An Implementation Of Email Spam Filtering


email SpamIn the World Wide Web jargon, a virtual threat of Email Spam constantly looms for Internet users. While there is no end to technological evolutions, there is equivalent numbers of ways developed to prevent the breach of online security. With techniques like Naïve Bayes, Random Forest, Neural Network, and Support Vector Machine (SVM), the unwanted and irrelevant e-mails can be blocked or filtered out as spams.

E-mail marketing is the most cost-effective method for companies to advertise their products as it is the most economical form of marketing their products or services. However, this leads to the creation of irrelevant data for users in the form of spam e-mail that are unwanted. This junk of irrelevant data may land users to skip reading their (non-spam) relevant mails as well. And then a considerable amount of time is also spent in manually deleting those unwanted mails from the inbox.

With the advent of email marketing, people have become more and more intolerant for unsolicited mails. As a result, email spam filters have gradually tightened their range with automatic filtering functionality. The use of several machine learning algorithms is prevalent for blocking such spam emails. But, Naïve Bayes is amongst one of the most frequently used and popularly implemented methods to filter out such emails from the user’s inbox.

Words that occur specifically and frequently in spam, non-spam and both are closely tracked in Naïve Bayes classifier.It is an algorithm, which uses Bayes’ theorem to categorise words or objects. Commonly used in text analysis, medical diagnosis, and spam filters, it’s the most successful implementation is seen in spam filters.

So let us dive deep into the Naïve Bayes technique and how it helps to identify spam e-mails.

Blacklist–It is a prevalent spam-filtering technique to stop irrelevant email by blocking them from a preset sender’s list created by an individual’s or an organization’s systems administrator. A blacklist is a record of Internet Protocol (IP) addresses or email addresses that have been earlier used to send spam. When a message drops in the inbox, the spam filter verifies it with the blacklisted email or IP address. If the message does exist in the record of blacklisted addresses, it is identified as spam and gets rejected.

Although, sometimes it can even misidentify legitimate email senders as spammers. These supposed false positives are an outcome of junk mails dropping in from an IP address handled by authentic email users. Correspondingly, since many spammers regularly switch email and IP addresses to refrain from being traced, the newest outbreaks may not be instantly caught by a blacklist.

Whitelist–The blocking system of whitelist works completely opposite to the blocking system of the blacklist. Instead of specifying which senders or addresses to block emails from, it lets you enlist which senders or addresses to allow emails from. These senders are affixed in a trusted-users list. Many spam filters allow you to use a whitelist along with another spam-filtering element as a way to reduce the number of legitimate emails that accidentally fall prey to being flagged as spam. However, installing a dedicated filter would lead to automatically blocking any legitimate sender who was previously not approved.

There are a few anti-spam applications, which uses many such systems called an automatic whitelist. As per this system, an unidentified sender’s IP or email address is verified against a database. If these addresses have no spamming history, their email is sent to the inbox of the recipient and added to the whitelist.

Grey List– Relatively new spam-filtering method that detects spammers who only attempt to send out a batch of junk email once. As per the greylisting system, the receiver’s email server initially rejects mails from anonymous email users and reciprocates to the originating server with a failure message. However, if the server again attempts to send the mail – the greylist of any legitimate server would assume that the email is not spam thereby letting it drop in the inbox of the recipient. But this time, the recipient’s IP or email address will be added to an allowed sender’s list.

Greylist spam filtering technique, however, is not ideal for time-sensitive messages, which may delay the sending and receiving of important mails. Therefore, it requires fewer system resources.

Bayesian Filter–It is the most advanced form of content-based filtering technique, which requires a user to manually train the system over time by flagging legitimate and spam emails. The filter then picks up the phrases and words, which are then categorised under high probability and low probability list separately. To evaluate the incoming mail as spam, the filter scans through email’s content and compares them against the two probability lists to compute the probability of the email to be spam. For example, “Free” is the word that has occurred 65 times in spam emails list but only 4 times in legitimate ones, the chances are 95 percent for an incoming message with the word “Free”is junk.

However, as this technique need a training period before it begins functioning well, the user will have to keep patience and manually keep deleting a few junk emails, at least for a few in the beginning. Due to this constant building up of the word’s probability list based on the texts of emails received, it starts becoming more effective with time the more it is used.

SPAM DETECTION IN GMAIL

Gmail spam filter is known as one of the best spam filtering algorithms to keep its inbox free of junk messages. To detect the authenticity of an email while using Gmail, there are numerous rules implemented to each message that passes through Google data centers. Each rule illustrates certain characteristics of a spam email which can be evaluated with its associated numerical value and is similar to the attributes of a spam. The weighted significance attached to each message forms the basis of an equation. If the resulting value comes close to the equated score, then it is considered as spam. This score is examined against the sensitivity mark established by a user’s spam filter. And therefore, it is considered as a valid or spam email.

TEXT CLASSIFICATION USING NAÏVE BAYES

The task of categorizing documents on the basis of its content is called text classification. A document is often represented by a set of words appearing repeatedly. In simple terms, it detects the frequently appearing words in the document while snubbing their order. There are two probabilistic models of documents, namely the Bernoulli and Multinominal Document model, signifying documents as a set of words that uses the Naïve Bayes assumption.

Bernoulli Document Model–This model document generates a Boolean response for each word by comparing it with the examining document, such as 1 if it belongs and 0 if it does not. It not only considers the number of recurrences of each word but also takes into account the non-recurrences words in the document. The non-recurring words are factored while calculating the conditional probabilities and therefore the absence of words is also considered.

Multinominal Document Model – Under this document model, the non-repetitive terms are totally overlooked unlike Bernoulli model, which even takes the non-repetitive terms into consideration.

Naive Bayes is one of the oldest methods of spam filtering, which computes the probability of terms being non-spam and spam. Due to its nature of being less computationally intensive, Naïve Bayes classifier is deemed very efficient in today's scenario to stop receiving spam or adult content.

Read more news:

India a Key Market in Asia Pacific For Expansion of Data Centres, JLL Report

Velocity MR Study shows Indians holiday now more than ever before