Multinomial Naive Bayes for Spam detection.

stillbigjosh
3 min readAug 11, 2018

--

Here, we will demonstrate how Naive Bayes Classifier solves an age old internet problem — “Spams”. To solve this, we will follow
a few easy steps. Before you proceed, I’m assuming you have a basic understanding of Naive Bayes Classifiers, if
you don’t — NB Classifiers is a classification algorithm used to solve certain classification problems.
By ‘certain’ meaning Naive Bayes is limited in usage, Naive Bayes Classifier would be the wrong algorithm to use
in a scenario whereby your dataset points are dependent, NBC makes an assumption about your data, each features would be
scaled equally, thus the name.

What do we need?
1. A sample dataset of Spam and legitimate email subjects, to narrow down our word list. Since this is basically a demonstration, we won’t be working with large datasets, rather popular spam mails subjects, do note when working on a large scaled projects the datasets should be robust.
2. Python3.x Intepreter
3. In depth knowledge of Multinomial Naive Bayes Classification.

How we go about this?
We have samples of popular spam email subjects from wordsthatclick.com
1. save big on all vehicles
2. melt fat away
3. you were recommended into the global professional network
4. how to grow 3+ inches taller in just a matter of weeks
5. you’ve won a lottery
We also have samples of non spam emails
1. thanks for signing up for our newsletter
2. your profile was recently changed
3. confirm you new account
4. recommended courses for you
5. customer invoice
But do keep it in mind that spammers are not stupid, they are getting better to bypass spam detection, which means the better
the datasets, the better the algorithm.

Technicality?
1. Convert the data(both the spam and non spam) into a frequency table, this will be done in code and here, the Artificial Intelligence favorite pet — Python;

2. Create a Likelihood table

3. Use Naive Bayes equation to calculate the posterior probability for each class(spam or no spam), The class with the highest posterior probability is the
the outcome of prediction, given a mail with subject line “new professional courses is recommneded for you”

4. Step 3 and 4 will be solved simultaneously, by computing the posterior probability of each word in the new subject line, and we will be applying Laplace smoothing to fix the missing words in our datasets. We then add the results of each of these words. The class with the highest posterior probability
is the category this new mail belongs to.
First, we compute the posterior probability of the class spam given the predictor(“new professional courses is recommended for you”).

The posterior probability of class ‘spam’ given the predictor(new mail subject line) was found to be 0.8839285714285714.

Next, we compute the posterior probability of class ‘non spam’ given the same predictor.

The posterior probability hereby is given as 1.4375. Finally, since it’s a universal truth that 1.4375 > 0.8839285714285714, the new mail in our inbox with the subject line, “new professional course is recommended for you” has been classified as a “non spam” email by the algorithm. We are safe. :)

The entire source code can be found on my github https://github.com/stillbigjosh/naivebayes-spamdetection

Thank you.

--

--