From Zero to Hero: Laplace Additive Smoothing for Naive Bayes Classifier.

5 min readMay 6, 2023

The significance of probability in Machine Learning (ML) and Natural Language Processing (NLP) algorithms cannot be overstated. These algorithms depend upon probabilistic models to make predictions and decisions based on input data. In NLP, probabilities are used for language modelling, estimating the likelihood of a sentence or a word sequence, and generating text. However, sometimes these probabilities can be zero, potentially hampering the performance of the model. Fortunately, Laplace smoothing is a technique, that could be used to smooth categorical data and avoid zero probabilities in such scenarios.

Laplace smoothing (Not Laplacian Smoothing used in Image Processing), also known as additive smoothing, is a simple yet powerful approach that has been used in a broader range of applications. The basic idea is to add a small constant value to each count or frequency to avoid zero probabilities, which can cause problems as discussed above.

Naive Bayes Classifier

What is Naive Bayes Classification?- A naive Bayes classifier is an algorithm that uses Bayes’ theorem to classify objects. Naive Bayes classifiers assume strong, or naive, independence between attributes of data points. Refer to this article to understand the mathematical explanation of the algorithm: https://towardsdatascience.com/a-mathematical-explanation-of-naive-bayes-in-5-minutes-44adebcdb5f8

To understand how Laplace smoothing works in unseen events, let’s consider a common example. Suppose we have a dataset of emails and we want to classify them as spam or not spam. We can use a Naive Bayes classifier, which estimates the probability of each word occurring in spam and non-spam emails. The mathematical explanation of the Naive Bayes Spam classification is clearly mentioned in the image attached below.

A mathematical explanation of Naive Bayes Email Classification.

The Zero-Frequency Problem

When a word is not present in any spam emails, it means that the probability of that word appearing in spam emails is zero. Consequently, the probability of the entire email being classified as spam is also zero. This scenario is not desirable since it would result in inaccurate predictions.

If the word ‘amount’ is absent from the spam word list provided in the example, any email containing the word ‘amount’ will be regarded as significant, regardless of how much the other words in the email appear to be spam.

The Solution -Laplace Smoothing

Laplace Additive Smoothing is used to alleviate the zero-probability problem. Adding a small constant value (α) to each count ensures that no probability estimate is zero. The formula for Laplace smoothing is:

P(xᵢ | y) = (count(xᵢ,y) + α) / (count(y) + α*N)

Where:

P(xᵢ | y) is of observing feature x_i (i.e., a specific word) given that the email is in class y (i.e., spam or non-spam).
count(xᵢ, y) is the number of times feature x_i appears in emails of class y (i.e., the number of spam or non-spam emails that contain the word).
count(y) is the total number of emails in class y (i.e., the total number of spam or non-spam emails).
N represents the total number of possible feature values (i.e., the total number of distinct words in the dataset)
α-(alpha) is a smoothing parameter (usually set to 1)

What is the importance of hyperparameter α-(Alpha):
Laplace’s rule of succession sets the default value of α to 1, but for some datasets or problems, the value of α can be different. The choice of α can significantly affect the performance of the model, too much smoothing can result in underfitting, which can reduce the performance of the model. So a validation dataset or cross-validation is used to determine the appropriate value of α.

Continuing to discuss the Spam email classifier problem, let’s add a constant value ‘1’ to the frequency of the words present in the text corpus. This will help us to make the frequencies non-zero, and the shift in the relative probability will not be impacted.

It can be observed that adding a constant has helped us classify the test email effectively when the training data does not contain the word in the test corpus. So, to generalize better, we need to smooth or regularize the estimates.

Laplace smoothing is not limited to Naive Bayes classifiers but can also be applied to other probabilistic text modelling. For example, it can be used to smooth the counts in a bag-of-words model for text classification.

Limitations

Despite being a useful technique, Laplace smoothing has certain limitations. For instance, it assumes that all features have equal prior probabilities, which might not be the case in some scenarios. Additionally, it assumes that a constant value α(alpha) is suitable for all features, which may not be the optimal choice in some cases.

Conclusion

Laplace smoothing, also referred to as additive smoothing, is used to avoid zero probabilities and improve the accuracy and reliability of statistical models and prevent overfitting. It has diverse applications in several fields, including machine learning, NLP and is considered to be an essential approach that every data scientist should have in their toolbox. So the next time you encounter a dataset with zero probabilities, consider using Laplace smoothing to make your models more robust and accurate.

Hope this helps! Looking forward to sharing another captivating article with you in the near future!

You can also connect with me on Linked In: https://www.linkedin.com/in/anupama-k-79770b17a/