Naive Bayes in R Tutorial

Summary: The e1071 package contains the naiveBayes function. It allows numeric and factor variables to be used in the naive bayes model. Laplace smoothing allows unrepresented classes to show up. Predictions can be made for the most likely class or for a matrix of all possible classes.

Tutorial Time: 20 minutes

Data Being Used: Simulated data for response to an email campaign. Includes binary purchase history, email open history, sales in past 12 months, and a response variable to the current email. See “Data Used” section at the bottom to get the R script to generate the dataset.

Training a Naive Bayes Classifier

Before you start building a Naive Bayes Classifier, check that you know how a naive bayes classifier works.

To get started in R, you’ll need to install the e1071 package which is made available by the Technical University in Vienna (TU Wien).

Default Parameters

library(e1071)
#Default Paramters
nb_default <- naiveBayes(response~., data=train[,-4])
default_pred <- predict(nb_default, test, type="class")

table(default_pred, test$response,dnn=c("Prediction","Actual"))
#          Actual
#Prediction   0   1
#         0 138  16
#         1  12  14

The function naiveBayes is a simple, elegant implementation of the naive bayes algorithm. There are really only a handful of parameters you should consider.

naiveBayes(formula, data, laplace = 0, subset, na.action = na.pass)

The formula is traditional Y~X1+X2+…+Xn
The data is typically a dataframe of numeric or factor variables.
laplace provides a smoothing effect (as discussed below)
subset lets you use only a selection subset of your data based on some boolean filter
na.action lets you determine what to do when you hit a missing value in your dataset.

The only parameters we have reason to change in this instance is the laplace smoothing value. Since the naive bayes algorithm is so simple, we don’t need to spend much time setting up our parameters in the first place.

Laplace Smoothing

nb_laplace1 <- naiveBayes(response~., data=train, laplace=1)
laplace1_pred <- predict(nb_laplace1, test, type="class")

table(laplace1_pred, test$response,dnn=c("Prediction","Actual"))
#          Actual
#Prediction   0   1
#         0 143  11
#         1   7  19

The naiveBayes function includes the Laplace parameter. Whatever positive integer this is set to will be added into for every class.

We can see that the conditional probabilities for the two models are now different. The bigger the laplace smoothing value, the more you are making the models the same. We’ll talk more about these conditional probability tables in the Structure of naiveBayes Model Object.

nb_default$tables$opened_previously
#   opened_previously
#Y           0         1
#  0 0.8142857 0.1857143
#  1 0.8142857 0.1857143

nb_laplace1$tables$opened_previously
#   opened_previously
#Y           0         1
#  0 0.8125000 0.1875000
#  1 0.8055556 0.1944444

Data Allowed in Training

The naiveBayes function takes in numeric or factor variables in a data frame or a numeric matrix. It’s important to note that single vectors will not work for the input data but will work for the dependent variable (Y).

Factor variables and Character variables are accepted.
Character variables are coerced into Factors.
Numeric variables will be placed on a normal distribution.
Then the numeric variable will be converted into a probability on that distribution.

Structure of naiveBayes Model Object

Here’s the boring stuff:
apriori is the prior probability for each class in your training set.

levels are the allowable classes in your model.

Now on to the interesting stuff!

The tables attribute stores the conditional probabilities for each factor attribute and class combination. We can easily calculate these same tables using table and prop.table.

prop.table(table(train$response,train$opened_previously,
              dnn=c("Response","Past Opened")),
            1) #Probabilities across rows
#        Past Opened
#Response         0         1
#       0 0.7959770 0.2040230
#       1 0.8055556 0.1944444

nb_default$tables$opened_previously
#   opened_previously
#Y           0         1
#  0 0.7959770 0.2040230
#  1 0.8055556 0.1944444

You can see how much cleaner it is to use the naiveBayes results rather than calculating the tables by hand. Imagine having tens of factors. Even with copy+paste, it would be a chore.

For the continuous / numeric variables, instead of a conditional probability table the naiveBayes function returns a table of mean and standard deviation (in that order).

summary(train$sales_12mo)
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  20.60   43.49   52.08   51.33   58.76   76.76 

nb_default$tables$sales_12mo
# Returns mean and then standard deviation
#   sales_12mo
#Y       [,1]      [,2]
#  0 49.64311 10.767762
#  1 59.46683  4.748088

These mean and standard deviation calculations provide the normal distribution for each class. You can mirror what the naiveBayes function is doing by using pnorm(x, mean=, sd=) for each class.

Bucket Numeric Variables

If you weren’t satisfied with using a gaussian distribution, you could manually discretize / bucketize your numeric variables using functions like hist() or cut().

Predicting with Naive Bayes Classifier

After creating the naive Bayes model object, you can use the universal predict function to create a prediction.

default_pred <- predict(nb_default, test, type="class")

predict will, by default, return the class with the highest probability for that predicted row.

Return Matrix of Class Probabilities

default_raw_pred <- predict(nb_default, test, type="raw")

The predict function allows you to specify whether you want the most probable class or if you want to get the probability for every class. Nothing changes with the exception being the type parameter is set to “raw”.

In case your data isn’t well distributed across your class variable, R has trouble handling this. If you have any class with only one instance, the naiveBayes model will still train and predicting the most likely class will also work. It’s the type=”raw” that will fail.

Per this StackOverflow post, you will need to duplicate your data. The duplication of every row keeps the proportions the same but allows the raw prediction method to work.

Data Used

set.seed(1)
no_resp <- 500
resp <- 100
response <- factor(c(rep(0,no_resp),rep(1,resp)))
purchased_previously <- factor(c(sample(0:1,no_resp,prob=c(0.6,0.4),replace=T),
                          sample(0:1,resp,prob=c(0.2,0.8),replace=T)))
opened_previously <- factor(sample(0:1,(no_resp+resp),prob=c(0.8,0.2),replace = T))
sales_12mo <- c(rnorm(n=no_resp,mean = 50, sd = 10),
                rnorm(n=resp,mean = 60, sd = 5))
none_open_buy <- factor(c(sample(0:1, no_resp,prob=c(0.8,0.2),replace=T),
                          rep(1,resp)))
test_var <- sample(LETTERS[1:2],(resp+no_resp),replace=T)

naive_data <- data.frame(purchased_previously = purchased_previously,
                         opened_previously = opened_previously,
                         sales_12mo = sales_12mo, 
                         none_open_buy = none_open_buy,
                         test_var = test_var,
                         response = response)

naive_data <- naive_data[sample(1:nrow(naive_data),nrow(naive_data)),]

train <- naive_data[1:(nrow(naive_data)*.7),]
test <- naive_data[(nrow(naive_data)*.7+1):nrow(naive_data),]

Learn by Marketing

Data Mining + Marketing in Plain English