Summary: The e1071 package contains the naiveBayes function. It allows numeric and factor variables to be used in the naive bayes model. Laplace smoothing allows unrepresented classes to show up. Predictions can be made for the most likely class or for a matrix of all possible classes.
Tutorial Time: 20 minutes
Data Being Used: Simulated data for response to an email campaign. Includes binary purchase history, email open history, sales in past 12 months, and a response variable to the current email. See “Data Used” section at the bottom to get the R script to generate the dataset.
Training a Naive Bayes Classifier
Before you start building a Naive Bayes Classifier, check that you know how a naive bayes classifier works.
To get started in R, you’ll need to install the e1071 package which is made available by the Technical University in Vienna (TU Wien).
Default Parameters
library(e1071) #Default Paramters nb_default <- naiveBayes(response~., data=train[,-4]) default_pred <- predict(nb_default, test, type="class") table(default_pred, test$response,dnn=c("Prediction","Actual")) # Actual #Prediction 0 1 # 0 138 16 # 1 12 14
The function naiveBayes
is a simple, elegant implementation of the naive bayes algorithm. There are really only a handful of parameters you should consider.
naiveBayes(formula, data, laplace = 0, subset, na.action = na.pass)
- The
formula
is traditional Y~X1+X2+…+Xn - The
data
is typically a dataframe of numeric or factor variables. laplace
provides a smoothing effect (as discussed below)subset
lets you use only a selection subset of your data based on some boolean filterna.action
lets you determine what to do when you hit a missing value in your dataset.
The only parameters we have reason to change in this instance is the laplace smoothing value. Since the naive bayes algorithm is so simple, we don’t need to spend much time setting up our parameters in the first place.
Laplace Smoothing
nb_laplace1 <- naiveBayes(response~., data=train, laplace=1) laplace1_pred <- predict(nb_laplace1, test, type="class") table(laplace1_pred, test$response,dnn=c("Prediction","Actual")) # Actual #Prediction 0 1 # 0 143 11 # 1 7 19
The naiveBayes function includes the Laplace parameter. Whatever positive integer this is set to will be added into for every class.
We can see that the conditional probabilities for the two models are now different. The bigger the laplace smoothing value, the more you are making the models the same. We’ll talk more about these conditional probability tables in the Structure of naiveBayes Model Object.
nb_default$tables$opened_previously # opened_previously #Y 0 1 # 0 0.8142857 0.1857143 # 1 0.8142857 0.1857143 nb_laplace1$tables$opened_previously # opened_previously #Y 0 1 # 0 0.8125000 0.1875000 # 1 0.8055556 0.1944444
Data Allowed in Training
The naiveBayes function takes in numeric or factor variables in a data frame or a numeric matrix. It’s important to note that single vectors will not work for the input data but will work for the dependent variable (Y).
- Factor variables and Character variables are accepted.
- Character variables are coerced into Factors.
- Numeric variables will be placed on a normal distribution.
- Then the numeric variable will be converted into a probability on that distribution.
Structure of naiveBayes Model Object
Here’s the boring stuff:
apriori
is the prior probability for each class in your training set.
levels
are the allowable classes in your model.
Now on to the interesting stuff!
The tables
attribute stores the conditional probabilities for each factor attribute and class combination. We can easily calculate these same tables using table
and prop.table
.
prop.table(table(train$response,train$opened_previously, dnn=c("Response","Past Opened")), 1) #Probabilities across rows # Past Opened #Response 0 1 # 0 0.7959770 0.2040230 # 1 0.8055556 0.1944444 nb_default$tables$opened_previously # opened_previously #Y 0 1 # 0 0.7959770 0.2040230 # 1 0.8055556 0.1944444
You can see how much cleaner it is to use the naiveBayes results rather than calculating the tables by hand. Imagine having tens of factors. Even with copy+paste, it would be a chore.
For the continuous / numeric variables, instead of a conditional probability table the naiveBayes function returns a table of mean and standard deviation (in that order).
summary(train$sales_12mo) # Min. 1st Qu. Median Mean 3rd Qu. Max. # 20.60 43.49 52.08 51.33 58.76 76.76 nb_default$tables$sales_12mo # Returns mean and then standard deviation # sales_12mo #Y [,1] [,2] # 0 49.64311 10.767762 # 1 59.46683 4.748088
These mean and standard deviation calculations provide the normal distribution for each class. You can mirror what the naiveBayes function is doing by using pnorm(x, mean=, sd=)
for each class.
Bucket Numeric Variables
If you weren’t satisfied with using a gaussian distribution, you could manually discretize / bucketize your numeric variables using functions like hist()
or cut()
.
Predicting with Naive Bayes Classifier
After creating the naive Bayes model object, you can use the universal predict function to create a prediction.
default_pred <- predict(nb_default, test, type="class")
predict
will, by default, return the class with the highest probability for that predicted row.
Return Matrix of Class Probabilities
default_raw_pred <- predict(nb_default, test, type="raw")
The predict function allows you to specify whether you want the most probable class or if you want to get the probability for every class. Nothing changes with the exception being the type parameter is set to “raw”.
In case your data isn’t well distributed across your class variable, R has trouble handling this. If you have any class with only one instance, the naiveBayes model will still train and predicting the most likely class will also work. It’s the type=”raw” that will fail.
Per this StackOverflow post, you will need to duplicate your data. The duplication of every row keeps the proportions the same but allows the raw prediction method to work.
Data Used
set.seed(1) no_resp <- 500 resp <- 100 response <- factor(c(rep(0,no_resp),rep(1,resp))) purchased_previously <- factor(c(sample(0:1,no_resp,prob=c(0.6,0.4),replace=T), sample(0:1,resp,prob=c(0.2,0.8),replace=T))) opened_previously <- factor(sample(0:1,(no_resp+resp),prob=c(0.8,0.2),replace = T)) sales_12mo <- c(rnorm(n=no_resp,mean = 50, sd = 10), rnorm(n=resp,mean = 60, sd = 5)) none_open_buy <- factor(c(sample(0:1, no_resp,prob=c(0.8,0.2),replace=T), rep(1,resp))) test_var <- sample(LETTERS[1:2],(resp+no_resp),replace=T) naive_data <- data.frame(purchased_previously = purchased_previously, opened_previously = opened_previously, sales_12mo = sales_12mo, none_open_buy = none_open_buy, test_var = test_var, response = response) naive_data <- naive_data[sample(1:nrow(naive_data),nrow(naive_data)),] train <- naive_data[1:(nrow(naive_data)*.7),] test <- naive_data[(nrow(naive_data)*.7+1):nrow(naive_data),]