Machine Learning: Ruby and the Naive Bayes Theorem

We’ve all heard of the “Big Data” trend in computer science: everyone and their grandmother is trying to garner insights from large datasets. Often, the tool for the job is machine learning. A large portion of the Ruby community revolves around web development; even here, some basic machine learning knowledge can come in handy when applied creatively. This article will get you up to speed with a simple (but often quite effective) machine learning technique: the Naive Bayes Classifier.

We’ll go through the underlying theory, which does require some math background but not too much. Then, we’ll build a reusable implementation of the classifier and work it through an example dataset.

Classifiers

First of all, what is a “classifier”, specifically? The concept is pretty simple. Given some descriptions (called “features”) about a given object, the classifier classifies this object into a category. Obviously, this is not a perfect process and every now and then, we get an object that has characteristics that could fit into both categories. Classifiers are generally rated on the accuracy of their classifications.

An exemplary example of classification is spam detection. Given a piece of text, the classifier has to decide whether or not it is spam. A good question to ask would be, “What characteristics about the piece of text determine whether or not it is spam?”. These characteristics or features can be things like the length of the text, the presence of certain words, or the number of links in the email. The classifier then uses the values of these features (they are either boolean or real numbers) to make the classification.

The Naive Bayes Classifier and Bayes’ Theorem

So far, we’ve considered the general ideas behind a classifier, but not the actual mechanism. There are lots of ways to actually go about classifying stuff, each with its own strengths and weaknesses. We’ll consider the Naive Bayes Classifier (NBC) in this article.

The examples below can be copied and pasted into the text box on this page for cleaner rendering.

Let’s say we have classes (i.e. categories) \[y_1, y_2, …, y_k\] and we are given the set of observations \[X = x_1, x_2, …, x_n\] The NBC is trying to find the value of:

\[P(y_k \vert X = X = x_1, x_2, …, x_n)\]

In other words, given an object and some set of feature values, the NBC tries to figure out the probability of that object being part of a given class. We do this for every class and then classify the object into the class that gave us the highest probability. Of course, so far we’ve just specified what we want to find, not exactly how to find it. That’s where Bayes’ theorem comes into play. It says:

\[P(A \vert B) = \frac{P(B \vert A) \cdot P(A)}{B} \]

In English, that says:

To find the probability of event A happening (for example, rain) given that event B happens (for example, cloudy skies)
multiply the probabilty of B happening given A (i.e. cloudy skies given that its raining) and the probability of A happening (i.e. chance of raining)
divide by the probability of B (i.e. chance of cloudy skies).

That is a bit complicated; reading the explanation a couple times until it “sets” might help (drop questions in the comments below!).

We can use Bayes’ Theorem to say:

\[P(y_k \vert X = x_1, x_2, …, x_n) = \frac{P(X = x_1, x_2, .. x_n \vert y_k) P(y_k)}{P(x_1, x_2, … x_n)} \]

That’s a pretty complicated looking expression. But, it is just an application of Bayes’ Theorem. Instead of rain and cloudy skies, we are using the class and the feature values.

We still don’t know how to figure out \[P(X = x_1, x_2, .. x_n \vert y_k)\] (which is the probability of the feature values occurring, given they are part of a certain class.) Here we make something called the “independence assumption”, which says that “given a certain class, the values of the features are independent of one another.”

For example, if we know that it is breakfast time, the chance of it being sunny and me being asleep are independent of one another. There are cases where this assumption is catastrophically wrong, but there are lots of cases where it is reasonably accurate.

As it happens, we can write this down as our expression:

\[P(y_k \vert X = X = x_1, x_2, …, x_n) = \frac{P(x_1 \vert y_k) P(x_2 \vert y_k)…\cdot P(x_k \vert y_k) P(y_k)}{P(x_1, x_2, … x_n)} \]

We can think of this probability as a score. Once our NBC looks at the feature values, it assigns each of the classes a score and the highest one is the right classification for the feature values. But the denominator of the expression has nothing to do with the class \[y_k\] and is therefore constant for all the classes so we can just get rid of it. Our final score expression can be written as:

\[\text{Class score for } y_k = P(x_1 \vert y_k) P(x_2 \vert y_k)…\cdot P(x_k \vert y_k) P(y_k)\]

So, we’ve broken down the problem to the point where we need to know only two types of things: the chance that a given feature value is part of a class \[P(x_i \vert y_k)\] and the chance of a class occurring \[P(y_k)\]

The latter can be made pretty simple if the classes are supposed to be of roughly the same size:

\[P(y_k) = \frac{1}{k}\]

The second one is a bit more difficult. Fortunately, this 18th century guy called Gauss figured it out for us (using the normal distribution, which doesn’t work everywhere but should for many cases):

\[P(x_i | y_k) = \frac{1}{\sqrt{2\pi\sigma^2_c}}\,e^{ -\frac{(v-\mu_c)^2}{2\sigma^2_c} } \]

In the above expression, \[\sigma_c\] represents the standard deviation of the i’th feature of the \[y_k\] class, and the \[\mu_c\] represents the mean of the same. If this equation seems a bit like mumbo jumbo, the only thing you really need to understand from it is that it uses the idea that values far away from the mean value have a lower chance of occurring (or, more accurately, it uses a normal distribution).

Enough math for the time-being; let’s get to the Ruby.

Implementation

First of all, we need to work out how to represent our training data. We need to have class names and for each class name, a corresponding set of training feature values. We can make it look like this:

{
  "class1" => [
    [feature1, feature2, feature3, ..., featureN],
    [feature1, feature2, feature3, ..., featureN]
  ],

  "class2" => [
    [feature1, feature2, feature3, ..., featureN],
    [feature1, feature2, feature3, ..., featureN]
  ],
}

For parts of the implementation (such as when computing the Gaussian probabilities), we need to compute statistical information about certain sets. To do that, we’ll be using the descriptive_statistics gem.

We can start by sketching out some of the basic methods:

#used for mean, standard deviation, etc.
require 'descriptive_statistics'

class NaiveBayes

  #training data format:
  #{class-name: [[parameter1, parameter2, parameter3, ...],
  #              [parameter1, parameter2, parameter3,...]}
  def initialize(training_data, dim)
    @training_data = training_data
    @dimension = dim

  end

  def num_classes
    return @training_data.length
  end

  def classes
    return @training_data.keys
  end

Notice that @dimension refers to the number of features we have in every feature set (denoted “N” in the training data format described above). Now that we have some utility methods, we can actually get cracking on the meat of the NBC algorithm. In order to compute \[P(x_i \vert y_k)\] we have to compute the mean and standard deviation of the occurrences of a feature.

Here’s what I mean. If we have the following training set:

{
  ...
  "class1" => [
  [1, 1, 2],
  [1, 2, 2],
  [1, 16, 2],
  [1, 4, 2]
  ]
}

Then the standard deviation for the first and last feature (which don’t change) would be 0, whereas the same measure for the second feature is a positive number. In order to compute this information, we need a method that can put all the features of a certain index in a given class into one set. Ask and you shall receive:

def feature_set(index, class_name)
  feature_set = []
  training_set = @training_data[class_name]

  training_set.length.times do |i|
    feature_set << training_set[i][index]
  end

  return feature_set
end

This also outlines something that differentiates machine learning code from web code. On the web, you’re able to see errors much more quickly than you would in many ML applications. Since there’s always possibility of your classifier’s classification being wrong, it is difficult to detect whether the source of error is a coding mistake or an algorithm deficiency. So, in my experience, ML code is generally more explicit than the stuff we write for the web.

Now, we can put together the code for \[P(x_i \vert y_k)\]

#given a class, this method determines the probability
#of a certain value ocurring for a given feature
#index: index of the feature in consideration in the training data
#value: the value of the feature for which we are finding the probability
#class_name: name of the class in consideration
def feature_probability(index, value, class_name)
  #get the feature value set
  fs = feature_set(index, class_name)

  #statistical properties of the feature set
  fs_std = fs.standard_deviation
  fs_mean = fs.mean
  fs_var = fs.variance

  #deal with the edge case of a 0 standard deviation
  if fs_std == 0
    return fs_mean == value ? 1.0 : 0.0
  end

  #calculate the gaussian probability
  pi = Math::PI
  e = Math::E

  exp = -((value - fs_mean)**2)/(2*fs_var)
  prob = (1.0/(Math.sqrt(2*pi*fs_var))) * (e**exp)

  return prob
end

Going back to our final expression for the class score:

\[\text{Class score for } y_k = P(x_1 \vert y_k) P(x_2 \vert y_k)…\cdot P(x_k \vert y_k) P(y_k)\]

We need to multiply together the \[P(x_i \vert y_k)\] for all the indices finally multiplying that by \[P(y_k)\] to get the result. Ruby makes it easy:

#multiply together the feature probabilities for all of the 
#features in a class for given values
def feature_mult(feature_values, class_name)
  res = 1.0
  feature_values.length.times do |i|
    res *= feature_probability(i, feature_values[i], class_name)
  end

  return res
end

#this is where we compute the final naive Bayesian probability
#for a given set of features being a part of a given class.
def class_probability(feature_values, class_name)
  class_fraction = 1.0 / num_classes
  feature_bayes = feature_mult(feature_values, class_name)
  res = feature_bayes * class_fraction
  return res
end

Finally, we just have to sort the classes by their score and then return the one with the highest score (that is the classification for a given set of feature values):

#This the method we should be calling!
#Given a set of feature values, it decides
#what class to categorize them under
def classify(feature_values)
  res = classes.sort_by do |class_name|
    class_probability(feature_values, class_name)
  end

  return res[-1]
end

The code for all of these pieces is very straightforward (except possibly the feature_probability function, but it is essentially the translation of a formula into Ruby) and that’s in large part thanks to Ruby. Blocks, nice loop syntax, easy sorting with a key block, etc. all make Ruby a very pleasant language for machine learning.

Test Drive

Now that we’ve put together the NaiveBayes class, we can test it out. The Iris dataset can be considered something of a standard in the machine learning community. It provides data about petal and sepal dimensions for certain Iris flowers and a species label for each entry.

An example of an entry would be:

4.8,3.0,1.4,0.3,Iris-setosa

Our goal is to train our Naive Bayes Classifier to take feature values (the numbers) and tell us what kind of species the numbers are likely to be. Essentially, we just have to put together the training data set and quiz the classifier with a separate file of entries which are not part of the training set. You can find both the code and the data in this article’s repository. As it happens, the NBC is able to correctly identify 11 out of 12 feature values, which is pretty good considering the small training set.

Wrapping It Up

The go to language for MLers these days seems to be Python. In part, this might be due to the fact that Python has extensive library support in the form of scikit-learn and numpy and also because Ruby’s web-centered community is not extremely active in machine learning. I think that Ruby is a nice language to implement machine learning algorithms in (particularly simple ones) and also I feel that Web developers would also do well to pick up some ML tricks!

Frequently Asked Questions (FAQs) about Machine Learning with Ruby and Naive Bayes Theorem

How does the Naive Bayes theorem work in machine learning?

The Naive Bayes theorem is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This assumption is ‘naive’ because it is rarely true in real-world data sets. However, the technique is particularly suited to large data sets and can often outperform more sophisticated classification methods.

Why use Ruby for machine learning?

Ruby is a high-level, interpreted programming language that is easy to write and read. It has a clean syntax that makes it a good choice for beginners. Ruby also has a number of libraries and tools for machine learning, including the ruby-band library for data analysis and the classifier-reborn gem for implementing the Naive Bayes algorithm.

How accurate is the Naive Bayes classifier?

The accuracy of the Naive Bayes classifier can vary depending on the data set. It performs well when the features are independent, but its performance can degrade if this assumption is violated. However, it is often used in text classification and spam filtering, where it achieves high accuracy.

Can I use Naive Bayes for regression problems?

Naive Bayes is a classification algorithm, which means it is used to predict discrete outcomes. It is not suitable for regression problems, where the goal is to predict a continuous outcome.

How do I handle continuous data with Naive Bayes?

Naive Bayes is typically used with categorical data. For continuous data, you can use a version of Naive Bayes called Gaussian Naive Bayes. This version assumes that the continuous values associated with each class are distributed according to a Gaussian (normal) distribution.

What are the limitations of the Naive Bayes classifier?

The main limitation of the Naive Bayes classifier is the assumption of feature independence. If features are correlated, the classifier can produce incorrect probabilities. However, despite this limitation, it often performs well in practice.

How do I choose the right features for my Naive Bayes model?

Feature selection is an important step in building a Naive Bayes model. You should choose features that are relevant to the prediction task and independent of each other. Techniques such as correlation analysis and mutual information can help identify relevant and independent features.

How do I evaluate the performance of my Naive Bayes model?

You can use metrics such as accuracy, precision, recall, and F1 score to evaluate the performance of your Naive Bayes model. You can also use a confusion matrix to visualize the performance of the classifier.

Can I use Naive Bayes for multi-class classification problems?

Yes, Naive Bayes can be used for multi-class classification problems. In this case, the classifier calculates the probability of each class given the features, and predicts the class with the highest probability.

How do I handle missing data with Naive Bayes?

Missing data can be a challenge for any machine learning algorithm. With Naive Bayes, you can handle missing data by ignoring the missing values when calculating probabilities, or by using techniques such as mean imputation or regression imputation to fill in the missing values.