Published in

·

13 min read

·

Jun 21, 2022

--

The beauty of Bayesian statistics is, at the same time, one of its most annoying features: we often get answers in the form of “well, the number is somewhere between…”

If on one hand, this might be frustrating to business strategists or policymakers looking for straightforward guidance, it is also a fantastic way to know how wrong you can be.

To start with a simple example, let us say that we are given the task of finding the distribution of heights of students in a high school classroom. Maybe this school requires a uniform, and the uniform company wants to have some sense of how many uniforms of each size to produce, and knowing the distribution of heights will allow us to know what sizes to make.

Say we have a variable, *h*, that denotes the height of these students. What if we want to find out how *h* is distributed? That is, we might already have an idea of what the distribution looks like (height is usually normally distributed), but what are the parameters of said distribution? Why do we even care? Well, we care because knowing how a variable is distributed helps us predict what new data will look like. If we know height is normally distributed, for example, we know that a new data point will likely be close to the mean (if the mean is 1.75 m, then we expect a new student who joins the class to have a height close to that mean). Knowing the parameters of distribution also allows us to sample from it (so we can create a sample of ‘fake’ students and thus have an idea of how many students of each height we might expect). Therefore, finding the distribution of a variable helps us with prediction problems.

Bayesian inference is a method to figure out what the distribution of variables is (like the distribution of the heights *h*). The interesting feature of Bayesian inference is that it is up to the statistician (or data scientist) to use their prior knowledge as a means to improve our guess of how the distribution looks like.

Bayesian inference depends on the principal formula of Bayesian statistics: Bayes’ theorem. Bayes’ theorem takes in our assumptions about how the distribution looks like, a new piece of data, and outputs an updated distribution. For data science, Bayes’ theorem is usually presented as such:

Statisticians also gave each component of this theorem names:

Let's go over them to understand them a bit better.

We can see from Bayes' theorem that the prior is a probability: P(θ). First, let's dig into what ‘θ’ means. θ is often expressed as our hypothesis for the model that best describes the variable we are trying to investigate. Let's go back to the example of heights. We infer, from background knowledge and common sense, that height is normally distributed in a class. Formally:

h∼N(μ,σ)

Where N denotes the normal distribution, μ denotes the mean and σ denotes the standard deviation.

Now, our prior isn’t exactly the expression above. Instead, it is our assumptions for how each of the parameters, μ, and σ, is distributed. Note that this is where the defining characteristic of Bayesian statistics shows up: how do we find the distribution of these parameters? Well, amusingly, we “make them up” based on our prior knowledge. If we have very little prior knowledge, we can choose a very uninformative prior so as not to bias the process at all. For example, we can define that μ, the mean height, is somewhere between 1.65m and 1.8m. If we want to go for an uninformative prior, we can say that μ is uniformly distributed along that interval. If instead, we believe that the mean height is somewhat skewed to a value closer to 1.65m than to 1.8m, we can define that μ is distributed with beta distribution, defined by “hyper” parameters α and β. We can take a look at these options below:

Note how the y-axis gives us the ‘probability density’, i.e. how likely we think it is that the true μ is the one on the x-axis. Also, note that the beta and uniform distributions lead to different conclusions on what we think the value of μ may be. If we choose the uniform distribution, we are saying that we have no inclination as to whether μ is close to any value within our range, we just think it is somewhere in there. If we choose the beta distribution, we are fairly sure that the ‘real’ value of μ is somewhere between 1.68m and 1.72m, as shown by the peak along the blue line.

Note that we are discussing the prior for μ, but that our model actually has two parameters: N(μ,σ). Generally, we can define a prior over σ as well. However, if we are feeling lucky about our guess for σ, or if we want to simplify the process for the sake of an example, we can just set σ to be a fixed value, like 0.1m.

The likelihood is expressed as P(Data|θ). The ‘data’ in this case would be an observed value for the height. Say we get to measure one student, picked at random, and their height is 1.7m. Consider that with this datum we can now have a sense of how good each option for θ is. We do this by asking: if a particular option for θ, call it θ1, were the true one, how ‘likely’ would it be for us to observe a height of 1.7m? How about θ2: how likely would it be to observe a height of 1.7m if θ2 were the ‘correct’ model?

Indeed what we want at this stage is a function that will systematically look at every possibility for the model θ and find the probability that this model would have ‘produced’ or ‘predicted’ the observed value. Remember that in this case, our model θ is defined as N(μ,σ), i.e. a normal distribution with mean μ and standard deviation σ. We are also keeping σ constant to simplify the process. Therefore our function will take in possible values for μ as our ‘independent’ variable, our observed data point of 1.7m, and it will output the probabilities that each model is the correct model as our ‘dependent’ variable. Note that we run into a slight hiccup here: the ‘probability’ that a particular model is the correct one would technically be zero because, in theory, the possibilities for θ are endless. For this reason, I discretized the possibilities for θ, making it such that there are 50 options for θ between 1.65m and 1.8m. We then use the probability density functions for each proposed model to evaluate the probability density of the observed datum for a particular model. The probability density isn’t the same as the probability, it just gives us a relative measure of how likely it is that the point was observed given each of the model options. To get true probabilities, we would have to ‘normalize’ the probability densities, making it such that adding all of their values would give us 1.

This turns out to not be a huge deal, however, because we can still compare how likely each model is. Its as if we are asking: constraining the possibilities to this discrete set of models that somewhat comprehensively cover plausible possibilities, which model is the best? Nevertheless, we are also going to normalize the probability densities as well, as you will see below.

This function is called the ‘likelihood’ function, and we typically define it by the probability ‘density’ function (PDF) of the model we are proposing, evaluated at the new data point. Thus, we want the “PDF” of the distribution N(μ, 0.1m) for the data point 1.7m.

Note that for those who have used PDFs before this seems a bit backward. With PDFs, we usually have a fixed model, e.g. N(1.8,0.1), and we use it to evaluate the probability of different values for our variable *h*. This would mean that we would have the variable *h* on the x-axis, and the probability density on the y-axis.

For our current purposes, however, we are varying the distribution/model itself. This means that our x-axis will actually have the different possibilities for our variable μ, and our y-axis will have the probability density for each of these possibilities. Have a look at the code below, which represents our likelihood function and visualization for it:

See how our function simply shows us which values of μ are most likely? See how this gives us a relative understanding of the probability, which statisticians prefer to call the ‘likelihood’ for each possible μ? With this evaluation of the likelihood, we get a preemptive inclination as to what the ‘best’ value for μ might be. We then have to combine this with the prior, however, to get our final ‘guess’ on what the best value of μ is.

Some statisticians call P(*Data*) the ‘evidence’. The meaning of this variable is pretty straightforward: it is the probability that value *Data* is produced. However, this is tough to calculate directly. Thankfully, we have a good trick up our sleeves.

Consider the following expression:

Can you see why it makes sense? Intuitively, what we are saying is that if we were to sum the probability evaluations for every hypothesis θ, we’d exhaust all the hypothesis probabilities, and we should be left with just P(Data). Note that P(θ) is a probability distribution, and becomes equal to 1 if we add together all of its outputs for every θ. Note that ∫P(Data|θ)∗P(θ)dθ is equivalent to finding the area under the curve of the graph with P(Data|θ)∗P(θ) on the y-axis and θ on the x-axis, we will do exactly this for the next step.

The right-hand side of Bayes’ theorem, P(θ|Data), is called the ‘posterior’. It is our posterior understanding of how the data is distributed given that we witnessed the data, and that we had a prior about it.

How do we get the posterior? Going back to the equation:

The first step, then, is to multiply the likelihood (P(Data|θ)) and the prior (P(θ)):

We can see how, if we were to sum the probability densities in the unnormalized posterior, we would have a value that is different than 1:

Note that we still have ‘P(Data)’ to contend with, hence why I call this the ‘unnormalized_posterior’.

Remember from above that:

We can simply compute that integral and make that final division and there we go: we get the posterior. That being said, it's nice to think a bit more about why we divide by P(Data). Note that calculating P(Data) is the same as finding the area between the unnormalized posterior and the x-axis on the graph above. Again, because the posterior is a probability distribution, it must be the case that the area bounded by the posterior pdf sums to 1. To make sure that this is the case, we have to find out what the current area under the curve is, and then we divide each data point by that number. This then gives us the normalized version!

Also, note that it would be very challenging to compute the integral analytically. Instead, we can rely on an approximation using scipy and the ‘trapezium’ method that needs only the values for the x-axis and y-axis to make an estimate of the area. This is the implementation I used below:

Can we verify that this new graph represents a valid PDF? Remember that for that to be the case, the area under the graph must sum to 1. We can quickly compute that value to check:

Note that so far, our Bayesian inference went through the following steps:

- Defining our prior(s)
- Defining our likelihood function
- Observing one datum
- Using Bayes’ theorem to find the posterior distribution (and using the trapezium rule to normalize the posterior).

In practice, we don’t stop there: observing one datum is probably not enough for us to have a high degree of confidence in our model! The idea behind Bayesian inference is that it is iterative and that every new observation makes us better informed about the ‘true’ distribution of our variable of interest.

How do we move forward, then? The next steps are to:

- Observe a new data point.
- Take the posterior we just found, and have that become our new prior!
- Using the same old likelihood function, evaluate the likelihood of our different hypotheses given this newly observed data point.
- Multiply our new prior and likelihood values to get the unnormalized posterior. Use the ‘trapezium’ rule to normalize the posterior.

We then repeat steps 1 through 4 for however many data points we have! To showcase this process, let me first generate 1000 new data points:

See how, as we get more and more data, our model gets closer and closer to the ‘truth’? In other words, our model’s mean value converges to the real value of μ, whose true value is 1.7m, and the uncertainty about our guess (i.e. the standard deviation of our distribution) shrinks! This means that more data = more accurate AND precise guesses.

Below, we can see how the increased number of data leads to a predicted mean μ that is ever closer to the ‘true’ value of μ, 1.7m.

Notice from the code that I used a cumulative trapezium function to find the mean of the distribution output by our posterior. I’ll leave it to the reader to re-create the code and investigate why and how this makes sense!

At this stage, all that is left for us to do is to get the final predicted mean of our model (the one with the most data) and take that to be our ‘true’ model. After observing data from 1000 students, we take the final mean and have that as the mean of our model. Furthermore, we can get the 99% confidence interval as well, which helps us understand how ‘wrong’ our predicted mean can be.

Now that we have our improved model, we can use it to make predictions! Based on the final model we arrived at, our model is specified as:

- N(μ,σ)
- μ=1.699m
- σ=0.1m

We can now use this model to answer potentially interesting business-related questions! For example:

## How many students can we expect to have more than 1.75m in a class of 100 students?

Using our distribution, we can answer the question in two ways: the first is the analytical way. We can use the cumulative density function for the normal distribution to find how much of the density is below 1.75m, and then subtract that value from 1 to obtain the density that is above 1.75m:

This indicates that there is about a 30% chance that a student will be taller than 1.75m. In a class of 100 students, this means that we expect 30 students to be taller than 1.75m. The other way we might answer this question is through a simulation. We can use our model to sample 100 students, and count how many are taller than 1.75m:

True Bayesian statisticians rarely simulate anything only once, however. We want to be able to capture our uncertainty and variance. Therefore, we can perform the above simulation one hundred times and plot that distribution:

We can also find the 99% confidence interval for our assertions given the simulation, which is an important advantage of this method:

With our simulated mean as well as our bounds, we can now have a really good idea of how many students we might expect that have more than 1.75m. If we needed to know that information in order to produce school uniforms for sale, for instance, we could take a conservative approach and produce 40 large-sized uniforms, and we would expect that 99% of the time we would have enough uniforms to sell. This helps us know how much redundancy we might want to have. Whereas producing 30 large uniforms might leave no room for error, a more interesting question is: how much room for error should we leave? With the help of Bayesian statistics and Bayesian inference, we can find a very compelling answer to questions of that type!