Tedium Free MLE

Published

October 19, 2021

Introduction

Maximum likelihood estimation has the dubious honor of being difficult for humans and machines alike (difficult for machines at least in the naïve formulation that doesn’t use log-likelihood).

MLE is challenging for humans because it requires the multiplication of \(n\) likelihood expressions, which is time consuming and error prone - this is the tedium part we’re trying to avoid. Fortunately, computers are very good at repeated multiplication, even repeated symbolic multiplication.

Problem Formulation and Example

MLE estimates parameters of an assumed probability distribution, given data \(x_i\) observed independently from the same distribution. If that distribution has probability function \(f(\cdot)\), then the likelihood of \(x_i\) is \(f(x_i)\).

As the \(x_i\)s are independent, the likelihood of all \(x_i\)s will be the product of their individual likelihoods. In mathematical notation, the product will be:

\[\prod_{i=1}^{n} f(x_i)\]

Probability functions (mass functions or density functions) like our \(f(\cdot)\) typically have parameters. For instance, the Gaussian distribution has parameters \(\mu\) and \(\sigma^2\), and the Poisson distribution has rate parameter λ. We use MLE to estimate these parameters, so they are the unknowns in the expression and they will appear in each \(f(x_i)\) term. We can restate the problem as an equality with the generic parameter \(\theta\):

\[L(\theta) = \prod_{i=1}^{n} f(x_i)\]

The expression \(L(\theta)\) is the likelihood. In order to find the MLE it is necessary to maximize this function, or find the value of \(\theta\) for which \(L(\theta)\) is as large as possible. This process is probably easier to show than to describe. In particular, we’ll be demonstrating the usefulness of the sympy module in making these symbolic calculations.

Example

Say we observed values \([3,1,2]\) generated from a Poisson. What is likelihood function of λ?

Importing the necessities and setting up some symbols and expressions:

from sympy.stats import Poisson, density, E, variance
from sympy import Symbol, simplify
from sympy.abc import x

lambda_ = Symbol("lambda", positive=True)

f = Poisson("f", lambda_)
density(f)(x)

\(\displaystyle \frac{\lambda^{x} e^{- \lambda}}{x!}\)

sympy gives us a representation of the Poisson density to work with in the Poisson() object, keeping track of all of the terms internally.

The likelihood expression is the product of the probability function evaluated at these three points:

L_ = density(f)(3) * density(f)(1) * density(f)(2)
L_

\(\displaystyle \frac{\lambda^{6} e^{- 3 \lambda}}{12}\)

That’s our expression for the likelihood \(L(\theta)\) 🙂 In order to maximize the expression, we’ll take the derivative expression and then solve for the value of parameter \(\lambda\) where the derivative expression is equal to 0. This value of \(\lambda\) will maximize the likelihood.

Finding the derivative using sympy:

from sympy import diff

dL_ = diff(L_, lambda_)
dL_

\(\displaystyle - \frac{\lambda^{6} e^{- 3 \lambda}}{4} + \frac{\lambda^{5} e^{- 3 \lambda}}{2}\)

Setting the derivative \(\frac{dL}{d\theta}\) equal to zero:

from sympy import Eq


dLeqz = Eq(dL_, 0)
dLeqz

\(\displaystyle - \frac{\lambda^{6} e^{- 3 \lambda}}{4} + \frac{\lambda^{5} e^{- 3 \lambda}}{2} = 0\)

And finally, solving the equation for \(\lambda\):

from sympy import solve

solve(dLeqz, lambda_)
[2]

And that’s our answer!

Complications

There is a slight wrinkle with this approach. It is susceptible to numerical instability, which (luckily) did not affect us in this example. This is how MLE can become difficult for computers too.

Likelihoods are usually very small numbers and computers simply can’t track numbers that are too small or too large. Multiplying very small numbers together repeatedly makes very VERY small numbers that can sometimes disappear completely. Without getting too distracted by the minutiae of numerical stability or underflow, we can still appreciate some bizarre behavior that results when floats are misused:

6.89 + .1
6.989999999999999
(0.1)**512
0.0

In the second scenario, we can imagine having 512 data points and finding that the likelihood evaluates to 0.1 (times our parameter) for every single one. Then our product would look like \(g(\theta) \cdot (0.1)^{512}\). The computer just told us that one of those terms is zero, and we’re left unable to find the parameters for our MLE.

Solution

What do we do instead? Is there any way to make these numbers bigger, without changing the problem or solution? Is there an equivalent problem with bigger numbers?

Adding a number and multiplying by a number don’t fix the problem - they just add terms to the expression, which ends up zero anyhow. However these functions do have one property that we will need to be sure we are solving an equivalent problem: they preserve the order of the input in the output. We call these functions monotonic.

The monotonic functions also include the log function. The log function has some very nice properties, not least of which that it makes our calculations immune to the problems we saw above. Calculating the log likelihood:

from sympy import log

_ = simplify(log(L_))
_

\(\displaystyle - 3 \lambda + 6 \log{\left(\lambda \right)} - \log{\left(12 \right)}\)

And then taking the derivative as before:

d_ = diff(_, lambda_)
d_

Setting equal to zero:

_ = Eq(_, 0)
_

\(\displaystyle -3 + \frac{6}{\lambda} = 0\)

And solving:

from sympy import solve

solve(_, lambda_)
[2]

The two solutions agree! Which is necessary, but not sufficient to show these methods are equivalent in general.