Maximum Likelihood Estimation (MLE) Definition, What does it Mean?

Contributed by: Venkat Murali
LinkedIn Profile: https://www.linkedin.com/in/venkat-murali-3753bab/

The maximum likelihood estimation is a method that determines values for parameters of the model. It is the statistical method of estimating the parameters of the probability distribution by maximizing the likelihood function. The point in which the parameter value that maximizes the likelihood function is called the maximum likelihood estimate.

Development:

This principle was originally developed by Ronald Fisher, in the 1920s. He stated that the probability distribution is the one that makes the observed data “most likely”. Which means, the parameter vector is considered which maximizes the likelihood function.

Goal:

The goal of maximum likelihood estimation is to make inference about the population, which is most likely to have generated the sample i.e., the joint probability distribution of the random variables.

Before proceeding further, let us understand the key difference between the two terms used in statistics – Likelihood and Probability which is very important for data scientists and data analysts in the world.

Also Read: What is Machine Learning? How does it work?

Difference between Likelihood and Probability:

S. No	Likelihood	Probability
1	Refers to the past events with known outcomes	Refers to the occurrence of future events
2	I flipped a coin 10 times and obtained 10 heads. What is the likelihood that the coin is fair? Given the fixed outcomes (data), what is the likelihood of different parameter values?	I flipped a coin 10 times. What is the probability of it landing heads or tails every time? Given the fixed parameter(p=0.5). What is the probability of different outcomes?
3	Likelihoods doesn’t add up to 1	Probabilities add up to 1

Probability is simply the likelihood of an event happening.

Simple Explanation – Maximum Likelihood Estimation using MS Excel.

Problem: What is the Probability of Heads when a single coin is tossed 40 times.

Observation: When the probability of a single coin toss is low in the range of 0% to 10%, the probability of getting 19 heads in 40 tosses is also very low. However, when we go for higher values in the range of 30% to 40%, I observed the likelihood of getting 19 heads in 40 tosses is also rising higher and higher in this scenario.

In some cases, after an initial increase, the likelihood percentage gradually decreases after some probability percentage which is the intermediate point (or) peak value. The peak value is called maximum likelihood.

Five Major Steps in MLE:

Perform a certain experiment to collect the data.
Choose a parametric model of the data, with certain modifiable parameters.
Formulate the likelihood as an objective function to be maximized.
Maximize the objective function and derive the parameters of the model.

Examples:

Toss a Coin – To find the probabilities of head and tail
Throw a Dart – To find your PDF of distance to the bull eye
Sample a group of animals – To find the quantity of animals

How Machine Learning algorithms use Maximum Likelihood Estimation and how it is helpful in the estimation of the results

When the probability of a single coin toss is low in the range of 0% to 10%, Logistic regression is a model for binary classification real-time practical applications. The parameters of a logistic regression model can be estimated by the probabilistic framework called maximum likelihood estimation.

A probability distribution for the target variable (labeled class) must be assumed and followed by a likelihood function defined that calculates the probability of observing the outcome given the input data and the model. The function can be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset.

In Maximum Likelihood Estimation, we maximize the conditional probability of observing the data (X) given a specific probability distribution and its parameters (theta – ɵ )

P(X,ɵ) where X is the joint probability distribution of all observations from 1 to n.

P(X1, X2, X3.…Xn; ɵ)

The resulting conditional probability is known as the likelihood of observing the data with the given model parameters and denoted as (L)
L(X, ɵ)

The joint probability can also be defined as the multiplication of the conditional probability for each observation given the distribution parameters

Sum i to n log [(P(xi, ɵ)]

As log is used mostly in the likelihood function, it is known as log-likelihood function. It is common in optimization problems to prefer to minimize the cost function.

Therefore, the negative of the log-likelihood function is used and known as Negative Log-Likelihood function.

Minimize: Sum i to n log [(P(xi, ɵ)]
The Maximum Likelihood Estimation framework can be used as a basis for estimating the parameters of many different machine learning models for regression and classification predictive modeling. This includes the logistic regression model.

Let X1, X2, X3, ……, Xn be a random sample from a distribution with a parameter θ. Suppose that we have observed X1=x1, X2=x2, ⋯⋯, Xn=xn.

If Xi’s are discrete, then the likelihood function is defined as

L (x1, x2, ⋯, xn; θ) = Px1x2⋯xn(x1, x2,⋯,xn;θ).

If Xi’s are jointly continuous, then the likelihood function is defined as

L (x1, x2, ⋯, xn; θ) = fx1x2⋯xn(x1, x2,⋯,xn;θ).

In some problems, it is easier to work with the log likelihood function given by

ln L (x1, x2, ⋯, xn ; θ).

Also Read: Understanding Probability Distribution

MLE in Python:

Implementing MLE in the data science project can be quite simple with a variety of approaches and mathematical techniques. Below is one of the approaches to get started with programming for MLE.

Step 1: Import libraries:

import numpy as np

import pandas as pd
import matplotlib pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
import scipy.stats as stats

import pymc3 as pm3
import numdifftools as ndt
import statsmodels.api as sm

from statsmodels.base.model import GenericLikelihoodModel

Step 2: Generate Data

N = 1000
x = np.linspace(0,200,N)
e = np.random.normal(loc = 0.0, scale = 5.0, size = N)
y = 3*x + e

df = pd.DataFrame({‘y’:y, ‘x’:x})
df[‘constant’] = 1

Step 3: Visualize the Plot

sns.regplot(df.x, df.y)

Step 4: Scatter Plot with OLS Line and confidence intervals

Step 5: Modeling OLS with Statsmodels

We created regression-like continuous data, so will use sm.OLS to calculate the best coefficients and Log-likelihood (LL) is the benchmark.

Split features and target

X = df[[‘constant’, ‘x’]]

Fit model and summarize

sm.OLS(y,X).fit().summary()

Also Read: The Ultimate Guide to Python: Python Tutorial

Maximizing Log Likelihood to solve for Optimal Coefficients-

We use a combination of packages and functions to see if we can calculate the same OLS results above using MLE methods.

Because scipy.optimize has only a minimize method, we will minimize the negative of the log-likelihood. This is recommended mostly in data science domains. Simple Function is built for it.

Define likelihood function:

def MLERegression(params):
intercept, beta, sd = params[0], params[1], params[2]

Inputs at our parameters

yhat = intercept + beta*x

Using The Bayesian question, compute PDF of observed values normally distributed around mean (yhat) with a standard deviation of sd

negLL = -np.sum( stats.norm.logpdf(y, loc=yhat, scale=sd) )

negLL

Minimizing the Cost Function:

guess = np.array([5,5,2])

results = minimize (MLERegression, guess, method = ‘Nelder-Mead’,
options={‘disp’: True})

If you find this interesting and wish to learn more, upskill with Great Learning’s PGP Artificial Intelligence and Machine Learning Course today!