From Theory to Code in Machine Learning (Part 2): Maximum Likelihood Estimation for Classification

Implementing Logistic Regression from Scratch in Python

Published in

Level Up Coding

6 min readApr 24, 2024

The previous article discussed what is Maximum Likelihood Estimation (MLE) and how to use MLE for Linear Regression. This article will give an in-depth description of how to use MLE for classification tasks, where the target variable is categorical. Specifically, this article includes:

the derivation of a general classification objective under the MLE principle
the application of the objective to optimize Logistic Regression models.

Section 1: Classification Objective under the MLE Principle

Let’s look at binary classification with only 2 classes and then extent it to multi-class classification with more than 2 classes. For example, for passengers in the Titanic accident, they are either survived or not survived. For numerical computation, the categorical variable is firstly encoded to a dummy variable with values {0, 1}, e.g., {0: Not Survived, 1 : Survived}.

Let’s look at the general likelihood function we have derived:

As we progressed for linear regression, the first goal is to find the probability distribution for Pr(y_i|x_i, θ) . For binary classification, the nature of binary outcomes suggests a Bernoulli distribution:

, where p represents the probability that Y=1.

Revision: In regression problems, the target variable of linear regression is assumed to follow a normal distribution.

Intuitive Understanding of Bernoulli distribution:

When y=1: The formula simplifies to P(Y=1)=p. This makes intuitive sense because it directly states that the probability of Y being 1 is exactly p, the parameter we defined as the probability of Y being 1 given the input x.
When y=0: The formula simplifies to P(Y=0)=1−p. This is logical because if the probability of Y being 1 is p, then naturally, the probability of Y being 0 must be 1−p since Y can only be 0 or 1 in this binary outcome scenario.

Derived Likelihood and Negative Log Likelihood (NLL): The likelihood L(θ) can be derived as:

Correspondingly, the negative log-likelihood ℓ(θ) can be derived as:

p can be directly obtained through a logistic regression function or a neural network with a sigmoid output layer. Hence, p_i is replaced by y^i in such context.

p_i= y-hat_i under Logistic Regression: Logistic regression actually directly predicts the probability p by using sigmoid function on the linear combination of a set of features:

The outputs of Sigmoid function is suitable for representing pis because they lie between 0 and 1.

Section 2: Optimizing Logistic Regression under the MLE Objective

Algebraic Solution

In the previous article, the parameters of linear regression β0 and β1 can be optimized algebraically via a closed-form solution: setting the derivatives of the loss function to zero. Let’s do the same thing here. Let’s do the same thing for logistic regression. Firstly, we calculate the derivatives with respect to β0 and β1 , respectively:

You can watch the video below for the detailed derivations of the above expressions.

The predicted probabilities y^i that are used in these derivatives are outcomes of the sigmoid function applied to the linear combinations of inputs and model parameters. Due to this dependency, the equations making derivatives 0 cannot be simplified to a closed-form solution.

Gradient Descent

To find the minimum of a objective function (imaging the lowest point in a valley), gradient descent starts from an arbitrary point and uses the slope of the function at that point (i.e., the above derivatives) to determine the direction to move to descend towards the valley’s bottom. Each step adjusts the position based on the slope, and this process is repeated iteratively until the lowest possible point (minimum loss) is approached.

Initialize Parameters: Start with initial guesses for β0, β1, etc.

 beta1 = np.zeros(n_features)
 beta0 = 0

2. Compute Gradients: Use the above expressions for derivatives:

d_beta1 = (1 / n_samples) * np.dot(X.T, (y_pred - y))
d_beta0 = (1 / n_samples) * np.sum(y_pred - y)

3. Update Parameters: Adjust each parameter in the direction that most reduces the loss, which is the opposite direction of the gradient:

, where α is the learning rate, a small positive scalar determining the size of each step.

beta1 -= lr * d_beta1
beta0 -= lr * d_beta0

4. Iterate: Repeat the process of computing gradients and updating parameters until the changes are very small or a maximum number of iterations is reached.

for _ in range(num_iter):
        # Compute the predicted values
        y_pred = sigmoid(np.dot(X, beta1) + beta0)

        # Compute the gradients
        d_beta1 = (1 / n_samples) * np.dot(X.T, (y_pred - y))
        d_beta0 = (1 / n_samples) * np.sum(y_pred - y)

        # Update the weights and bias
        beta1 -= lr * d_beta1
        beta0 -= lr * d_beta0

The final function for estimating logistic regression models can be:

def logistic_regression(X, y, lr=0.01, num_iter=1000):
    """
    Perform logistic regression on the dataset (X, y) using gradient descent.

    Parameters:
    - X: numpy array of shape (n_samples, n_features), the input data.
    - y: numpy array of shape (n_samples,), the target values.
    - lr: float, the learning rate for gradient descent.
    - num_iter: int, the number of iterations for gradient descent.

    Returns:
    - w: numpy array of shape (n_features,), the weights of the logistic regression model.
    - b: float, the bias of the logistic regression model.
    """
    n_samples, n_features = X.shape
    beta1 = np.zeros(n_features)
    beta0 = 0

    for _ in range(num_iter):
        # Compute the predicted values
        y_pred = sigmoid(np.dot(X, beta1) + beta0)

        # Compute the gradients
        d_beta1 = (1 / n_samples) * np.dot(X.T, (y_pred - y))
        d_beta0 = (1 / n_samples) * np.sum(y_pred - y)

        # Update the weights and bias
        beta1 -= lr * d_beta1
        beta0 -= lr * d_beta0

    return beta1, beta0

Final Verification

Below is the test code to verify the result.

np.random.seed(42)
X = np.random.randn(100, 1)
ture_w = np.array([0.5])
true_b = 0.1
y = sigmoid(np.dot(X, ture_w) + true_b)

beta1, beta0 = logistic_regression(X, y, num_iter=1000, lr=0.01)
print(f"beta_0: {beta0}, beta_1: {beta1}")
# beta_0: 0.07599928462722418, beta_1: [0.42017187]

beta1, beta0 = logistic_regression(X, y, num_iter=10000, lr=0.01)
print(f"beta_0: {beta0}, beta_1: {beta1}")
#beta_0: 0.0999999943039172, beta_1: [0.49999998]

Here is the output. beta_0: 0.076, beta_1: 0.42, different from the ture_beta_0 and ture_beta_1 . But wih the increased iterationi, we can see a nearly perfect result: beta_0: 0.099, beta_1: 0.4999 , very close to ture_beta_0 and ture_beta_1 .