Logistic regression from scratch in R

Introduction
Gradient descent
Logistic regression
Fitting a model
Making a prediction

Figure 1: Logistic function

\begin{equation} f(x) = \frac {L} {1 + e^{-k(x - x_0)}} \end{equation}

Introduction

Logistic regression is a method for modelling the probability of an event happening via a linear combination of its independent variables ¹. The logistic function converts the log-odds into a probability \(\in [0, 1]\). The log-odds is the logarithm of the odds ratio: the probability an event will occur divided by the probability it will not occur ²:

\begin{equation} f(x) = \frac {1} {1 + e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_d x_d)}} \end{equation}

The "regression" part of logistic regression is the process of estimating the logistic function's parameters, which are the coefficients \(\beta_0 ... \beta_d\) in the linear combination. If an event has \(d\) independent variables, then the model will have \(d+1\) parameters. The parameters are optimised by minimising a loss function with gradient descent, where the loss function is the negated log-likelihood:

\begin{equation} \ell = -\sum_{k=1}^{K}(y_k \ln(p_k) + (1 - y_k) \ln(1 - p_k)) \end{equation}

We use negated log-likelihood because it is a convex and easily differentiable function that is commonly used to measure the accuracy of classification models. The negative log-likelihood function measures the disagreement between the predicted probability of a class and the actual class label. By taking the negation, we aim to minimise this disagreement or maximise the likelihood of the model to predict the correct class.

Gradient descent

The gradient descent algorithm minimises a loss function for a fixed number of epochs or until the change in error is below some convergence threshold \(c\) ³. In each epoch, the gradient of the loss function with parameters \(\lambda^{t-1}\) is computed; the initial parameters are uniformly random. The parameters in \(\lambda^{t}\) are adjusted in the direction of the steepest gradient. The goal is to find the loss function's global minimum.

The following R function implements the gradient descent algorithm via recursion. The function requires the following eight parameters:

learning machine: a generalisation of a "machine" that can learning from experience.
independent variables: X, the inputs.
dependent variable: Y, the outputs.
loss function: for measuring the quality of a fit.
gradient of loss function: for calculating the direction of the steepest gradient.
learning rate: how much to adjust parameters in each iteration.
initial parameters: randomised.
max epochs: stop after n epochs.

Calling gradientDescent will recursively descend until max epochs has been reached. The output is a chain of error values (the loss function output) and lambda values (candidate model parameters) at each epoch.

# gradient descent optimisation algorithm.
#   F  : learning machine;
#   X  : independent variables;
#   Y  : dependent variable;
#   ℓ  : loss function;
#   ℓ_ : the gradient of ℓ at λ;
#   η  : learning rate;
#   λ  : initial parameters;
#   T  : number of epochs.
gradientDescent <- function(F, X, Y, ℓ, ℓ_, η, λ, T) {
    descend <- function(λ, T) {
          # test the parameters in λ
          Y_hat <- F(X, λ)
          error <- ℓ(Y, Y_hat)

          if(T > 1) {
              # compute new weights
              λ_new <- λ - (η * ℓ_(λ, Y, Y_hat))

              # recursive case
              return(c(list("error"=error, "lambda"=λ),
                       descend(error, λ_new, T - 1)))
          } else {
              # base case
              return(list("error"=error, "lambda"=λ))
          }
    }

    # optimise
    return(descend(λ, T))
}

Logistic regression

The following function implements the logistic function \(\sigma\). The function works for single scalar values as well as vectors.

# logistic function σ(x).
#   x : scalar or vector.
#
# if x is a vector, logit will be applied element-wise
σ <- function(x) {
    e <- exp(1)
    return(e ** x / (1 + (e ** x)))
}

The log likelihood measures the quality of a model using \(Y\) the known classes and \(\hat{Y}\) the predicted classes:

# log likelihood function.
#   Y     : known labels;
#   Y_hat : predicted labels.
logLikelihood <- function(Y, Y_hat) {
    return(sum(log((Y * Y_hat) + ((1 - Y) * (1 - Y_hat)))))
}

Finally, the following gradient function computes and returns a vector of partial derivatives of the negated log likelihood loss function. It is implemented using recursion i.e. it computes each partial derivative via a chain of function calls to itself.

# evaluates to a vector of d+1 partial derivatives
gradient <- function(λ, Y, Y_hat, i) {
    # compute derivative
    a <- Y - (1 - Y)
    b <- (Y * Y_hat) + ((1 - Y) * (1 - Y_hat))
    c <- a / b
    d <- σ_(t(X_tilde) %*% λ) * X_tilde[i,]
    partial <- -sum(c * d)

    if(i > 1) {
          # recursive case
          # build list in reverse since i starts at d + 1 and
          # and not 0.
          return(c(gradient(λ, Y, Y_hat, i - 1), partial))
    } else {
          # base case
          return(partial)
    }
}

Fitting a model

In order to make a prediction we first need to train a model on a training set \(T = (X, Y)\). A common practice is to use a 70:30 or 80:20 split for training vs testing respectively. In some cases, a 60:20:20 split is used for training, validation, and testing respectively. The idea is to have enough data for training while keeping enough aside for testing and validation. The splits should be constructed by randomly sampling from the full dataset.

In the following R code, the "train" variable holds the model's entire training history i.e. the error \(\ell\) and the candidate parameters \(\lambda\) at each step. The final parameters are at the end of the chain, and need to be retrieved before the model can be used to make prediction.

train <- gradientDescent(
            # learning machine
            F=(function(X, λ) σ(t(X) %*% λ)),

            # training dataset ([1, X], Y)
            X=X_tilde, Y=Y,

            # loss function and gradient of loss function at λ
            ℓ=(function(Y, Y_hat) -logLikelihood(Y, Y_hat)),
            ℓ_=(function(λ, Y, Y_hat) gradient(λ, Y, Y_hat, d + 1)),

            # learning rate, training epochs, and initial λ parameters
            η=η, T=T, λ=initλ(d + 1))

# helper function: extracts the best parameters from the sequence of
# results returned by gradientDescent.
#   r : output of logisticRegression_Train.
bestParameters <- function(r) {
  stopifnot(length(r) > 0)
  # get the very last vector of parameters 
  return(unlist(tail(r[names(r) == "lambda"], n=1)))
}

Recall: if an event has \(d\) independent variables, then the model will have \(d+1\) parameters, due to the bias \(\beta_0\). Therefore, we need to append a column to the independent variables matrix - we will call this \(X\sim\).

# helper function: adds a new dimension to X containing only ones.
#   X : a matrix;
#   N : number of observations in X.
#
# X~ = [1, XT]T
makeX_tilde <- function(X, N) {
  ones <- rep(1, N)
  return(t(cbind(ones, t(X))))
}

Making a prediction

We can test the algorithm by fitting a model to a synthetic dataset \((X, Y)\). The independent variables in \(X\) consist of random numbers between \(1\) and \(10\). The first half of \(X\) is \(\le 5\) and the second half is \(\ge 5\). The first half of \(Y\) is false (\(0\)) and the second half is true (\(1\)).

The following three plots visualise the experiment. The middle plots shows the gradient descent algorithm doing its job: the loss function clearly decreases and converges after about 100 epochs.

Figure 2: Logistic regression synthetic dataset (results)

Once we have found the optimum parameters \(\lambda\), a prediction can be made for \(X\) by passing \(X\) and \(\lambda\) to the logistic function \(\sigma\)

# computes σ(X~Tλ) and returns the predicted
# probabilities for each x \in X.
#   X : a matrix;
#   N : number of observations in X.
#   λ : d+1 model parameters.
predict <- function(X, N, λ) {
  # make X~ and return logit function of linear regression.
  X_tilde <- makeX_tilde(X, N)
  return(σ(t(X_tilde) %*% λ))
}

Transforming predictions (probabilities) into classifications (classes 0 or 1) is simple: just round the probabilities to zero or one.

# computes σ(X~Tλ) and returns the predicted
# probabilities rounded to either 0 or 1.
#   X : a matrix;
#   N : number of observations in X.
#   λ : d+1 model parameters.
classify <- function(X, N, λ) {
  # compute probabilities
  probs <- logisticRegression_Predict(X, N, λ)
  return(round(probs))
}

Footnotes:

https://en.wikipedia.org/wiki/Logistic_regression

https://www.statisticshowto.com/probability-and-statistics/probability-main-index/odds-ratio/

https://en.wikipedia.org/wiki/Gradient_descent