Logistic regression from scratch in R
Table of Contents
Figure 1: Logistic function
Introduction
Logistic regression is a method for modelling the probability of an event happening via a linear combination of its independent variables 1. The logistic function converts the log-odds into a probability \(\in [0, 1]\). The log-odds is the logarithm of the odds ratio: the probability an event will occur divided by the probability it will not occur 2:
The "regression" part of logistic regression is the process of estimating the logistic function's parameters, which are the coefficients \(\beta_0 ... \beta_d\) in the linear combination. If an event has \(d\) independent variables, then the model will have \(d+1\) parameters. The parameters are optimised by minimising a loss function with gradient descent, where the loss function is the negated log-likelihood:
We use negated log-likelihood because it is a convex and easily differentiable function that is commonly used to measure the accuracy of classification models. The negative log-likelihood function measures the disagreement between the predicted probability of a class and the actual class label. By taking the negation, we aim to minimise this disagreement or maximise the likelihood of the model to predict the correct class.
Gradient descent
The gradient descent algorithm minimises a loss function for a fixed number of epochs or until the change in error is below some convergence threshold \(c\) 3. In each epoch, the gradient of the loss function with parameters \(\lambda^{t-1}\) is computed; the initial parameters are uniformly random. The parameters in \(\lambda^{t}\) are adjusted in the direction of the steepest gradient. The goal is to find the loss function's global minimum.
The following R function implements the gradient descent algorithm via recursion. The function requires the following eight parameters:
- learning machine: a generalisation of a "machine" that can learning from experience.
- independent variables: X, the inputs.
- dependent variable: Y, the outputs.
- loss function: for measuring the quality of a fit.
- gradient of loss function: for calculating the direction of the steepest gradient.
- learning rate: how much to adjust parameters in each iteration.
- initial parameters: randomised.
- max epochs: stop after n epochs.
Calling gradientDescent will recursively descend until max epochs
has been reached. The output is a chain of error values (the loss
function output) and lambda values (candidate model parameters) at
each epoch.
# gradient descent optimisation algorithm.
# F : learning machine;
# X : independent variables;
# Y : dependent variable;
# ℓ : loss function;
# ℓ_ : the gradient of ℓ at λ;
# η : learning rate;
# λ : initial parameters;
# T : number of epochs.
gradientDescent <- function(F, X, Y, ℓ, ℓ_, η, λ, T) {
descend <- function(λ, T) {
# test the parameters in λ
Y_hat <- F(X, λ)
error <- ℓ(Y, Y_hat)
if(T > 1) {
# compute new weights
λ_new <- λ - (η * ℓ_(λ, Y, Y_hat))
# recursive case
return(c(list("error"=error, "lambda"=λ),
descend(error, λ_new, T - 1)))
} else {
# base case
return(list("error"=error, "lambda"=λ))
}
}
# optimise
return(descend(λ, T))
}
Logistic regression
The following function implements the logistic function \(\sigma\). The function works for single scalar values as well as vectors.
# logistic function σ(x).
# x : scalar or vector.
#
# if x is a vector, logit will be applied element-wise
σ <- function(x) {
e <- exp(1)
return(e ** x / (1 + (e ** x)))
}
The log likelihood measures the quality of a model using \(Y\) the known classes and \(\hat{Y}\) the predicted classes:
# log likelihood function.
# Y : known labels;
# Y_hat : predicted labels.
logLikelihood <- function(Y, Y_hat) {
return(sum(log((Y * Y_hat) + ((1 - Y) * (1 - Y_hat)))))
}
Finally, the following gradient function computes and returns a
vector of partial derivatives of the negated log likelihood loss
function. It is implemented using recursion i.e. it computes each
partial derivative via a chain of function calls to itself.
# evaluates to a vector of d+1 partial derivatives
gradient <- function(λ, Y, Y_hat, i) {
# compute derivative
a <- Y - (1 - Y)
b <- (Y * Y_hat) + ((1 - Y) * (1 - Y_hat))
c <- a / b
d <- σ_(t(X_tilde) %*% λ) * X_tilde[i,]
partial <- -sum(c * d)
if(i > 1) {
# recursive case
# build list in reverse since i starts at d + 1 and
# and not 0.
return(c(gradient(λ, Y, Y_hat, i - 1), partial))
} else {
# base case
return(partial)
}
}
Fitting a model
In order to make a prediction we first need to train a model on a training set \(T = (X, Y)\). A common practice is to use a 70:30 or 80:20 split for training vs testing respectively. In some cases, a 60:20:20 split is used for training, validation, and testing respectively. The idea is to have enough data for training while keeping enough aside for testing and validation. The splits should be constructed by randomly sampling from the full dataset.
In the following R code, the "train" variable holds the model's entire training history i.e. the error \(\ell\) and the candidate parameters \(\lambda\) at each step. The final parameters are at the end of the chain, and need to be retrieved before the model can be used to make prediction.
train <- gradientDescent(
# learning machine
F=(function(X, λ) σ(t(X) %*% λ)),
# training dataset ([1, X], Y)
X=X_tilde, Y=Y,
# loss function and gradient of loss function at λ
ℓ=(function(Y, Y_hat) -logLikelihood(Y, Y_hat)),
ℓ_=(function(λ, Y, Y_hat) gradient(λ, Y, Y_hat, d + 1)),
# learning rate, training epochs, and initial λ parameters
η=η, T=T, λ=initλ(d + 1))
# helper function: extracts the best parameters from the sequence of
# results returned by gradientDescent.
# r : output of logisticRegression_Train.
bestParameters <- function(r) {
stopifnot(length(r) > 0)
# get the very last vector of parameters
return(unlist(tail(r[names(r) == "lambda"], n=1)))
}
Recall: if an event has \(d\) independent variables, then the model will have \(d+1\) parameters, due to the bias \(\beta_0\). Therefore, we need to append a column to the independent variables matrix - we will call this \(X\sim\).
# helper function: adds a new dimension to X containing only ones.
# X : a matrix;
# N : number of observations in X.
#
# X~ = [1, XT]T
makeX_tilde <- function(X, N) {
ones <- rep(1, N)
return(t(cbind(ones, t(X))))
}
Making a prediction
We can test the algorithm by fitting a model to a synthetic dataset
\((X, Y)\). The independent variables in \(X\) consist of random numbers
between \(1\) and \(10\). The first half of \(X\) is \(\le 5\) and the second
half is \(\ge 5\). The first half of \(Y\) is false (\(0\)) and the second
half is true (\(1\)).
The following three plots visualise the experiment. The middle plots shows the gradient descent algorithm doing its job: the loss function clearly decreases and converges after about 100 epochs.
Figure 2: Logistic regression synthetic dataset (results)
Once we have found the optimum parameters \(\lambda\), a prediction can be made for \(X\) by passing \(X\) and \(\lambda\) to the logistic function \(\sigma\)
# computes σ(X~Tλ) and returns the predicted
# probabilities for each x \in X.
# X : a matrix;
# N : number of observations in X.
# λ : d+1 model parameters.
predict <- function(X, N, λ) {
# make X~ and return logit function of linear regression.
X_tilde <- makeX_tilde(X, N)
return(σ(t(X_tilde) %*% λ))
}
Transforming predictions (probabilities) into classifications (classes 0 or 1) is simple: just round the probabilities to zero or one.
# computes σ(X~Tλ) and returns the predicted
# probabilities rounded to either 0 or 1.
# X : a matrix;
# N : number of observations in X.
# λ : d+1 model parameters.
classify <- function(X, N, λ) {
# compute probabilities
probs <- logisticRegression_Predict(X, N, λ)
return(round(probs))
}