Logistic Regression from Scratch

Machine Learning | 28 September 2019

During my journey as a Machine Learning (ML) practitioner, I found it has become ultimately easy for any human with limited knowledge on algorithms to take advantage of free python libraries such as scikit-learn to solve a ML problem. Truth be said, it’s easy and sometimes no brainer to achieve this, as there are so many codes available in GitHub, Medium, Kaggle etc., You just need some amount of time looking at these codes to arrive at a solution to a problem of your choice.

But, what if we learn every algorithm or procedures behind each machine learning pipeline that does all the heavy lifting for us inside these amazing libraries. In this blog post and the series of blog posts to come, I will be focusing on implementing machine learning algorithms from scratch using python and numpy.

Sure you might argue with me for the first paragraph. But learning how each algorithm work behind the scenes is very important to use these algorithms and bring in customized features in any domain (say hardware design).

In this blog post, we will implement logistic regression from scratch using python and numpy to a binary classification problem.

I assume that you have knowledge on python programming and scikit-learn, because at the end we will compare our implementation (from scratch) with scikit-learn’s implementation.

Dataset

We will use the breast cancer dataset from scikit-learn for this implementation. This is a binary classification problem i.e., each data point in the training data belong to one of two classes. Below code loads the dataset and prints out the important information we seek.

logistic_regression.pycode
1
2
3
4
5
6
7
8
9
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

print(data.keys())
print("No.of.data points (rows) : {}".format(len(data.data)))
print("No.of.features (columns) : {}".format(len(data.feature_names)))
print("No.of.classes            : {}".format(len(data.target_names)))
print("Class names              : {}".format(list(data.target_names)))
1
2
3
4
5
6
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

No.of.data points (rows) : 569
No.of.features (columns) : 30
No.of.classes            : 2
Class names              : ['malignant', 'benign']

If you wish to know more about the dataset, use data.DESCR to print out the entire description of the dataset.

As you can see, the dataset has data and target keys from which we can access the data in this dataset. There are 569 rows and 30 columns. To describe the dataset mathematically, we will use this notation \( [x_i, h(x_i), y_i] \).

where

  • \( x_i \) denotes a single data point in the dataset.
  • \( h(x_i) \) is th feature vector for that single data point which has \( [h_1(x_i), h_2(x_i) … h_{30}(x_i)] \).
  • \( y_i \) is the target class for that single data point.

We can easily convert this dataset into a pandas dataframe for better data analysis. To view the datatype of each column, we use the below code. As every column is numeric (float64), we don’t want to perform any data preprocessing here.

logistic_regression.pycode
1
2
3
import pandas as pd
df = pd.DataFrame(data.data)
print(df.dtypes)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
0     float64
1     float64
2     float64
3     float64
4     float64
5     float64
6     float64
7     float64
8     float64
9     float64
10    float64
11    float64
12    float64
13    float64
14    float64
15    float64
16    float64
17    float64
18    float64
19    float64
20    float64
21    float64
22    float64
23    float64
24    float64
25    float64
26    float64
27    float64
28    float64
29    float64
dtype: object

Supervised Learning

In a nutshell, what we try to solve in this problem is -

  • We have some training data X_train along with its class names y_train.
  • We train a model (set of algorithms) with this training data (the magic happens here and we are yet to see it!)
  • We use the trained model to predict the class y_test of unseen data point X_test.

This is called supervised learning problem (if you don’t know yet) because we use a training dataset with class names already made available to us (nicely).

The flowchart that you could expect before diving into logistic regression implementation might look like the one shown below.

Figure 1. Supervised Learning using Logistic Regression.

Logistic regression is a type of generalized linear classification algorithm which follows a beautiful procedure to learn from data. To learn means,

  • Weight: We define a weight value (parameter) for each feature (column) in the dataset.
  • Linear Predictor (score): We compute weighted sum for each data point in the dataset.
  • Link Function: We use a link function to transform that weighted sum to the probability range \( [0, 1] \).
  • Log-Likelihood: We use the log-likelihood function as the quality metric to evaluate the prediction of the model i.e., how well the model has predicted y_predict when compared with ground truth y_train.
  • Gradient Ascent: We use gradient ascent algorithm to update the weights (parameters) by trying to maximize the likelihood.
  • Prediction: We take these learned weights and make predictions when new data point is given to the model.

Note: To understand how the above steps work, we need to have some knowledge on probability, statistics, calculus and linear algebra.

Linear Predictor (score)

First, we define a weight value for each column (feature) in our dataset. As we have 30 features (columns) in the breast cancer dataset, we will have 30 weights [ \( \mathbf W_1, \mathbf W_2 … \mathbf W_{30}\) ]. We compute the score (weighted sum) for each data point as follows.

$$ \begin{align} score & = \mathbf W_0 + (\mathbf W_1 * h_1(x_i)) + (\mathbf W_2 * h_2(x_i)) + ... + (\mathbf W_{30} * h_{30}(x_i)) \\ & = \mathbf W^T h(x_i) \end{align} $$

Notice we have \( \mathbf W_0 \) with no coefficient which is called the bias or intercept which must be learnt from the training data. If you need to understand what bias is, please watch this excellent video.

As we have numeric values, the score for each data point might fall within a range \( [-\infty, +\infty]\). Recall that our aim is to predict “given a new data point, tell me whether it’s malignant (0) or benign (1)”. This means, prediction from the ML model must be either 0 or 1. How are we going to achieve this? The answer is link function.

If you give any input to a link function (say sigmoid), it transforms that input value to a range \( [0, 1] \). In our case, anything below 0.5 is assumed to be malignant, and anything above or equal to 0.5 is assumed to be benign.

Figure 2. Link Function over a linear predictor (score).

Below code is used to find the sigmoid value for a given input score.

logistic_regression.pycode
1
2
3
4
5
6
def sigmoid(score):
  return (1 / (1 + np.exp(-score)))

def predict_probability(features, weights):
  score = np.dot(features, weights)
  return sigmoid(score)

In the above code, features, weights and score correspond to the matrices shown below.

$$ \begin{align} [features] = \begin{bmatrix} h(x_1)^T \\ h(x_2)^T \\ . \\ . \\ . \\ h(x_{569})^T \\ \end{bmatrix} = \begin{bmatrix} h_0(x_1) & h_1(x_1) & . & . & . & h_{30}(x_1) \\ h_0(x_2) & h_1(x_2) & . & . & . & h_{30}(x_2) \\ . & . & . & . & . & . \\ . & . & . & . & . & . \\ . & . & . & . & . & . \\ h_0(x_{569}) & h_1(x_{569}) & . & . & . & h_{30}(x_{569}) \\ \end{bmatrix} \end{align} $$
$$ [score] = [features] \mathbf w = \begin{bmatrix} h(x_1)^T \\ h(x_2)^T \\ . \\ . \\ . \\ h(x_{569})^T \end{bmatrix} \mathbf w = \begin{bmatrix} h(x_1)^T \mathbf w \\ h(x_2)^T \mathbf w \\ . \\ . \\ . \\ h(x_{569})^T \mathbf w \end{bmatrix} = \begin{bmatrix} \mathbf w^T h(x_1) \\ \mathbf w^T h(x_2) \\ . \\ . \\ . \\ \mathbf w^T h(x_{569}) \end{bmatrix} $$

But wait! how will the output value of this link function be the same as the ground truth value for a particular data point? It can’t be as we are randomizing the weights for the features which will throw out some random value as the prediction.

The whole point in learning algorithm is to adjust these weights based on the training data to arrive at a sweet spot that makes the ML model have low bias and low variance.

Training the classifier = Learning the weight coefficients (with low bias and low variance).

How do we adjust these weights? We need to define a quality metric that compares the output prediction of the ML model with the original ground truth class value.

After evaluating the quality metric, we use gradient ascent algorithm to update the weights in a way that the quality metric reaches a global optimum value. Interesting isn’t it?

Compute Likelihood

How do we measure “how well the classifier fits the training data”? Using likelihood. We need to choose weight coefficients \(\mathbf w\) that maximizes likelihood given below.

$$ \prod_{i=1}^N P(y_i | \mathbf x_i, \mathbf w) $$

For a binary classification problem, it turns out that we can use log-likelihood as the quality metric which makes computations and derivatives simpler.

$$ l(\mathbf w) = ln \prod_{i=1}^N P(y_i | \mathbf x_i, \mathbf w) $$

After picking the log-likelihood function, we must know it’s derivative with respect to a weight coefficient so that we can use gradient ascent to update that weight.

We use the below equation to calculate the log-likelihood for the classifier.

$$ ll(\mathbf w) = \sum_{i=1}^N ((\mathbf 1[y_i = +1] - 1) \mathbf w^T h(\mathbf w_i) - ln(1 + exp(-\mathbf w^T h(x_i)))) $$

We will understand the formation of these equations in a separate post. But for now, let us focus on implementing everything in code.

We define the below function to compute log-likelihood. Notice that we sum over all the training examples.

logistic_regression.pycode
1
2
3
4
5
def compute_log_likelihood(features, label, weights):
  indicator = (label==+1)
  scores    = np.dot(features, weights)
  ll        = np.sum((np.transpose(np.array([indicator]))-1)*scores - np.log(1. + np.exp(-scores)))
  return ll

Compute Derivative

Once we have the log-likelihood equation, we can compute its derivative with respect to a single weight coefficient using the below formula.

$$ \frac{\partial l}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf x_i) (\mathbf 1[y_i = +1] - P(y_i = +1|\mathbf x_i, \mathbf w)) $$

The above equation might look scary. But its easy to write in code.

  • The term \((\mathbf 1[y_i = +1] - P(y_i = +1|\mathbf x_i, \mathbf w)\) is nothing but the difference between indicators and predictions which is equal to errors.
  • \(h_j(\mathbf x_i)\) is the feature value of a training example \(\mathbf x_i \) for a single column \(j\).

We find the derivative of log-likelihood with respect to each of the weight coefficient \( \mathbf w \) which in turn depends on its feature column.

Notice that we sum over all the training examples, and the derivative that we return is a single number.

logistic_regression.pycode
1
2
3
def feature_derivative(errors, feature):
  derivative = np.dot(np.transpose(errors), feature)
  return derivative

Gradient Ascent

Now, we have all the ingredients to perform gradient ascent. The magic of this tutorial happens here!

Think of gradient ascent similar to hill-climbing. To reach the top of the hill (which is the global maximum), we choose a parameter called learning-rate. This defines the step-size that we need to take each iteration before we update the weight coefficients.

The steps that we will perform in gradient ascent are as follows.

  1. Initialize weights vector \( \mathbf w \) to random values or zero using np.zeros().
  2. Predict the class probability \( P(y_i = +1|\mathbf x_i, \mathbf w) \) for all training examples using predict_probability function and save to a variable named predictions. The shape of this variable would be y_train.shape.
  3. Calculate the indicator value for all training examples by comparing the label against \( +1 \) and save it to a variable named indicators. The shape of this variable would also be y_train.shape.
  4. Calculate the errors as the difference between indicators and predictions and save it to a variable named errors.
  5. Important step: For each \( j^{th} \) weight coefficient, compute it’s derivative using feature_derivative function with the \( j^{th} \) column of features. Increment the \( j^{th} \) coefficient using \( lr * derivative\) where \( lr \) is the learning rate for this algorithm which we handpick.
  6. Do steps 2 to 5 for epochs times (number of iterations) and return the learned weight coefficients.

Below is the code to perform logistic regression using gradient ascent optimization algorithm.

logistic_regression.pycode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# logistic regression without L2 regularization
def logistic_regression(features, labels, lr, epochs):

  # add bias (intercept) with features matrix
  bias      = np.ones((features.shape[0], 1))
  features  = np.hstack((bias, features))

  # initialize the weight coefficients
  weights = np.zeros((features.shape[1], 1))

  logs = []

  # loop over epochs times
  for epoch in range(epochs):

    # predict probability for each row in the dataset
    predictions = predict_probability(features, weights)

    # calculate the indicator value
    indicators = (labels==+1)

    # calculate the errors
    errors = np.transpose(np.array([indicators])) - predictions

    # loop over each weight coefficient
    for j in range(len(weights)):

      # calculate the derivative of jth weight cofficient
      derivative = feature_derivative(errors, features[:,j])
      weights[j] += lr * derivative

    # compute the log-likelihood
    ll = compute_log_likelihood(features, labels, weights)
    logs.append(ll)

  import matplotlib.pyplot as plt
  x = np.linspace(0, len(logs), len(logs))
  fig = plt.figure()
  plt.plot(x, logs)
  fig.suptitle('Training the classifier (without L2)')
  plt.xlabel('Epoch')
  plt.ylabel('Log-likelihood')
  fig.savefig('train_without_l2.jpg')
  plt.show()

  return weights

Split the dataset

To test our classifier’s performance, we will split the original dataset into training and testing. We choose a test_size parameter value to split the dataset into train and test using scikit-learn’s train_test_split function as shown below.

logistic_regression.pycode
1
2
3
4
5
6
7
8
from sklearn.model_selection import train_test_split
# split the dataset into training and testing 
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=9)

print("X_train : " + str(X_train.shape))
print("y_train : " + str(y_train.shape))
print("X_test : " + str(X_test.shape))
print("y_test : " + str(y_test.shape))
1
2
3
4
X_train : (455, 30)
y_train : (455,)
X_test : (114, 30)
y_test : (114,)

Train the classifier

As we already learnt, training the classifier means learning the weight coefficients. To train the classifier, we

  • Add intercept or bias to the feature matrix.
  • Initialize the weight coefficients to zeros.
  • Handpick the hyper-parameters learning rate and epochs.
  • Use logistic_regression() function that we have just built and pass in the ingredients.
logistic_regression.pycode
1
2
3
4
5
6
# hyper-parameters
learning_rate = 1e-7
epochs        = 500

# perform logistic regression
learned_weights = logistic_regression(X_train, y_train, learning_rate, epochs)
Figure 3. Increasing log-likelihood during training (without L2 regularization).

Test the classifier

To make predictions using the trained classifier, we use X_test data (testing data), learned_weights and predict_probability() function.

To find the accuracy between ground truth class values y_test and logistic regression predicted class values predictions, we use scikit-learn’s accuracy_score() function as shown below.

logistic_regression.pycode
1
2
3
4
5
6
7
8
9
10
11
from sklearn.metrics import accuracy_score
# make predictions using learned weights on testing data
bias_train     = np.ones((X_train.shape[0], 1))
bias_test      = np.ones((X_test.shape[0], 1))
features_train = np.hstack((bias_train, X_train))
features_test  = np.hstack((bias_test, X_test))

test_predictions  = (predict_probability(features_test, learned_weights).flatten()>0.5)
train_predictions = (predict_probability(features_train, learned_weights).flatten()>0.5)
print("Accuracy of our LR classifier on training data: {}".format(accuracy_score(np.expand_dims(y_train, axis=1), train_predictions)))
print("Accuracy of our LR classifier on testing data: {}".format(accuracy_score(np.expand_dims(y_test, axis=1), test_predictions)))
1
2
Accuracy of our LR classifier on training data: 0.9164835164835164
Accuracy of our LR classifier on testing data: 0.9298245614035088

Reduce Overfitting

Overfitting is a mandatory problem that we need to solve when it comes to machine learning. After training, we have the learned weight coefficients which must not overfit the training dataset.

When the decision boundary traced by the learned weight coefficients fits the training data extremely well, we have this overfitting problem. Often, overfitting is associated with very large estimated weight coefficients. This leads to overconfident predictions which is not very good for a real-world classifier.

To solve this, we need to measure the magnitude of weight coefficients. There are two approaches to measure it.

L1 norm: Sum of absolute value

\(\lVert \mathbf w \rVert _1 = |\mathbf w_0| + |\mathbf w_1| + |\mathbf w_2| … + |\mathbf w_N| \)

L2 norm: Sum of squares

\(\lVert \mathbf w \rVert _2^2 = \mathbf w_0^2 + \mathbf w_1^2 + \mathbf w_2^2 … + \mathbf w_N^2 \)

L2 Regularization

We will use L2 norm (sum of squares) to reduce overshooting weight coefficients. It turns out that, instead of using likelihood function alone as the quality metric, what if we subtract \(\lambda \lVert \mathbf w \rVert _2^2\) from it, where \(\lambda\) is a hyper-parameter to control bias-variance tradeoff due to this regularization.

So, our new quality metric with regularization to combat overconfidence problem would be

$$ l(w) - \lambda \lVert \mathbf w \rVert _2^2 $$
  • Large \(\lambda \): High bias, low variance.
  • Small \(\lambda \): Low bias, high variance.

Recall to perform gradient ascent, we need to know the derivative of quality metric to update the weight coefficients. Thus, the new derivative equation would be

$$ \frac{\partial l(\mathbf w)}{\partial \mathbf w_j} - 2 \lambda \mathbf w_j $$

Let’s understand the regularization impact on penalizing weight coefficients.

  • If \( \mathbf w_j > 0\), then \(- 2 \lambda \mathbf w_j < 0\), thus it decreases \( \mathbf w_j > 0\) resulting in \( \mathbf w_j \) closer to 0.
  • If \( \mathbf w_j < 0\), then \(- 2 \lambda \mathbf w_j > 0\), thus it increases \( \mathbf w_j > 0\) resulting in \( \mathbf w_j \) closer to 0.

When it comes to code, we need to update feature_derivative() function, compute_log_likelihood() function and logistic_regression() function with whatever we have learnt so far about L2 regularization as shown below.

logistic_regression.pycode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# feature derivative computation with L2 regularization
def l2_feature_derivative(errors, feature, weight, l2_penalty, feature_is_constant):
  derivative = np.dot(np.transpose(errors), feature)
  
  if not feature_is_constant:
    derivative -= 2 * l2_penalty * weight

  return derivative

# log-likelihood computation with L2 regularization
def l2_compute_log_likelihood(features, labels, weights, l2_penalty):
  indicator = (label==+1)
  scores    = np.dot(features, weights)
  ll        = np.sum((np.transpose(np.array([indicator]))-1)*scores - np.log(1. + np.exp(-scores))) - (l2_penalty * np.sum(weights[1:]**2))
  return ll

# logistic regression with L2 regularization
def l2_logistic_regression(features, labels, lr, epochs, l2_penalty):

  # add bias (intercept) with features matrix
  bias      = np.ones((features.shape[0], 1))
  features  = np.hstack((bias, features))

  # initialize the weight coefficients
  weights = np.zeros((features.shape[1], 1))

  logs = []

  # loop over epochs times
  for epoch in range(epochs):

    # predict probability for each row in the dataset
    predictions = predict_probability(features, weights)

    # calculate the indicator value
    indicators = (labels==+1)

    # calculate the errors
    errors = np.transpose(np.array([indicators])) - predictions

    # loop over each weight coefficient
    for j in range(len(weights)):

      isIntercept = (j==0)

      # calculate the derivative of jth weight cofficient
      derivative = l2_feature_derivative(errors, features[:,j], weights[j], l2_penalty, isIntercept)
      weights[j] += lr * derivative

    # compute the log-likelihood
    ll = l2_compute_log_likelihood(features, labels, weights, l2_penalty)
    logs.append(ll)

  import matplotlib.pyplot as plt
  x = np.linspace(0, len(logs), len(logs))
  fig = plt.figure()
  plt.plot(x, logs)
  fig.suptitle('Training the classifier (with L2)')
  plt.xlabel('Epoch')
  plt.ylabel('Log-likelihood')
  fig.savefig('train_with_l2.jpg')
  plt.show()

  return weights

Now, we can perform logistic regression with L2 regularization on this dataset using the below code.

logistic_regression.pycode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# logistic regression with regularization
def lr_with_regularization():
  # hyper-parameters
  learning_rate = 1e-7
  epochs        = 300000
  l2_penalty    = 0.001

  # perform logistic regression and get the learned weights
  learned_weights = l2_logistic_regression(X_train, y_train, learning_rate, epochs, l2_penalty)

  # make predictions using learned weights on testing data
  bias_train     = np.ones((X_train.shape[0], 1))
  bias_test      = np.ones((X_test.shape[0], 1))
  features_train = np.hstack((bias_train, X_train))
  features_test  = np.hstack((bias_test, X_test))

  test_predictions  = (predict_probability(features_test, learned_weights).flatten()>0.5)
  train_predictions = (predict_probability(features_train, learned_weights).flatten()>0.5)
  print("Accuracy of our LR classifier on training data: {}".format(accuracy_score(np.expand_dims(y_train, axis=1), train_predictions)))
  print("Accuracy of our LR classifier on testing data: {}".format(accuracy_score(np.expand_dims(y_test, axis=1), test_predictions)))

  # using scikit-learn's logistic regression classifier
  model = LogisticRegression(random_state=9)
  model.fit(X_train, y_train)
  sk_test_predictions  = model.predict(X_test)
  sk_train_predictions = model.predict(X_train)
  print("Accuracy of scikit-learn's LR classifier on training data: {}".format(accuracy_score(y_train, sk_train_predictions)))
  print("Accuracy of scikit-learn's LR classifier on testing data: {}".format(accuracy_score(y_test, sk_test_predictions)))

  visualize_weights(np.squeeze(learned_weights), 'weights_with_l2.jpg')

# visualize weight coefficients
def visualize_weights(weights, title):
  import matplotlib.pyplot as plt
  x = np.linspace(0, len(weights), len(weights))

  fig = plt.figure()
  plt.bar(x, weights, align='center', alpha=0.5)
  plt.xlabel("Weight Index (Feature Column Number)")
  plt.ylabel("Weight Coefficient")
  plt.title('Visualizing Weights')
  plt.tight_layout()
  fig.savefig(title)

  plt.show()

lr_without_regularization()
1
2
3
4
Accuracy of our LR classifier on training data: 0.9406593406593406
Accuracy of our LR classifier on testing data: 0.9385964912280702
Accuracy of scikit-learn's LR classifier on training data: 0.9648351648351648
Accuracy of scikit-learn's LR classifier on testing data: 0.9385964912280702
Figure 4. Visualizing learnt weight coefficients after training.
Figure 5. Increasing log-likelihood during training (with L2 regularization).

Conclusion

Thus, we have implemented our very own logistic regression classifier using python and numpy with/without L2 regularization, and compared it with scikit-learn’s implementation.

We have achieved the same test accuracy as scikit-learn’s implementation and what a way to achieve it on our own!

One key take away from this post is that, we still need to manually tune these hyper-parameters (learning_rate, epochs and l2_penalty) to reach the global maximum. If you found some approach to automate this task, please leave it out in the comments so that I as well as others can learn it.

In case if you found something useful to add to this article or you found a bug in the code or would like to improve some points mentioned, feel free to write it down in the comments. Hope you found something useful here.