During my journey as a Machine Learning (ML) practitioner, I found it has become ultimately easy for any human with limited knowledge on algorithms to take advantage of free python libraries such as scikit-learn to solve a ML problem. Truth be said, it’s easy and sometimes no brainer to achieve this, as there are so many codes available in GitHub, Medium, Kaggle etc., You just need some amount of time looking at these codes to arrive at a solution to a problem of your choice.

But, what if we learn every algorithm or procedures behind each machine learning pipeline that does all the heavy lifting for us inside these amazing libraries. In this blog post and the series of blog posts to come, I will be focusing on implementing machine learning algorithms from scratch using python and numpy.

Sure you might argue with me for the first paragraph. But learning how each algorithm work behind the scenes is very important to use these algorithms and bring in customized features in any domain (say hardware design).

In this blog post, we will implement logistic regression from scratch using python and numpy to a binary classification problem.

I assume that you have knowledge on python programming and scikit-learn, because at the end we will compare our implementation (from scratch) with scikit-learn’s implementation.

### Dataset

We will use the breast cancer dataset from scikit-learn for this implementation. This is a binary classification problem i.e., each data point in the training data belong to one of two classes. Below code loads the dataset and prints out the important information we seek.

1
2
3
4
5
6
7
8
9

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
print(data.keys())
print("No.of.data points (rows) : {}".format(len(data.data)))
print("No.of.features (columns) : {}".format(len(data.feature_names)))
print("No.of.classes : {}".format(len(data.target_names)))
print("Class names : {}".format(list(data.target_names)))

1
2
3
4
5
6

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
No.of.data points (rows) : 569
No.of.features (columns) : 30
No.of.classes : 2
Class names : ['malignant', 'benign']

If you wish to know more about the dataset, use data.DESCR to print out the entire description of the dataset.

As you can see, the dataset has data and target keys from which we can access the data in this dataset. There are 569 rows and 30 columns. To describe the dataset mathematically, we will use this notation \( [x_i, h(x_i), y_i] \).

where

- \( x_i \) denotes a single data point in the dataset.
- \( h(x_i) \) is th feature vector for that single data point which has \( [h_1(x_i), h_2(x_i) … h_{30}(x_i)] \).
- \( y_i \) is the target class for that single data point.

We can easily convert this dataset into a pandas dataframe for better data analysis. To view the datatype of each column, we use the below code. As every column is numeric (float64), we don’t want to perform any data preprocessing here.

1
2
3

import pandas as pd
df = pd.DataFrame(data.data)
print(df.dtypes)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

0 float64
1 float64
2 float64
3 float64
4 float64
5 float64
6 float64
7 float64
8 float64
9 float64
10 float64
11 float64
12 float64
13 float64
14 float64
15 float64
16 float64
17 float64
18 float64
19 float64
20 float64
21 float64
22 float64
23 float64
24 float64
25 float64
26 float64
27 float64
28 float64
29 float64
dtype: object

### Supervised Learning

In a nutshell, what we try to solve in this problem is -

- We have some training data X_train along with its class names y_train.
- We train a model (set of algorithms) with this training data (the magic happens here and we are yet to see it!)
- We use the trained model to predict the class y_test of unseen data point X_test.

This is called supervised learning problem (if you don’t know yet) because we use a training dataset with class names already made available to us (nicely).

The flowchart that you could expect before diving into logistic regression implementation might look like the one shown below.

Logistic regression is a type of generalized linear classification algorithm which follows a beautiful procedure to learn from data. To learn means,

**Weight**: We define a weight value (parameter) for each feature (column) in the dataset.**Linear Predictor (score)**: We compute weighted sum for each data point in the dataset.**Link Function**: We use a link function to transform that weighted sum to the probability range \( [0, 1] \).**Log-Likelihood**: We use the log-likelihood function as the*quality metric*to evaluate the prediction of the model i.e., how well the model has predicted y_predict when compared with ground truth y_train.**Gradient Ascent**: We use gradient ascent algorithm to*update the weights (parameters)*by trying to maximize the likelihood.**Prediction**: We take these learned weights and make predictions when new data point is given to the model.

**Note**: To understand how the above steps work, we need to have some knowledge on probability, statistics, calculus and linear algebra.

### Linear Predictor (score)

First, we define a weight value for each column (feature) in our dataset. As we have 30 features (columns) in the breast cancer dataset, we will have 30 weights [ \( \mathbf W_1, \mathbf W_2 … \mathbf W_{30}\) ]. We compute the score (weighted sum) for each data point as follows.

Notice we have \( \mathbf W_0 \) with no coefficient which is called the *bias* or *intercept* which must be learnt from the training data. If you need to understand what bias is, please watch this excellent video.

As we have numeric values, the score for each data point might fall within a range \( [-\infty, +\infty]\). Recall that our aim is to predict *“given a new data point, tell me whether it’s malignant (0) or benign (1)”*. This means, prediction from the ML model must be either 0 or 1. How are we going to achieve this? The answer is *link function*.

### Link Function

If you give any input to a link function (say sigmoid), it transforms that input value to a range \( [0, 1] \). In our case, anything below 0.5 is assumed to be malignant, and anything above or equal to 0.5 is assumed to be benign.

Below code is used to find the sigmoid value for a given input score.

1
2
3
4
5
6

def sigmoid(score):
return (1 / (1 + np.exp(-score)))
def predict_probability(features, weights):
score = np.dot(features, weights)
return sigmoid(score)

In the above code, features, weights and score correspond to the matrices shown below.

But wait! how will the output value of this link function be the same as the ground truth value for a particular data point? It can’t be as we are randomizing the weights for the features which will throw out some random value as the prediction.

The whole point in learning algorithm is to *adjust these weights* based on the training data to arrive at a *sweet spot* that makes the ML model have *low bias* and *low variance*.

Training the classifier = Learning the weight coefficients (with low bias and low variance).

How do we adjust these weights? We need to define a *quality metric* that compares the output prediction of the ML model with the original ground truth class value.

After evaluating the quality metric, we use *gradient ascent algorithm* to update the weights in a way that the quality metric reaches a global optimum value. Interesting isn’t it?

### Compute Likelihood

How do we measure “how well the classifier fits the training data”? Using likelihood. We need to choose weight coefficients \(\mathbf w\) that maximizes likelihood given below.

For a binary classification problem, it turns out that we can use log-likelihood as the quality metric which makes computations and derivatives simpler.

After picking the log-likelihood function, we must know it’s derivative with respect to a weight coefficient so that we can use gradient ascent to update that weight.

We use the below equation to calculate the log-likelihood for the classifier.

We will understand the formation of these equations in a separate post. But for now, let us focus on implementing everything in code.

We define the below function to compute log-likelihood. Notice that we sum over all the training examples.

1
2
3
4
5

def compute_log_likelihood(features, label, weights):
indicator = (label==+1)
scores = np.dot(features, weights)
ll = np.sum((np.transpose(np.array([indicator]))-1)*scores - np.log(1. + np.exp(-scores)))
return ll

### Compute Derivative

Once we have the log-likelihood equation, we can compute its derivative with respect to a single weight coefficient using the below formula.

The above equation might look scary. But its easy to write in code.

- The term \((\mathbf 1[y_i = +1] - P(y_i = +1|\mathbf x_i, \mathbf w)\) is nothing but the difference between indicators and predictions which is equal to errors.
- \(h_j(\mathbf x_i)\) is the feature value of a training example \(\mathbf x_i \) for a single column \(j\).

We find the derivative of log-likelihood with respect to each of the weight coefficient \( \mathbf w \) which in turn depends on its feature column.

Notice that we sum over all the training examples, and the derivative that we return is a single number.

1
2
3

def feature_derivative(errors, feature):
derivative = np.dot(np.transpose(errors), feature)
return derivative

### Gradient Ascent

Now, we have all the ingredients to perform gradient ascent. The magic of this tutorial happens here!

Think of gradient ascent similar to hill-climbing. To reach the top of the hill (which is the global maximum), we choose a parameter called *learning-rate*. This defines the *step-size* that we need to take each iteration before we update the weight coefficients.

The steps that we will perform in gradient ascent are as follows.

- Initialize weights vector \( \mathbf w \) to random values or zero using np.zeros().
- Predict the class probability \( P(y_i = +1|\mathbf x_i, \mathbf w) \) for all training examples using predict_probability function and save to a variable named predictions. The shape of this variable would be y_train.shape.
- Calculate the indicator value for all training examples by comparing the label against \( +1 \) and save it to a variable named indicators. The shape of this variable would also be y_train.shape.
- Calculate the errors as the difference between indicators and predictions and save it to a variable named errors.
**Important step**: For each \( j^{th} \) weight coefficient, compute it’s derivative using feature_derivative function with the \( j^{th} \) column of features. Increment the \( j^{th} \) coefficient using \( lr * derivative\) where \( lr \) is the learning rate for this algorithm which we handpick.- Do steps 2 to 5 for epochs times (number of iterations) and return the learned weight coefficients.

Below is the code to perform logistic regression using gradient ascent optimization algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46

# logistic regression without L2 regularization
def logistic_regression(features, labels, lr, epochs):
# add bias (intercept) with features matrix
bias = np.ones((features.shape[0], 1))
features = np.hstack((bias, features))
# initialize the weight coefficients
weights = np.zeros((features.shape[1], 1))
logs = []
# loop over epochs times
for epoch in range(epochs):
# predict probability for each row in the dataset
predictions = predict_probability(features, weights)
# calculate the indicator value
indicators = (labels==+1)
# calculate the errors
errors = np.transpose(np.array([indicators])) - predictions
# loop over each weight coefficient
for j in range(len(weights)):
# calculate the derivative of jth weight cofficient
derivative = feature_derivative(errors, features[:,j])
weights[j] += lr * derivative
# compute the log-likelihood
ll = compute_log_likelihood(features, labels, weights)
logs.append(ll)
import matplotlib.pyplot as plt
x = np.linspace(0, len(logs), len(logs))
fig = plt.figure()
plt.plot(x, logs)
fig.suptitle('Training the classifier (without L2)')
plt.xlabel('Epoch')
plt.ylabel('Log-likelihood')
fig.savefig('train_without_l2.jpg')
plt.show()
return weights

### Split the dataset

To test our classifier’s performance, we will split the original dataset into training and testing. We choose a test_size parameter value to split the dataset into train and test using scikit-learn’s train_test_split function as shown below.

1
2
3
4
5
6
7
8

from sklearn.model_selection import train_test_split
# split the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=9)
print("X_train : " + str(X_train.shape))
print("y_train : " + str(y_train.shape))
print("X_test : " + str(X_test.shape))
print("y_test : " + str(y_test.shape))

1
2
3
4

X_train : (455, 30)
y_train : (455,)
X_test : (114, 30)
y_test : (114,)

### Train the classifier

As we already learnt, training the classifier means learning the weight coefficients. To train the classifier, we

- Add intercept or bias to the feature matrix.
- Initialize the weight coefficients to zeros.
- Handpick the hyper-parameters
*learning rate*and*epochs*. - Use logistic_regression() function that we have just built and pass in the ingredients.

1
2
3
4
5
6

# hyper-parameters
learning_rate = 1e-7
epochs = 500
# perform logistic regression
learned_weights = logistic_regression(X_train, y_train, learning_rate, epochs)

### Test the classifier

To make predictions using the trained classifier, we use X_test data (testing data), learned_weights and predict_probability() function.

To find the accuracy between ground truth class values y_test and logistic regression predicted class values predictions, we use scikit-learn’s accuracy_score() function as shown below.

1
2
3
4
5
6
7
8
9
10
11

from sklearn.metrics import accuracy_score
# make predictions using learned weights on testing data
bias_train = np.ones((X_train.shape[0], 1))
bias_test = np.ones((X_test.shape[0], 1))
features_train = np.hstack((bias_train, X_train))
features_test = np.hstack((bias_test, X_test))
test_predictions = (predict_probability(features_test, learned_weights).flatten()>0.5)
train_predictions = (predict_probability(features_train, learned_weights).flatten()>0.5)
print("Accuracy of our LR classifier on training data: {}".format(accuracy_score(np.expand_dims(y_train, axis=1), train_predictions)))
print("Accuracy of our LR classifier on testing data: {}".format(accuracy_score(np.expand_dims(y_test, axis=1), test_predictions)))

1
2

Accuracy of our LR classifier on training data: 0.9164835164835164
Accuracy of our LR classifier on testing data: 0.9298245614035088

### Reduce Overfitting

Overfitting is a mandatory problem that we need to solve when it comes to machine learning. After training, we have the learned weight coefficients which must not *overfit* the training dataset.

When the decision boundary traced by the learned weight coefficients fits the training data extremely well, we have this overfitting problem. Often, overfitting is associated with very large estimated weight coefficients. This leads to overconfident predictions which is not very good for a real-world classifier.

To solve this, we need to measure the magnitude of weight coefficients. There are two approaches to measure it.

**L1 norm**: Sum of absolute value

\(\lVert \mathbf w \rVert _1 = |\mathbf w_0| + |\mathbf w_1| + |\mathbf w_2| … + |\mathbf w_N| \)

**L2 norm**: Sum of squares

\(\lVert \mathbf w \rVert _2^2 = \mathbf w_0^2 + \mathbf w_1^2 + \mathbf w_2^2 … + \mathbf w_N^2 \)

### L2 Regularization

We will use L2 norm (sum of squares) to reduce overshooting weight coefficients. It turns out that, instead of using likelihood function alone as the quality metric, what if we subtract \(\lambda \lVert \mathbf w \rVert _2^2\) from it, where \(\lambda\) is a hyper-parameter to control bias-variance tradeoff due to this regularization.

So, our new quality metric with regularization to combat overconfidence problem would be

- Large \(\lambda \): High bias, low variance.
- Small \(\lambda \): Low bias, high variance.

Recall to perform gradient ascent, we need to know the derivative of quality metric to update the weight coefficients. Thus, the new derivative equation would be

Let’s understand the regularization impact on penalizing weight coefficients.

- If \( \mathbf w_j > 0\), then \(- 2 \lambda \mathbf w_j < 0\), thus it decreases \( \mathbf w_j > 0\) resulting in \( \mathbf w_j \) closer to 0.
- If \( \mathbf w_j < 0\), then \(- 2 \lambda \mathbf w_j > 0\), thus it increases \( \mathbf w_j > 0\) resulting in \( \mathbf w_j \) closer to 0.

When it comes to code, we need to update feature_derivative() function, compute_log_likelihood() function and logistic_regression() function with whatever we have learnt so far about L2 regularization as shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

# feature derivative computation with L2 regularization
def l2_feature_derivative(errors, feature, weight, l2_penalty, feature_is_constant):
derivative = np.dot(np.transpose(errors), feature)
if not feature_is_constant:
derivative -= 2 * l2_penalty * weight
return derivative
# log-likelihood computation with L2 regularization
def l2_compute_log_likelihood(features, labels, weights, l2_penalty):
indicator = (label==+1)
scores = np.dot(features, weights)
ll = np.sum((np.transpose(np.array([indicator]))-1)*scores - np.log(1. + np.exp(-scores))) - (l2_penalty * np.sum(weights[1:]**2))
return ll
# logistic regression with L2 regularization
def l2_logistic_regression(features, labels, lr, epochs, l2_penalty):
# add bias (intercept) with features matrix
bias = np.ones((features.shape[0], 1))
features = np.hstack((bias, features))
# initialize the weight coefficients
weights = np.zeros((features.shape[1], 1))
logs = []
# loop over epochs times
for epoch in range(epochs):
# predict probability for each row in the dataset
predictions = predict_probability(features, weights)
# calculate the indicator value
indicators = (labels==+1)
# calculate the errors
errors = np.transpose(np.array([indicators])) - predictions
# loop over each weight coefficient
for j in range(len(weights)):
isIntercept = (j==0)
# calculate the derivative of jth weight cofficient
derivative = l2_feature_derivative(errors, features[:,j], weights[j], l2_penalty, isIntercept)
weights[j] += lr * derivative
# compute the log-likelihood
ll = l2_compute_log_likelihood(features, labels, weights, l2_penalty)
logs.append(ll)
import matplotlib.pyplot as plt
x = np.linspace(0, len(logs), len(logs))
fig = plt.figure()
plt.plot(x, logs)
fig.suptitle('Training the classifier (with L2)')
plt.xlabel('Epoch')
plt.ylabel('Log-likelihood')
fig.savefig('train_with_l2.jpg')
plt.show()
return weights

Now, we can perform logistic regression with L2 regularization on this dataset using the below code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

# logistic regression with regularization
def lr_with_regularization():
# hyper-parameters
learning_rate = 1e-7
epochs = 300000
l2_penalty = 0.001
# perform logistic regression and get the learned weights
learned_weights = l2_logistic_regression(X_train, y_train, learning_rate, epochs, l2_penalty)
# make predictions using learned weights on testing data
bias_train = np.ones((X_train.shape[0], 1))
bias_test = np.ones((X_test.shape[0], 1))
features_train = np.hstack((bias_train, X_train))
features_test = np.hstack((bias_test, X_test))
test_predictions = (predict_probability(features_test, learned_weights).flatten()>0.5)
train_predictions = (predict_probability(features_train, learned_weights).flatten()>0.5)
print("Accuracy of our LR classifier on training data: {}".format(accuracy_score(np.expand_dims(y_train, axis=1), train_predictions)))
print("Accuracy of our LR classifier on testing data: {}".format(accuracy_score(np.expand_dims(y_test, axis=1), test_predictions)))
# using scikit-learn's logistic regression classifier
model = LogisticRegression(random_state=9)
model.fit(X_train, y_train)
sk_test_predictions = model.predict(X_test)
sk_train_predictions = model.predict(X_train)
print("Accuracy of scikit-learn's LR classifier on training data: {}".format(accuracy_score(y_train, sk_train_predictions)))
print("Accuracy of scikit-learn's LR classifier on testing data: {}".format(accuracy_score(y_test, sk_test_predictions)))
visualize_weights(np.squeeze(learned_weights), 'weights_with_l2.jpg')
# visualize weight coefficients
def visualize_weights(weights, title):
import matplotlib.pyplot as plt
x = np.linspace(0, len(weights), len(weights))
fig = plt.figure()
plt.bar(x, weights, align='center', alpha=0.5)
plt.xlabel("Weight Index (Feature Column Number)")
plt.ylabel("Weight Coefficient")
plt.title('Visualizing Weights')
plt.tight_layout()
fig.savefig(title)
plt.show()
lr_without_regularization()

1
2
3
4

Accuracy of our LR classifier on training data: 0.9406593406593406
Accuracy of our LR classifier on testing data: 0.9385964912280702
Accuracy of scikit-learn's LR classifier on training data: 0.9648351648351648
Accuracy of scikit-learn's LR classifier on testing data: 0.9385964912280702

### Conclusion

Thus, we have implemented our very own logistic regression classifier using python and numpy with/without L2 regularization, and compared it with scikit-learn’s implementation.

We have achieved the **same test accuracy as scikit-learn’s implementation** and what a way to achieve it on our own!

One key take away from this post is that, we still need to manually tune these hyper-parameters (learning_rate, epochs and l2_penalty) to reach the global maximum. If you found some approach to automate this task, please leave it out in the comments so that I as well as others can learn it.

In case if you found something useful to add to this article or you found a bug in the code or would like to improve some points mentioned, feel free to write it down in the comments. Hope you found something useful here.