
Welcome back! If you followed along in the previous installment you will be familiar enough with all of the players in the game, so to say, to understand how the weights and biases are calculated and updated, so lets get started!
Forward Propagation
Logistic regression operates similarly to a single neuron in a neural network, that is to say, there is only one node where we will compute our predictions and our adjust our weights. Since there is only one node, and a single layer, we can initialize our weights and biases to both be 0, if this were neural network with, multiple layers initializing the weights and bias to 0 will cause each successive layers have the same values as the previous layer, thus being the equivalent of what we are doing now with our single layer example. If you recall to train our network, we will run the sigmoid function σ(z) = 1 / (1 + e^(-z))
where z is wtX + b
. So to get a better understanding of what’s happening we will manually go through our first epoch. We’ll start by using the sigmoid function to get our prediction, then use that prediction to calculate the loss (the error rate for a single training example) for each training example using the given weights and biases, and then take the average of those to give us our cost (The average error rate of all training examples), which we will then use in our backward propagation step.
The Loss Function
To calculate the loss, we need to find a way to produce a number close to 0 as the prediction approaches 1, and produces a number close to positive infinity as the prediction approaches 0. In this case we will use the Binary Cross-Entropy Function, lets look at it and see how it works. The binary cross-entropy loss function is commonly used for binary classification problems, where the goal is to predict one of two possible outcomes (e.g., true or false, 0 or 1). It measures the performance of a model by comparing its predicted probabilities with the actual labels in the training data. The binary cross-entropy loss function is defined as follows: L(y, ŷ) = -(y * log(ŷ) + (1 - y) * log(1 - ŷ))
, where y is the actual value, and ŷ is the prediction. Looking at the formula we can separate the two sides of the equation by the plus sign. If the actual value of the training sample is 1 or true, the right side of the equation becomes 0 so only the first part of the equation is now in play, with -log(ŷ), in this case if the actual value of the training sample is 1, then the loss will approach 0 as the prediction approaches 1 and it will approach positive infinity as the prediction approaches 0. Lets show that with a few examples:
-log(0.9) = 0.046
-log(.99) = 0.0044
-log(.05) = 1.3
-log(.000005) = 5.3
When the true value of the training sample is 0 or false, the right side of the equation is now in effect and we are working with -log(1-ŷ)
, looking at this we can already see why this will do the exact opposite of the other side of the equation, namely that a predicted value approaching 1 will now be going toward positive infinity and a predicted value near 0 will approach zero, lets use the same examples and see what we get this time:
-log(1 - .9) = 1
-log(1 - .99) = 2
-log(1 - .05) = .02
-log(1 - .000005) = .000002
So, in short, what his formula does is take our prediction and return a loss value that is closer to 0 the more correct our prediction is. Now lets examine our cost function.
The Cost Function
The cost function is used to take all of our losses and average them, so that we can have an idea of how well our model is doing based on the current weights and biases after every epoch. The function looks like this: That is to say, -1 times the sum off all of our training examples divided by the number of training examples.
Now that we know what the loss and cost functions looks like and what they do let’s manually go through the forward propagation step of our first epoch and see what we end up with:
Given:
X = [15 72 15 31 15 1002 15 32 15 16]
Y = [1 0 1 0 1 0 1 0 1 0]
Initial weights: w = 0 <- We only have a single feature in this toy example, the number itself, so there is only 1 weight
Initial bias: b = 0
Loss function: We'll use the binary cross-entropy loss
For i = 1:
x(1) = 15, y(1) = 1
ŷ(1) = σ(w^T * x(1) + b) = σ(0*15 + 0) = 0.5
L(1) = -(y(1) * log(ŷ(1)) + (1 - y(1)) * log(1 - ŷ(1))) = -(1 * log(0.5) + 0 * log(0.5)) = -(-0.693) = 0.693
For i = 2:
x(2) = 72, y(2) = 0
ŷ(2) = σ(0*72 + 0) = 0.5
L(2) = -(0 * log(0.5) + (1 - 0) * log(1 - 0.5)) = -(0 + log(0.5)) = -(-0.693) = 0.693
For i = 3:
x(3) = 15, y(3) = 1
ŷ(3) = 0.5
L(3) = 0.693
For i = 4:
x(4) = 31, y(4) = 0
ŷ(4) = 0.5
L(4) = 0.693
For i = 5:
x(5) = 15, y(5) = 1
ŷ(5) = 0.5
L(5) = 0.693
For i = 6:
x(6) = 1002, y(6) = 0
ŷ(6) = 0.5
L(6) = 0.693
For i = 7:
x(7) = 15, y(7) = 1
ŷ(7) = 0.5
L(7) = 0.693
For i = 8:
x(8) = 32, y(8) = 0
ŷ(8) = 0.5
L(8) = 0.693
For i = 9:
x(9) = 15, y(9) = 1
ŷ(9) = 0.5
L(9) = 0.693
For i = 10:
x(10) = 16, y(10) = 0
ŷ(10) = 0.5
L(10) = 0.693
Since all of our losses were .693 due to initializing our weights to 0 the cost will be .693 as well
Cost = .693
Backward Propagation
Now we are ready to use our losses to figure out how we will adjust our weights to reduce the cost. To do this we need to find derivative of our Loss with respect to derivative of our weights or dL/dw
as well as the derivative of our loss with respect to the bias or dL/db
. The derivative is the the line going through a point on a curve, we can use this line to help us figure out the next step we should take to minimize our cost, the reason we need to do this is because we don’t know the shape of the curve or plane we are traversing, we have to take steps and see if we are moving in the correct direction based on the cost we get after each epoch. Think of it as if you were in a pitch black room that had a floor that sloped in many directions and there were many small hills, your goal is to reach the absolute lowest point in the room if you can, what you would do is feel where your feet are, maybe poke your foot a little to the left or right and feel which way felt like it lead you further down, when you found that position then you would take one small step in that direction, and you’d take a small step rather than a large step in case you made a mistake and stepped over the low point to another higher point, this is a manner, is what the entire algorithm will do, as we find our losses and cost, we take a small step, also known as the learning rate or α (alpha), in the direction of the derivatives we discover. This involves a little calculus that we won’t get into today (maybe in a later post) but suffice it to say when its all worked out
, that is to say the average of the sum of all of the prediction – the actual value * the example. Once again using samples with only one feature makes this very easy to reason about. For dL/db we calculate that as
, the same as the above without the need to multiply in the sample itself.
To perform the update to the weights and biases we just need to subtract the learning rate times dL/dw from the weights and biases respectively, yielding: w = w - α * dL/dw and b = b - α * dL/db
Now lets do the backward propagation manually for the first epoch and see what we get.
dL/dw = (1/10) * [(0.5 - 1)*15 + (0.5 - 0)*72 + (0.5 - 1)*15 + (0.5 - 0)*31 + (0.5 - 1)*15 + (0.5 - 0)*1002 + (0.5 - 1)*15 + (0.5 - 0)*32 + (0.5 - 1)*15 + (0.5 - 0)*16]
= (1/10) * [-7.5 + 36 - 7.5 + 15.5 - 7.5 + 501 - 7.5 + 16 - 7.5 + 8]
= (1/10) * 538.5
= 53.85
dL/db = (1/10) * [(0.5 - 1) + (0.5 - 0) + (0.5 - 1) + (0.5 - 0) + (0.5 - 1) + (0.5 - 0) + (0.5 - 1) + (0.5 - 0) + (0.5 - 1) + (0.5 - 0)]
= (1/10) * [-0.5 + 0.5 - 0.5 + 0.5 - 0.5 + 0.5 - 0.5 + 0.5 - 0.5 + 0.5]
= 0
Now we can update w and b using a learning rate α:
w = w - α * dL/dw
b = b - α * dL/db
Let's use a learning rate of α = 0.01
w = 0 - 0.01 * 53.85 = -0.5385
b = 0 - 0.01 * 0 = 0
Now that we have our updated weights and biases we can start the next epoch and try to reduce our losses. In the next post what we will do put all of this into code so that we can add more items to our training set and run many epochs in order to get a model that works well, because, of course, doing that by hand would be incredible time consuming once our training set has say 1000 items rather than 10.
Thanks for reading, and stay tuned for the next installment!