Deep Learning: Logistic Regression
by: Manas Reddy
Over the past few decades, the digitization of our society has led to massive amounts of data being gathered and stored. Combining this increase in the scale of stored information with advances in hardware computational power and algorithmic innovations, the field of artificial intelligence (AI) has jumped into the spotlight as machines seem to possess the ‘magical’ ability to learn without being told explicitly what to do.
What makes this field so interesting is the fact that it can do tasks without us ever having to handle them, accounting for any errors or misinformation. The computer should be able to theoretically be able to “adapt ” and similarly produce the same output. Artificial Intelligence is also being used to discover new ways or methods to improve drastically or find new ways to solve already existing problems for example early cancer detection using early gene detection in babies.
So the question arises? How does the so-called “Artificial Intelligence” work?
Here is an example of a very simple “Neuron”. These as you know are the building blocks of the brain. Like-wise in AI, neurons also form the basic building blocks of anything AI. The neuron mainly does three things,
- Takes input from data, here shown as x1, x2,x3
- Does some sort of function on the data
- and finally, displays the Output
Sounds, pretty straightforward. But much lies beneath the surface. An Ai can contain 100s if not 1000s of such neurons to give the suitable output.
LOGISTIC REGRESSION: First Principles Method
So as we understood, a neuron consists of inputs, and a function is performed on the input to give you an output. But to determine how important the input is to the output we add weights to the inputs as often in many cases some inputs have more importance than others. For example in creating an algorithm to find out how much a house should be priced for in a market. The square footage of the house is more important to the result than the type of roof tiles laid on the house. Thus weights are added, depicted as w1,w2,w3 etc. To offset a zero condition a bias is added “b”.
Here, represented in matrix form we map it as a [1,3] matrix as there is only one output and three inputs. This equation closely resembles a line equation,
The SIGMOID Function
This is how a sigmoid function looks, the sigmoid function has two major outputs either 0, or 1. Which makes it a perfect function for binary classification. The formula for a sigmoid function is
The activation function of Z gives a value closer to 0 or 1. As if Z is very small i.e negative it gives 1/ (1 + a value that is very big) which is equivalent to 1/ 0 = 0 and if the value is closer to one i.e very big then 1/1 = 1. That's why for binary classification the sigmoid function is preferred.
Logistic Regression
Now for the hero of the story, Logistic Regression. Logistic Regression actually works on the principles of Logistic Regression in Statistics. Imposter Syndrome who? Logistic Regression in Ai actually borrows the principles finetunes it to match its needs and gives the output. Logistic Regression is actually widely used in Machine Learning, it's used to find out simple yes/no, true/false, i.e Binary classification of data. For example, to determine if the picture shown is a cat or not, to see if in a covid patient's lung scan the patient suffers lung infection.
So from the equation above
a = w(transpose ) x + b
z = sigmoid function(a)
LOSS and COST Function
To effectively train our model, we must guide our model to realize what's desired and not desired so we add a loss function to each neuron and the goal of the machine is to minimize the loss to give the optimal output.
On observation, we deduce the loss function optimal for linear regression is
And the cost function is given as the average of the loss function among all the neurons, which is given as
Gradient Descent
Plotting the cost function J(w,b) gives us a convex function that gives us a graph that looks like this
The reason we use this is because of the fact that the function is a convex function with one global optima. Imagine rolling a ball around any corners the place with the least loss would be right in the centre. Similarily the global optima with the least point would be the centre. To reach the centre we must lower the value iteritavely till we reach the lowest point. The best way to lower the value would be by using a derivative. As the definition of a derivative is how much a value would change if incremented by an infinitesimally small value with respect to another value. So the formulas for weight (W) and Bias(b) is:
Where α is the learning rate or the “stride ” the ball takes to reach the bottom using the previous analogy.
To represent it graphically
Forward Propagation
In order to get an Output, the input data must be fed in the forward direction, so as to go through the function. Each hidden layer accepts the input data, processes it as per the activation function, and passes it to the successive layer.
Backward Propagation
In order to find the gradient, one must use Back Propagation, but now the question arises, what the heck is gradient, and why do we need to find it. The gradient is how different weights i.e, how w1,w2,w3 that we use at the beginning of the function to compute the costs, affect the cost overall. So for example we require a low cost per function as; a low cost per function indicates that the model is performing well and is accurate. So after one propagation, the output values are displayed.
These are the key functions of Logistic Regression