Fundamentals of Deep Learning: First Principles Part II
By: Manas Reddy
“Okay, the last article was getting wayyy too long and who likes reading 20-page articles, Picking up where we left off”
1.7 Activation Functions
Now that we successfully depicted a small neural network, just the fundamental working. So many questions arise,
“Are we gonna set the weights, how do we use custom inputs, how do we optimize the functions, etc ” “All questions will be answered and thou shalt redeem your destiny along this quest for knowledge ” ~ Master Oogway
The activation function is supposed to mimic the “activation of the neuron in the brain”, similarly the activation function can help us activate or deactivate certain activation points.
We’ve learned that a basic neural network, consists of multiplying the weights and the inputs and summing the biases to get an output. But, in order to get an accurate output or get a desired range of outputs, because, in the end, there's not much inference one can make from a matrix of values. So we introduce the concept of Activation Functions. The output we get from the multiplying and adding, we pipeline into an activation function.
There are many activation functions, but we’ll discuss the three major ones. Namely the Step Function, Sigmoid Function, and ReLU Function.
1. Step Function
A pretty straightforward function, it attempts to “graph” two features of a function, and it outputs values of either 0 or 1. As probably one can understand most problems we require to solve aren't values of zero or one. Thus this activation function isn't too popular.
2. Sigmoid Function
The main reason why we use the sigmoid function is that it exists between (0 to 1). Therefore, it is especially used for models where we have to predict the probability as an output. Since the probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.
3. ReLU Function
ReLU activation function is the most used Activation due to its significance in Convolutional Neural Networks and Deep Learning.
As you can see, the ReLU is half rectified (from the bottom). f(z) is zero when z is less than zero and f(z) is equal to z when z is above or equal to zero.
The Range: [ 0 to infinity)
The function and it's derivative both are monotonic.
But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turn affects the resulting graph by not mapping the negative values appropriately.
1.8 The Role of Activation Functions
Now that we’ve seen how these individual functions work, let’s see how they impact our Neural Network structure. The neurons located in the neural network actively try to “graph” the inputs and outputs to observe inferences. But, most of the time the most accurate graphs aren't linear functions, they could be convex, or a cosine function, or anything else. To accurately graph the curve, activation functions are used. The initial neuron is used to set the point of activation, along the curve, and the second neuron can be used to set a point of deactivation. Now, remember we never really discussed using Weights and Biases.
Now imagine you have a sine curve
This is a typical sine curve, given all the activation functions what function would you use?
Pretty daunting right, interestingly we use the ReLU function because it's a typical non — linear function. The bias is the first neuron is used to offset the ReLU function “horizontally”.
For example, as depicted above, now “0” on the X-axis depicts the point of activation, but still, we’re far from the actual sine wave curve, the second neuron as we know takes the output of the first neuron, the second neurons bias can be used to offset the weight vertically
Exactly as depicted above, now if we change the weights of the second neuron the graph changes again
Similarly, if we feed a neural net with multiple multiple neurons we can eventually
Thus we can eventually fit it, adjusting the various weights and biases.
1.9 Classifying Spiral Data
Okay now, let's actively work on a problem classifying a spiral dataset.
“Finally the juicy bits”
Let's start by coding a spiral using NumPy and Python, this isn’t required but fun to know
We’re gonna attempt our Deep Learning Algorithm to classify this spiral and identify 3 classes (blue, yellow, red)
We’re also gonna import a new library nnfs created by our good friend SentDex, all this library does for us is initialise and set our new data parameters. All the other python code remains the same. Just a brief overview of what our code looks like now
Several things to be noted are, “nnfs.init()” all this does is convert the previous “np.random.seed(0)” into the same data type everytime as sometimes with numpy it can change. “X, y = spiral_data(100,3)” is basically drawn from the spiral dataset that sets 100 features and three classes, and finally “Layer_Dense(2,5)” the two is changed from 4 previously because of the number of inputs in the spiral neuron dataset the shape is 2 as there are only two inputs x, y since all they are are just co-ordinates.
1.10 SoftMax Activation
Now we’ve seen the activation function on each neuron and how they’re represented with the ReLU activation function. Now there presents a problem according to the ReLU activation function, all negative inputs are converted directly to zero , but what if all your inputs are zero, or some important inputs are zero. Then the output would either be zero or X. That wouldn’t be ideal in classifying those inputs. So we apply a SoftMax Function to the output neurons.
The Softmax function is an easy function to understand, first to solve the problem of negative values, we use the process of exponentiation i.e in lame man’s terms just exponentiation the value of x with Euler’s Constant pretty straightforward
This basically solves the issue of no output (y) can be negative even though the value of the input (x) is negative.
Now the problem arises, given a batch of Inputs how does one compare the accuracy between the results obtained. The most obvious method is to compare the probabilities, ie divide it between the sum of the values generated for that batch. This process is known as Normalization. Implementing both of these together is known as SoftMax Activation.
Using simple inputs this is exactly how SoftMax Activation is implemented. We obtain values in the range in the range of 0 and 1, and the sum of the values obtained is equal to 1.
One more key problem we observe again, is the previous concept of Data Explosion occurs here again, on exponentiation some values quickly become large very quickly by the order of 10²⁰ even 10¹⁰⁰ which is a nightmare to deal with. This phenomenon is called Overflow.
To combat overflow we subtract the largest number in that batch with the remaining number making the largest number zero and everything else negative, this works in our favor as the highest number can only be 1 as E⁰ is anways 1. So eventually all data will be between (0,1)
Implementing SoftMax in the spiral data code
Boom, just like that we’ve almost implemented a complete working of a Neural Network. We just have optimise it, to get the least error, to improve accuracy. Which will be implemented in part 3