We will use a single perceptron neural network model to solve a simple classification problem
ClassificationΒ is the problem of identifying which of a set of categories an observation belongs to. In case of only two categories it is called aΒ binary classification problem.
The neural network components are shown in the following scheme:
The input layer contains two nodes π₯1 and π₯2. Weight vector π = [π€1 π€2] and bias (π) are the parameters to be updated during the model training.
For every training example π₯(π) = [π₯1(π) π₯(2π)]:
$$ z^{(i)} = w_1x_1^{(i)} + w_2x_2^{(i)} + b = Wx^{(i)} + b.\tag{1} $$
But we cannot take a real number π§(π) into the output as we need to perform classification.
We will use single perceptron with sigmoid activation function defined as
$$ a = \sigma\left(z\right) = \frac{1}{1+e^{-z}}.\tag{2} $$
Then a threshold value of 0.5 can be used for predictions: 1 if π > 0.5 and 0 otherwise. Putting it all together, mathematically the single perceptron neural network with sigmoid activation function can be expressed as:
$$
\begin{align*} z^{(i)} &= W x^{(i)} + b, \\\\ a^{(i)} &= \sigma \left( z^{(i)} \right).\tag{3} \end{align*}
$$
If you have π training examples organized in the columns of (2Γπ) matrix π, you can apply the activation function element-wise. So the model can be written as:
$$ \begin{align*}Z &= WX + b, \\A &= \sigma(Z),\end{align*}\tag{4} $$
where π is broadcasted to the vector of a size (1Γπ).
When dealing with classification problems, the most commonly used cost function is theΒ log loss, which is described by the following equation:
$$ \mathcal{L}\left(W, b\right) = \frac{1}{m}\sum_{i=1}^{m} L\left(W, b\right) = \frac{1}{m}\sum_{i=1}^{m} \large\left(\small -y^{(i)}\log\left(a^{(i)}\right) - (1-y^{(i)})\log\left(1- a^{(i)}\right) \large \right) \small\quad\tag{5} $$
where π¦(π) β {0,1} are the original labels and π(π) are the continuous output values of the forward propagation step (elements of array π΄).
We want to minimize the cost function during the training. To implement gradient descent, calculate partial derivatives using chain rule: Partial derivatives tells us exactly in what direction to move each one of the weights and biases in order to reduce the loss.
$$ \begin{align}\frac{\partial \mathcal{L} }{ \partial w_1 } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right)x_1^{(i)},\\\\\frac{\partial\mathcal{L} }{ \partial w_2 } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right)x_2^{(i)},\\\\\frac{\partial\mathcal{L} }{ \partial b } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right).\end{align}\tag{6} $$