We will use a single perceptron neural network model to solve a simple classification problem

ClassificationΒ is the problem of identifying which of a set of categories an observation belongs to. In case of only two categories it is called aΒ binary classification problem.

The neural network components are shown in the following scheme:

Untitled

The input layer contains two nodes π‘₯1 and π‘₯2. Weight vector π‘Š = [𝑀1 𝑀2] and bias (𝑏) are the parameters to be updated during the model training.

For every training example π‘₯(𝑖) = [π‘₯1(𝑖) π‘₯(2𝑖)]:

$$ z^{(i)} = w_1x_1^{(i)} + w_2x_2^{(i)} + b = Wx^{(i)} + b.\tag{1} $$

But we cannot take a real number 𝑧(𝑖) into the output as we need to perform classification.

We will use single perceptron with sigmoid activation function defined as

$$ a = \sigma\left(z\right) = \frac{1}{1+e^{-z}}.\tag{2} $$

Then a threshold value of 0.5 can be used for predictions: 1 if π‘Ž > 0.5 and 0 otherwise. Putting it all together, mathematically the single perceptron neural network with sigmoid activation function can be expressed as:

$$

\begin{align*} z^{(i)} &= W x^{(i)} + b, \\\\ a^{(i)} &= \sigma \left( z^{(i)} \right).\tag{3} \end{align*}

$$

If you have π‘š training examples organized in the columns of (2Γ—π‘š) matrix 𝑋, you can apply the activation function element-wise. So the model can be written as:

$$ \begin{align*}Z &= WX + b, \\A &= \sigma(Z),\end{align*}\tag{4} $$

where 𝑏 is broadcasted to the vector of a size (1Γ—π‘š).

When dealing with classification problems, the most commonly used cost function is theΒ log loss, which is described by the following equation:

$$ \mathcal{L}\left(W, b\right) = \frac{1}{m}\sum_{i=1}^{m} L\left(W, b\right) = \frac{1}{m}\sum_{i=1}^{m} \large\left(\small -y^{(i)}\log\left(a^{(i)}\right) - (1-y^{(i)})\log\left(1- a^{(i)}\right) \large \right) \small\quad\tag{5} $$

where 𝑦(𝑖) ∈ {0,1} are the original labels and π‘Ž(𝑖) are the continuous output values of the forward propagation step (elements of array 𝐴).

We want to minimize the cost function during the training. To implement gradient descent, calculate partial derivatives using chain rule: Partial derivatives tells us exactly in what direction to move each one of the weights and biases in order to reduce the loss.

$$ \begin{align}\frac{\partial \mathcal{L} }{ \partial w_1 } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right)x_1^{(i)},\\\\\frac{\partial\mathcal{L} }{ \partial w_2 } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right)x_2^{(i)},\\\\\frac{\partial\mathcal{L} }{ \partial b } &= \frac{1}{m}\sum_{i=1}^{m} \left(a^{(i)} - y^{(i)}\right).\end{align}\tag{6} $$