I am having a neural network for multi-class classification (3 classes) having the following architecture:
Input layer has 2 neurons for 2 input features
There is one hidden layer having 4 neurons
Output layer has 3 neurons corresponding to 3 classes to be predicted
Sigmoid activation function is used for hidden layer neurons and softmax activation function is used for output layer.
The parameters used in the network are as follows:
Weights from input layer to hidden layer have the shape = (4, 2)
Biases for hidden layer = (1, 4)
Weights from hidden layer to output layer have the shape = (3, 4)
Biases for output layer = (1, 3)
The forward propagation is coded as follows:
Z1 = np.dot(X, W1.T) + b1 # Z1.shape = (m, 4); 'm' is number of training examples
A1 = sigmoid(Z1) # A1.shape = (m, 4)
Z2 = np.dot(W2, A1.T) + b2.T # Z2.shape = (3, m)
Now 'Z2' has to be fed into activation function so that each of the three neurons compute probabilistic activations summing upto one.
The code I have for the 3 output neurons are:
o1 = np.exp(Z2[0,:])/np.exp(Z2[0,:]).sum() # o1.shape = (m,)
o2 = np.exp(Z2[1,:])/np.exp(Z2[1,:]).sum() # o2.shape = (m,)
o1 = np.exp(Z2[3,:])/np.exp(Z2[3,:]).sum() # o3.shape = (m,)
I was expecting each o1, o2 and o3 to output a vector of shape (3,).
My aim is to reduce 'Z2' having shape (m, n) and use softmax activation function to (1, n) for each of the 'n' neurons.
Here, 'm' is the number of training examples and 'n' is the number of classes.
What am I doing wrong?
Thanks!