For classical neural networks you have two steps:
- Feeding inputs through the network
- Backpropagation of the error and correction of the weights (synapses)
The second one is where gradient descent is used.
This is the example from your link http://iamtrask.github.io/2015/07/27/python-network-part2/
import numpy as np
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T
alpha,hidden_dim = (0.5,4)
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1
for j in xrange(60000):
layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))
layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))
layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))
layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))
synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))
synapse_0 -= (alpha * X.T.dot(layer_1_delta))
In the forward step you apply f(x)=1/(1+exp(-x)) (activation function) to the weighted sum of the inputs (dot-product aka scalar product is a short form for that) to a neuron's state.
The gradient descent is hidden in the backpropagation in the line where you calc. the layer_x_delta:
layer_2*(1-layer_2) is the derivation (also known as gradient) of the f above at position layer_2. So the learning delta is essentially following this gradient in the right direction.
- In the
layer_1_delta you take the calculated delta from the second layer, pull it backwards in a linear way with np.dot (again just weighted sum) and then take the direction of the gradient as above with x(1-x)
- Then one changes the weights according to the delta (error) in the target neuron and the activation of the source neuron. (
np.dot(layer_1, delta_layer_2)). alpha is just a learning rate (usually 0 < alpha < 1) to avoid overcorrection.
I hope you can get something out of this answer!
-=aka descent. Example:synapse_0 -= synapse_0_derivative. The lines before that calculate the gradient. Most NN-optimizers are based on the gradient-descent idea, where backpropagation is used to calculate the gradients and in nearly all cases stochastic gradient descent is used for optimizing, which is a little bit different from pure gradient-descent. These are basics and a ML-course would help much more than these kind of blog-posts.1/(1+exp(-x))which has as derivativex(1-x). You can find both those expressions in the code with filled inx.