Gradient Descent in Python

Question

I want to create a simple neural network and in my studies, I reached a concept called Gradient Descent. That says:

Imagine that you had a red ball inside of a rounded bucket. Imagine further that the red ball is trying to find the bottom of the bucket. This is optimization.

I use this tutorial:

http://iamtrask.github.io/2015/07/27/python-network-part2/

But I can't understand when optimization happens. When gradient descent happens and most importantly, What is the relation with the rounded bucket example?

This website has another tutorial(basic method):

http://iamtrask.github.io/2015/07/12/basic-python-network/

According to this website, This one is not gradient descent but the second tutorial is the same is first one and it called gradient descent method. I can't understand these two differences.

Please read a machine-learning tutorial or wiki's gradient-descent article. The optimization-steps are the line with the -= aka descent. Example: synapse_0 -= synapse_0_derivative. The lines before that calculate the gradient. Most NN-optimizers are based on the gradient-descent idea, where backpropagation is used to calculate the gradients and in nearly all cases stochastic gradient descent is used for optimizing, which is a little bit different from pure gradient-descent. These are basics and a ML-course would help much more than these kind of blog-posts. — sascha
– sascha, Commented Feb 6, 2017 at 10:42
I read a lot and i can't understand anyway. I use Sigmoid activation function and i can't relate it to that bucket example. in this tutorial, what is my optimization? In Sigmoid we have to minimize some and maximize some others... but in gradient descent we have to ONLY minimize... i'm confused... — Fcoder
– Fcoder, Commented Feb 6, 2017 at 10:46
One reason more to start with the basics! And loss is loss... it's always one minimization-problem. — sascha
– sascha, Commented Feb 6, 2017 at 10:46
I agree that you need to get the basics first, you have serious misconceptions, sigmoid has nothing to do with minimize/maximize something. And these tutorials are so extremely simplified that not much can be learned. — Dr. Snoopy
– Dr. Snoopy, Commented Feb 6, 2017 at 10:57
The sigmaoid function in your first link is 1/(1+exp(-x)) which has as derivative x(1-x). You can find both those expressions in the code with filled in x. — Snow bunting
– Snow bunting, Commented Feb 6, 2017 at 12:32

Martin Thoma · Accepted Answer · 2017-02-07 12:53:00Z

But I can't understand when optimization happens. When gradient descent happens and most importantly, What is the relation with the rounded bucket example?

For all machine learning problems, you have a loss function. The loss is higher the farther you are away from a desirable solution. For example, in a classification problem you can calculate the error of you current classifier. You could take the error as a simple loss functions. The more errors your classifier makes, the worse it is.

Now your models have parameters. Lets call those "weights" w. If you have n of those, you can write w \in R^n.

For each set of weights w, you can assign it an error. If n=2, you can plot a graph for this error function. It could look like this:

Each position at the x-y- plane is one set of parameters. The point in the z direction is the error. You want to minimize the error. Hence your optimization problem is a minimization problem. You want to go down in that bowl. You don't know it is a bowl, this is just a visualiztion. But by looking at the gradient, you can calculate which direction will reduce the error. Hence gradient descent. Reducing the error by optimizing the weights.

Usually, you don't have n=2, but rather n=100 * 10^6 or something similar.

Alec Redford made a couple of great visualizations for this process for different kinds of gradient descent:

Source

Snow bunting · Accepted Answer · 2018-05-14 15:26:07Z

2

For classical neural networks you have two steps:

Feeding inputs through the network
Backpropagation of the error and correction of the weights (synapses) The second one is where gradient descent is used.

This is the example from your link http://iamtrask.github.io/2015/07/27/python-network-part2/

import numpy as np   
X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])   
y = np.array([[0,1,1,0]]).T   
alpha,hidden_dim = (0.5,4)   
synapse_0 = 2*np.random.random((3,hidden_dim)) - 1   
synapse_1 = 2*np.random.random((hidden_dim,1)) - 1   
for j in xrange(60000):   
    layer_1 = 1/(1+np.exp(-(np.dot(X,synapse_0))))   
    layer_2 = 1/(1+np.exp(-(np.dot(layer_1,synapse_1))))   
    layer_2_delta = (layer_2 - y)*(layer_2*(1-layer_2))  
    layer_1_delta = layer_2_delta.dot(synapse_1.T) * (layer_1 * (1-layer_1))   
    synapse_1 -= (alpha * layer_1.T.dot(layer_2_delta))  
    synapse_0 -= (alpha * X.T.dot(layer_1_delta))

In the forward step you apply f(x)=1/(1+exp(-x)) (activation function) to the weighted sum of the inputs (dot-product aka scalar product is a short form for that) to a neuron's state.

The gradient descent is hidden in the backpropagation in the line where you calc. the layer_x_delta:

layer_2*(1-layer_2) is the derivation (also known as gradient) of the f above at position layer_2. So the learning delta is essentially following this gradient in the right direction.
In the layer_1_delta you take the calculated delta from the second layer, pull it backwards in a linear way with np.dot (again just weighted sum) and then take the direction of the gradient as above with x(1-x)
Then one changes the weights according to the delta (error) in the target neuron and the activation of the source neuron. (np.dot(layer_1, delta_layer_2)). alpha is just a learning rate (usually 0 < alpha < 1) to avoid overcorrection.

I hope you can get something out of this answer!

edited May 14, 2018 at 15:26

answered Feb 6, 2017 at 12:28

Snow bunting

1,2911 gold badge12 silver badges30 bronze badges

2 Comments

Snow bunting Over a year ago

Could someone help me on how to format the code block in a nice way? ^^

Fcoder Over a year ago

I edited code, Im reading it right now. hope it helps me. thank you anyway

init_27 · Accepted Answer · 2017-06-14 12:58:21Z

0

Gradient Descent is used to minimise the Error functions inside your given model. Error is the given difference between expected and generated output. Gradient calculates the 'gradient' of the graph and then 'descends' the gradient so as to reduce the 'Cost function' to its minimum

answered Jun 14, 2017 at 12:58

init_27

163 bronze badges

Comments

Rohan Baisantry · Accepted Answer · 2018-02-27 06:44:47Z

Gradient Descent is an optimization technique that is used to find the parameters that will give the least cost.

now assume the rounded bucket was the cost function, now putting that in a graph ( if you see it in 2D it will be a parabola and the bottom of the bucket is the bottom of the parabola, if you see it in 3D it will be the rounded bucket. ), the bottom of the bucket will correspond to the least cost possible. ( we want the least cost as that corresponds to the prediction being better )

now gradient descent will always give the steepest ascent and will always lead us to a minimum value when the cost function is concave as there will be only one optimum (,ie; the local and global minimum will be the same, as it is in this case ) it uses the partial derivative of the cost cost function with respect to the parameters its built on so as to choose the best direction to move in so that it reaches the least cost value and gives the best parameters possible. as and when the gradient descent decides which direction to move ( one full run through the data set) it will update the parameter values, this is when the optimization occurs as parameters are updated to give a reduced error.

Hope this helps!

Collectives™ on Stack Overflow

Gradient Descent in Python

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related