0

I was reading about momentum and I was trying to implement the equation of momentum in my mini-batch code. enter image description here

The problem is that it is not working the regression line is going too far from the ideal line and I`m not sure if the implementation is correct.

enter image description here

def stochastic_gradient_descent_step(m,b,data_sample):

    n_points = data_sample.shape[0] #size of data
    m_grad = 0
    b_grad = 0
    stepper = 0.0001 #this is the learning rate
    z_m = 1.0
    z_b = 1.0
    betha = 0.81

    for i in range(n_points):

        #Get current pair (x,y)
        x = data_sample[i,0]
        y = data_sample[i,1]
        if(math.isnan(x)|math.isnan(y)): #it will prevent for crashing when some data is missing
            #print("is nan")
            continue

        #you will calculate the partical derivative for each value in data
        #Partial derivative respect 'm'
        dm = -((2/n_points) * x * (y - (m*x + b)))

        #Partial derivative respect 'b'
        db = - ((2/n_points) * (y - (m*x + b)))


        #Update gradient
        m_grad = m_grad + dm
        b_grad = b_grad + db

    #calculate the momentum
    z_m = betha*z_m + m_grad
    z_b = betha*z_b + b_grad
    #Set the new 'better' updated 'm' and 'b'   
    m_updated = m - stepper*z_m
    b_updated = b - stepper*z_b

return m_updated,b_updated


Edited

I have edited my code now and as Sasha suggested me I put the gradient calculation in one function and the momentum in other and I put z_m and z_b as global so they don't lose their value in each iteration.

z_m =0.0 #initilise to 0
z_b =0.0 #initilise to 0
def getGradient(m,b,data_sample):
    global z_m
    global z_b
    n_points = data_sample.shape[0] #size of data
    m_grad = 0
    b_grad = 0
    stepper = 0.0001 #this is the learning rate

    betha = 0.81

    for i in range(n_points):

        #Get current pair (x,y)
        x = data_sample[i,0]
        y = data_sample[i,1]
        if(math.isnan(x)|math.isnan(y)): #it will prevent for crashing when some data is missing
            #print("is nan")
            continue

        #you will calculate the partical derivative for each value in data
        #Partial derivative respect 'm'
        dm = -((2/n_points) * x * (y - (m*x + b)))

        #Partial derivative respect 'b'
        db = - ((2/n_points) * (y - (m*x + b)))


        #Update gradient
        m_grad = m_grad + dm
        b_grad = b_grad + db


    return m_grad,b_grad

def calculateMomentum(m_grad,b_grad,betha=0.81,stepper=0.0001):
    global z_m,z_b
    #calculate the momentum
    z_m = betha*z_m + m_grad
    z_b = betha*z_b + b_grad
    #Set the new 'better' updated 'm' and 'b'   
    m_updated = m - stepper*z_m
    b_updated = b - stepper*z_b
    return m_updated,b_updated 

Now the regression line is calculated correctly (maybe). With SGD the final error is 59706304 and with momentum the final error is 56729062, but it could be for the random mini-batch choosen at the moment of calculating the gradient.

enter image description here

9
  • Not working is the classic useless description here on SO! Commented Jul 27, 2017 at 0:00
  • sorry I will update it Commented Jul 27, 2017 at 0:03
  • You can see the rest of the code in my github file github.com/matvi/GradientDescent/blob/master/SGD.ipynb Commented Jul 27, 2017 at 0:06
  • 1
    That momentum-usage makes no sense here! The momentum is some form of state between weight-updates. Your's is only living for one update and then lost as the function is finished (your code would need to be refactored). Apart from that, those calculations also look wrong (imagine a gradient of 0.00001; you are always adding 0.81 to that; obviously that's not good). Commented Jul 27, 2017 at 0:15
  • 1
    I know what momentum is used for. But you don't seem to understand the logic. It's a state which persists between mini-batches. So those can't be local variables within your mini-batch function. I think you should got the idea and can refactor your code. You probably want to avoid doing the step in that function at all; making it a pure: calc-gradient function. Then the momentum-smoothing can be used in one outer function. Commented Jul 27, 2017 at 15:19

1 Answer 1

0

First of all initialisation is invalid, z_m and z_b should be initialised to 0 (as this is your first guess of the gradient). Second of all in the current, functional form you never "store" z_m or z_b for next iteration, so they do get reset (to the invalid value of 1)

Sign up to request clarification or add additional context in comments.

1 Comment

Sasha says that " It's a state which persists between mini-batches" .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.