Vectorizing a gradient descent algorithm

Question

I am coding gradient descent in matlab. For two features, I get for the update step:

temp0 = theta(1,1) - (alpha/m)*sum((X*theta-y).*X(:,1));
temp1 = theta(2,1) - (alpha/m)*sum((X*theta-y).*X(:,2));
theta(1,1) = temp0;
theta(2,1) = temp1;

However, I want to vectorize this code and to be able to apply it to any number of features. For the vectorization part, it shows that what I am trying to do is a matrix multiplication

theta = theta - (alpha/m) * (X' * (X*theta-y));

This is well seen, but when I tried, I realized that it doesn't work for gradient descent because the parameters are not updated simultaneously.

Then, how can I vectorize this code and make sure the parameters and updated at the same time?

Why? This is updating theta for all features simultaneously. — Ray, Commented Dec 23, 2013 at 3:14
Yet, when I tried for 2 features, I got different results in my first implementation (with the temp variables) and in the second one... Which I don't understand — bigTree, Commented Dec 23, 2013 at 3:29
I found an implementation of gradient descent here. I guess it is not possible to "vectorize more" — bigTree, Commented Dec 23, 2013 at 4:05
possible duplicate of Vectorization of a gradient descent code — fuesika, Commented Dec 23, 2013 at 9:50
coursera.org/learn/machine-learning/discussions/weeks/2/threads/… — Ken Zhang, Commented Feb 9, 2018 at 23:02

S.Arora · Accepted Answer · 2017-04-08 01:52:29Z

23

For the vectorized version try the following(two steps to make simultaneous update explicitly) :

 gradient = (alpha/m) * X' * (X*theta -y)
 theta = theta - gradient

answered Apr 8, 2017 at 1:52

S.Arora

2633 silver badges7 bronze badges

Can you please explain this .. equation gradient = (alpha/m) * X' * (X*theta -y) .. or just guide me to a book or a url that explains it .. thank you
– Amr Bedair
Commented Oct 9, 2018 at 20:48
pre-computing the gradient before matrix(theta) update into two steps makes sure the update is simultaneous. Because the second step will then be atomic.
– S.Arora
Commented Nov 4, 2018 at 13:17
This one should be the accepted answer, because it's much more efficient since it takes more advantage of vectorization
– Marco Cadei
Commented Feb 20, 2019 at 10:15

Add a comment |

lennon310 · Accepted Answer · 2013-12-23 04:27:16Z

10

Your vectorization is correct. I also tried both of your code, and it got me the same theta. Just remember don't use your updated theta in your second implementation.

This also works but less simplified than your 2nd implementation:

Error = X * theta - y;
for i = 1:2
    S(i) = sum(Error.*X(:,i));
end

theta = theta - alpha * (1/m) * S'

answered Dec 23, 2013 at 4:27

lennon310

12.7k11 gold badges45 silver badges63 bronze badges

I don't understand: Basically, when theta has two parameters, I tried the first version (with temp) and the second ("completely vectorized). I tried to work on the same problem with both codes and performed 1500 iterations for each version. In the end, I didn't get the same result and I could see that the first version, at least, converges faster) (I tried both these versions on the first assignment of the coursera ml class and clearly the results were different).
– bigTree
Commented Dec 23, 2013 at 4:45
I tried only with one 2*1 vector as theta, after one step, both of your implements got the same new theta. I am not sure whether other parts of your code influence the result. You can track every step to observe the change of theta. btw, are both initialization theta same? Are they set to zeros or random numbers?
– lennon310
Commented Dec 23, 2013 at 4:50
theta clearly doesn't change the same way in both implementations. I am not sure though whether the second version simply converges more slowly or whether it is not doing the correct update step.
– bigTree
Commented Dec 23, 2013 at 4:53
The converges should also be the same if you train the samples altogether instead of in sequence.
– lennon310
Commented Dec 23, 2013 at 5:09

Add a comment |

iatanasov · Accepted Answer · 2016-02-07 17:17:10Z

2

In order to update them simultaneously you need to keep the value of theta(1..n) in temporary vector and after the operation just update values in original theta vector.

This is the code, that I use for this purpose:

Temp update

tempChange = zeros(length(theta), 1);

tempChage = theta - (alpha/m) * (X' * (X*theta-y));

Actual update

theta = tempChage;

answered Feb 7, 2016 at 17:17

iatanasov

537 bronze badges

Add a comment |

Suraj Rao · Accepted Answer · 2019-02-25 14:11:41Z

Here is the vectorized form of gradient descent it works for me in octave.
remember that X is a matrix with ones in the first column (since theta_0 *1 is thetha_0). For each column in X you have a feature(n) in X. Each row is a training set(m). so X a m X (n+1 ) matrix. The y column vector could be the house prices. Its good to have a cost function to check if you find a minimum.
choose a value for alpha maybe a = 0.001 and try changing it for each time you run the code. The num_iters is the times you want it to run.

function theta = gradientDescent(X, y, theta, alpha, num_iters)

m = length(y); % number of training examples


 for iter = 1:num_iters

  theta = theta - (alpha/m) * (X') * ((X*theta)-y)


 end

end

see the full explanation here: https://www.coursera.org/learn/machine-learning/resources/QQx8l

Liam · Accepted Answer · 2017-08-11 15:31:14Z

0

theta = theta - (alpha/m) * (X') * ((X*theta)-y)

edited Aug 11, 2017 at 15:31

Liam

29.8k28 gold badges139 silver badges203 bronze badges

answered Aug 11, 2017 at 15:05

Madhusudhan B A

11

Add a comment |

Abhishek Jain · Accepted Answer · 2019-02-01 12:54:24Z

I am very new to this topic, still my opinion is: if you compute X*theta before hand then while doing vectorized operation to adjust theta, need not to be in temp. in other words: if you compute X*theta while updating theta vector, theta(1) updates before theta(2) and hence changes the X*theta. but if we compute X*theta as y_pred and then do vectorize op on theta, it will be ok.

so my suggestion is(without using temp):

y_pred = X*theta %theta is [1;1] and X is mX2 matrix
theta = theta - (alpha/m) * (X' * (y_pred-y));

Please correct me if I am wrong.

Sandipan Dey · Accepted Answer · 2021-09-11 06:47:47Z

0

In python vectorized gradient descent implementation for linear regression with MSE loss may look like the following:

import numpy as np
def grad_step(X, y, θ, α):
    m = X.shape[0]
    return θ - (α / m) * 2 * X.T @ (X @ θ - y)

def compute_loss(X, y, θ):
    return np.mean((X @ θ - y).T @ (X @ θ - y))

# run gradient descent
θ = np.zeros(X.shape[1]).reshape(-1,1)
for _ in range(100):
    θ = grad_step(X, y, θ, α = 0.01)

Since with matrix differentiation rules, we can compute the gradient of the cost function as follows:

edited Sep 11, 2021 at 6:47

answered Sep 10, 2021 at 12:48

Sandipan Dey

23.3k4 gold badges57 silver badges71 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Vectorizing a gradient descent algorithm

7 Answers 7

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Linked

Related