22

I am coding gradient descent in matlab. For two features, I get for the update step:

temp0 = theta(1,1) - (alpha/m)*sum((X*theta-y).*X(:,1));
temp1 = theta(2,1) - (alpha/m)*sum((X*theta-y).*X(:,2));
theta(1,1) = temp0;
theta(2,1) = temp1;

However, I want to vectorize this code and to be able to apply it to any number of features. For the vectorization part, it shows that what I am trying to do is a matrix multiplication

theta = theta - (alpha/m) * (X' * (X*theta-y));

This is well seen, but when I tried, I realized that it doesn't work for gradient descent because the parameters are not updated simultaneously.

Then, how can I vectorize this code and make sure the parameters and updated at the same time?

5

7 Answers 7

23

For the vectorized version try the following(two steps to make simultaneous update explicitly) :

 gradient = (alpha/m) * X' * (X*theta -y)
 theta = theta - gradient
3
  • Can you please explain this .. equation gradient = (alpha/m) * X' * (X*theta -y) .. or just guide me to a book or a url that explains it .. thank you
    – Amr Bedair
    Commented Oct 9, 2018 at 20:48
  • pre-computing the gradient before matrix(theta) update into two steps makes sure the update is simultaneous. Because the second step will then be atomic.
    – S.Arora
    Commented Nov 4, 2018 at 13:17
  • This one should be the accepted answer, because it's much more efficient since it takes more advantage of vectorization Commented Feb 20, 2019 at 10:15
10

Your vectorization is correct. I also tried both of your code, and it got me the same theta. Just remember don't use your updated theta in your second implementation.

This also works but less simplified than your 2nd implementation:

Error = X * theta - y;
for i = 1:2
    S(i) = sum(Error.*X(:,i));
end

theta = theta - alpha * (1/m) * S'
4
  • I don't understand: Basically, when theta has two parameters, I tried the first version (with temp) and the second ("completely vectorized). I tried to work on the same problem with both codes and performed 1500 iterations for each version. In the end, I didn't get the same result and I could see that the first version, at least, converges faster) (I tried both these versions on the first assignment of the coursera ml class and clearly the results were different).
    – bigTree
    Commented Dec 23, 2013 at 4:45
  • I tried only with one 2*1 vector as theta, after one step, both of your implements got the same new theta. I am not sure whether other parts of your code influence the result. You can track every step to observe the change of theta. btw, are both initialization theta same? Are they set to zeros or random numbers?
    – lennon310
    Commented Dec 23, 2013 at 4:50
  • theta clearly doesn't change the same way in both implementations. I am not sure though whether the second version simply converges more slowly or whether it is not doing the correct update step.
    – bigTree
    Commented Dec 23, 2013 at 4:53
  • The converges should also be the same if you train the samples altogether instead of in sequence.
    – lennon310
    Commented Dec 23, 2013 at 5:09
2

In order to update them simultaneously you need to keep the value of theta(1..n) in temporary vector and after the operation just update values in original theta vector.

This is the code, that I use for this purpose:

Temp update

tempChange = zeros(length(theta), 1);

tempChage = theta - (alpha/m) * (X' * (X*theta-y));

Actual update

theta = tempChage;

1

Here is the vectorized form of gradient descent it works for me in octave.
remember that X is a matrix with ones in the first column (since theta_0 *1 is thetha_0). For each column in X you have a feature(n) in X. Each row is a training set(m). so X a m X (n+1 ) matrix. The y column vector could be the house prices. Its good to have a cost function to check if you find a minimum.
choose a value for alpha maybe a = 0.001 and try changing it for each time you run the code. The num_iters is the times you want it to run.

function theta = gradientDescent(X, y, theta, alpha, num_iters)

m = length(y); % number of training examples


 for iter = 1:num_iters

  theta = theta - (alpha/m) * (X') * ((X*theta)-y)


 end

end

see the full explanation here: https://www.coursera.org/learn/machine-learning/resources/QQx8l

0
theta = theta - (alpha/m) * (X') * ((X*theta)-y)
0

I am very new to this topic, still my opinion is: if you compute X*theta before hand then while doing vectorized operation to adjust theta, need not to be in temp. in other words: if you compute X*theta while updating theta vector, theta(1) updates before theta(2) and hence changes the X*theta. but if we compute X*theta as y_pred and then do vectorize op on theta, it will be ok.

so my suggestion is(without using temp):

y_pred = X*theta %theta is [1;1] and X is mX2 matrix
theta = theta - (alpha/m) * (X' * (y_pred-y));

Please correct me if I am wrong.

0

In python vectorized gradient descent implementation for linear regression with MSE loss may look like the following:

import numpy as np
def grad_step(X, y, θ, α):
    m = X.shape[0]
    return θ - (α / m) * 2 * X.T @ (X @ θ - y)

def compute_loss(X, y, θ):
    return np.mean((X @ θ - y).T @ (X @ θ - y))

# run gradient descent
θ = np.zeros(X.shape[1]).reshape(-1,1)
for _ in range(100):
    θ = grad_step(X, y, θ, α = 0.01)

enter image description here

Since with matrix differentiation rules, we can compute the gradient of the cost function as follows:

enter image description here

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.