Questions tagged [gradient-descent]

Ask Question

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

997 questions

7 votes

1 answer

206 views

Why do “good” loss functions in ML need both Lipschitz continuity and smoothness?

I’m trying to understand the common assumptions in machine-learning optimization theory, where a “well-behaved” loss function is often required to be both L-Lipschitz and β-smooth (i.e., have β-...

Antonios Sarikas

asked Nov 26, 2025 at 17:39

2 votes

0 answers

32 views

What causes the degradation problem - the higher training error in much deeper networks?

In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that "When deeper networks are able to start converging, a degradation problem has been exposed: with ...

Vignesh N

asked Oct 11, 2025 at 12:28

0 votes

0 answers

49 views

Why does LightGBM use the factor (1-a)/b in GOSS?

LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked. In the algorithm, ...

yanis-falaki

asked Jul 21, 2025 at 7:15

2 votes

1 answer

64 views

Running SGD multiple times and picking the best result: keywords / name for this practice?

When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...

Jacob Maibach

asked Jun 10, 2025 at 20:26

4 votes

1 answer

98 views

Stochastic Gradient Descent for Multilayer Networks

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...

Machine123

asked May 10, 2025 at 14:13

10 votes

3 answers

2k views

Is Backpropagation faulty?

Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...

Yaron

asked Apr 27, 2025 at 0:44

1 vote

0 answers

88 views

cost function behaves erratically [closed]

Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...

user470820

asked Mar 27, 2025 at 14:14

2 votes

0 answers

59 views

Estimator bias implies Gradient Bias

Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...

Alberto

1,589

asked Mar 9, 2025 at 21:35

5 votes

1 answer

125 views

For a linear problem $Ax=b$, is gradient descent a lot faster than least squares (any approach)?

Context There are many methods to solve least squares, but most of them involve $k n^3$ flops. Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...

uranus

asked Mar 4, 2025 at 16:49

0 votes

0 answers

56 views

Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine

I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization. The ...

Paolo Pedinotti

asked Mar 2, 2025 at 13:48

3 votes

1 answer

91 views

Do deep learning frameworks "look ahead" when calculating gradient in Nesterov optimization?

The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t$, but at $\theta_t + \beta m$, where $\beta$ is the momentum coefficient and $m$ ...

Antonios Sarikas

asked Feb 22, 2025 at 22:20

1 vote

0 answers

65 views

Do weights update less towards the start of a neural network?

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...

Null Six

asked Jan 22, 2025 at 18:10

14 votes

6 answers

3k views

Why are so many problems linear and how would one solve nonlinear problems?

I am taking a deep learning in Python class this semester and we are doing linear algebra. Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...

Lukas

asked Jan 4, 2025 at 19:41

1 vote

0 answers

49 views

Batch Normalization and the effect of scaled weights on the gradients

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...

kklaw

asked Dec 26, 2024 at 11:21

1 vote

0 answers

60 views

Why do machine learning courses on regression mostly focus on gradient descient although we have the closed form estimator $(X'X)^{-1}X'Y$? [duplicate]

In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...

ExcitedSnail

3,090

asked Dec 13, 2024 at 14:08

15 30 50 per page

2 3 4 5

…

67 Next

Stack Exchange Network

Questions tagged [gradient-descent]

Why do “good” loss functions in ML need both Lipschitz continuity and smoothness?

What causes the degradation problem - the higher training error in much deeper networks?

Why does LightGBM use the factor (1-a)/b in GOSS?

Running SGD multiple times and picking the best result: keywords / name for this practice?

Stochastic Gradient Descent for Multilayer Networks

Is Backpropagation faulty?

cost function behaves erratically [closed]

Estimator bias implies Gradient Bias

For a linear problem $Ax=b$, is gradient descent a lot faster than least squares (any approach)?

Nesterov Accelerated Gradient Descent Stalling with High Regularization in Extreme Learning Machine

Do deep learning frameworks "look ahead" when calculating gradient in Nesterov optimization?

Do weights update less towards the start of a neural network?

Why are so many problems linear and how would one solve nonlinear problems?

Batch Normalization and the effect of scaled weights on the gradients

Why do machine learning courses on regression mostly focus on gradient descient although we have the closed form estimator $(X'X)^{-1}X'Y$? [duplicate]

Hot Network Questions