Skip to main content

Questions tagged [gradient-descent]

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point. For stochastic gradient descent there is also the [sgd] tag.

7 votes
1 answer
206 views

I’m trying to understand the common assumptions in machine-learning optimization theory, where a “well-behaved” loss function is often required to be both L-Lipschitz and β-smooth (i.e., have β-...
Antonios Sarikas's user avatar
2 votes
0 answers
32 views

In the paper "Deep Residual Learning for Image Recognition", it's been mentioned that "When deeper networks are able to start converging, a degradation problem has been exposed: with ...
Vignesh N's user avatar
0 votes
0 answers
49 views

LightGBM is a specific implementation of gradient boosted decision trees. One notable difference is how samples used for calculating variance gain in split points are picked. In the algorithm, ...
yanis-falaki's user avatar
2 votes
1 answer
64 views

When fitting neural networks, I often run stochastic gradient descent multiple times and take the run with the lowest training loss. I'm trying to look up research literature on this practice, but I'm ...
Jacob Maibach's user avatar
4 votes
1 answer
98 views

I was going through the algorithm for Stochastic Gradient decent in mulilayer network from the book Machine Learning by Tom Mitchell, and it shows the formulae for weight update rule. However, I dont ...
Machine123's user avatar
10 votes
3 answers
2k views

Consider a neural network with 2 or more layers. After we update the weights in layer 1, the input to layer 2 ($a^{(1)}$) has changed, so ∂z/∂w is no longer correct, as z has changed to z* and z* $\...
Yaron's user avatar
  • 109
1 vote
0 answers
88 views

Trying to learn basic machine learning, I wrote my own code for logistic regression where I minimize the usual log likelihood using gradient descent. This is the plot of the error function through a ...
user470820's user avatar
2 votes
0 answers
59 views

Say I have a biased estimator, for example estimating $\log \mathbb{E}[f_\theta(x)]$ using Monte Carlo Does this implies that $\nabla_\theta \log \mathbb{E}[f_\theta(x)]$ is also biased if estimated ...
Alberto's user avatar
  • 1,589
5 votes
1 answer
125 views

Context There are many methods to solve least squares, but most of them involve $k n^3$ flops. Using gradient descent, one runs $A x_i$ and uses the error to update $x_{i+1} = x_i - c \times \mathrm{g}...
uranus's user avatar
  • 51
0 votes
0 answers
56 views

I'm implementing Nesterov Accelerated Gradient Descent (NAG) on an Extreme Learning Machine (ELM) with one hidden layer. My loss function is the Mean Squared Error (MSE) with $L^2$ regularization. The ...
Paolo Pedinotti's user avatar
3 votes
1 answer
91 views

The whole point behind Nesterov optimization is to calculate the gradient not at the current parameter values $\theta_t$, but at $\theta_t + \beta m$, where $\beta$ is the momentum coefficient and $m$ ...
Antonios Sarikas's user avatar
1 vote
0 answers
65 views

That is, because the error is coming from the end of the neural network (ie at the output layer) and trickles back via backpropagation to the start of the neural network, does that mean that the ...
Null Six's user avatar
14 votes
6 answers
3k views

I am taking a deep learning in Python class this semester and we are doing linear algebra. Last lecture we "invented" linear regression with gradient descent (did least squares the lecture ...
Lukas's user avatar
  • 141
1 vote
0 answers
49 views

I have been reading the following paper: https://arxiv.org/pdf/1706.05350, and I am having a hard time with some claims and derivations made in the paper. First of all, the main thing I am interested ...
kklaw's user avatar
  • 554
1 vote
0 answers
60 views

In many online machine learning courses and videos(such as Andrew Ng's coursera course), when it comes to regression (for example regressing $Y$ on features $X$), althouth we have the closed form ...
ExcitedSnail's user avatar
  • 3,090

15 30 50 per page
1
2 3 4 5
67