Gradient decent optimization
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
1
down vote
favorite
I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y
. Now in a scenario where weights w1, w2
are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1
and w2
or is it a combination like in few iterations only w1
is changed and when w1
isn't reducing the error more, the derivative starts with w2
- to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.
optimization gradient-descent
add a comment |Â
up vote
1
down vote
favorite
I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y
. Now in a scenario where weights w1, w2
are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1
and w2
or is it a combination like in few iterations only w1
is changed and when w1
isn't reducing the error more, the derivative starts with w2
- to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.
optimization gradient-descent
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y
. Now in a scenario where weights w1, w2
are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1
and w2
or is it a combination like in few iterations only w1
is changed and when w1
isn't reducing the error more, the derivative starts with w2
- to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.
optimization gradient-descent
I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y
. Now in a scenario where weights w1, w2
are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1
and w2
or is it a combination like in few iterations only w1
is changed and when w1
isn't reducing the error more, the derivative starts with w2
- to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.
optimization gradient-descent
optimization gradient-descent
asked 1 hour ago
Pb89
539
539
add a comment |Â
add a comment |Â
2 Answers
2
active
oldest
votes
up vote
1
down vote
Gradient decent is applied to both w1
and w2
for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.
Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.
add a comment |Â
up vote
1
down vote
Gradient descent updates all parameters at each step. You can see this in the update rule:
$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$
Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.
The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.
So the algorithm may try different combinations like increasew1
, decreasew2
based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â Pb89
26 mins ago
and does the partial derivative also help to explain how much increase or decrease has to be done tow1
andw2
or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â Pb89
24 mins ago
The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â Sycorax
13 mins ago
If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â Pb89
12 mins ago
The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â Sycorax
8 mins ago
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Gradient decent is applied to both w1
and w2
for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.
Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.
add a comment |Â
up vote
1
down vote
Gradient decent is applied to both w1
and w2
for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.
Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Gradient decent is applied to both w1
and w2
for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.
Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.
Gradient decent is applied to both w1
and w2
for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.
Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.
answered 58 mins ago
SmallChess
5,44341837
5,44341837
add a comment |Â
add a comment |Â
up vote
1
down vote
Gradient descent updates all parameters at each step. You can see this in the update rule:
$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$
Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.
The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.
So the algorithm may try different combinations like increasew1
, decreasew2
based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â Pb89
26 mins ago
and does the partial derivative also help to explain how much increase or decrease has to be done tow1
andw2
or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â Pb89
24 mins ago
The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â Sycorax
13 mins ago
If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â Pb89
12 mins ago
The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â Sycorax
8 mins ago
add a comment |Â
up vote
1
down vote
Gradient descent updates all parameters at each step. You can see this in the update rule:
$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$
Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.
The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.
So the algorithm may try different combinations like increasew1
, decreasew2
based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â Pb89
26 mins ago
and does the partial derivative also help to explain how much increase or decrease has to be done tow1
andw2
or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â Pb89
24 mins ago
The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â Sycorax
13 mins ago
If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â Pb89
12 mins ago
The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â Sycorax
8 mins ago
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Gradient descent updates all parameters at each step. You can see this in the update rule:
$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$
Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.
The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.
Gradient descent updates all parameters at each step. You can see this in the update rule:
$$
w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
$$
Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.
The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.
edited 2 mins ago
answered 43 mins ago
Sycorax
36k694180
36k694180
So the algorithm may try different combinations like increasew1
, decreasew2
based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â Pb89
26 mins ago
and does the partial derivative also help to explain how much increase or decrease has to be done tow1
andw2
or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â Pb89
24 mins ago
The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â Sycorax
13 mins ago
If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â Pb89
12 mins ago
The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â Sycorax
8 mins ago
add a comment |Â
So the algorithm may try different combinations like increasew1
, decreasew2
based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
â Pb89
26 mins ago
and does the partial derivative also help to explain how much increase or decrease has to be done tow1
andw2
or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
â Pb89
24 mins ago
The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â Sycorax
13 mins ago
If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â Pb89
12 mins ago
The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â Sycorax
8 mins ago
So the algorithm may try different combinations like increase
w1
, decrease w2
based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?â Pb89
26 mins ago
So the algorithm may try different combinations like increase
w1
, decrease w2
based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?â Pb89
26 mins ago
and does the partial derivative also help to explain how much increase or decrease has to be done to
w1
and w2
or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?â Pb89
24 mins ago
and does the partial derivative also help to explain how much increase or decrease has to be done to
w1
and w2
or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?â Pb89
24 mins ago
The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â Sycorax
13 mins ago
The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
â Sycorax
13 mins ago
If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â Pb89
12 mins ago
If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
â Pb89
12 mins ago
The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â Sycorax
8 mins ago
The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
â Sycorax
8 mins ago
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-decent-optimization%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password