Gradient decent optimization

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
1
down vote

favorite












I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










share|cite|improve this question



























    up vote
    1
    down vote

    favorite












    I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










    share|cite|improve this question























      up vote
      1
      down vote

      favorite









      up vote
      1
      down vote

      favorite











      I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.










      share|cite|improve this question













      I am trying to understand gradient decent optimization in ML algorithms. I understand that there's a cost function - where the aim is to minimize the error y^-y. Now in a scenario where weights w1, w2 are being optimized to give the minimum error, When the optimization does occur through partial derivatives, in each turn does it change both w1 and w2 or is it a combination like in few iterations only w1 is changed and when w1 isn't reducing the error more, the derivative starts with w2 - to reach the local minima? The application can be a linear regression model or a logistic regression model or Boosting algorithms.







      optimization gradient-descent






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked 1 hour ago









      Pb89

      539




      539




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



          Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






          share|cite|improve this answer



























            up vote
            1
            down vote













            Gradient descent updates all parameters at each step. You can see this in the update rule:



            $$
            w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
            $$



            Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



            The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






            share|cite|improve this answer






















            • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
              – Pb89
              26 mins ago











            • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
              – Pb89
              24 mins ago










            • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
              – Sycorax
              13 mins ago










            • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
              – Pb89
              12 mins ago










            • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
              – Sycorax
              8 mins ago











            Your Answer




            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "65"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: false,
            noModals: false,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-decent-optimization%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            1
            down vote













            Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



            Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






            share|cite|improve this answer
























              up vote
              1
              down vote













              Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



              Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






              share|cite|improve this answer






















                up vote
                1
                down vote










                up vote
                1
                down vote









                Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



                Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.






                share|cite|improve this answer












                Gradient decent is applied to both w1 and w2 for each iteration. During each iteration, the parameters updated according to the gradients. They would likely have different partial derivative.



                Check https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression.







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered 58 mins ago









                SmallChess

                5,44341837




                5,44341837






















                    up vote
                    1
                    down vote













                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



                    The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






                    share|cite|improve this answer






















                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      26 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      24 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      13 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      12 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      8 mins ago















                    up vote
                    1
                    down vote













                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



                    The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






                    share|cite|improve this answer






















                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      26 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      24 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      13 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      12 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      8 mins ago













                    up vote
                    1
                    down vote










                    up vote
                    1
                    down vote









                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



                    The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.






                    share|cite|improve this answer














                    Gradient descent updates all parameters at each step. You can see this in the update rule:



                    $$
                    w^(t+1)=w^(t) - etanabla fleft(w^(t)right).
                    $$



                    Since the gradient of the loss function $nabla f(w)$ is vector-valued with dimension matching that of $w$, all parameters are updated at each iteration.



                    The learning rate $eta$ is a positive number that re-scales the gradient. Taking too large a step can endlessly bounce you across the loss surface with no improvement in your loss function; too small a step can mean tediously slow progress towards the optimum.







                    share|cite|improve this answer














                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited 2 mins ago

























                    answered 43 mins ago









                    Sycorax

                    36k694180




                    36k694180











                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      26 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      24 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      13 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      12 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      8 mins ago

















                    • So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                      – Pb89
                      26 mins ago











                    • and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                      – Pb89
                      24 mins ago










                    • The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                      – Sycorax
                      13 mins ago










                    • If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                      – Pb89
                      12 mins ago










                    • The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                      – Sycorax
                      8 mins ago
















                    So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                    – Pb89
                    26 mins ago





                    So the algorithm may try different combinations like increase w1 , decrease w2 based on the direction from partial derivative to reach local minima and just to confirm the algorithm will not necessarily give the global minima always?
                    – Pb89
                    26 mins ago













                    and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                    – Pb89
                    24 mins ago




                    and does the partial derivative also help to explain how much increase or decrease has to be done to w1 and w2 or that is done by learning rate/shrinkage while partial derivative only provides direction of descent?
                    – Pb89
                    24 mins ago












                    The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                    – Sycorax
                    13 mins ago




                    The gradient is a vector, so it gives a direction and a magnitude. A vector can be arbitrarily rescaled by a positive scalar and it will have the same direction, but the rescaling will change its magnitude.
                    – Sycorax
                    13 mins ago












                    If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                    – Pb89
                    12 mins ago




                    If magnitude is also given by the gradient then what is the role of shrinkage or learning rate?
                    – Pb89
                    12 mins ago












                    The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                    – Sycorax
                    8 mins ago





                    The learning rate rescales the gradient. Suppose $nabla f(x)$ has a large norm (length). Taking a large step will move you to a distant part of the loss surface (jumping from one mountain to another). The core justification of gradient descent is that it's a linear approximation in the vicinity of $w^(t)$. That approximation is always inexact, but it's probably worse the farther away you move -- hence, you want to take small steps, so you use some small $eta$, where 'small' is entirely problem-specific.
                    – Sycorax
                    8 mins ago


















                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f373055%2fgradient-decent-optimization%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    How to check contact read email or not when send email to Individual?

                    How many registers does an x86_64 CPU actually have?

                    Nur Jahan