Why there is square in MSE (mean squared error)?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












5












$begingroup$


Please forgive me for such a beginner question, since I'm learning stats . & machine learning.



I'm trying to understand Mean Squared Error.



I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?



If it's just to keep the values positive then why don't we only take absolute values.



I just want to understand what values does it bring to the actual loss function.



Thanks










share|cite|improve this question









$endgroup$







  • 6




    $begingroup$
    Short answer: because the square function is nicely differentiable, the absolute value function is not.
    $endgroup$
    – user2974951
    Feb 19 at 12:15










  • $begingroup$
    Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
    $endgroup$
    – Xi'an
    Feb 19 at 13:21










  • $begingroup$
    There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
    $endgroup$
    – JiK
    Feb 19 at 17:15










  • $begingroup$
    Related: stats.stackexchange.com/questions/48267/…
    $endgroup$
    – kedarps
    Feb 19 at 21:36















5












$begingroup$


Please forgive me for such a beginner question, since I'm learning stats . & machine learning.



I'm trying to understand Mean Squared Error.



I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?



If it's just to keep the values positive then why don't we only take absolute values.



I just want to understand what values does it bring to the actual loss function.



Thanks










share|cite|improve this question









$endgroup$







  • 6




    $begingroup$
    Short answer: because the square function is nicely differentiable, the absolute value function is not.
    $endgroup$
    – user2974951
    Feb 19 at 12:15










  • $begingroup$
    Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
    $endgroup$
    – Xi'an
    Feb 19 at 13:21










  • $begingroup$
    There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
    $endgroup$
    – JiK
    Feb 19 at 17:15










  • $begingroup$
    Related: stats.stackexchange.com/questions/48267/…
    $endgroup$
    – kedarps
    Feb 19 at 21:36













5












5








5


1



$begingroup$


Please forgive me for such a beginner question, since I'm learning stats . & machine learning.



I'm trying to understand Mean Squared Error.



I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?



If it's just to keep the values positive then why don't we only take absolute values.



I just want to understand what values does it bring to the actual loss function.



Thanks










share|cite|improve this question









$endgroup$




Please forgive me for such a beginner question, since I'm learning stats . & machine learning.



I'm trying to understand Mean Squared Error.



I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?



If it's just to keep the values positive then why don't we only take absolute values.



I just want to understand what values does it bring to the actual loss function.



Thanks







machine-learning loss-functions






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked Feb 19 at 12:12









rummykhanrummykhan

1283




1283







  • 6




    $begingroup$
    Short answer: because the square function is nicely differentiable, the absolute value function is not.
    $endgroup$
    – user2974951
    Feb 19 at 12:15










  • $begingroup$
    Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
    $endgroup$
    – Xi'an
    Feb 19 at 13:21










  • $begingroup$
    There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
    $endgroup$
    – JiK
    Feb 19 at 17:15










  • $begingroup$
    Related: stats.stackexchange.com/questions/48267/…
    $endgroup$
    – kedarps
    Feb 19 at 21:36












  • 6




    $begingroup$
    Short answer: because the square function is nicely differentiable, the absolute value function is not.
    $endgroup$
    – user2974951
    Feb 19 at 12:15










  • $begingroup$
    Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
    $endgroup$
    – Xi'an
    Feb 19 at 13:21










  • $begingroup$
    There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
    $endgroup$
    – JiK
    Feb 19 at 17:15










  • $begingroup$
    Related: stats.stackexchange.com/questions/48267/…
    $endgroup$
    – kedarps
    Feb 19 at 21:36







6




6




$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15




$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15












$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21




$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21












$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15




$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15












$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36




$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36










3 Answers
3






active

oldest

votes


















7












$begingroup$

MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.






share|cite|improve this answer









$endgroup$








  • 1




    $begingroup$
    MSE is also much more amenable to linear algebra analysis.
    $endgroup$
    – Acccumulation
    Feb 19 at 16:36










  • $begingroup$
    Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
    $endgroup$
    – Brevan Ellefsen
    Feb 19 at 23:01


















3












$begingroup$

If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :



beginalign*
mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
&= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
&=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
&= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
&= textVar(hattheta) + textBias(hattheta)^2
endalign*



The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.



Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.






share|cite|improve this answer









$endgroup$








  • 2




    $begingroup$
    I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
    $endgroup$
    – R.M.
    Feb 19 at 16:08


















0












$begingroup$

I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.



The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
1. the distance between two objects cannot be negative
2. the distance from “A to B” is the same as the distance from “B to A”
3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”



So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.



Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.



Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.






share|cite|improve this answer









$endgroup$












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393243%2fwhy-there-is-square-in-mse-mean-squared-error%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    7












    $begingroup$

    MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.






    share|cite|improve this answer









    $endgroup$








    • 1




      $begingroup$
      MSE is also much more amenable to linear algebra analysis.
      $endgroup$
      – Acccumulation
      Feb 19 at 16:36










    • $begingroup$
      Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
      $endgroup$
      – Brevan Ellefsen
      Feb 19 at 23:01















    7












    $begingroup$

    MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.






    share|cite|improve this answer









    $endgroup$








    • 1




      $begingroup$
      MSE is also much more amenable to linear algebra analysis.
      $endgroup$
      – Acccumulation
      Feb 19 at 16:36










    • $begingroup$
      Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
      $endgroup$
      – Brevan Ellefsen
      Feb 19 at 23:01













    7












    7








    7





    $begingroup$

    MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.






    share|cite|improve this answer









    $endgroup$



    MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.







    share|cite|improve this answer












    share|cite|improve this answer



    share|cite|improve this answer










    answered Feb 19 at 12:25









    gunesgunes

    6,0201115




    6,0201115







    • 1




      $begingroup$
      MSE is also much more amenable to linear algebra analysis.
      $endgroup$
      – Acccumulation
      Feb 19 at 16:36










    • $begingroup$
      Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
      $endgroup$
      – Brevan Ellefsen
      Feb 19 at 23:01












    • 1




      $begingroup$
      MSE is also much more amenable to linear algebra analysis.
      $endgroup$
      – Acccumulation
      Feb 19 at 16:36










    • $begingroup$
      Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
      $endgroup$
      – Brevan Ellefsen
      Feb 19 at 23:01







    1




    1




    $begingroup$
    MSE is also much more amenable to linear algebra analysis.
    $endgroup$
    – Acccumulation
    Feb 19 at 16:36




    $begingroup$
    MSE is also much more amenable to linear algebra analysis.
    $endgroup$
    – Acccumulation
    Feb 19 at 16:36












    $begingroup$
    Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
    $endgroup$
    – Brevan Ellefsen
    Feb 19 at 23:01




    $begingroup$
    Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
    $endgroup$
    – Brevan Ellefsen
    Feb 19 at 23:01













    3












    $begingroup$

    If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :



    beginalign*
    mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
    &= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
    &=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
    &= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
    &= textVar(hattheta) + textBias(hattheta)^2
    endalign*



    The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.



    Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.






    share|cite|improve this answer









    $endgroup$








    • 2




      $begingroup$
      I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
      $endgroup$
      – R.M.
      Feb 19 at 16:08















    3












    $begingroup$

    If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :



    beginalign*
    mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
    &= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
    &=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
    &= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
    &= textVar(hattheta) + textBias(hattheta)^2
    endalign*



    The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.



    Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.






    share|cite|improve this answer









    $endgroup$








    • 2




      $begingroup$
      I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
      $endgroup$
      – R.M.
      Feb 19 at 16:08













    3












    3








    3





    $begingroup$

    If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :



    beginalign*
    mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
    &= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
    &=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
    &= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
    &= textVar(hattheta) + textBias(hattheta)^2
    endalign*



    The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.



    Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.






    share|cite|improve this answer









    $endgroup$



    If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :



    beginalign*
    mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
    &= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
    &=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
    &= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
    &= textVar(hattheta) + textBias(hattheta)^2
    endalign*



    The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.



    Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.







    share|cite|improve this answer












    share|cite|improve this answer



    share|cite|improve this answer










    answered Feb 19 at 12:47









    winperiklewinperikle

    1485




    1485







    • 2




      $begingroup$
      I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
      $endgroup$
      – R.M.
      Feb 19 at 16:08












    • 2




      $begingroup$
      I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
      $endgroup$
      – R.M.
      Feb 19 at 16:08







    2




    2




    $begingroup$
    I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
    $endgroup$
    – R.M.
    Feb 19 at 16:08




    $begingroup$
    I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
    $endgroup$
    – R.M.
    Feb 19 at 16:08











    0












    $begingroup$

    I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.



    The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
    1. the distance between two objects cannot be negative
    2. the distance from “A to B” is the same as the distance from “B to A”
    3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”



    So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.



    Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
    So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.



    Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.






    share|cite|improve this answer









    $endgroup$

















      0












      $begingroup$

      I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.



      The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
      1. the distance between two objects cannot be negative
      2. the distance from “A to B” is the same as the distance from “B to A”
      3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”



      So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.



      Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
      So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.



      Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.






      share|cite|improve this answer









      $endgroup$















        0












        0








        0





        $begingroup$

        I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.



        The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
        1. the distance between two objects cannot be negative
        2. the distance from “A to B” is the same as the distance from “B to A”
        3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”



        So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.



        Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
        So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.



        Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.






        share|cite|improve this answer









        $endgroup$



        I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.



        The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
        1. the distance between two objects cannot be negative
        2. the distance from “A to B” is the same as the distance from “B to A”
        3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”



        So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.



        Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
        So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.



        Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered Mar 3 at 16:46









        Yisroel CahnYisroel Cahn

        1




        1



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393243%2fwhy-there-is-square-in-mse-mean-squared-error%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown






            Popular posts from this blog

            How to check contact read email or not when send email to Individual?

            How many registers does an x86_64 CPU actually have?

            Nur Jahan