Why there is square in MSE (mean squared error)?
Clash Royale CLAN TAG#URR8PPP
$begingroup$
Please forgive me for such a beginner question, since I'm learning stats . & machine learning.
I'm trying to understand Mean Squared Error.
I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?
If it's just to keep the values positive then why don't we only take absolute values.
I just want to understand what values does it bring to the actual loss function.
Thanks
machine-learning loss-functions
$endgroup$
add a comment |
$begingroup$
Please forgive me for such a beginner question, since I'm learning stats . & machine learning.
I'm trying to understand Mean Squared Error.
I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?
If it's just to keep the values positive then why don't we only take absolute values.
I just want to understand what values does it bring to the actual loss function.
Thanks
machine-learning loss-functions
$endgroup$
6
$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15
$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21
$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15
$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36
add a comment |
$begingroup$
Please forgive me for such a beginner question, since I'm learning stats . & machine learning.
I'm trying to understand Mean Squared Error.
I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?
If it's just to keep the values positive then why don't we only take absolute values.
I just want to understand what values does it bring to the actual loss function.
Thanks
machine-learning loss-functions
$endgroup$
Please forgive me for such a beginner question, since I'm learning stats . & machine learning.
I'm trying to understand Mean Squared Error.
I understand the "Mean Error", the Mean of Errors between real and predicted values, what worries me is why we take square of errors?
If it's just to keep the values positive then why don't we only take absolute values.
I just want to understand what values does it bring to the actual loss function.
Thanks
machine-learning loss-functions
machine-learning loss-functions
asked Feb 19 at 12:12
rummykhanrummykhan
1283
1283
6
$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15
$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21
$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15
$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36
add a comment |
6
$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15
$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21
$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15
$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36
6
6
$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15
$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15
$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21
$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21
$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15
$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15
$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36
$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.
$endgroup$
1
$begingroup$
MSE is also much more amenable to linear algebra analysis.
$endgroup$
– Acccumulation
Feb 19 at 16:36
$begingroup$
Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
$endgroup$
– Brevan Ellefsen
Feb 19 at 23:01
add a comment |
$begingroup$
If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :
beginalign*
mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
&= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
&=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
&= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
&= textVar(hattheta) + textBias(hattheta)^2
endalign*
The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.
Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.
$endgroup$
2
$begingroup$
I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
$endgroup$
– R.M.
Feb 19 at 16:08
add a comment |
$begingroup$
I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.
The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
1. the distance between two objects cannot be negative
2. the distance from “A to B” is the same as the distance from “B to A”
3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”
So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.
Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.
Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393243%2fwhy-there-is-square-in-mse-mean-squared-error%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.
$endgroup$
1
$begingroup$
MSE is also much more amenable to linear algebra analysis.
$endgroup$
– Acccumulation
Feb 19 at 16:36
$begingroup$
Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
$endgroup$
– Brevan Ellefsen
Feb 19 at 23:01
add a comment |
$begingroup$
MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.
$endgroup$
1
$begingroup$
MSE is also much more amenable to linear algebra analysis.
$endgroup$
– Acccumulation
Feb 19 at 16:36
$begingroup$
Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
$endgroup$
– Brevan Ellefsen
Feb 19 at 23:01
add a comment |
$begingroup$
MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.
$endgroup$
MSE has some desirable properties such as easier differentiability (as @user2974951 comments) for further analysis. Differentiability of objective function is in general very important to perform analytical calculations. Taking absolute values is called Mean Absolute Error (MAE in short). It also has applications. It's not like we always prefer MSE or MAE. Another reason, might be penalizing large errors more, because if your error is large, its square is much larger. For example, if some error term, $e_i$ is 999, and the other, $e_j$, is $50$; and if we are to choose which term to decrease by an amount of $1$, MAE can choose any of them. But, MSE aims at the larger one since the square decrease is higher.
answered Feb 19 at 12:25
gunesgunes
6,0201115
6,0201115
1
$begingroup$
MSE is also much more amenable to linear algebra analysis.
$endgroup$
– Acccumulation
Feb 19 at 16:36
$begingroup$
Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
$endgroup$
– Brevan Ellefsen
Feb 19 at 23:01
add a comment |
1
$begingroup$
MSE is also much more amenable to linear algebra analysis.
$endgroup$
– Acccumulation
Feb 19 at 16:36
$begingroup$
Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
$endgroup$
– Brevan Ellefsen
Feb 19 at 23:01
1
1
$begingroup$
MSE is also much more amenable to linear algebra analysis.
$endgroup$
– Acccumulation
Feb 19 at 16:36
$begingroup$
MSE is also much more amenable to linear algebra analysis.
$endgroup$
– Acccumulation
Feb 19 at 16:36
$begingroup$
Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
$endgroup$
– Brevan Ellefsen
Feb 19 at 23:01
$begingroup$
Squaring also provides some convexity which is important in many applications, e.g. Error Functions for Neural Networks. Non convex loss functions can take more work to justify convergence
$endgroup$
– Brevan Ellefsen
Feb 19 at 23:01
add a comment |
$begingroup$
If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :
beginalign*
mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
&= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
&=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
&= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
&= textVar(hattheta) + textBias(hattheta)^2
endalign*
The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.
Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.
$endgroup$
2
$begingroup$
I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
$endgroup$
– R.M.
Feb 19 at 16:08
add a comment |
$begingroup$
If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :
beginalign*
mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
&= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
&=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
&= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
&= textVar(hattheta) + textBias(hattheta)^2
endalign*
The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.
Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.
$endgroup$
2
$begingroup$
I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
$endgroup$
– R.M.
Feb 19 at 16:08
add a comment |
$begingroup$
If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :
beginalign*
mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
&= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
&=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
&= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
&= textVar(hattheta) + textBias(hattheta)^2
endalign*
The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.
Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.
$endgroup$
If $hattheta$ is an estimator of the parameter $theta$ then the MSE $mathbbE[(hattheta - theta)^2]$ is the sum of the variance of $hattheta$ and the square bias :
beginalign*
mathbbE[(hattheta - theta)^2] &= mathbbEbig [ hattheta^2 - 2hatthetatheta + theta^2big ] \
&= mathbbE[hattheta^2] -2thetamathbbE[hattheta] + theta^2 \
&=mathbbE[hattheta^2] - mathbbE[hattheta]^2 + mathbbE[hattheta]^2 - 2thetamathbbE[hattheta] + theta^2 \
&= textVar(hattheta) + (mathbbE[hattheta] - theta )^2 \
&= textVar(hattheta) + textBias(hattheta)^2
endalign*
The MSE is thus made of two important characteristics of an estimator : bias and variance. An estimator may have a small bias but if it has a large variance it's not interesting. On the other hand, an estimator may be very precise, i.e small variance, but if it has a large bias it's also not interesting. The MSE takes both into account.
Moreover, one property of the MSE is that if $hattheta$ depends on $n$, the size of the sample, then if MSE($hattheta_n) to 0$ as $n to +infty$ (thus both variance and bias converge to zero) $hattheta_n$ is consistent, i.e it converges in probability to $theta$.
answered Feb 19 at 12:47
winperiklewinperikle
1485
1485
2
$begingroup$
I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
$endgroup$
– R.M.
Feb 19 at 16:08
add a comment |
2
$begingroup$
I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
$endgroup$
– R.M.
Feb 19 at 16:08
2
2
$begingroup$
I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
$endgroup$
– R.M.
Feb 19 at 16:08
$begingroup$
I'm not sure that the bias/variance argument is all that compelling. Yes, as variance and/or bias goes up, so does MSE. But that's also the case for mean absolute error. The formula might be different/more complicated, but at the "we want an error expression which is high when either the bias or the variance is high" level which you're arguing, mean absolute error (as well as a host of other error metrics) also suffice. -- To argue for MSE specifically, you would have to show why that particular functional form (V + B^2) has advantages (e.g. from a bias/variance tradeoff perspective).
$endgroup$
– R.M.
Feb 19 at 16:08
add a comment |
$begingroup$
I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.
The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
1. the distance between two objects cannot be negative
2. the distance from “A to B” is the same as the distance from “B to A”
3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”
So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.
Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.
Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.
$endgroup$
add a comment |
$begingroup$
I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.
The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
1. the distance between two objects cannot be negative
2. the distance from “A to B” is the same as the distance from “B to A”
3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”
So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.
Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.
Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.
$endgroup$
add a comment |
$begingroup$
I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.
The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
1. the distance between two objects cannot be negative
2. the distance from “A to B” is the same as the distance from “B to A”
3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”
So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.
Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.
Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.
$endgroup$
I think the some of the answers here aren’t fully answering the question. If we are penalizing the error more wouldn’t squaring it be rather arbitrary? If the mean squared error for one estimator (MSE1) is larger than the mean squared error for another estimator (MSE2) than sqrt(MSE1) > sqrt(MSE2) (proof: https://math.stackexchange.com/questions/1494484/using-proof-by-contradiction-to-show-that-xy-implies-sqrt-x-sqrt-y/1494511). The order is preserved and you are not changing anything by taking the square root and in fact not further penalizing anything.
The mean squared error (MSE) is the “distance” between the true value and the estimated value. The distance you are used to seeing is Euclidean distance in one dimension (i.e. sqrt((difference between two points)^2) ). But, how can we measure the distance between other objects? For example, how can we measure the distance between two functions? At some points of the function, the “y-value” is higher for one function and at other points the “y-value” is higher for the other function. In order to define a distance between two functions, we need a more abstract definition for distance. We will call this abstract distance a metric and we would like it to follow the following properties:
1. the distance between two objects cannot be negative
2. the distance from “A to B” is the same as the distance from “B to A”
3. the distance from “A to C” is less than or equal to the distance from “A to B” plus the distance from “B to C”
So coming back to our example of how to measure the distance between two function, if we define a metric as one function is x distance away from another function by the absolute value of the difference between the maximum y-values of the two functions, then that metric satisfies the three properties. So if g(x) can take values of 1 to 5 for all possible values of x and f(x) can take values 2 to 4 for all values of x, then the distance between g and f is 5–4=1.
Now getting back to your original question, the answer is squaring the difference between the true value and the estimated value satisfies those three properties for distance (so we don’t need to take a square root). It is the same for variance. The variance is the weighted sum of the distances between possible outcomes and the mean. The standard deviation is the square root of the variance. The reason we sometimes use standard deviation as the measure of dispersion is because the variance is in squared units. For example, (5 feet - 1 feet)^2 = 16 feet^2. How can we compare 16 feet^2 with anything that is in just feet? By taking the square root, we can compare 4 feet with other things measured in feet.
So to summarize, it doesn’t really matter if you take the square root, it’s still just measuring the distance between two thing. For variance, we want to compare it to other things with the same unit so we use standard deviation. MSE is only being compared to other MSEs so there is no need to take the square root.
Note: some of the things I wrote are not so rigorously shown or stated, but I just wanted to give you the idea of how it works.
answered Mar 3 at 16:46
Yisroel CahnYisroel Cahn
1
1
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f393243%2fwhy-there-is-square-in-mse-mean-squared-error%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
6
$begingroup$
Short answer: because the square function is nicely differentiable, the absolute value function is not.
$endgroup$
– user2974951
Feb 19 at 12:15
$begingroup$
Any form of distance is acceptable, to put one above the others requires to turn it into (the opposite of) a utility function for which the numerical values start making sense. Even parameterisation free solutions do require a choice between functional distances.
$endgroup$
– Xi'an
Feb 19 at 13:21
$begingroup$
There are so many good reasons it's hard to write a good comprehensive (and historically accurate) answer. Anyone who's up to the task should add one reason that's not yet mentioned: Minimizing the squared error sum corresponds to the maximum likelihood for normally distributed error.
$endgroup$
– JiK
Feb 19 at 17:15
$begingroup$
Related: stats.stackexchange.com/questions/48267/…
$endgroup$
– kedarps
Feb 19 at 21:36