Metrics to determine K in K-cross fold validation
Clash Royale CLAN TAG#URR8PPP
up vote
1
down vote
favorite
Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?
The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?
cross-validation accuracy performance
add a comment |Â
up vote
1
down vote
favorite
Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?
The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?
cross-validation accuracy performance
add a comment |Â
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?
The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?
cross-validation accuracy performance
Consider a scenario where the dataset in hand is quite large, let's assume 50000 samples (quite well balanced between two classes). What metrics can be used to decide the K value in a K-fold cross-validation? In other words, can a 5-fold CV be enough or should I go for a 10-fold CV?
The rule of thumb is the higher K, the better. But, putting aside the computational costs, what can be used to decide the value of K? Should we look at the overall performance, e.g. average accuracy? That is, if accuracy (5CV) ~ accuracy(10CV), we can opt for 5-fold CV?. Is the standard deviation between the performance of different folds important? That is, the lower the better?
cross-validation accuracy performance
cross-validation accuracy performance
asked 1 hour ago
NCL
312
312
add a comment |Â
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
1
down vote
First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!
I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.
Also you can select your models with different K based on a criteria. For example AIC.
add a comment |Â
up vote
1
down vote
You should ask yourself, why are we even doing cross-validation?
It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.
If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.
Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.
Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.
So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.
New contributor
add a comment |Â
up vote
1
down vote
The rule of thumb is the higher K, the better.
I think a better rule of thumb is: The larger your dataset, the less important is $k$.
However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):
- Increasing $k$ decreases the bias because the training set better represents the data
- Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar
Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.
And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!
I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.
Also you can select your models with different K based on a criteria. For example AIC.
add a comment |Â
up vote
1
down vote
First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!
I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.
Also you can select your models with different K based on a criteria. For example AIC.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!
I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.
Also you can select your models with different K based on a criteria. For example AIC.
First of all choosing K is basically heuristic approach. It depends on the data and model. Most of the times 5 is a good choice in my opinion. It doesn't need to much computation power and time but you need to try and see which one is better for your data. There is no free lunch!
I would suggest another CV idea for you. For example if you use 5 Fold CV (without stratifying and shuffle) basically you divide your data to 5 equal folds. The mean of equal is this: every folds have same shape. Every fold can has a different distribution. So you can choose your folds manually. Plot the distribution of target variable and try to catch same patterns for decide your folds.
Also you can select your models with different K based on a criteria. For example AIC.
answered 56 mins ago
silverstone
765
765
add a comment |Â
add a comment |Â
up vote
1
down vote
You should ask yourself, why are we even doing cross-validation?
It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.
If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.
Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.
Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.
So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.
New contributor
add a comment |Â
up vote
1
down vote
You should ask yourself, why are we even doing cross-validation?
It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.
If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.
Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.
Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.
So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.
New contributor
add a comment |Â
up vote
1
down vote
up vote
1
down vote
You should ask yourself, why are we even doing cross-validation?
It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.
If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.
Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.
Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.
So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.
New contributor
You should ask yourself, why are we even doing cross-validation?
It's not to get a better accuracy. You're trying to get a better estimate for the accuracy (or another metric) on unseen data. You want to know how well does the model generalize.
If you try to grid search for the "best K", you're going to either waste some data, or get a worse estimate of the metric.
Wasting data - you split your data into two sets and grid search on one of them and then do a cross-validation(with the "best K") on the second dataset. Don't do this.
Getting a worse estimate - you do a grid search for the "best K" and choose the one that gets you the best result according to your chosen metric. But now you brought information that you shouldn't have. You are being too optimistic with your estimate. That's the exact opposite of what you wanted, when you started with the cross-validation. Don't do this either.
So what you should do? Pick the largest K that makes sense with the problem you are trying to solve. Don't put the computational cost aside. The computational cost should determine the K.
New contributor
New contributor
answered 42 mins ago
ExabytE
111
111
New contributor
New contributor
add a comment |Â
add a comment |Â
up vote
1
down vote
The rule of thumb is the higher K, the better.
I think a better rule of thumb is: The larger your dataset, the less important is $k$.
However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):
- Increasing $k$ decreases the bias because the training set better represents the data
- Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar
Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.
And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.
add a comment |Â
up vote
1
down vote
The rule of thumb is the higher K, the better.
I think a better rule of thumb is: The larger your dataset, the less important is $k$.
However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):
- Increasing $k$ decreases the bias because the training set better represents the data
- Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar
Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.
And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
The rule of thumb is the higher K, the better.
I think a better rule of thumb is: The larger your dataset, the less important is $k$.
However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):
- Increasing $k$ decreases the bias because the training set better represents the data
- Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar
Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.
And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.
The rule of thumb is the higher K, the better.
I think a better rule of thumb is: The larger your dataset, the less important is $k$.
However, it is useful to have a general understanding of the impact of $k$ on the performance estimator (leaving aside computational costs):
- Increasing $k$ decreases the bias because the training set better represents the data
- Increasing $k$ increases the variance of the estimator because the training data sets are becoming more similar
Also note that there is no unbiased estimator for the variance of the $k$-fold CV. Together this means that there is no metric that can tell you the best $k$ if you leave computational costs aside. Some empirical studies suggest that 10 is a reasonable default.
And to be clear, $k$ is not a hyper-parameter you want to tune to find the best accuracy. If you start performing $k_2$-fold CV to find the best $k_1$ something hopefully feels wrong.
answered 34 mins ago
oW_
2,707629
2,707629
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40772%2fmetrics-to-determine-k-in-k-cross-fold-validation%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password