Binary predictor with highly skewed distribution
Clash Royale CLAN TAG#URR8PPP
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;
up vote
2
down vote
favorite
I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.
Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!
regression binary-data skewness predictor
add a comment |Â
up vote
2
down vote
favorite
I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.
Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!
regression binary-data skewness predictor
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.
Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!
regression binary-data skewness predictor
I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.
Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!
regression binary-data skewness predictor
regression binary-data skewness predictor
edited Oct 3 at 22:46
asked Oct 3 at 22:26
curiousmind
11618
11618
add a comment |Â
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.
Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).
It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.
1
Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
â curiousmind
Oct 3 at 22:56
2
Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
â jbowman
Oct 3 at 23:01
1
Thank you. This answer is helpful.
â curiousmind
Oct 3 at 23:27
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.
Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).
It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.
1
Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
â curiousmind
Oct 3 at 22:56
2
Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
â jbowman
Oct 3 at 23:01
1
Thank you. This answer is helpful.
â curiousmind
Oct 3 at 23:27
add a comment |Â
up vote
3
down vote
In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.
Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).
It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.
1
Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
â curiousmind
Oct 3 at 22:56
2
Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
â jbowman
Oct 3 at 23:01
1
Thank you. This answer is helpful.
â curiousmind
Oct 3 at 23:27
add a comment |Â
up vote
3
down vote
up vote
3
down vote
In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.
Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).
It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.
In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.
Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).
It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.
answered Oct 3 at 22:46
jbowman
22.7k24178
22.7k24178
1
Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
â curiousmind
Oct 3 at 22:56
2
Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
â jbowman
Oct 3 at 23:01
1
Thank you. This answer is helpful.
â curiousmind
Oct 3 at 23:27
add a comment |Â
1
Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
â curiousmind
Oct 3 at 22:56
2
Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
â jbowman
Oct 3 at 23:01
1
Thank you. This answer is helpful.
â curiousmind
Oct 3 at 23:27
1
1
Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
â curiousmind
Oct 3 at 22:56
Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
â curiousmind
Oct 3 at 22:56
2
2
Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
â jbowman
Oct 3 at 23:01
Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
â jbowman
Oct 3 at 23:01
1
1
Thank you. This answer is helpful.
â curiousmind
Oct 3 at 23:27
Thank you. This answer is helpful.
â curiousmind
Oct 3 at 23:27
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f370017%2fbinary-predictor-with-highly-skewed-distribution%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password