Binary predictor with highly skewed distribution

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty margin-bottom:0;







up vote
2
down vote

favorite












I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!










share|cite|improve this question





























    up vote
    2
    down vote

    favorite












    I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



    Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!










    share|cite|improve this question

























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



      Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!










      share|cite|improve this question















      I am running a linear regression model and I have a binary predictor that has a highly skewed distribution. For example, one category represents 96% of the data. In terms of frequency, the other 4% represents 26 observations.



      Should I keep/remove this binary predictor variable? And, what is the rationale for doing so? Thank you in advance!







      regression binary-data skewness predictor






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Oct 3 at 22:46

























      asked Oct 3 at 22:26









      curiousmind

      11618




      11618




















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          3
          down vote













          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer
















          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            Oct 3 at 22:56






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            Oct 3 at 23:01






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            Oct 3 at 23:27










          Your Answer




          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: false,
          noModals: false,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f370017%2fbinary-predictor-with-highly-skewed-distribution%23new-answer', 'question_page');

          );

          Post as a guest






























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          3
          down vote













          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer
















          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            Oct 3 at 22:56






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            Oct 3 at 23:01






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            Oct 3 at 23:27














          up vote
          3
          down vote













          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer
















          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            Oct 3 at 22:56






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            Oct 3 at 23:01






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            Oct 3 at 23:27












          up vote
          3
          down vote










          up vote
          3
          down vote









          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.






          share|cite|improve this answer












          In general, it's not an issue; you should keep it if it makes sense to be in the model, which presumably it does or it wouldn't be there to begin with.



          Consider, for example, a model for weekly sales of chayote squash in the New Orleans area (see https://en.wikipedia.org/wiki/Chayote, down in the "Americas" section.) Such a model would likely need a dummy variable for Thanksgiving week in order to capture the very large increase in chayote sales at Thanksgiving (> 5x "regular" sales.) This dummy variable would take on the value "1" once every 52 weeks and "0" the rest of the time, so the "not Thanksgiving week" category represents roughly 98% of the data. If we take the dummy variable out, our Thanksgiving forecasts will be terrible and likely all the rest of our forecasts will be a lot worse, because they would be affected by the Thanksgiving data point in various ways (e.g., trends look much steeper if Thanksgiving is near the end of the modeling horizon, ...).



          It's important, however, to note the following caveat. @Henry's comment in response to the OP is of course correct; if you only have one observation for one of the two categories, including the dummy variable will, in effect, simply remove that observation from the data set, and all your (other) parameter estimates would be the same as if you had just deleted that observation.







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered Oct 3 at 22:46









          jbowman

          22.7k24178




          22.7k24178







          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            Oct 3 at 22:56






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            Oct 3 at 23:01






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            Oct 3 at 23:27












          • 1




            Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
            – curiousmind
            Oct 3 at 22:56






          • 2




            Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
            – jbowman
            Oct 3 at 23:01






          • 1




            Thank you. This answer is helpful.
            – curiousmind
            Oct 3 at 23:27







          1




          1




          Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
          – curiousmind
          Oct 3 at 22:56




          Thanks for your answer. I have made some edits to my question, do your response still holds? It seems it does, just wanted to confirm with you.
          – curiousmind
          Oct 3 at 22:56




          2




          2




          Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
          – jbowman
          Oct 3 at 23:01




          Yes, it does. I'll leave the caveat in there so that the answer is more widely applicable than just to the case where you have several observations in the "rare" category.
          – jbowman
          Oct 3 at 23:01




          1




          1




          Thank you. This answer is helpful.
          – curiousmind
          Oct 3 at 23:27




          Thank you. This answer is helpful.
          – curiousmind
          Oct 3 at 23:27

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f370017%2fbinary-predictor-with-highly-skewed-distribution%23new-answer', 'question_page');

          );

          Post as a guest













































































          Popular posts from this blog

          How to check contact read email or not when send email to Individual?

          Displaying single band from multi-band raster using QGIS

          How many registers does an x86_64 CPU actually have?