Linear regression when Y is bounded and discrete

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












11












$begingroup$


The question is straightforward: Is it appropriate to use linear regression when Y is bounded and discrete (e.g. the test score 1~100, some pre-defined ranking 1~17)? In this case, is it "not good" to use linear regression, or it's totally wrong to use it?










share|cite|improve this question









$endgroup$
















    11












    $begingroup$


    The question is straightforward: Is it appropriate to use linear regression when Y is bounded and discrete (e.g. the test score 1~100, some pre-defined ranking 1~17)? In this case, is it "not good" to use linear regression, or it's totally wrong to use it?










    share|cite|improve this question









    $endgroup$














      11












      11








      11


      4



      $begingroup$


      The question is straightforward: Is it appropriate to use linear regression when Y is bounded and discrete (e.g. the test score 1~100, some pre-defined ranking 1~17)? In this case, is it "not good" to use linear regression, or it's totally wrong to use it?










      share|cite|improve this question









      $endgroup$




      The question is straightforward: Is it appropriate to use linear regression when Y is bounded and discrete (e.g. the test score 1~100, some pre-defined ranking 1~17)? In this case, is it "not good" to use linear regression, or it's totally wrong to use it?







      regression multiple-regression least-squares linear bounds






      share|cite|improve this question













      share|cite|improve this question











      share|cite|improve this question




      share|cite|improve this question










      asked Mar 4 at 15:04









      Master ShiMaster Shi

      1396




      1396




















          6 Answers
          6






          active

          oldest

          votes


















          8












          $begingroup$

          When a response or outcome $Y$ is bounded, various questions arise in fitting a model, including the following:



          1. Any model that could predict values for the response outside those bounds is in principle dubious. Hence a linear model might be problematic as there are no bounds on $hat Y = Xb$ for predictors $X$ and coefficients $b$ whenever the $X$ are themselves unbounded in one or both directions. However, the relationship might be weak enough for this not to bite and/or predictions might well remain within bounds over the observed or plausible range of the predictors. At one extreme, if the response is some mean $+$ noise it hardly matters which model one fits.


          2. As the response can't exceed its bounds, a nonlinear relationship is often more plausible with predicted responses tailing off to approach bounds asymptotically. Sigmoid curves or surfaces such as those predicted by logit or probit models are attractive in this regard and are now not difficult to fit. A response such as literacy (or fraction adopting any new idea) often shows such a sigmoid curve in time and plausibly with almost any other predictor.


          3. A bounded response can't have the variance properties expected in plain or vanilla regression. Necessarily as the mean response approaches lower and upper bounds, the variance always approaches zero.


          A model should be chosen according to what works and knowledge of the underlying generating process. Whether the client or audience knows about particular model families may also guide practice.



          Note that I am deliberately avoiding blanket judgments such as good/not good, appropriate/not appropriate, right/wrong. All models are approximations at best and which approximation appeals, or is good enough for a project, isn't so easy to predict. I typically favour logit models as first choice for bounded responses myself, but even that preference is based partly on habit (e.g. my avoiding probit models for no very good reason) and partly on where I will report results, usually to readerships that are, or should be, statistically well informed.



          Your examples of discrete scales are for scores 1-100 (in assignments I mark, 0 is certainly possible!) or rankings 1-17. For scales like that, I would usually think of fitting continuous models to responses scaled to [0, 1]. There are, however, practitioners of ordinal regression models who would happily fit such models to scales with a fairly large number of discrete values. I am happy if they reply if they are so minded.






          share|cite|improve this answer











          $endgroup$




















            6












            $begingroup$

            I work in health services research. We collect patient-reported outcomes, e.g. physical function or depressive symptoms, and they are frequently scored in the format you mentioned: a 0 to N scale generated by summing up all the individual questions in the scale.



            The vast majority of the literature I've reviewed has just used a linear model (or a hierarchical linear model if the data stem from repeat observations). I've yet to see anyone use @NickCox's suggestion for a (fractional) logit model, although it is a perfectly plausible model.



            Item response theory strikes me as another plausible statistical model to apply. This is where you assume some latent trait $theta$ causes responses to the questions using a logistic or ordered logistic model. That inherently handles the issues of boundedness and possible non-linearity that Nick raised.



            The graph below stems from my forthcoming dissertation work. This is where I fit a linear model (red) to a depressive symptom question score that's been converted to Z-scores, and an (explanatory) IRT model in blue to the same questions. Basically, the coefficients for both model are on the same scale (i.e. in standard deviations). There's actually a fair bit of agreement in the size of the coefficients. As Nick alluded to, all models are wrong. But the linear model may not be too wrong to use.



            enter image description here



            That said, a fundamental assumption of almost all current IRT models is that the trait in question is bipolar, i.e. its support is $-infty$ to $infty$. That's probably not true of depressive symptoms. Models for unipolar latent traits are still under development, and standard software can't fit them. A lot of the traits in health services research that we're interested in are likely to be unipolar, e.g. depressive symptoms, other aspects of psychopathology, patient satisfaction. So the IRT model may also be wrong as well.



            (Note: the model above was fit usint Phil Chalmers' mirt package in R. Graph produced using ggplot2 and ggthemes. Color scheme draws from the Stata default color scheme.)






            share|cite|improve this answer











            $endgroup$








            • 4




              $begingroup$
              Just because linear models are widely used, does not mean they are appropriate. Many people use linear models because that is only what they know or are comfortable with.
              $endgroup$
              – qwr
              Mar 5 at 5:46










            • $begingroup$
              The medical literature is especially rife with poor practice that's propagated by "this is what this field/journal does" type ideaology. As a general rule, I would not use or fail to use something just because of it's appearance, however common, in medical research.
              $endgroup$
              – LSC
              Mar 5 at 11:38


















            1












            $begingroup$

            Take a look at the predicted values and check if they have roughly the same distribution as the original Ys. If this is the case, linear regression is probably fine. and you will gain little by improving your model.






            share|cite|improve this answer











            $endgroup$




















              1












              $begingroup$

              A linear regression may "adequately" describe such data, but it's unlikely. Many assumptions of linear regression tend to be violated in this type of data to such a degree that linear regression becomes ill-advised. I'll just choose a few assumptions as examples,



              1. Normality - Even ignoring the discreteness of such data, such data tends to exhibit extreme violations of normality because the distributions are "cut off" by the bounds.

              2. Homoscedasticity - This type of data tends to violate homoscedasticity. Variances tend to be larger when the actual mean is towards the center of the range, as compared to the edges.

              3. Linearity - Since the range of Y is bounded, the assumption is automatically violated.

              The violations of these assumptions are mitigated if the data tends to fall around the center of the range, away from the edges. But really, linear regression is not the optimal tool for this kind of data. Much better alternatives might be binomial regression, or poisson regression.






              share|cite|improve this answer









              $endgroup$








              • 2




                $begingroup$
                It's hard to see that Poisson regression is a candidate for doubly bounded responses.
                $endgroup$
                – Nick Cox
                Mar 4 at 17:20


















              0












              $begingroup$

              If the response only takes a few categories, you may be able to use classification methods or ordinal regression if your response variable is ordinal.



              Plain linear regression will neither give you discrete categories nor bounded response variables. The latter can be fixed by using a logit model like in logistic regression. For something like a test score with 100 categories 1-100, you might as well simplify your prediction and use a bounded response variable.






              share|cite|improve this answer









              $endgroup$




















                0












                $begingroup$

                use a cdf (cumulative distribution function from statistics). if your model is y=xb+e, then change it to y=cdf(xb+e). You will need to rescale your dependent variable data to fall between 0 and 1. If it's positive numbers, divide by them max, and take your model predictions and multiply by the same number.
                Then go check the fit and see if the bounded predictions improve things.



                You probably want to use a canned algorithm to take care of the statistics for you.






                share|cite|improve this answer









                $endgroup$












                • $begingroup$
                  This seems to confuse two facts: (1) bounded responses should be scaled to between 0 and 1 for logit, probit and similar models to apply (2) cdfs also vary between 0 and 1. In treating a fractional response as such, you aren't modelling its cdf.
                  $endgroup$
                  – Nick Cox
                  Mar 5 at 13:08











                Your Answer





                StackExchange.ifUsing("editor", function ()
                return StackExchange.using("mathjaxEditing", function ()
                StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
                StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
                );
                );
                , "mathjax-editing");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "65"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: false,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: null,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                draft saved

                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395548%2flinear-regression-when-y-is-bounded-and-discrete%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                6 Answers
                6






                active

                oldest

                votes








                6 Answers
                6






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                8












                $begingroup$

                When a response or outcome $Y$ is bounded, various questions arise in fitting a model, including the following:



                1. Any model that could predict values for the response outside those bounds is in principle dubious. Hence a linear model might be problematic as there are no bounds on $hat Y = Xb$ for predictors $X$ and coefficients $b$ whenever the $X$ are themselves unbounded in one or both directions. However, the relationship might be weak enough for this not to bite and/or predictions might well remain within bounds over the observed or plausible range of the predictors. At one extreme, if the response is some mean $+$ noise it hardly matters which model one fits.


                2. As the response can't exceed its bounds, a nonlinear relationship is often more plausible with predicted responses tailing off to approach bounds asymptotically. Sigmoid curves or surfaces such as those predicted by logit or probit models are attractive in this regard and are now not difficult to fit. A response such as literacy (or fraction adopting any new idea) often shows such a sigmoid curve in time and plausibly with almost any other predictor.


                3. A bounded response can't have the variance properties expected in plain or vanilla regression. Necessarily as the mean response approaches lower and upper bounds, the variance always approaches zero.


                A model should be chosen according to what works and knowledge of the underlying generating process. Whether the client or audience knows about particular model families may also guide practice.



                Note that I am deliberately avoiding blanket judgments such as good/not good, appropriate/not appropriate, right/wrong. All models are approximations at best and which approximation appeals, or is good enough for a project, isn't so easy to predict. I typically favour logit models as first choice for bounded responses myself, but even that preference is based partly on habit (e.g. my avoiding probit models for no very good reason) and partly on where I will report results, usually to readerships that are, or should be, statistically well informed.



                Your examples of discrete scales are for scores 1-100 (in assignments I mark, 0 is certainly possible!) or rankings 1-17. For scales like that, I would usually think of fitting continuous models to responses scaled to [0, 1]. There are, however, practitioners of ordinal regression models who would happily fit such models to scales with a fairly large number of discrete values. I am happy if they reply if they are so minded.






                share|cite|improve this answer











                $endgroup$

















                  8












                  $begingroup$

                  When a response or outcome $Y$ is bounded, various questions arise in fitting a model, including the following:



                  1. Any model that could predict values for the response outside those bounds is in principle dubious. Hence a linear model might be problematic as there are no bounds on $hat Y = Xb$ for predictors $X$ and coefficients $b$ whenever the $X$ are themselves unbounded in one or both directions. However, the relationship might be weak enough for this not to bite and/or predictions might well remain within bounds over the observed or plausible range of the predictors. At one extreme, if the response is some mean $+$ noise it hardly matters which model one fits.


                  2. As the response can't exceed its bounds, a nonlinear relationship is often more plausible with predicted responses tailing off to approach bounds asymptotically. Sigmoid curves or surfaces such as those predicted by logit or probit models are attractive in this regard and are now not difficult to fit. A response such as literacy (or fraction adopting any new idea) often shows such a sigmoid curve in time and plausibly with almost any other predictor.


                  3. A bounded response can't have the variance properties expected in plain or vanilla regression. Necessarily as the mean response approaches lower and upper bounds, the variance always approaches zero.


                  A model should be chosen according to what works and knowledge of the underlying generating process. Whether the client or audience knows about particular model families may also guide practice.



                  Note that I am deliberately avoiding blanket judgments such as good/not good, appropriate/not appropriate, right/wrong. All models are approximations at best and which approximation appeals, or is good enough for a project, isn't so easy to predict. I typically favour logit models as first choice for bounded responses myself, but even that preference is based partly on habit (e.g. my avoiding probit models for no very good reason) and partly on where I will report results, usually to readerships that are, or should be, statistically well informed.



                  Your examples of discrete scales are for scores 1-100 (in assignments I mark, 0 is certainly possible!) or rankings 1-17. For scales like that, I would usually think of fitting continuous models to responses scaled to [0, 1]. There are, however, practitioners of ordinal regression models who would happily fit such models to scales with a fairly large number of discrete values. I am happy if they reply if they are so minded.






                  share|cite|improve this answer











                  $endgroup$















                    8












                    8








                    8





                    $begingroup$

                    When a response or outcome $Y$ is bounded, various questions arise in fitting a model, including the following:



                    1. Any model that could predict values for the response outside those bounds is in principle dubious. Hence a linear model might be problematic as there are no bounds on $hat Y = Xb$ for predictors $X$ and coefficients $b$ whenever the $X$ are themselves unbounded in one or both directions. However, the relationship might be weak enough for this not to bite and/or predictions might well remain within bounds over the observed or plausible range of the predictors. At one extreme, if the response is some mean $+$ noise it hardly matters which model one fits.


                    2. As the response can't exceed its bounds, a nonlinear relationship is often more plausible with predicted responses tailing off to approach bounds asymptotically. Sigmoid curves or surfaces such as those predicted by logit or probit models are attractive in this regard and are now not difficult to fit. A response such as literacy (or fraction adopting any new idea) often shows such a sigmoid curve in time and plausibly with almost any other predictor.


                    3. A bounded response can't have the variance properties expected in plain or vanilla regression. Necessarily as the mean response approaches lower and upper bounds, the variance always approaches zero.


                    A model should be chosen according to what works and knowledge of the underlying generating process. Whether the client or audience knows about particular model families may also guide practice.



                    Note that I am deliberately avoiding blanket judgments such as good/not good, appropriate/not appropriate, right/wrong. All models are approximations at best and which approximation appeals, or is good enough for a project, isn't so easy to predict. I typically favour logit models as first choice for bounded responses myself, but even that preference is based partly on habit (e.g. my avoiding probit models for no very good reason) and partly on where I will report results, usually to readerships that are, or should be, statistically well informed.



                    Your examples of discrete scales are for scores 1-100 (in assignments I mark, 0 is certainly possible!) or rankings 1-17. For scales like that, I would usually think of fitting continuous models to responses scaled to [0, 1]. There are, however, practitioners of ordinal regression models who would happily fit such models to scales with a fairly large number of discrete values. I am happy if they reply if they are so minded.






                    share|cite|improve this answer











                    $endgroup$



                    When a response or outcome $Y$ is bounded, various questions arise in fitting a model, including the following:



                    1. Any model that could predict values for the response outside those bounds is in principle dubious. Hence a linear model might be problematic as there are no bounds on $hat Y = Xb$ for predictors $X$ and coefficients $b$ whenever the $X$ are themselves unbounded in one or both directions. However, the relationship might be weak enough for this not to bite and/or predictions might well remain within bounds over the observed or plausible range of the predictors. At one extreme, if the response is some mean $+$ noise it hardly matters which model one fits.


                    2. As the response can't exceed its bounds, a nonlinear relationship is often more plausible with predicted responses tailing off to approach bounds asymptotically. Sigmoid curves or surfaces such as those predicted by logit or probit models are attractive in this regard and are now not difficult to fit. A response such as literacy (or fraction adopting any new idea) often shows such a sigmoid curve in time and plausibly with almost any other predictor.


                    3. A bounded response can't have the variance properties expected in plain or vanilla regression. Necessarily as the mean response approaches lower and upper bounds, the variance always approaches zero.


                    A model should be chosen according to what works and knowledge of the underlying generating process. Whether the client or audience knows about particular model families may also guide practice.



                    Note that I am deliberately avoiding blanket judgments such as good/not good, appropriate/not appropriate, right/wrong. All models are approximations at best and which approximation appeals, or is good enough for a project, isn't so easy to predict. I typically favour logit models as first choice for bounded responses myself, but even that preference is based partly on habit (e.g. my avoiding probit models for no very good reason) and partly on where I will report results, usually to readerships that are, or should be, statistically well informed.



                    Your examples of discrete scales are for scores 1-100 (in assignments I mark, 0 is certainly possible!) or rankings 1-17. For scales like that, I would usually think of fitting continuous models to responses scaled to [0, 1]. There are, however, practitioners of ordinal regression models who would happily fit such models to scales with a fairly large number of discrete values. I am happy if they reply if they are so minded.







                    share|cite|improve this answer














                    share|cite|improve this answer



                    share|cite|improve this answer








                    edited Mar 5 at 11:48

























                    answered Mar 4 at 16:21









                    Nick CoxNick Cox

                    39.1k587131




                    39.1k587131























                        6












                        $begingroup$

                        I work in health services research. We collect patient-reported outcomes, e.g. physical function or depressive symptoms, and they are frequently scored in the format you mentioned: a 0 to N scale generated by summing up all the individual questions in the scale.



                        The vast majority of the literature I've reviewed has just used a linear model (or a hierarchical linear model if the data stem from repeat observations). I've yet to see anyone use @NickCox's suggestion for a (fractional) logit model, although it is a perfectly plausible model.



                        Item response theory strikes me as another plausible statistical model to apply. This is where you assume some latent trait $theta$ causes responses to the questions using a logistic or ordered logistic model. That inherently handles the issues of boundedness and possible non-linearity that Nick raised.



                        The graph below stems from my forthcoming dissertation work. This is where I fit a linear model (red) to a depressive symptom question score that's been converted to Z-scores, and an (explanatory) IRT model in blue to the same questions. Basically, the coefficients for both model are on the same scale (i.e. in standard deviations). There's actually a fair bit of agreement in the size of the coefficients. As Nick alluded to, all models are wrong. But the linear model may not be too wrong to use.



                        enter image description here



                        That said, a fundamental assumption of almost all current IRT models is that the trait in question is bipolar, i.e. its support is $-infty$ to $infty$. That's probably not true of depressive symptoms. Models for unipolar latent traits are still under development, and standard software can't fit them. A lot of the traits in health services research that we're interested in are likely to be unipolar, e.g. depressive symptoms, other aspects of psychopathology, patient satisfaction. So the IRT model may also be wrong as well.



                        (Note: the model above was fit usint Phil Chalmers' mirt package in R. Graph produced using ggplot2 and ggthemes. Color scheme draws from the Stata default color scheme.)






                        share|cite|improve this answer











                        $endgroup$








                        • 4




                          $begingroup$
                          Just because linear models are widely used, does not mean they are appropriate. Many people use linear models because that is only what they know or are comfortable with.
                          $endgroup$
                          – qwr
                          Mar 5 at 5:46










                        • $begingroup$
                          The medical literature is especially rife with poor practice that's propagated by "this is what this field/journal does" type ideaology. As a general rule, I would not use or fail to use something just because of it's appearance, however common, in medical research.
                          $endgroup$
                          – LSC
                          Mar 5 at 11:38















                        6












                        $begingroup$

                        I work in health services research. We collect patient-reported outcomes, e.g. physical function or depressive symptoms, and they are frequently scored in the format you mentioned: a 0 to N scale generated by summing up all the individual questions in the scale.



                        The vast majority of the literature I've reviewed has just used a linear model (or a hierarchical linear model if the data stem from repeat observations). I've yet to see anyone use @NickCox's suggestion for a (fractional) logit model, although it is a perfectly plausible model.



                        Item response theory strikes me as another plausible statistical model to apply. This is where you assume some latent trait $theta$ causes responses to the questions using a logistic or ordered logistic model. That inherently handles the issues of boundedness and possible non-linearity that Nick raised.



                        The graph below stems from my forthcoming dissertation work. This is where I fit a linear model (red) to a depressive symptom question score that's been converted to Z-scores, and an (explanatory) IRT model in blue to the same questions. Basically, the coefficients for both model are on the same scale (i.e. in standard deviations). There's actually a fair bit of agreement in the size of the coefficients. As Nick alluded to, all models are wrong. But the linear model may not be too wrong to use.



                        enter image description here



                        That said, a fundamental assumption of almost all current IRT models is that the trait in question is bipolar, i.e. its support is $-infty$ to $infty$. That's probably not true of depressive symptoms. Models for unipolar latent traits are still under development, and standard software can't fit them. A lot of the traits in health services research that we're interested in are likely to be unipolar, e.g. depressive symptoms, other aspects of psychopathology, patient satisfaction. So the IRT model may also be wrong as well.



                        (Note: the model above was fit usint Phil Chalmers' mirt package in R. Graph produced using ggplot2 and ggthemes. Color scheme draws from the Stata default color scheme.)






                        share|cite|improve this answer











                        $endgroup$








                        • 4




                          $begingroup$
                          Just because linear models are widely used, does not mean they are appropriate. Many people use linear models because that is only what they know or are comfortable with.
                          $endgroup$
                          – qwr
                          Mar 5 at 5:46










                        • $begingroup$
                          The medical literature is especially rife with poor practice that's propagated by "this is what this field/journal does" type ideaology. As a general rule, I would not use or fail to use something just because of it's appearance, however common, in medical research.
                          $endgroup$
                          – LSC
                          Mar 5 at 11:38













                        6












                        6








                        6





                        $begingroup$

                        I work in health services research. We collect patient-reported outcomes, e.g. physical function or depressive symptoms, and they are frequently scored in the format you mentioned: a 0 to N scale generated by summing up all the individual questions in the scale.



                        The vast majority of the literature I've reviewed has just used a linear model (or a hierarchical linear model if the data stem from repeat observations). I've yet to see anyone use @NickCox's suggestion for a (fractional) logit model, although it is a perfectly plausible model.



                        Item response theory strikes me as another plausible statistical model to apply. This is where you assume some latent trait $theta$ causes responses to the questions using a logistic or ordered logistic model. That inherently handles the issues of boundedness and possible non-linearity that Nick raised.



                        The graph below stems from my forthcoming dissertation work. This is where I fit a linear model (red) to a depressive symptom question score that's been converted to Z-scores, and an (explanatory) IRT model in blue to the same questions. Basically, the coefficients for both model are on the same scale (i.e. in standard deviations). There's actually a fair bit of agreement in the size of the coefficients. As Nick alluded to, all models are wrong. But the linear model may not be too wrong to use.



                        enter image description here



                        That said, a fundamental assumption of almost all current IRT models is that the trait in question is bipolar, i.e. its support is $-infty$ to $infty$. That's probably not true of depressive symptoms. Models for unipolar latent traits are still under development, and standard software can't fit them. A lot of the traits in health services research that we're interested in are likely to be unipolar, e.g. depressive symptoms, other aspects of psychopathology, patient satisfaction. So the IRT model may also be wrong as well.



                        (Note: the model above was fit usint Phil Chalmers' mirt package in R. Graph produced using ggplot2 and ggthemes. Color scheme draws from the Stata default color scheme.)






                        share|cite|improve this answer











                        $endgroup$



                        I work in health services research. We collect patient-reported outcomes, e.g. physical function or depressive symptoms, and they are frequently scored in the format you mentioned: a 0 to N scale generated by summing up all the individual questions in the scale.



                        The vast majority of the literature I've reviewed has just used a linear model (or a hierarchical linear model if the data stem from repeat observations). I've yet to see anyone use @NickCox's suggestion for a (fractional) logit model, although it is a perfectly plausible model.



                        Item response theory strikes me as another plausible statistical model to apply. This is where you assume some latent trait $theta$ causes responses to the questions using a logistic or ordered logistic model. That inherently handles the issues of boundedness and possible non-linearity that Nick raised.



                        The graph below stems from my forthcoming dissertation work. This is where I fit a linear model (red) to a depressive symptom question score that's been converted to Z-scores, and an (explanatory) IRT model in blue to the same questions. Basically, the coefficients for both model are on the same scale (i.e. in standard deviations). There's actually a fair bit of agreement in the size of the coefficients. As Nick alluded to, all models are wrong. But the linear model may not be too wrong to use.



                        enter image description here



                        That said, a fundamental assumption of almost all current IRT models is that the trait in question is bipolar, i.e. its support is $-infty$ to $infty$. That's probably not true of depressive symptoms. Models for unipolar latent traits are still under development, and standard software can't fit them. A lot of the traits in health services research that we're interested in are likely to be unipolar, e.g. depressive symptoms, other aspects of psychopathology, patient satisfaction. So the IRT model may also be wrong as well.



                        (Note: the model above was fit usint Phil Chalmers' mirt package in R. Graph produced using ggplot2 and ggthemes. Color scheme draws from the Stata default color scheme.)







                        share|cite|improve this answer














                        share|cite|improve this answer



                        share|cite|improve this answer








                        edited Mar 4 at 21:46

























                        answered Mar 4 at 19:58









                        Weiwen NgWeiwen Ng

                        606314




                        606314







                        • 4




                          $begingroup$
                          Just because linear models are widely used, does not mean they are appropriate. Many people use linear models because that is only what they know or are comfortable with.
                          $endgroup$
                          – qwr
                          Mar 5 at 5:46










                        • $begingroup$
                          The medical literature is especially rife with poor practice that's propagated by "this is what this field/journal does" type ideaology. As a general rule, I would not use or fail to use something just because of it's appearance, however common, in medical research.
                          $endgroup$
                          – LSC
                          Mar 5 at 11:38












                        • 4




                          $begingroup$
                          Just because linear models are widely used, does not mean they are appropriate. Many people use linear models because that is only what they know or are comfortable with.
                          $endgroup$
                          – qwr
                          Mar 5 at 5:46










                        • $begingroup$
                          The medical literature is especially rife with poor practice that's propagated by "this is what this field/journal does" type ideaology. As a general rule, I would not use or fail to use something just because of it's appearance, however common, in medical research.
                          $endgroup$
                          – LSC
                          Mar 5 at 11:38







                        4




                        4




                        $begingroup$
                        Just because linear models are widely used, does not mean they are appropriate. Many people use linear models because that is only what they know or are comfortable with.
                        $endgroup$
                        – qwr
                        Mar 5 at 5:46




                        $begingroup$
                        Just because linear models are widely used, does not mean they are appropriate. Many people use linear models because that is only what they know or are comfortable with.
                        $endgroup$
                        – qwr
                        Mar 5 at 5:46












                        $begingroup$
                        The medical literature is especially rife with poor practice that's propagated by "this is what this field/journal does" type ideaology. As a general rule, I would not use or fail to use something just because of it's appearance, however common, in medical research.
                        $endgroup$
                        – LSC
                        Mar 5 at 11:38




                        $begingroup$
                        The medical literature is especially rife with poor practice that's propagated by "this is what this field/journal does" type ideaology. As a general rule, I would not use or fail to use something just because of it's appearance, however common, in medical research.
                        $endgroup$
                        – LSC
                        Mar 5 at 11:38











                        1












                        $begingroup$

                        Take a look at the predicted values and check if they have roughly the same distribution as the original Ys. If this is the case, linear regression is probably fine. and you will gain little by improving your model.






                        share|cite|improve this answer











                        $endgroup$

















                          1












                          $begingroup$

                          Take a look at the predicted values and check if they have roughly the same distribution as the original Ys. If this is the case, linear regression is probably fine. and you will gain little by improving your model.






                          share|cite|improve this answer











                          $endgroup$















                            1












                            1








                            1





                            $begingroup$

                            Take a look at the predicted values and check if they have roughly the same distribution as the original Ys. If this is the case, linear regression is probably fine. and you will gain little by improving your model.






                            share|cite|improve this answer











                            $endgroup$



                            Take a look at the predicted values and check if they have roughly the same distribution as the original Ys. If this is the case, linear regression is probably fine. and you will gain little by improving your model.







                            share|cite|improve this answer














                            share|cite|improve this answer



                            share|cite|improve this answer








                            edited Mar 4 at 16:07









                            Nick Cox

                            39.1k587131




                            39.1k587131










                            answered Mar 4 at 16:06









                            mzubamzuba

                            773520




                            773520





















                                1












                                $begingroup$

                                A linear regression may "adequately" describe such data, but it's unlikely. Many assumptions of linear regression tend to be violated in this type of data to such a degree that linear regression becomes ill-advised. I'll just choose a few assumptions as examples,



                                1. Normality - Even ignoring the discreteness of such data, such data tends to exhibit extreme violations of normality because the distributions are "cut off" by the bounds.

                                2. Homoscedasticity - This type of data tends to violate homoscedasticity. Variances tend to be larger when the actual mean is towards the center of the range, as compared to the edges.

                                3. Linearity - Since the range of Y is bounded, the assumption is automatically violated.

                                The violations of these assumptions are mitigated if the data tends to fall around the center of the range, away from the edges. But really, linear regression is not the optimal tool for this kind of data. Much better alternatives might be binomial regression, or poisson regression.






                                share|cite|improve this answer









                                $endgroup$








                                • 2




                                  $begingroup$
                                  It's hard to see that Poisson regression is a candidate for doubly bounded responses.
                                  $endgroup$
                                  – Nick Cox
                                  Mar 4 at 17:20















                                1












                                $begingroup$

                                A linear regression may "adequately" describe such data, but it's unlikely. Many assumptions of linear regression tend to be violated in this type of data to such a degree that linear regression becomes ill-advised. I'll just choose a few assumptions as examples,



                                1. Normality - Even ignoring the discreteness of such data, such data tends to exhibit extreme violations of normality because the distributions are "cut off" by the bounds.

                                2. Homoscedasticity - This type of data tends to violate homoscedasticity. Variances tend to be larger when the actual mean is towards the center of the range, as compared to the edges.

                                3. Linearity - Since the range of Y is bounded, the assumption is automatically violated.

                                The violations of these assumptions are mitigated if the data tends to fall around the center of the range, away from the edges. But really, linear regression is not the optimal tool for this kind of data. Much better alternatives might be binomial regression, or poisson regression.






                                share|cite|improve this answer









                                $endgroup$








                                • 2




                                  $begingroup$
                                  It's hard to see that Poisson regression is a candidate for doubly bounded responses.
                                  $endgroup$
                                  – Nick Cox
                                  Mar 4 at 17:20













                                1












                                1








                                1





                                $begingroup$

                                A linear regression may "adequately" describe such data, but it's unlikely. Many assumptions of linear regression tend to be violated in this type of data to such a degree that linear regression becomes ill-advised. I'll just choose a few assumptions as examples,



                                1. Normality - Even ignoring the discreteness of such data, such data tends to exhibit extreme violations of normality because the distributions are "cut off" by the bounds.

                                2. Homoscedasticity - This type of data tends to violate homoscedasticity. Variances tend to be larger when the actual mean is towards the center of the range, as compared to the edges.

                                3. Linearity - Since the range of Y is bounded, the assumption is automatically violated.

                                The violations of these assumptions are mitigated if the data tends to fall around the center of the range, away from the edges. But really, linear regression is not the optimal tool for this kind of data. Much better alternatives might be binomial regression, or poisson regression.






                                share|cite|improve this answer









                                $endgroup$



                                A linear regression may "adequately" describe such data, but it's unlikely. Many assumptions of linear regression tend to be violated in this type of data to such a degree that linear regression becomes ill-advised. I'll just choose a few assumptions as examples,



                                1. Normality - Even ignoring the discreteness of such data, such data tends to exhibit extreme violations of normality because the distributions are "cut off" by the bounds.

                                2. Homoscedasticity - This type of data tends to violate homoscedasticity. Variances tend to be larger when the actual mean is towards the center of the range, as compared to the edges.

                                3. Linearity - Since the range of Y is bounded, the assumption is automatically violated.

                                The violations of these assumptions are mitigated if the data tends to fall around the center of the range, away from the edges. But really, linear regression is not the optimal tool for this kind of data. Much better alternatives might be binomial regression, or poisson regression.







                                share|cite|improve this answer












                                share|cite|improve this answer



                                share|cite|improve this answer










                                answered Mar 4 at 16:54









                                Stat_ProgrammerStat_Programmer

                                3411




                                3411







                                • 2




                                  $begingroup$
                                  It's hard to see that Poisson regression is a candidate for doubly bounded responses.
                                  $endgroup$
                                  – Nick Cox
                                  Mar 4 at 17:20












                                • 2




                                  $begingroup$
                                  It's hard to see that Poisson regression is a candidate for doubly bounded responses.
                                  $endgroup$
                                  – Nick Cox
                                  Mar 4 at 17:20







                                2




                                2




                                $begingroup$
                                It's hard to see that Poisson regression is a candidate for doubly bounded responses.
                                $endgroup$
                                – Nick Cox
                                Mar 4 at 17:20




                                $begingroup$
                                It's hard to see that Poisson regression is a candidate for doubly bounded responses.
                                $endgroup$
                                – Nick Cox
                                Mar 4 at 17:20











                                0












                                $begingroup$

                                If the response only takes a few categories, you may be able to use classification methods or ordinal regression if your response variable is ordinal.



                                Plain linear regression will neither give you discrete categories nor bounded response variables. The latter can be fixed by using a logit model like in logistic regression. For something like a test score with 100 categories 1-100, you might as well simplify your prediction and use a bounded response variable.






                                share|cite|improve this answer









                                $endgroup$

















                                  0












                                  $begingroup$

                                  If the response only takes a few categories, you may be able to use classification methods or ordinal regression if your response variable is ordinal.



                                  Plain linear regression will neither give you discrete categories nor bounded response variables. The latter can be fixed by using a logit model like in logistic regression. For something like a test score with 100 categories 1-100, you might as well simplify your prediction and use a bounded response variable.






                                  share|cite|improve this answer









                                  $endgroup$















                                    0












                                    0








                                    0





                                    $begingroup$

                                    If the response only takes a few categories, you may be able to use classification methods or ordinal regression if your response variable is ordinal.



                                    Plain linear regression will neither give you discrete categories nor bounded response variables. The latter can be fixed by using a logit model like in logistic regression. For something like a test score with 100 categories 1-100, you might as well simplify your prediction and use a bounded response variable.






                                    share|cite|improve this answer









                                    $endgroup$



                                    If the response only takes a few categories, you may be able to use classification methods or ordinal regression if your response variable is ordinal.



                                    Plain linear regression will neither give you discrete categories nor bounded response variables. The latter can be fixed by using a logit model like in logistic regression. For something like a test score with 100 categories 1-100, you might as well simplify your prediction and use a bounded response variable.







                                    share|cite|improve this answer












                                    share|cite|improve this answer



                                    share|cite|improve this answer










                                    answered Mar 5 at 5:45









                                    qwrqwr

                                    250112




                                    250112





















                                        0












                                        $begingroup$

                                        use a cdf (cumulative distribution function from statistics). if your model is y=xb+e, then change it to y=cdf(xb+e). You will need to rescale your dependent variable data to fall between 0 and 1. If it's positive numbers, divide by them max, and take your model predictions and multiply by the same number.
                                        Then go check the fit and see if the bounded predictions improve things.



                                        You probably want to use a canned algorithm to take care of the statistics for you.






                                        share|cite|improve this answer









                                        $endgroup$












                                        • $begingroup$
                                          This seems to confuse two facts: (1) bounded responses should be scaled to between 0 and 1 for logit, probit and similar models to apply (2) cdfs also vary between 0 and 1. In treating a fractional response as such, you aren't modelling its cdf.
                                          $endgroup$
                                          – Nick Cox
                                          Mar 5 at 13:08















                                        0












                                        $begingroup$

                                        use a cdf (cumulative distribution function from statistics). if your model is y=xb+e, then change it to y=cdf(xb+e). You will need to rescale your dependent variable data to fall between 0 and 1. If it's positive numbers, divide by them max, and take your model predictions and multiply by the same number.
                                        Then go check the fit and see if the bounded predictions improve things.



                                        You probably want to use a canned algorithm to take care of the statistics for you.






                                        share|cite|improve this answer









                                        $endgroup$












                                        • $begingroup$
                                          This seems to confuse two facts: (1) bounded responses should be scaled to between 0 and 1 for logit, probit and similar models to apply (2) cdfs also vary between 0 and 1. In treating a fractional response as such, you aren't modelling its cdf.
                                          $endgroup$
                                          – Nick Cox
                                          Mar 5 at 13:08













                                        0












                                        0








                                        0





                                        $begingroup$

                                        use a cdf (cumulative distribution function from statistics). if your model is y=xb+e, then change it to y=cdf(xb+e). You will need to rescale your dependent variable data to fall between 0 and 1. If it's positive numbers, divide by them max, and take your model predictions and multiply by the same number.
                                        Then go check the fit and see if the bounded predictions improve things.



                                        You probably want to use a canned algorithm to take care of the statistics for you.






                                        share|cite|improve this answer









                                        $endgroup$



                                        use a cdf (cumulative distribution function from statistics). if your model is y=xb+e, then change it to y=cdf(xb+e). You will need to rescale your dependent variable data to fall between 0 and 1. If it's positive numbers, divide by them max, and take your model predictions and multiply by the same number.
                                        Then go check the fit and see if the bounded predictions improve things.



                                        You probably want to use a canned algorithm to take care of the statistics for you.







                                        share|cite|improve this answer












                                        share|cite|improve this answer



                                        share|cite|improve this answer










                                        answered Mar 5 at 11:20









                                        dougiedougie

                                        1




                                        1











                                        • $begingroup$
                                          This seems to confuse two facts: (1) bounded responses should be scaled to between 0 and 1 for logit, probit and similar models to apply (2) cdfs also vary between 0 and 1. In treating a fractional response as such, you aren't modelling its cdf.
                                          $endgroup$
                                          – Nick Cox
                                          Mar 5 at 13:08
















                                        • $begingroup$
                                          This seems to confuse two facts: (1) bounded responses should be scaled to between 0 and 1 for logit, probit and similar models to apply (2) cdfs also vary between 0 and 1. In treating a fractional response as such, you aren't modelling its cdf.
                                          $endgroup$
                                          – Nick Cox
                                          Mar 5 at 13:08















                                        $begingroup$
                                        This seems to confuse two facts: (1) bounded responses should be scaled to between 0 and 1 for logit, probit and similar models to apply (2) cdfs also vary between 0 and 1. In treating a fractional response as such, you aren't modelling its cdf.
                                        $endgroup$
                                        – Nick Cox
                                        Mar 5 at 13:08




                                        $begingroup$
                                        This seems to confuse two facts: (1) bounded responses should be scaled to between 0 and 1 for logit, probit and similar models to apply (2) cdfs also vary between 0 and 1. In treating a fractional response as such, you aren't modelling its cdf.
                                        $endgroup$
                                        – Nick Cox
                                        Mar 5 at 13:08

















                                        draft saved

                                        draft discarded
















































                                        Thanks for contributing an answer to Cross Validated!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        Use MathJax to format equations. MathJax reference.


                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f395548%2flinear-regression-when-y-is-bounded-and-discrete%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown