How shall I understand what the GNU utilities “comm” and “diff” do in terms of ordered sets?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.



In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)



  • the difference operation on two sets results in a set of the elements in the first set but not in the second.


  • When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.


Similar operations to set difference exist on files, such as comm from coreutils and diff from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.



Moreover, comm and diff also work differently from each other.



What do comm and diff try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)



Specifically, given two files, for each line in each file, how do comm and diff determine



  • whether the line also occur in the other file?

  • if it does, whether its occurrences in the two files are the same or different?

by taking into account the order between the lines in each file?



How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?



How do comm and diff differ when both are used for taking subtraction of two files?



Thanks.



Thanks.










share|cite|improve this question



























    up vote
    2
    down vote

    favorite












    I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.



    In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)



    • the difference operation on two sets results in a set of the elements in the first set but not in the second.


    • When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.


    Similar operations to set difference exist on files, such as comm from coreutils and diff from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.



    Moreover, comm and diff also work differently from each other.



    What do comm and diff try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)



    Specifically, given two files, for each line in each file, how do comm and diff determine



    • whether the line also occur in the other file?

    • if it does, whether its occurrences in the two files are the same or different?

    by taking into account the order between the lines in each file?



    How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?



    How do comm and diff differ when both are used for taking subtraction of two files?



    Thanks.



    Thanks.










    share|cite|improve this question

























      up vote
      2
      down vote

      favorite









      up vote
      2
      down vote

      favorite











      I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.



      In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)



      • the difference operation on two sets results in a set of the elements in the first set but not in the second.


      • When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.


      Similar operations to set difference exist on files, such as comm from coreutils and diff from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.



      Moreover, comm and diff also work differently from each other.



      What do comm and diff try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)



      Specifically, given two files, for each line in each file, how do comm and diff determine



      • whether the line also occur in the other file?

      • if it does, whether its occurrences in the two files are the same or different?

      by taking into account the order between the lines in each file?



      How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?



      How do comm and diff differ when both are used for taking subtraction of two files?



      Thanks.



      Thanks.










      share|cite|improve this question















      I am looking for some help to understand two GNU utility programs in terms of ordered sets. If you happen to know their usages, you might be able to understand what I am asking here.



      In mathematics, a set doesn't impose order between its elements. (if a set does, then it is called an ordered set, a different concept)



      • the difference operation on two sets results in a set of the elements in the first set but not in the second.


      • When taking the intersection of two sets, if an element is considered in both sets, it doesn't matter where it appears in each set.


      Similar operations to set difference exist on files, such as comm from coreutils and diff from diffutils. But we can't think of a file as a set of lines, but as an ordered set of lines, because the lines are ordered by their line numbers naturally.



      Moreover, comm and diff also work differently from each other.



      What do comm and diff try to accomplish at concept level (at input and output level) respectively? I suspect I may need some basic knowledge on ordered sets to understand that. I don't expect an explanation at their implementation/algorithmic level, but that might help (I heard that version control is using same algorithms to accomplish incremental copy.)



      Specifically, given two files, for each line in each file, how do comm and diff determine



      • whether the line also occur in the other file?

      • if it does, whether its occurrences in the two files are the same or different?

      by taking into account the order between the lines in each file?



      How does diff decide whether some line "occurs in both files but differ" or "occurs in one file but not the other"?



      How do comm and diff differ when both are used for taking subtraction of two files?



      Thanks.



      Thanks.







      elementary-set-theory algorithms computer-science order-theory






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited 4 hours ago

























      asked 5 hours ago









      Tim

      16k20118306




      16k20118306




















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          5
          down vote













          $mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.



          $mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.






          share|cite|improve this answer




















          • Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
            – Tim
            4 hours ago











          • The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
            – Rob Arthan
            4 hours ago










          • Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
            – Tim
            2 hours ago










          • A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
            – Tim
            1 hour ago

















          up vote
          2
          down vote













          I don't know about comm, but the math name for what diff figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.



          Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.



          Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.






          share|cite|improve this answer






















            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "69"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            noCode: true, onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













             

            draft saved


            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2989204%2fhow-shall-i-understand-what-the-gnu-utilities-comm-and-diff-do-in-terms-of-o%23new-answer', 'question_page');

            );

            Post as a guest






























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            5
            down vote













            $mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.



            $mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.






            share|cite|improve this answer




















            • Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
              – Tim
              4 hours ago











            • The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
              – Rob Arthan
              4 hours ago










            • Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
              – Tim
              2 hours ago










            • A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
              – Tim
              1 hour ago














            up vote
            5
            down vote













            $mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.



            $mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.






            share|cite|improve this answer




















            • Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
              – Tim
              4 hours ago











            • The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
              – Rob Arthan
              4 hours ago










            • Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
              – Tim
              2 hours ago










            • A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
              – Tim
              1 hour ago












            up vote
            5
            down vote










            up vote
            5
            down vote









            $mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.



            $mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.






            share|cite|improve this answer












            $mathttcomm$ does a set operation on two sets of text lines on the assumption that the two sets are represented as sorted input files, so it can work through both sequences deciding which lines are in file 1 only, which are in file 2 only and which are in both without having to backtrack.



            $mathttdiff$ is quite different. It takes two text files and tries to find an efficient way of describing the second file as a result of editing the first. This is a loosely defined problem. $mathttdiff$ uses some form of the longest common subsequence algorithm to solve it. The results can be counter-intuitive.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered 4 hours ago









            Rob Arthan

            28.3k42865




            28.3k42865











            • Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
              – Tim
              4 hours ago











            • The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
              – Rob Arthan
              4 hours ago










            • Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
              – Tim
              2 hours ago










            • A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
              – Tim
              1 hour ago
















            • Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
              – Tim
              4 hours ago











            • The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
              – Rob Arthan
              4 hours ago










            • Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
              – Tim
              2 hours ago










            • A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
              – Tim
              1 hour ago















            Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
            – Tim
            4 hours ago





            Thanks. As far as my question are concerned, I don't require comm to work on sorted files only.
            – Tim
            4 hours ago













            The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
            – Rob Arthan
            4 hours ago




            The documentation for $mathttcomm$ is quite clear that the input files must be sorted. What $mathttcomm$ does on input files that are not sorted is implementation defined: it could produce an error message, or crash or deliver an arbitrary output.
            – Rob Arthan
            4 hours ago












            Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
            – Tim
            2 hours ago




            Thanks. The LCS algorithm explains how diff decides if a line exists in one file but not in the other. How does diff decide whether a line occurs in both files but differ (indicated by "!" in context mode of diff, "fct" i.e. replacement in normal mode of diff, and "|" in side by side mode of diff)
            – Tim
            2 hours ago












            A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
            – Tim
            1 hour ago




            A second question: when comm operates on two sorted files, does it also generate the same result as LCS algorithm applies to the sorted files?
            – Tim
            1 hour ago










            up vote
            2
            down vote













            I don't know about comm, but the math name for what diff figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.



            Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.



            Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.






            share|cite|improve this answer


























              up vote
              2
              down vote













              I don't know about comm, but the math name for what diff figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.



              Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.



              Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.






              share|cite|improve this answer
























                up vote
                2
                down vote










                up vote
                2
                down vote









                I don't know about comm, but the math name for what diff figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.



                Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.



                Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.






                share|cite|improve this answer














                I don't know about comm, but the math name for what diff figures out is "longest common subsequence". If you search using that term you will find tons of information about it; maybe start with the Wikipedia article.



                Although you do have to think about it kind of backwards, as the output of diff consists of all the parts that aren't in the longest common subsequence.



                Also, as to terminology, "ordered sets" isn't quite right, after all $mathbbR$ is an ordered set. I'd describe them as (finite) sequences.







                share|cite|improve this answer














                share|cite|improve this answer



                share|cite|improve this answer








                edited 4 hours ago

























                answered 4 hours ago









                JonathanZ

                1,970613




                1,970613



























                     

                    draft saved


                    draft discarded















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2989204%2fhow-shall-i-understand-what-the-gnu-utilities-comm-and-diff-do-in-terms-of-o%23new-answer', 'question_page');

                    );

                    Post as a guest













































































                    Popular posts from this blog

                    How to check contact read email or not when send email to Individual?

                    How many registers does an x86_64 CPU actually have?

                    Nur Jahan