How can I call an external command from within Miller (mlr)’s DSL?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












1















Suppose I have the following CSV:



$ cat test.csv
id,domain
1,foo.com
2,bar.com


Using mlr put, I can easily map any function over a field in the CSV, as long as I can define it in the Miller DSL. So, for example, mlr --csv put '$id = $id + 1' will increment the id by 1 for each record.



But what if I can’t define the function in Miller’s DSL, perhaps because it is not pure? Suppose I wanted to map each domain in the CSV to an IP address. I’d like to do something like mlr --csv put '$ip = shell("nslookup $domain"). Is there an easy way to do this?



Currently I am extracting the input field into a separate file, rewriting it in a separate shell script, and adding the result back in with mlr join. However, this is pretty messy, because my CSV is full of quoted commas and newlines, which I need to carefully handle myself rather than relying on Miller.










share|improve this question




























    1















    Suppose I have the following CSV:



    $ cat test.csv
    id,domain
    1,foo.com
    2,bar.com


    Using mlr put, I can easily map any function over a field in the CSV, as long as I can define it in the Miller DSL. So, for example, mlr --csv put '$id = $id + 1' will increment the id by 1 for each record.



    But what if I can’t define the function in Miller’s DSL, perhaps because it is not pure? Suppose I wanted to map each domain in the CSV to an IP address. I’d like to do something like mlr --csv put '$ip = shell("nslookup $domain"). Is there an easy way to do this?



    Currently I am extracting the input field into a separate file, rewriting it in a separate shell script, and adding the result back in with mlr join. However, this is pretty messy, because my CSV is full of quoted commas and newlines, which I need to carefully handle myself rather than relying on Miller.










    share|improve this question


























      1












      1








      1








      Suppose I have the following CSV:



      $ cat test.csv
      id,domain
      1,foo.com
      2,bar.com


      Using mlr put, I can easily map any function over a field in the CSV, as long as I can define it in the Miller DSL. So, for example, mlr --csv put '$id = $id + 1' will increment the id by 1 for each record.



      But what if I can’t define the function in Miller’s DSL, perhaps because it is not pure? Suppose I wanted to map each domain in the CSV to an IP address. I’d like to do something like mlr --csv put '$ip = shell("nslookup $domain"). Is there an easy way to do this?



      Currently I am extracting the input field into a separate file, rewriting it in a separate shell script, and adding the result back in with mlr join. However, this is pretty messy, because my CSV is full of quoted commas and newlines, which I need to carefully handle myself rather than relying on Miller.










      share|improve this question
















      Suppose I have the following CSV:



      $ cat test.csv
      id,domain
      1,foo.com
      2,bar.com


      Using mlr put, I can easily map any function over a field in the CSV, as long as I can define it in the Miller DSL. So, for example, mlr --csv put '$id = $id + 1' will increment the id by 1 for each record.



      But what if I can’t define the function in Miller’s DSL, perhaps because it is not pure? Suppose I wanted to map each domain in the CSV to an IP address. I’d like to do something like mlr --csv put '$ip = shell("nslookup $domain"). Is there an easy way to do this?



      Currently I am extracting the input field into a separate file, rewriting it in a separate shell script, and adding the result back in with mlr join. However, this is pretty messy, because my CSV is full of quoted commas and newlines, which I need to carefully handle myself rather than relying on Miller.







      shell csv miller






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 15 at 11:54







      sjy

















      asked Jan 29 at 8:12









      sjysjy

      1064




      1064




















          1 Answer
          1






          active

          oldest

          votes


















          0














          Calling external commands from the Miller DSL



          The Miller DSL reference deals with calling external commands in the section on redirected-output statements:




          The print, dump, tee, emitf, emit, and emitp keywords all allow you to redirect output to one or more files or pipe-to commands.




          I couldn’t find this in the documentation (other than by inference from the examples), but the syntax for using these statements with a pipe-to command seems to be statement | quoted-shell-command, unquoted-mlr-expression. For example:



          $ mlr --csv put 'tee | "tr [a-z] [A-Z]", $*' test.csv
          id,domain
          1,foo.com
          2,bar.com
          ID,DOMAIN
          1,FOO.COM
          2,BAR.COM


          Note that the piped output appears after Miller’s output (in this case, the unchanged input, as tee does not affect the stream and put emits it). By suppressing put’s output with -q, and extracting a single field with print $domain rather than tee $*, we can get a list of IP addresses:



          $ mlr --csv put -q 'print | "xargs dig +short", $domain' test.csv
          23.23.86.44
          104.27.138.186
          104.27.139.186


          Miller didn’t do much for us here; we still had to use xargs to convert stdin into an argument (because dig does not accept domains on stdin). Moreover, dig’s output contained newlines, meaning that the output no longer matches the input one-to-one. Since mlr adheres to the Unix philosophy, it would have been easier just to join a pipe to the end of mlr --headerless-csv-output cut -f domain if this was all I needed.



          Joining output from external commands to your input



          What I really wanted to do was assign the result of calling an external command to an in-stream variable in the Miller DSL, and as far as I can tell, this is not possible. However, by swapping xargs for GNU parallel, we can use the --tag option to keep track of the argument we gave dig, and benefit from flexible, concurrent I/O:



          $ mlr --csv --headerless-csv-output cut -f domain test.csv | parallel --tag dig +short
          foo.com 23.23.86.44
          bar.com 104.27.139.186
          bar.com 104.27.138.186


          Since we are dealing with CSV, parallel can actually handle this on its own, although we need to access fields by position (2) rather than name (domain):



          $ < test.csv parallel -C "," --skip-first-line --tagstring 2 dig +short 2
          foo.com 23.23.86.44
          bar.com 104.27.139.186
          bar.com 104.27.138.186


          This is a tab-separated list of (domain, ip) pairs, so we can convert it back to CSV with a header using mlr --t2c --implicit-csv-header label domain,ip. Then, since both our output and our original test.csv have a domain field, we can use mlr join to produce a single output table, and mlr nest to implode the multiple values for bar.com:



          $ mlr --csv cut -f domain test.csv | 
          parallel --skip-first-line --tag dig +short |
          mlr --t2c --implicit-csv-header label domain,ip |
          mlr --c2p --barred join -f test.csv -j domain then
          nest --implode --values --across-records -f ip
          +---------+----+-------------------------------+
          | domain | id | ip |
          +---------+----+-------------------------------+
          | foo.com | 1 | 23.23.86.44 |
          | bar.com | 2 | 104.27.138.186;104.27.139.186 |
          +---------+----+-------------------------------+





          share|improve this answer
























            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "106"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497375%2fhow-can-i-call-an-external-command-from-within-miller-mlr-s-dsl%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            Calling external commands from the Miller DSL



            The Miller DSL reference deals with calling external commands in the section on redirected-output statements:




            The print, dump, tee, emitf, emit, and emitp keywords all allow you to redirect output to one or more files or pipe-to commands.




            I couldn’t find this in the documentation (other than by inference from the examples), but the syntax for using these statements with a pipe-to command seems to be statement | quoted-shell-command, unquoted-mlr-expression. For example:



            $ mlr --csv put 'tee | "tr [a-z] [A-Z]", $*' test.csv
            id,domain
            1,foo.com
            2,bar.com
            ID,DOMAIN
            1,FOO.COM
            2,BAR.COM


            Note that the piped output appears after Miller’s output (in this case, the unchanged input, as tee does not affect the stream and put emits it). By suppressing put’s output with -q, and extracting a single field with print $domain rather than tee $*, we can get a list of IP addresses:



            $ mlr --csv put -q 'print | "xargs dig +short", $domain' test.csv
            23.23.86.44
            104.27.138.186
            104.27.139.186


            Miller didn’t do much for us here; we still had to use xargs to convert stdin into an argument (because dig does not accept domains on stdin). Moreover, dig’s output contained newlines, meaning that the output no longer matches the input one-to-one. Since mlr adheres to the Unix philosophy, it would have been easier just to join a pipe to the end of mlr --headerless-csv-output cut -f domain if this was all I needed.



            Joining output from external commands to your input



            What I really wanted to do was assign the result of calling an external command to an in-stream variable in the Miller DSL, and as far as I can tell, this is not possible. However, by swapping xargs for GNU parallel, we can use the --tag option to keep track of the argument we gave dig, and benefit from flexible, concurrent I/O:



            $ mlr --csv --headerless-csv-output cut -f domain test.csv | parallel --tag dig +short
            foo.com 23.23.86.44
            bar.com 104.27.139.186
            bar.com 104.27.138.186


            Since we are dealing with CSV, parallel can actually handle this on its own, although we need to access fields by position (2) rather than name (domain):



            $ < test.csv parallel -C "," --skip-first-line --tagstring 2 dig +short 2
            foo.com 23.23.86.44
            bar.com 104.27.139.186
            bar.com 104.27.138.186


            This is a tab-separated list of (domain, ip) pairs, so we can convert it back to CSV with a header using mlr --t2c --implicit-csv-header label domain,ip. Then, since both our output and our original test.csv have a domain field, we can use mlr join to produce a single output table, and mlr nest to implode the multiple values for bar.com:



            $ mlr --csv cut -f domain test.csv | 
            parallel --skip-first-line --tag dig +short |
            mlr --t2c --implicit-csv-header label domain,ip |
            mlr --c2p --barred join -f test.csv -j domain then
            nest --implode --values --across-records -f ip
            +---------+----+-------------------------------+
            | domain | id | ip |
            +---------+----+-------------------------------+
            | foo.com | 1 | 23.23.86.44 |
            | bar.com | 2 | 104.27.138.186;104.27.139.186 |
            +---------+----+-------------------------------+





            share|improve this answer





























              0














              Calling external commands from the Miller DSL



              The Miller DSL reference deals with calling external commands in the section on redirected-output statements:




              The print, dump, tee, emitf, emit, and emitp keywords all allow you to redirect output to one or more files or pipe-to commands.




              I couldn’t find this in the documentation (other than by inference from the examples), but the syntax for using these statements with a pipe-to command seems to be statement | quoted-shell-command, unquoted-mlr-expression. For example:



              $ mlr --csv put 'tee | "tr [a-z] [A-Z]", $*' test.csv
              id,domain
              1,foo.com
              2,bar.com
              ID,DOMAIN
              1,FOO.COM
              2,BAR.COM


              Note that the piped output appears after Miller’s output (in this case, the unchanged input, as tee does not affect the stream and put emits it). By suppressing put’s output with -q, and extracting a single field with print $domain rather than tee $*, we can get a list of IP addresses:



              $ mlr --csv put -q 'print | "xargs dig +short", $domain' test.csv
              23.23.86.44
              104.27.138.186
              104.27.139.186


              Miller didn’t do much for us here; we still had to use xargs to convert stdin into an argument (because dig does not accept domains on stdin). Moreover, dig’s output contained newlines, meaning that the output no longer matches the input one-to-one. Since mlr adheres to the Unix philosophy, it would have been easier just to join a pipe to the end of mlr --headerless-csv-output cut -f domain if this was all I needed.



              Joining output from external commands to your input



              What I really wanted to do was assign the result of calling an external command to an in-stream variable in the Miller DSL, and as far as I can tell, this is not possible. However, by swapping xargs for GNU parallel, we can use the --tag option to keep track of the argument we gave dig, and benefit from flexible, concurrent I/O:



              $ mlr --csv --headerless-csv-output cut -f domain test.csv | parallel --tag dig +short
              foo.com 23.23.86.44
              bar.com 104.27.139.186
              bar.com 104.27.138.186


              Since we are dealing with CSV, parallel can actually handle this on its own, although we need to access fields by position (2) rather than name (domain):



              $ < test.csv parallel -C "," --skip-first-line --tagstring 2 dig +short 2
              foo.com 23.23.86.44
              bar.com 104.27.139.186
              bar.com 104.27.138.186


              This is a tab-separated list of (domain, ip) pairs, so we can convert it back to CSV with a header using mlr --t2c --implicit-csv-header label domain,ip. Then, since both our output and our original test.csv have a domain field, we can use mlr join to produce a single output table, and mlr nest to implode the multiple values for bar.com:



              $ mlr --csv cut -f domain test.csv | 
              parallel --skip-first-line --tag dig +short |
              mlr --t2c --implicit-csv-header label domain,ip |
              mlr --c2p --barred join -f test.csv -j domain then
              nest --implode --values --across-records -f ip
              +---------+----+-------------------------------+
              | domain | id | ip |
              +---------+----+-------------------------------+
              | foo.com | 1 | 23.23.86.44 |
              | bar.com | 2 | 104.27.138.186;104.27.139.186 |
              +---------+----+-------------------------------+





              share|improve this answer



























                0












                0








                0







                Calling external commands from the Miller DSL



                The Miller DSL reference deals with calling external commands in the section on redirected-output statements:




                The print, dump, tee, emitf, emit, and emitp keywords all allow you to redirect output to one or more files or pipe-to commands.




                I couldn’t find this in the documentation (other than by inference from the examples), but the syntax for using these statements with a pipe-to command seems to be statement | quoted-shell-command, unquoted-mlr-expression. For example:



                $ mlr --csv put 'tee | "tr [a-z] [A-Z]", $*' test.csv
                id,domain
                1,foo.com
                2,bar.com
                ID,DOMAIN
                1,FOO.COM
                2,BAR.COM


                Note that the piped output appears after Miller’s output (in this case, the unchanged input, as tee does not affect the stream and put emits it). By suppressing put’s output with -q, and extracting a single field with print $domain rather than tee $*, we can get a list of IP addresses:



                $ mlr --csv put -q 'print | "xargs dig +short", $domain' test.csv
                23.23.86.44
                104.27.138.186
                104.27.139.186


                Miller didn’t do much for us here; we still had to use xargs to convert stdin into an argument (because dig does not accept domains on stdin). Moreover, dig’s output contained newlines, meaning that the output no longer matches the input one-to-one. Since mlr adheres to the Unix philosophy, it would have been easier just to join a pipe to the end of mlr --headerless-csv-output cut -f domain if this was all I needed.



                Joining output from external commands to your input



                What I really wanted to do was assign the result of calling an external command to an in-stream variable in the Miller DSL, and as far as I can tell, this is not possible. However, by swapping xargs for GNU parallel, we can use the --tag option to keep track of the argument we gave dig, and benefit from flexible, concurrent I/O:



                $ mlr --csv --headerless-csv-output cut -f domain test.csv | parallel --tag dig +short
                foo.com 23.23.86.44
                bar.com 104.27.139.186
                bar.com 104.27.138.186


                Since we are dealing with CSV, parallel can actually handle this on its own, although we need to access fields by position (2) rather than name (domain):



                $ < test.csv parallel -C "," --skip-first-line --tagstring 2 dig +short 2
                foo.com 23.23.86.44
                bar.com 104.27.139.186
                bar.com 104.27.138.186


                This is a tab-separated list of (domain, ip) pairs, so we can convert it back to CSV with a header using mlr --t2c --implicit-csv-header label domain,ip. Then, since both our output and our original test.csv have a domain field, we can use mlr join to produce a single output table, and mlr nest to implode the multiple values for bar.com:



                $ mlr --csv cut -f domain test.csv | 
                parallel --skip-first-line --tag dig +short |
                mlr --t2c --implicit-csv-header label domain,ip |
                mlr --c2p --barred join -f test.csv -j domain then
                nest --implode --values --across-records -f ip
                +---------+----+-------------------------------+
                | domain | id | ip |
                +---------+----+-------------------------------+
                | foo.com | 1 | 23.23.86.44 |
                | bar.com | 2 | 104.27.138.186;104.27.139.186 |
                +---------+----+-------------------------------+





                share|improve this answer















                Calling external commands from the Miller DSL



                The Miller DSL reference deals with calling external commands in the section on redirected-output statements:




                The print, dump, tee, emitf, emit, and emitp keywords all allow you to redirect output to one or more files or pipe-to commands.




                I couldn’t find this in the documentation (other than by inference from the examples), but the syntax for using these statements with a pipe-to command seems to be statement | quoted-shell-command, unquoted-mlr-expression. For example:



                $ mlr --csv put 'tee | "tr [a-z] [A-Z]", $*' test.csv
                id,domain
                1,foo.com
                2,bar.com
                ID,DOMAIN
                1,FOO.COM
                2,BAR.COM


                Note that the piped output appears after Miller’s output (in this case, the unchanged input, as tee does not affect the stream and put emits it). By suppressing put’s output with -q, and extracting a single field with print $domain rather than tee $*, we can get a list of IP addresses:



                $ mlr --csv put -q 'print | "xargs dig +short", $domain' test.csv
                23.23.86.44
                104.27.138.186
                104.27.139.186


                Miller didn’t do much for us here; we still had to use xargs to convert stdin into an argument (because dig does not accept domains on stdin). Moreover, dig’s output contained newlines, meaning that the output no longer matches the input one-to-one. Since mlr adheres to the Unix philosophy, it would have been easier just to join a pipe to the end of mlr --headerless-csv-output cut -f domain if this was all I needed.



                Joining output from external commands to your input



                What I really wanted to do was assign the result of calling an external command to an in-stream variable in the Miller DSL, and as far as I can tell, this is not possible. However, by swapping xargs for GNU parallel, we can use the --tag option to keep track of the argument we gave dig, and benefit from flexible, concurrent I/O:



                $ mlr --csv --headerless-csv-output cut -f domain test.csv | parallel --tag dig +short
                foo.com 23.23.86.44
                bar.com 104.27.139.186
                bar.com 104.27.138.186


                Since we are dealing with CSV, parallel can actually handle this on its own, although we need to access fields by position (2) rather than name (domain):



                $ < test.csv parallel -C "," --skip-first-line --tagstring 2 dig +short 2
                foo.com 23.23.86.44
                bar.com 104.27.139.186
                bar.com 104.27.138.186


                This is a tab-separated list of (domain, ip) pairs, so we can convert it back to CSV with a header using mlr --t2c --implicit-csv-header label domain,ip. Then, since both our output and our original test.csv have a domain field, we can use mlr join to produce a single output table, and mlr nest to implode the multiple values for bar.com:



                $ mlr --csv cut -f domain test.csv | 
                parallel --skip-first-line --tag dig +short |
                mlr --t2c --implicit-csv-header label domain,ip |
                mlr --c2p --barred join -f test.csv -j domain then
                nest --implode --values --across-records -f ip
                +---------+----+-------------------------------+
                | domain | id | ip |
                +---------+----+-------------------------------+
                | foo.com | 1 | 23.23.86.44 |
                | bar.com | 2 | 104.27.138.186;104.27.139.186 |
                +---------+----+-------------------------------+






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Feb 15 at 15:19

























                answered Feb 15 at 13:28









                sjysjy

                1064




                1064



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f497375%2fhow-can-i-call-an-external-command-from-within-miller-mlr-s-dsl%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown






                    Popular posts from this blog

                    How to check contact read email or not when send email to Individual?

                    Displaying single band from multi-band raster using QGIS

                    How many registers does an x86_64 CPU actually have?