Script Processes in Parallel

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite
2












I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.



grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log


Then i tried alternative as below.



sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log


even this took 60+ seconds as well.



There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.







share|improve this question


















  • 1




    Why did you use -i with sed? Did that not modify the original log file?
    – Kusalananda
    Apr 10 at 17:09











  • Is it a requirement to save the parsed log to separate files?
    – Kusalananda
    Apr 10 at 17:38










  • -i is used to moving the lines from one file to another. Yes, that's the requirement.
    – SivaPrasath
    Apr 10 at 19:24










  • No, -i is for doing in-place editing of the input file.
    – Kusalananda
    Apr 10 at 19:25










  • I'm deleting those copied lines from the input file.
    – SivaPrasath
    Apr 10 at 19:39














up vote
2
down vote

favorite
2












I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.



grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log


Then i tried alternative as below.



sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log


even this took 60+ seconds as well.



There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.







share|improve this question


















  • 1




    Why did you use -i with sed? Did that not modify the original log file?
    – Kusalananda
    Apr 10 at 17:09











  • Is it a requirement to save the parsed log to separate files?
    – Kusalananda
    Apr 10 at 17:38










  • -i is used to moving the lines from one file to another. Yes, that's the requirement.
    – SivaPrasath
    Apr 10 at 19:24










  • No, -i is for doing in-place editing of the input file.
    – Kusalananda
    Apr 10 at 19:25










  • I'm deleting those copied lines from the input file.
    – SivaPrasath
    Apr 10 at 19:39












up vote
2
down vote

favorite
2









up vote
2
down vote

favorite
2






2





I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.



grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log


Then i tried alternative as below.



sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log


even this took 60+ seconds as well.



There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.







share|improve this question














I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.



grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log


Then i tried alternative as below.



sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log


even this took 60+ seconds as well.



There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.









share|improve this question













share|improve this question




share|improve this question








edited Apr 10 at 20:28









Kusalananda

102k13200317




102k13200317










asked Apr 10 at 17:06









SivaPrasath

5,47233050




5,47233050







  • 1




    Why did you use -i with sed? Did that not modify the original log file?
    – Kusalananda
    Apr 10 at 17:09











  • Is it a requirement to save the parsed log to separate files?
    – Kusalananda
    Apr 10 at 17:38










  • -i is used to moving the lines from one file to another. Yes, that's the requirement.
    – SivaPrasath
    Apr 10 at 19:24










  • No, -i is for doing in-place editing of the input file.
    – Kusalananda
    Apr 10 at 19:25










  • I'm deleting those copied lines from the input file.
    – SivaPrasath
    Apr 10 at 19:39












  • 1




    Why did you use -i with sed? Did that not modify the original log file?
    – Kusalananda
    Apr 10 at 17:09











  • Is it a requirement to save the parsed log to separate files?
    – Kusalananda
    Apr 10 at 17:38










  • -i is used to moving the lines from one file to another. Yes, that's the requirement.
    – SivaPrasath
    Apr 10 at 19:24










  • No, -i is for doing in-place editing of the input file.
    – Kusalananda
    Apr 10 at 19:25










  • I'm deleting those copied lines from the input file.
    – SivaPrasath
    Apr 10 at 19:39







1




1




Why did you use -i with sed? Did that not modify the original log file?
– Kusalananda
Apr 10 at 17:09





Why did you use -i with sed? Did that not modify the original log file?
– Kusalananda
Apr 10 at 17:09













Is it a requirement to save the parsed log to separate files?
– Kusalananda
Apr 10 at 17:38




Is it a requirement to save the parsed log to separate files?
– Kusalananda
Apr 10 at 17:38












-i is used to moving the lines from one file to another. Yes, that's the requirement.
– SivaPrasath
Apr 10 at 19:24




-i is used to moving the lines from one file to another. Yes, that's the requirement.
– SivaPrasath
Apr 10 at 19:24












No, -i is for doing in-place editing of the input file.
– Kusalananda
Apr 10 at 19:25




No, -i is for doing in-place editing of the input file.
– Kusalananda
Apr 10 at 19:25












I'm deleting those copied lines from the input file.
– SivaPrasath
Apr 10 at 19:39




I'm deleting those copied lines from the input file.
– SivaPrasath
Apr 10 at 19:39










3 Answers
3






active

oldest

votes

















up vote
3
down vote



accepted










I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed or grep per IP address.



Instead, you may get away with a single use of sed, if you prepare carefully.



First, we have to create a sed script, and we do that from a file ip.list which contains the IP addresses, one address per line:



sed -e 'h' 
-e 's/./\./g'
-e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
-e 'G'
-e 's/n//'
-e 's/$/.log/' ip.list >ip.sed


This sed stuff does, for each IP address,



  1. Copy the address to the "hold space" (an extra buffer in sed).

  2. Change . in the "pattern space" (the input line) into . (to match the dots properly, your code did not do this).

  3. Prepend ^ and append [[:blank:]]/w /tmp/access- to the pattern space.

  4. Append the unmodified input line from the hold space to the pattern space with a newline in-between.

  5. Delete that newline.

  6. Append .log to the end of the line (and implicitly output the result).

For a file that contains



127.0.0.1
10.0.0.1
10.0.0.100


this would create the sed script



/^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
/^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
/^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log


Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100 would go into the /tmp/access-10.0.0.1.log file. Your code omitted this.



This can then be used on your log file (no looping):



sed -n -f ip.sed /var/log/http/access.log


I haven't ever tested writing to 1200 files from one and the same sed script. If it doesn't work, then try the below awk variation instead.




A similar solution with awk involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk invocation:



awk 'FNR == NR list[$1] = 1; next 
$1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log


Here, we give awk both the IP list and the log file at the same time. When NR == FNR we know we're still reading the first file (the list), and we add the IP numbers into the associative array list as keys, and continue with the next line of input.



If the FNR == NR condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list (this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.



We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk (or any utility) once per IP address.




I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.




Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep on the system in parallel:



Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like



xargs -P 4 -n 100 sh -c '
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done' sh <ip.list


Here, xargs will give at most 100 IP addresses at a time from the ip.list file to a short shell script. It will arrange with four parallel invocations of the script.



The short shell script:



for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done


This will just iterate over the 100 IP addresses that xargs gives it on its command line, and apply pretty much the same grep command that you had, the difference is that there will be four of these loops running in parallel.



Increase -P 4 to -P 16 or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep would read from and write to the same disk.



Except for the -P flag to xargs, all things in this answer should be able to run on any POSIX system. The -P flag for xargs is non-standard but implemented in GNU xargs and on BSD systems.






share|improve this answer






















  • It worked... thanks
    – SivaPrasath
    Apr 15 at 9:38

















up vote
1
down vote













For various approaches:
https://stackoverflow.com/questions/9066609/fastest-possible-grep



In addition to that,
If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.



You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.



When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.






share|improve this answer



























    up vote
    0
    down vote













    If /var/log/http/access.log is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log multiple times - especially if you have multiple cores. This will run one grep per IP in parallel (+ a couple of helping wrapping processes).



    pargrep() 
    # Send standard input to grep with different match strings in parallel
    # This command would be enough if you only have 250 match strings
    parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"

    export -f pargrep
    # Standard input is tee'ed to several pargreps.
    # Each pargrep gets 250 match strings and thus starts 250 processes.
    # For 1200 ips this starts 3600 processes taking around 1 GB RAM,
    # but it reads access.log only once
    cat /var/log/http/access.log |
    parallel --pipe --tee -N250 pargrep :::: ips





    share|improve this answer






















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );








       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436823%2fscript-processes-in-parallel%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      3
      down vote



      accepted










      I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed or grep per IP address.



      Instead, you may get away with a single use of sed, if you prepare carefully.



      First, we have to create a sed script, and we do that from a file ip.list which contains the IP addresses, one address per line:



      sed -e 'h' 
      -e 's/./\./g'
      -e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
      -e 'G'
      -e 's/n//'
      -e 's/$/.log/' ip.list >ip.sed


      This sed stuff does, for each IP address,



      1. Copy the address to the "hold space" (an extra buffer in sed).

      2. Change . in the "pattern space" (the input line) into . (to match the dots properly, your code did not do this).

      3. Prepend ^ and append [[:blank:]]/w /tmp/access- to the pattern space.

      4. Append the unmodified input line from the hold space to the pattern space with a newline in-between.

      5. Delete that newline.

      6. Append .log to the end of the line (and implicitly output the result).

      For a file that contains



      127.0.0.1
      10.0.0.1
      10.0.0.100


      this would create the sed script



      /^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
      /^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
      /^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log


      Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100 would go into the /tmp/access-10.0.0.1.log file. Your code omitted this.



      This can then be used on your log file (no looping):



      sed -n -f ip.sed /var/log/http/access.log


      I haven't ever tested writing to 1200 files from one and the same sed script. If it doesn't work, then try the below awk variation instead.




      A similar solution with awk involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk invocation:



      awk 'FNR == NR list[$1] = 1; next 
      $1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log


      Here, we give awk both the IP list and the log file at the same time. When NR == FNR we know we're still reading the first file (the list), and we add the IP numbers into the associative array list as keys, and continue with the next line of input.



      If the FNR == NR condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list (this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.



      We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk (or any utility) once per IP address.




      I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.




      Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep on the system in parallel:



      Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like



      xargs -P 4 -n 100 sh -c '
      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done' sh <ip.list


      Here, xargs will give at most 100 IP addresses at a time from the ip.list file to a short shell script. It will arrange with four parallel invocations of the script.



      The short shell script:



      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done


      This will just iterate over the 100 IP addresses that xargs gives it on its command line, and apply pretty much the same grep command that you had, the difference is that there will be four of these loops running in parallel.



      Increase -P 4 to -P 16 or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep would read from and write to the same disk.



      Except for the -P flag to xargs, all things in this answer should be able to run on any POSIX system. The -P flag for xargs is non-standard but implemented in GNU xargs and on BSD systems.






      share|improve this answer






















      • It worked... thanks
        – SivaPrasath
        Apr 15 at 9:38














      up vote
      3
      down vote



      accepted










      I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed or grep per IP address.



      Instead, you may get away with a single use of sed, if you prepare carefully.



      First, we have to create a sed script, and we do that from a file ip.list which contains the IP addresses, one address per line:



      sed -e 'h' 
      -e 's/./\./g'
      -e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
      -e 'G'
      -e 's/n//'
      -e 's/$/.log/' ip.list >ip.sed


      This sed stuff does, for each IP address,



      1. Copy the address to the "hold space" (an extra buffer in sed).

      2. Change . in the "pattern space" (the input line) into . (to match the dots properly, your code did not do this).

      3. Prepend ^ and append [[:blank:]]/w /tmp/access- to the pattern space.

      4. Append the unmodified input line from the hold space to the pattern space with a newline in-between.

      5. Delete that newline.

      6. Append .log to the end of the line (and implicitly output the result).

      For a file that contains



      127.0.0.1
      10.0.0.1
      10.0.0.100


      this would create the sed script



      /^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
      /^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
      /^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log


      Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100 would go into the /tmp/access-10.0.0.1.log file. Your code omitted this.



      This can then be used on your log file (no looping):



      sed -n -f ip.sed /var/log/http/access.log


      I haven't ever tested writing to 1200 files from one and the same sed script. If it doesn't work, then try the below awk variation instead.




      A similar solution with awk involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk invocation:



      awk 'FNR == NR list[$1] = 1; next 
      $1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log


      Here, we give awk both the IP list and the log file at the same time. When NR == FNR we know we're still reading the first file (the list), and we add the IP numbers into the associative array list as keys, and continue with the next line of input.



      If the FNR == NR condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list (this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.



      We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk (or any utility) once per IP address.




      I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.




      Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep on the system in parallel:



      Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like



      xargs -P 4 -n 100 sh -c '
      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done' sh <ip.list


      Here, xargs will give at most 100 IP addresses at a time from the ip.list file to a short shell script. It will arrange with four parallel invocations of the script.



      The short shell script:



      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done


      This will just iterate over the 100 IP addresses that xargs gives it on its command line, and apply pretty much the same grep command that you had, the difference is that there will be four of these loops running in parallel.



      Increase -P 4 to -P 16 or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep would read from and write to the same disk.



      Except for the -P flag to xargs, all things in this answer should be able to run on any POSIX system. The -P flag for xargs is non-standard but implemented in GNU xargs and on BSD systems.






      share|improve this answer






















      • It worked... thanks
        – SivaPrasath
        Apr 15 at 9:38












      up vote
      3
      down vote



      accepted







      up vote
      3
      down vote



      accepted






      I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed or grep per IP address.



      Instead, you may get away with a single use of sed, if you prepare carefully.



      First, we have to create a sed script, and we do that from a file ip.list which contains the IP addresses, one address per line:



      sed -e 'h' 
      -e 's/./\./g'
      -e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
      -e 'G'
      -e 's/n//'
      -e 's/$/.log/' ip.list >ip.sed


      This sed stuff does, for each IP address,



      1. Copy the address to the "hold space" (an extra buffer in sed).

      2. Change . in the "pattern space" (the input line) into . (to match the dots properly, your code did not do this).

      3. Prepend ^ and append [[:blank:]]/w /tmp/access- to the pattern space.

      4. Append the unmodified input line from the hold space to the pattern space with a newline in-between.

      5. Delete that newline.

      6. Append .log to the end of the line (and implicitly output the result).

      For a file that contains



      127.0.0.1
      10.0.0.1
      10.0.0.100


      this would create the sed script



      /^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
      /^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
      /^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log


      Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100 would go into the /tmp/access-10.0.0.1.log file. Your code omitted this.



      This can then be used on your log file (no looping):



      sed -n -f ip.sed /var/log/http/access.log


      I haven't ever tested writing to 1200 files from one and the same sed script. If it doesn't work, then try the below awk variation instead.




      A similar solution with awk involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk invocation:



      awk 'FNR == NR list[$1] = 1; next 
      $1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log


      Here, we give awk both the IP list and the log file at the same time. When NR == FNR we know we're still reading the first file (the list), and we add the IP numbers into the associative array list as keys, and continue with the next line of input.



      If the FNR == NR condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list (this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.



      We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk (or any utility) once per IP address.




      I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.




      Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep on the system in parallel:



      Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like



      xargs -P 4 -n 100 sh -c '
      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done' sh <ip.list


      Here, xargs will give at most 100 IP addresses at a time from the ip.list file to a short shell script. It will arrange with four parallel invocations of the script.



      The short shell script:



      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done


      This will just iterate over the 100 IP addresses that xargs gives it on its command line, and apply pretty much the same grep command that you had, the difference is that there will be four of these loops running in parallel.



      Increase -P 4 to -P 16 or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep would read from and write to the same disk.



      Except for the -P flag to xargs, all things in this answer should be able to run on any POSIX system. The -P flag for xargs is non-standard but implemented in GNU xargs and on BSD systems.






      share|improve this answer














      I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed or grep per IP address.



      Instead, you may get away with a single use of sed, if you prepare carefully.



      First, we have to create a sed script, and we do that from a file ip.list which contains the IP addresses, one address per line:



      sed -e 'h' 
      -e 's/./\./g'
      -e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
      -e 'G'
      -e 's/n//'
      -e 's/$/.log/' ip.list >ip.sed


      This sed stuff does, for each IP address,



      1. Copy the address to the "hold space" (an extra buffer in sed).

      2. Change . in the "pattern space" (the input line) into . (to match the dots properly, your code did not do this).

      3. Prepend ^ and append [[:blank:]]/w /tmp/access- to the pattern space.

      4. Append the unmodified input line from the hold space to the pattern space with a newline in-between.

      5. Delete that newline.

      6. Append .log to the end of the line (and implicitly output the result).

      For a file that contains



      127.0.0.1
      10.0.0.1
      10.0.0.100


      this would create the sed script



      /^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
      /^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
      /^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log


      Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100 would go into the /tmp/access-10.0.0.1.log file. Your code omitted this.



      This can then be used on your log file (no looping):



      sed -n -f ip.sed /var/log/http/access.log


      I haven't ever tested writing to 1200 files from one and the same sed script. If it doesn't work, then try the below awk variation instead.




      A similar solution with awk involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk invocation:



      awk 'FNR == NR list[$1] = 1; next 
      $1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log


      Here, we give awk both the IP list and the log file at the same time. When NR == FNR we know we're still reading the first file (the list), and we add the IP numbers into the associative array list as keys, and continue with the next line of input.



      If the FNR == NR condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list (this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.



      We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk (or any utility) once per IP address.




      I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.




      Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep on the system in parallel:



      Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like



      xargs -P 4 -n 100 sh -c '
      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done' sh <ip.list


      Here, xargs will give at most 100 IP addresses at a time from the ip.list file to a short shell script. It will arrange with four parallel invocations of the script.



      The short shell script:



      for n do
      grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
      done


      This will just iterate over the 100 IP addresses that xargs gives it on its command line, and apply pretty much the same grep command that you had, the difference is that there will be four of these loops running in parallel.



      Increase -P 4 to -P 16 or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep would read from and write to the same disk.



      Except for the -P flag to xargs, all things in this answer should be able to run on any POSIX system. The -P flag for xargs is non-standard but implemented in GNU xargs and on BSD systems.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Apr 10 at 20:41

























      answered Apr 10 at 19:46









      Kusalananda

      102k13200317




      102k13200317











      • It worked... thanks
        – SivaPrasath
        Apr 15 at 9:38
















      • It worked... thanks
        – SivaPrasath
        Apr 15 at 9:38















      It worked... thanks
      – SivaPrasath
      Apr 15 at 9:38




      It worked... thanks
      – SivaPrasath
      Apr 15 at 9:38












      up vote
      1
      down vote













      For various approaches:
      https://stackoverflow.com/questions/9066609/fastest-possible-grep



      In addition to that,
      If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.



      You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.



      When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.






      share|improve this answer
























        up vote
        1
        down vote













        For various approaches:
        https://stackoverflow.com/questions/9066609/fastest-possible-grep



        In addition to that,
        If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.



        You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.



        When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.






        share|improve this answer






















          up vote
          1
          down vote










          up vote
          1
          down vote









          For various approaches:
          https://stackoverflow.com/questions/9066609/fastest-possible-grep



          In addition to that,
          If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.



          You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.



          When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.






          share|improve this answer












          For various approaches:
          https://stackoverflow.com/questions/9066609/fastest-possible-grep



          In addition to that,
          If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.



          You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.



          When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Apr 10 at 17:37









          Dark Matter

          1867




          1867




















              up vote
              0
              down vote













              If /var/log/http/access.log is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log multiple times - especially if you have multiple cores. This will run one grep per IP in parallel (+ a couple of helping wrapping processes).



              pargrep() 
              # Send standard input to grep with different match strings in parallel
              # This command would be enough if you only have 250 match strings
              parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"

              export -f pargrep
              # Standard input is tee'ed to several pargreps.
              # Each pargrep gets 250 match strings and thus starts 250 processes.
              # For 1200 ips this starts 3600 processes taking around 1 GB RAM,
              # but it reads access.log only once
              cat /var/log/http/access.log |
              parallel --pipe --tee -N250 pargrep :::: ips





              share|improve this answer


























                up vote
                0
                down vote













                If /var/log/http/access.log is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log multiple times - especially if you have multiple cores. This will run one grep per IP in parallel (+ a couple of helping wrapping processes).



                pargrep() 
                # Send standard input to grep with different match strings in parallel
                # This command would be enough if you only have 250 match strings
                parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"

                export -f pargrep
                # Standard input is tee'ed to several pargreps.
                # Each pargrep gets 250 match strings and thus starts 250 processes.
                # For 1200 ips this starts 3600 processes taking around 1 GB RAM,
                # but it reads access.log only once
                cat /var/log/http/access.log |
                parallel --pipe --tee -N250 pargrep :::: ips





                share|improve this answer
























                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  If /var/log/http/access.log is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log multiple times - especially if you have multiple cores. This will run one grep per IP in parallel (+ a couple of helping wrapping processes).



                  pargrep() 
                  # Send standard input to grep with different match strings in parallel
                  # This command would be enough if you only have 250 match strings
                  parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"

                  export -f pargrep
                  # Standard input is tee'ed to several pargreps.
                  # Each pargrep gets 250 match strings and thus starts 250 processes.
                  # For 1200 ips this starts 3600 processes taking around 1 GB RAM,
                  # but it reads access.log only once
                  cat /var/log/http/access.log |
                  parallel --pipe --tee -N250 pargrep :::: ips





                  share|improve this answer














                  If /var/log/http/access.log is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log multiple times - especially if you have multiple cores. This will run one grep per IP in parallel (+ a couple of helping wrapping processes).



                  pargrep() 
                  # Send standard input to grep with different match strings in parallel
                  # This command would be enough if you only have 250 match strings
                  parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"

                  export -f pargrep
                  # Standard input is tee'ed to several pargreps.
                  # Each pargrep gets 250 match strings and thus starts 250 processes.
                  # For 1200 ips this starts 3600 processes taking around 1 GB RAM,
                  # but it reads access.log only once
                  cat /var/log/http/access.log |
                  parallel --pipe --tee -N250 pargrep :::: ips






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited Apr 15 at 23:01

























                  answered Apr 15 at 22:49









                  Ole Tange

                  11.2k1343101




                  11.2k1343101






















                       

                      draft saved


                      draft discarded


























                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436823%2fscript-processes-in-parallel%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      Bahrain

                      Postfix configuration issue with fips on centos 7; mailgun relay