Script Processes in Parallel
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.
grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log
Then i tried alternative as below.
sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log
even this took 60+ seconds as well.
There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.
shell-script shell sed grep parallelism
add a comment |Â
up vote
2
down vote
favorite
I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.
grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log
Then i tried alternative as below.
sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log
even this took 60+ seconds as well.
There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.
shell-script shell sed grep parallelism
1
Why did you use-i
withsed
? Did that not modify the original log file?
â Kusalananda
Apr 10 at 17:09
Is it a requirement to save the parsed log to separate files?
â Kusalananda
Apr 10 at 17:38
-i is used to moving the lines from one file to another. Yes, that's the requirement.
â SivaPrasath
Apr 10 at 19:24
No,-i
is for doing in-place editing of the input file.
â Kusalananda
Apr 10 at 19:25
I'm deleting those copied lines from the input file.
â SivaPrasath
Apr 10 at 19:39
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.
grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log
Then i tried alternative as below.
sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log
even this took 60+ seconds as well.
There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.
shell-script shell sed grep parallelism
I would like to parse apache access log w.r.t IPs. I have used the following code, but it took nearly 90sec.
grep "^$CLIENT_IP" /var/log/http/access.log > /tmp/access-$CLIENT_IP.log
Then i tried alternative as below.
sed -i -e "/^$CLIENT_IP/w /tmp/access-$CLIENT_IP.log" -e '//d' /var/log/http/access.log
even this took 60+ seconds as well.
There are 1200 IPs to parse. I would like to know is there any way we can implement parallelism to reduce the runtime.
shell-script shell sed grep parallelism
edited Apr 10 at 20:28


Kusalananda
102k13200317
102k13200317
asked Apr 10 at 17:06


SivaPrasath
5,47233050
5,47233050
1
Why did you use-i
withsed
? Did that not modify the original log file?
â Kusalananda
Apr 10 at 17:09
Is it a requirement to save the parsed log to separate files?
â Kusalananda
Apr 10 at 17:38
-i is used to moving the lines from one file to another. Yes, that's the requirement.
â SivaPrasath
Apr 10 at 19:24
No,-i
is for doing in-place editing of the input file.
â Kusalananda
Apr 10 at 19:25
I'm deleting those copied lines from the input file.
â SivaPrasath
Apr 10 at 19:39
add a comment |Â
1
Why did you use-i
withsed
? Did that not modify the original log file?
â Kusalananda
Apr 10 at 17:09
Is it a requirement to save the parsed log to separate files?
â Kusalananda
Apr 10 at 17:38
-i is used to moving the lines from one file to another. Yes, that's the requirement.
â SivaPrasath
Apr 10 at 19:24
No,-i
is for doing in-place editing of the input file.
â Kusalananda
Apr 10 at 19:25
I'm deleting those copied lines from the input file.
â SivaPrasath
Apr 10 at 19:39
1
1
Why did you use
-i
with sed
? Did that not modify the original log file?â Kusalananda
Apr 10 at 17:09
Why did you use
-i
with sed
? Did that not modify the original log file?â Kusalananda
Apr 10 at 17:09
Is it a requirement to save the parsed log to separate files?
â Kusalananda
Apr 10 at 17:38
Is it a requirement to save the parsed log to separate files?
â Kusalananda
Apr 10 at 17:38
-i is used to moving the lines from one file to another. Yes, that's the requirement.
â SivaPrasath
Apr 10 at 19:24
-i is used to moving the lines from one file to another. Yes, that's the requirement.
â SivaPrasath
Apr 10 at 19:24
No,
-i
is for doing in-place editing of the input file.â Kusalananda
Apr 10 at 19:25
No,
-i
is for doing in-place editing of the input file.â Kusalananda
Apr 10 at 19:25
I'm deleting those copied lines from the input file.
â SivaPrasath
Apr 10 at 19:39
I'm deleting those copied lines from the input file.
â SivaPrasath
Apr 10 at 19:39
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
3
down vote
accepted
I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed
or grep
per IP address.
Instead, you may get away with a single use of sed
, if you prepare carefully.
First, we have to create a sed
script, and we do that from a file ip.list
which contains the IP addresses, one address per line:
sed -e 'h'
-e 's/./\./g'
-e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
-e 'G'
-e 's/n//'
-e 's/$/.log/' ip.list >ip.sed
This sed
stuff does, for each IP address,
- Copy the address to the "hold space" (an extra buffer in
sed
). - Change
.
in the "pattern space" (the input line) into.
(to match the dots properly, your code did not do this). - Prepend
^
and append[[:blank:]]/w /tmp/access-
to the pattern space. - Append the unmodified input line from the hold space to the pattern space with a newline in-between.
- Delete that newline.
- Append
.log
to the end of the line (and implicitly output the result).
For a file that contains
127.0.0.1
10.0.0.1
10.0.0.100
this would create the sed
script
/^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
/^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
/^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log
Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100
would go into the /tmp/access-10.0.0.1.log
file. Your code omitted this.
This can then be used on your log file (no looping):
sed -n -f ip.sed /var/log/http/access.log
I haven't ever tested writing to 1200 files from one and the same sed
script. If it doesn't work, then try the below awk
variation instead.
A similar solution with awk
involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk
invocation:
awk 'FNR == NR list[$1] = 1; next
$1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log
Here, we give awk
both the IP list and the log file at the same time. When NR == FNR
we know we're still reading the first file (the list), and we add the IP numbers into the associative array list
as keys, and continue with the next line of input.
If the FNR == NR
condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list
(this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.
We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk
(or any utility) once per IP address.
I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.
Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep
on the system in parallel:
Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like
xargs -P 4 -n 100 sh -c '
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done' sh <ip.list
Here, xargs
will give at most 100 IP addresses at a time from the ip.list
file to a short shell script. It will arrange with four parallel invocations of the script.
The short shell script:
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done
This will just iterate over the 100 IP addresses that xargs
gives it on its command line, and apply pretty much the same grep
command that you had, the difference is that there will be four of these loops running in parallel.
Increase -P 4
to -P 16
or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep
would read from and write to the same disk.
Except for the -P
flag to xargs
, all things in this answer should be able to run on any POSIX system. The -P
flag for xargs
is non-standard but implemented in GNU xargs
and on BSD systems.
It worked... thanks
â SivaPrasath
Apr 15 at 9:38
add a comment |Â
up vote
1
down vote
For various approaches:
https://stackoverflow.com/questions/9066609/fastest-possible-grep
In addition to that,
If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.
You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.
When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.
add a comment |Â
up vote
0
down vote
If /var/log/http/access.log
is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log
multiple times - especially if you have multiple cores. This will run one grep
per IP in parallel (+ a couple of helping wrapping processes).
pargrep()
# Send standard input to grep with different match strings in parallel
# This command would be enough if you only have 250 match strings
parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"
export -f pargrep
# Standard input is tee'ed to several pargreps.
# Each pargrep gets 250 match strings and thus starts 250 processes.
# For 1200 ips this starts 3600 processes taking around 1 GB RAM,
# but it reads access.log only once
cat /var/log/http/access.log |
parallel --pipe --tee -N250 pargrep :::: ips
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed
or grep
per IP address.
Instead, you may get away with a single use of sed
, if you prepare carefully.
First, we have to create a sed
script, and we do that from a file ip.list
which contains the IP addresses, one address per line:
sed -e 'h'
-e 's/./\./g'
-e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
-e 'G'
-e 's/n//'
-e 's/$/.log/' ip.list >ip.sed
This sed
stuff does, for each IP address,
- Copy the address to the "hold space" (an extra buffer in
sed
). - Change
.
in the "pattern space" (the input line) into.
(to match the dots properly, your code did not do this). - Prepend
^
and append[[:blank:]]/w /tmp/access-
to the pattern space. - Append the unmodified input line from the hold space to the pattern space with a newline in-between.
- Delete that newline.
- Append
.log
to the end of the line (and implicitly output the result).
For a file that contains
127.0.0.1
10.0.0.1
10.0.0.100
this would create the sed
script
/^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
/^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
/^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log
Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100
would go into the /tmp/access-10.0.0.1.log
file. Your code omitted this.
This can then be used on your log file (no looping):
sed -n -f ip.sed /var/log/http/access.log
I haven't ever tested writing to 1200 files from one and the same sed
script. If it doesn't work, then try the below awk
variation instead.
A similar solution with awk
involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk
invocation:
awk 'FNR == NR list[$1] = 1; next
$1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log
Here, we give awk
both the IP list and the log file at the same time. When NR == FNR
we know we're still reading the first file (the list), and we add the IP numbers into the associative array list
as keys, and continue with the next line of input.
If the FNR == NR
condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list
(this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.
We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk
(or any utility) once per IP address.
I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.
Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep
on the system in parallel:
Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like
xargs -P 4 -n 100 sh -c '
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done' sh <ip.list
Here, xargs
will give at most 100 IP addresses at a time from the ip.list
file to a short shell script. It will arrange with four parallel invocations of the script.
The short shell script:
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done
This will just iterate over the 100 IP addresses that xargs
gives it on its command line, and apply pretty much the same grep
command that you had, the difference is that there will be four of these loops running in parallel.
Increase -P 4
to -P 16
or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep
would read from and write to the same disk.
Except for the -P
flag to xargs
, all things in this answer should be able to run on any POSIX system. The -P
flag for xargs
is non-standard but implemented in GNU xargs
and on BSD systems.
It worked... thanks
â SivaPrasath
Apr 15 at 9:38
add a comment |Â
up vote
3
down vote
accepted
I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed
or grep
per IP address.
Instead, you may get away with a single use of sed
, if you prepare carefully.
First, we have to create a sed
script, and we do that from a file ip.list
which contains the IP addresses, one address per line:
sed -e 'h'
-e 's/./\./g'
-e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
-e 'G'
-e 's/n//'
-e 's/$/.log/' ip.list >ip.sed
This sed
stuff does, for each IP address,
- Copy the address to the "hold space" (an extra buffer in
sed
). - Change
.
in the "pattern space" (the input line) into.
(to match the dots properly, your code did not do this). - Prepend
^
and append[[:blank:]]/w /tmp/access-
to the pattern space. - Append the unmodified input line from the hold space to the pattern space with a newline in-between.
- Delete that newline.
- Append
.log
to the end of the line (and implicitly output the result).
For a file that contains
127.0.0.1
10.0.0.1
10.0.0.100
this would create the sed
script
/^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
/^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
/^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log
Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100
would go into the /tmp/access-10.0.0.1.log
file. Your code omitted this.
This can then be used on your log file (no looping):
sed -n -f ip.sed /var/log/http/access.log
I haven't ever tested writing to 1200 files from one and the same sed
script. If it doesn't work, then try the below awk
variation instead.
A similar solution with awk
involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk
invocation:
awk 'FNR == NR list[$1] = 1; next
$1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log
Here, we give awk
both the IP list and the log file at the same time. When NR == FNR
we know we're still reading the first file (the list), and we add the IP numbers into the associative array list
as keys, and continue with the next line of input.
If the FNR == NR
condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list
(this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.
We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk
(or any utility) once per IP address.
I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.
Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep
on the system in parallel:
Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like
xargs -P 4 -n 100 sh -c '
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done' sh <ip.list
Here, xargs
will give at most 100 IP addresses at a time from the ip.list
file to a short shell script. It will arrange with four parallel invocations of the script.
The short shell script:
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done
This will just iterate over the 100 IP addresses that xargs
gives it on its command line, and apply pretty much the same grep
command that you had, the difference is that there will be four of these loops running in parallel.
Increase -P 4
to -P 16
or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep
would read from and write to the same disk.
Except for the -P
flag to xargs
, all things in this answer should be able to run on any POSIX system. The -P
flag for xargs
is non-standard but implemented in GNU xargs
and on BSD systems.
It worked... thanks
â SivaPrasath
Apr 15 at 9:38
add a comment |Â
up vote
3
down vote
accepted
up vote
3
down vote
accepted
I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed
or grep
per IP address.
Instead, you may get away with a single use of sed
, if you prepare carefully.
First, we have to create a sed
script, and we do that from a file ip.list
which contains the IP addresses, one address per line:
sed -e 'h'
-e 's/./\./g'
-e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
-e 'G'
-e 's/n//'
-e 's/$/.log/' ip.list >ip.sed
This sed
stuff does, for each IP address,
- Copy the address to the "hold space" (an extra buffer in
sed
). - Change
.
in the "pattern space" (the input line) into.
(to match the dots properly, your code did not do this). - Prepend
^
and append[[:blank:]]/w /tmp/access-
to the pattern space. - Append the unmodified input line from the hold space to the pattern space with a newline in-between.
- Delete that newline.
- Append
.log
to the end of the line (and implicitly output the result).
For a file that contains
127.0.0.1
10.0.0.1
10.0.0.100
this would create the sed
script
/^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
/^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
/^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log
Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100
would go into the /tmp/access-10.0.0.1.log
file. Your code omitted this.
This can then be used on your log file (no looping):
sed -n -f ip.sed /var/log/http/access.log
I haven't ever tested writing to 1200 files from one and the same sed
script. If it doesn't work, then try the below awk
variation instead.
A similar solution with awk
involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk
invocation:
awk 'FNR == NR list[$1] = 1; next
$1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log
Here, we give awk
both the IP list and the log file at the same time. When NR == FNR
we know we're still reading the first file (the list), and we add the IP numbers into the associative array list
as keys, and continue with the next line of input.
If the FNR == NR
condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list
(this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.
We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk
(or any utility) once per IP address.
I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.
Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep
on the system in parallel:
Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like
xargs -P 4 -n 100 sh -c '
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done' sh <ip.list
Here, xargs
will give at most 100 IP addresses at a time from the ip.list
file to a short shell script. It will arrange with four parallel invocations of the script.
The short shell script:
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done
This will just iterate over the 100 IP addresses that xargs
gives it on its command line, and apply pretty much the same grep
command that you had, the difference is that there will be four of these loops running in parallel.
Increase -P 4
to -P 16
or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep
would read from and write to the same disk.
Except for the -P
flag to xargs
, all things in this answer should be able to run on any POSIX system. The -P
flag for xargs
is non-standard but implemented in GNU xargs
and on BSD systems.
I'm assuming that you're doing this in a shell loop over all IP addresses, possibly with the IP addresses coming from a text file. Yes, that would be slow, with one invocation of sed
or grep
per IP address.
Instead, you may get away with a single use of sed
, if you prepare carefully.
First, we have to create a sed
script, and we do that from a file ip.list
which contains the IP addresses, one address per line:
sed -e 'h'
-e 's/./\./g'
-e 's#.*#/^&[[:blank:]]/w /tmp/access-#'
-e 'G'
-e 's/n//'
-e 's/$/.log/' ip.list >ip.sed
This sed
stuff does, for each IP address,
- Copy the address to the "hold space" (an extra buffer in
sed
). - Change
.
in the "pattern space" (the input line) into.
(to match the dots properly, your code did not do this). - Prepend
^
and append[[:blank:]]/w /tmp/access-
to the pattern space. - Append the unmodified input line from the hold space to the pattern space with a newline in-between.
- Delete that newline.
- Append
.log
to the end of the line (and implicitly output the result).
For a file that contains
127.0.0.1
10.0.0.1
10.0.0.100
this would create the sed
script
/^127.0.0.1[[:blank:]]/w /tmp/access-127.0.0.1.log
/^10.0.0.1[[:blank:]]/w /tmp/access-10.0.0.1.log
/^10.0.0.100[[:blank:]]/w /tmp/access-10.0.0.100.log
Note that you will have to match a blank character (space or tab) after the IP address, otherwise the log entries for 10.0.0.100
would go into the /tmp/access-10.0.0.1.log
file. Your code omitted this.
This can then be used on your log file (no looping):
sed -n -f ip.sed /var/log/http/access.log
I haven't ever tested writing to 1200 files from one and the same sed
script. If it doesn't work, then try the below awk
variation instead.
A similar solution with awk
involves reading the IP addresses into an array first and then matching them against each row. This requires one single awk
invocation:
awk 'FNR == NR list[$1] = 1; next
$1 in list name = $1 ".log"; print >>name; close name ' ip.list /var/log/http/access.log
Here, we give awk
both the IP list and the log file at the same time. When NR == FNR
we know we're still reading the first file (the list), and we add the IP numbers into the associative array list
as keys, and continue with the next line of input.
If the FNR == NR
condition is not true, we're reading from the second file (the log file) and we test whether the very first field of the input line is a key in list
(this is a plain string comparison, not a regular expression match). If it is, we append the line to the appropriately named file.
We have to be careful with closing the output file, as we might otherwise run out of opened file descriptors. So there's going to be a lot of opening and closing files for appending, but it's still going to be faster than calling awk
(or any utility) once per IP address.
I'd be interested in knowing if these things work for you and what the approximate running time might be. I have tested the solutions only on extremely small sets of data.
Of course, we could go with your idea of just brute forcing it through throwing multiple instances of e.g. grep
on the system in parallel:
Ignoring the fact that we don't match the dots in the IP addresses correctly, we might to something like
xargs -P 4 -n 100 sh -c '
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done' sh <ip.list
Here, xargs
will give at most 100 IP addresses at a time from the ip.list
file to a short shell script. It will arrange with four parallel invocations of the script.
The short shell script:
for n do
grep "^$n[[:blank:]]" /var/log/http/access.log >"/tmp/access-$n.log"
done
This will just iterate over the 100 IP addresses that xargs
gives it on its command line, and apply pretty much the same grep
command that you had, the difference is that there will be four of these loops running in parallel.
Increase -P 4
to -P 16
or something related to the number of CPUs that you have. The speedup probably would not be linear as each parallel instance of grep
would read from and write to the same disk.
Except for the -P
flag to xargs
, all things in this answer should be able to run on any POSIX system. The -P
flag for xargs
is non-standard but implemented in GNU xargs
and on BSD systems.
edited Apr 10 at 20:41
answered Apr 10 at 19:46


Kusalananda
102k13200317
102k13200317
It worked... thanks
â SivaPrasath
Apr 15 at 9:38
add a comment |Â
It worked... thanks
â SivaPrasath
Apr 15 at 9:38
It worked... thanks
â SivaPrasath
Apr 15 at 9:38
It worked... thanks
â SivaPrasath
Apr 15 at 9:38
add a comment |Â
up vote
1
down vote
For various approaches:
https://stackoverflow.com/questions/9066609/fastest-possible-grep
In addition to that,
If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.
You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.
When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.
add a comment |Â
up vote
1
down vote
For various approaches:
https://stackoverflow.com/questions/9066609/fastest-possible-grep
In addition to that,
If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.
You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.
When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
For various approaches:
https://stackoverflow.com/questions/9066609/fastest-possible-grep
In addition to that,
If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.
You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.
When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.
For various approaches:
https://stackoverflow.com/questions/9066609/fastest-possible-grep
In addition to that,
If you're doing this a lot then an SSD is probably the way to go. Touching the HD is the killer for something like this.
You have large numbers of different greps to run. Make a script which launches script commands (say, one per core) into the background and then tracks when they're done, as they're complete launch more.
When I was doing it I could get all 12 cores running at 100% CPU usage but you may find your resource limit to be something else. Given all your jobs want the same file if you're not on an SSD you might want to copy that file around so they're not sharing.
answered Apr 10 at 17:37
Dark Matter
1867
1867
add a comment |Â
add a comment |Â
up vote
0
down vote
If /var/log/http/access.log
is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log
multiple times - especially if you have multiple cores. This will run one grep
per IP in parallel (+ a couple of helping wrapping processes).
pargrep()
# Send standard input to grep with different match strings in parallel
# This command would be enough if you only have 250 match strings
parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"
export -f pargrep
# Standard input is tee'ed to several pargreps.
# Each pargrep gets 250 match strings and thus starts 250 processes.
# For 1200 ips this starts 3600 processes taking around 1 GB RAM,
# but it reads access.log only once
cat /var/log/http/access.log |
parallel --pipe --tee -N250 pargrep :::: ips
add a comment |Â
up vote
0
down vote
If /var/log/http/access.log
is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log
multiple times - especially if you have multiple cores. This will run one grep
per IP in parallel (+ a couple of helping wrapping processes).
pargrep()
# Send standard input to grep with different match strings in parallel
# This command would be enough if you only have 250 match strings
parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"
export -f pargrep
# Standard input is tee'ed to several pargreps.
# Each pargrep gets 250 match strings and thus starts 250 processes.
# For 1200 ips this starts 3600 processes taking around 1 GB RAM,
# but it reads access.log only once
cat /var/log/http/access.log |
parallel --pipe --tee -N250 pargrep :::: ips
add a comment |Â
up vote
0
down vote
up vote
0
down vote
If /var/log/http/access.log
is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log
multiple times - especially if you have multiple cores. This will run one grep
per IP in parallel (+ a couple of helping wrapping processes).
pargrep()
# Send standard input to grep with different match strings in parallel
# This command would be enough if you only have 250 match strings
parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"
export -f pargrep
# Standard input is tee'ed to several pargreps.
# Each pargrep gets 250 match strings and thus starts 250 processes.
# For 1200 ips this starts 3600 processes taking around 1 GB RAM,
# but it reads access.log only once
cat /var/log/http/access.log |
parallel --pipe --tee -N250 pargrep :::: ips
If /var/log/http/access.log
is bigger than RAM and thus cannot be cached, then running more processes in parallel can be a good alternative to reading access.log
multiple times - especially if you have multiple cores. This will run one grep
per IP in parallel (+ a couple of helping wrapping processes).
pargrep()
# Send standard input to grep with different match strings in parallel
# This command would be enough if you only have 250 match strings
parallel --pipe --tee grep ^ '>' /tmp/access-.log ::: "$@"
export -f pargrep
# Standard input is tee'ed to several pargreps.
# Each pargrep gets 250 match strings and thus starts 250 processes.
# For 1200 ips this starts 3600 processes taking around 1 GB RAM,
# but it reads access.log only once
cat /var/log/http/access.log |
parallel --pipe --tee -N250 pargrep :::: ips
edited Apr 15 at 23:01
answered Apr 15 at 22:49
Ole Tange
11.2k1343101
11.2k1343101
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f436823%2fscript-processes-in-parallel%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
1
Why did you use
-i
withsed
? Did that not modify the original log file?â Kusalananda
Apr 10 at 17:09
Is it a requirement to save the parsed log to separate files?
â Kusalananda
Apr 10 at 17:38
-i is used to moving the lines from one file to another. Yes, that's the requirement.
â SivaPrasath
Apr 10 at 19:24
No,
-i
is for doing in-place editing of the input file.â Kusalananda
Apr 10 at 19:25
I'm deleting those copied lines from the input file.
â SivaPrasath
Apr 10 at 19:39