filtering a large file with a larg filter
Clash Royale CLAN TAG#URR8PPP
I want to extract all lines of $file1
that start with a string stored in $file2
.
$file1
is 4 GB large with about 20 million lines, $file2
has 2 million lines, is about 140 MB large and contains two columns separated with ,
. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C
and $file1
can contain any additional characters except .
Unexpectedly this command
parallel --pipepart -a $file1 grep -Ff $file2
consumes an extreme amount of memory and gets killed by the OS.
The command works if I limit the number of threads:
parallel --pipepart -j 8 -a $file1 grep -Ff $file2
For the last command, htop reveals that each grep -Ff $file2
-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2
.
How can I achieve such a filter more efficiently?
shell-script text-processing grep performance gnu-parallel
add a comment |
I want to extract all lines of $file1
that start with a string stored in $file2
.
$file1
is 4 GB large with about 20 million lines, $file2
has 2 million lines, is about 140 MB large and contains two columns separated with ,
. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C
and $file1
can contain any additional characters except .
Unexpectedly this command
parallel --pipepart -a $file1 grep -Ff $file2
consumes an extreme amount of memory and gets killed by the OS.
The command works if I limit the number of threads:
parallel --pipepart -j 8 -a $file1 grep -Ff $file2
For the last command, htop reveals that each grep -Ff $file2
-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2
.
How can I achieve such a filter more efficiently?
shell-script text-processing grep performance gnu-parallel
1
You may want to split$file2
up into smaller pieces and run multiple times over these patterns. Feeding two million patterns intogrep
is an extremely suboptimal thing to do.
– Kusalananda
Feb 14 at 18:23
@Kusalananda Can I realize such a split with gnu parallel? I am using-F
as I am looking for partial matches at the beginning of the line.
– katosh
Feb 14 at 18:26
@Kusalananda Why would it be suboptimal? I would argue: Splitting$file2
only means reading in$file1
multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after-f
.
– katosh
Feb 14 at 20:36
2
I mean, it's suboptimal in the sense thatgrep
was never made for matching 2M strings from one file in another. If$file1
has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well thenjoin
is a much better tool.
– Kusalananda
Feb 14 at 20:42
The matches are from the beginning of the line until the second,
. But there might be additional commata.join
only uses one field and I am not sure what it does when the number of fields per line differs.
– katosh
Feb 14 at 20:48
add a comment |
I want to extract all lines of $file1
that start with a string stored in $file2
.
$file1
is 4 GB large with about 20 million lines, $file2
has 2 million lines, is about 140 MB large and contains two columns separated with ,
. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C
and $file1
can contain any additional characters except .
Unexpectedly this command
parallel --pipepart -a $file1 grep -Ff $file2
consumes an extreme amount of memory and gets killed by the OS.
The command works if I limit the number of threads:
parallel --pipepart -j 8 -a $file1 grep -Ff $file2
For the last command, htop reveals that each grep -Ff $file2
-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2
.
How can I achieve such a filter more efficiently?
shell-script text-processing grep performance gnu-parallel
I want to extract all lines of $file1
that start with a string stored in $file2
.
$file1
is 4 GB large with about 20 million lines, $file2
has 2 million lines, is about 140 MB large and contains two columns separated with ,
. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C
and $file1
can contain any additional characters except .
Unexpectedly this command
parallel --pipepart -a $file1 grep -Ff $file2
consumes an extreme amount of memory and gets killed by the OS.
The command works if I limit the number of threads:
parallel --pipepart -j 8 -a $file1 grep -Ff $file2
For the last command, htop reveals that each grep -Ff $file2
-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2
.
How can I achieve such a filter more efficiently?
shell-script text-processing grep performance gnu-parallel
shell-script text-processing grep performance gnu-parallel
edited Feb 15 at 11:41
katosh
asked Feb 14 at 18:19
katoshkatosh
1449
1449
1
You may want to split$file2
up into smaller pieces and run multiple times over these patterns. Feeding two million patterns intogrep
is an extremely suboptimal thing to do.
– Kusalananda
Feb 14 at 18:23
@Kusalananda Can I realize such a split with gnu parallel? I am using-F
as I am looking for partial matches at the beginning of the line.
– katosh
Feb 14 at 18:26
@Kusalananda Why would it be suboptimal? I would argue: Splitting$file2
only means reading in$file1
multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after-f
.
– katosh
Feb 14 at 20:36
2
I mean, it's suboptimal in the sense thatgrep
was never made for matching 2M strings from one file in another. If$file1
has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well thenjoin
is a much better tool.
– Kusalananda
Feb 14 at 20:42
The matches are from the beginning of the line until the second,
. But there might be additional commata.join
only uses one field and I am not sure what it does when the number of fields per line differs.
– katosh
Feb 14 at 20:48
add a comment |
1
You may want to split$file2
up into smaller pieces and run multiple times over these patterns. Feeding two million patterns intogrep
is an extremely suboptimal thing to do.
– Kusalananda
Feb 14 at 18:23
@Kusalananda Can I realize such a split with gnu parallel? I am using-F
as I am looking for partial matches at the beginning of the line.
– katosh
Feb 14 at 18:26
@Kusalananda Why would it be suboptimal? I would argue: Splitting$file2
only means reading in$file1
multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after-f
.
– katosh
Feb 14 at 20:36
2
I mean, it's suboptimal in the sense thatgrep
was never made for matching 2M strings from one file in another. If$file1
has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well thenjoin
is a much better tool.
– Kusalananda
Feb 14 at 20:42
The matches are from the beginning of the line until the second,
. But there might be additional commata.join
only uses one field and I am not sure what it does when the number of fields per line differs.
– katosh
Feb 14 at 20:48
1
1
You may want to split
$file2
up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep
is an extremely suboptimal thing to do.– Kusalananda
Feb 14 at 18:23
You may want to split
$file2
up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep
is an extremely suboptimal thing to do.– Kusalananda
Feb 14 at 18:23
@Kusalananda Can I realize such a split with gnu parallel? I am using
-F
as I am looking for partial matches at the beginning of the line.– katosh
Feb 14 at 18:26
@Kusalananda Can I realize such a split with gnu parallel? I am using
-F
as I am looking for partial matches at the beginning of the line.– katosh
Feb 14 at 18:26
@Kusalananda Why would it be suboptimal? I would argue: Splitting
$file2
only means reading in $file1
multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f
.– katosh
Feb 14 at 20:36
@Kusalananda Why would it be suboptimal? I would argue: Splitting
$file2
only means reading in $file1
multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f
.– katosh
Feb 14 at 20:36
2
2
I mean, it's suboptimal in the sense that
grep
was never made for matching 2M strings from one file in another. If $file1
has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join
is a much better tool.– Kusalananda
Feb 14 at 20:42
I mean, it's suboptimal in the sense that
grep
was never made for matching 2M strings from one file in another. If $file1
has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join
is a much better tool.– Kusalananda
Feb 14 at 20:42
The matches are from the beginning of the line until the second
,
. But there might be additional commata. join
only uses one field and I am not sure what it does when the number of fields per line differs.– katosh
Feb 14 at 20:48
The matches are from the beginning of the line until the second
,
. But there might be additional commata. join
only uses one field and I am not sure what it does when the number of fields per line differs.– katosh
Feb 14 at 20:48
add a comment |
1 Answer
1
active
oldest
votes
It is covered in man parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
EXAMPLE: Grepping n lines for m regular expressions.
The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.
RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.
CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.
It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.
Limiting factor: RAM
The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.
grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.
If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.
If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k
parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - bigfile
Limiting factor: CPU
If the CPU is the limiting factor parallelization should be done on
the regexps:cat regexp.txt | parallel --pipe -L1000 --round-robin --compress
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.
Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:parallel --pipepart --block 100M -a bigfile -k --compress
grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt
| parallel --pipe -L1000 --round-robin grep -f -
If a line matches multiple regexps, the line may be duplicated.
Bigger problem
If the problem is too big to be solved by this, you are probably ready
for Lucene.
This makes it clear. Thank you very much!
– katosh
Feb 15 at 17:34
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500676%2ffiltering-a-large-file-with-a-larg-filter%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It is covered in man parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
EXAMPLE: Grepping n lines for m regular expressions.
The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.
RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.
CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.
It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.
Limiting factor: RAM
The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.
grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.
If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.
If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k
parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - bigfile
Limiting factor: CPU
If the CPU is the limiting factor parallelization should be done on
the regexps:cat regexp.txt | parallel --pipe -L1000 --round-robin --compress
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.
Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:parallel --pipepart --block 100M -a bigfile -k --compress
grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt
| parallel --pipe -L1000 --round-robin grep -f -
If a line matches multiple regexps, the line may be duplicated.
Bigger problem
If the problem is too big to be solved by this, you are probably ready
for Lucene.
This makes it clear. Thank you very much!
– katosh
Feb 15 at 17:34
add a comment |
It is covered in man parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
EXAMPLE: Grepping n lines for m regular expressions.
The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.
RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.
CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.
It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.
Limiting factor: RAM
The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.
grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.
If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.
If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k
parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - bigfile
Limiting factor: CPU
If the CPU is the limiting factor parallelization should be done on
the regexps:cat regexp.txt | parallel --pipe -L1000 --round-robin --compress
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.
Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:parallel --pipepart --block 100M -a bigfile -k --compress
grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt
| parallel --pipe -L1000 --round-robin grep -f -
If a line matches multiple regexps, the line may be duplicated.
Bigger problem
If the problem is too big to be solved by this, you are probably ready
for Lucene.
This makes it clear. Thank you very much!
– katosh
Feb 15 at 17:34
add a comment |
It is covered in man parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
EXAMPLE: Grepping n lines for m regular expressions.
The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.
RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.
CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.
It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.
Limiting factor: RAM
The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.
grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.
If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.
If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k
parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - bigfile
Limiting factor: CPU
If the CPU is the limiting factor parallelization should be done on
the regexps:cat regexp.txt | parallel --pipe -L1000 --round-robin --compress
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.
Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:parallel --pipepart --block 100M -a bigfile -k --compress
grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt
| parallel --pipe -L1000 --round-robin grep -f -
If a line matches multiple regexps, the line may be duplicated.
Bigger problem
If the problem is too big to be solved by this, you are probably ready
for Lucene.
It is covered in man parallel
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions
EXAMPLE: Grepping n lines for m regular expressions.
The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.
RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.
CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.
It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.
Limiting factor: RAM
The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.
grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.
If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.
If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k
parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - bigfile
Limiting factor: CPU
If the CPU is the limiting factor parallelization should be done on
the regexps:cat regexp.txt | parallel --pipe -L1000 --round-robin --compress
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.
Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:parallel --pipepart --block 100M -a bigfile -k --compress
grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt
| parallel --pipe -L1000 --round-robin grep -f -
If a line matches multiple regexps, the line may be duplicated.
Bigger problem
If the problem is too big to be solved by this, you are probably ready
for Lucene.
answered Feb 15 at 16:17
Ole TangeOle Tange
12.7k1455105
12.7k1455105
This makes it clear. Thank you very much!
– katosh
Feb 15 at 17:34
add a comment |
This makes it clear. Thank you very much!
– katosh
Feb 15 at 17:34
This makes it clear. Thank you very much!
– katosh
Feb 15 at 17:34
This makes it clear. Thank you very much!
– katosh
Feb 15 at 17:34
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500676%2ffiltering-a-large-file-with-a-larg-filter%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You may want to split
$file2
up into smaller pieces and run multiple times over these patterns. Feeding two million patterns intogrep
is an extremely suboptimal thing to do.– Kusalananda
Feb 14 at 18:23
@Kusalananda Can I realize such a split with gnu parallel? I am using
-F
as I am looking for partial matches at the beginning of the line.– katosh
Feb 14 at 18:26
@Kusalananda Why would it be suboptimal? I would argue: Splitting
$file2
only means reading in$file1
multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after-f
.– katosh
Feb 14 at 20:36
2
I mean, it's suboptimal in the sense that
grep
was never made for matching 2M strings from one file in another. If$file1
has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well thenjoin
is a much better tool.– Kusalananda
Feb 14 at 20:42
The matches are from the beginning of the line until the second
,
. But there might be additional commata.join
only uses one field and I am not sure what it does when the number of fields per line differs.– katosh
Feb 14 at 20:48