filtering a large file with a larg filter

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












2















I want to extract all lines of $file1 that start with a string stored in $file2.



$file1 is 4 GB large with about 20 million lines, $file2 has 2 million lines, is about 140 MB large and contains two columns separated with ,. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C and $file1 can contain any additional characters except .



Unexpectedly this command



parallel --pipepart -a $file1 grep -Ff $file2


consumes an extreme amount of memory and gets killed by the OS.



The command works if I limit the number of threads:



parallel --pipepart -j 8 -a $file1 grep -Ff $file2


For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.



How can I achieve such a filter more efficiently?










share|improve this question



















  • 1





    You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

    – Kusalananda
    Feb 14 at 18:23












  • @Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

    – katosh
    Feb 14 at 18:26











  • @Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

    – katosh
    Feb 14 at 20:36






  • 2





    I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

    – Kusalananda
    Feb 14 at 20:42











  • The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

    – katosh
    Feb 14 at 20:48















2















I want to extract all lines of $file1 that start with a string stored in $file2.



$file1 is 4 GB large with about 20 million lines, $file2 has 2 million lines, is about 140 MB large and contains two columns separated with ,. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C and $file1 can contain any additional characters except .



Unexpectedly this command



parallel --pipepart -a $file1 grep -Ff $file2


consumes an extreme amount of memory and gets killed by the OS.



The command works if I limit the number of threads:



parallel --pipepart -j 8 -a $file1 grep -Ff $file2


For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.



How can I achieve such a filter more efficiently?










share|improve this question



















  • 1





    You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

    – Kusalananda
    Feb 14 at 18:23












  • @Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

    – katosh
    Feb 14 at 18:26











  • @Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

    – katosh
    Feb 14 at 20:36






  • 2





    I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

    – Kusalananda
    Feb 14 at 20:42











  • The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

    – katosh
    Feb 14 at 20:48













2












2








2








I want to extract all lines of $file1 that start with a string stored in $file2.



$file1 is 4 GB large with about 20 million lines, $file2 has 2 million lines, is about 140 MB large and contains two columns separated with ,. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C and $file1 can contain any additional characters except .



Unexpectedly this command



parallel --pipepart -a $file1 grep -Ff $file2


consumes an extreme amount of memory and gets killed by the OS.



The command works if I limit the number of threads:



parallel --pipepart -j 8 -a $file1 grep -Ff $file2


For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.



How can I achieve such a filter more efficiently?










share|improve this question
















I want to extract all lines of $file1 that start with a string stored in $file2.



$file1 is 4 GB large with about 20 million lines, $file2 has 2 million lines, is about 140 MB large and contains two columns separated with ,. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C and $file1 can contain any additional characters except .



Unexpectedly this command



parallel --pipepart -a $file1 grep -Ff $file2


consumes an extreme amount of memory and gets killed by the OS.



The command works if I limit the number of threads:



parallel --pipepart -j 8 -a $file1 grep -Ff $file2


For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.



How can I achieve such a filter more efficiently?







shell-script text-processing grep performance gnu-parallel






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 15 at 11:41







katosh

















asked Feb 14 at 18:19









katoshkatosh

1449




1449







  • 1





    You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

    – Kusalananda
    Feb 14 at 18:23












  • @Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

    – katosh
    Feb 14 at 18:26











  • @Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

    – katosh
    Feb 14 at 20:36






  • 2





    I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

    – Kusalananda
    Feb 14 at 20:42











  • The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

    – katosh
    Feb 14 at 20:48












  • 1





    You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

    – Kusalananda
    Feb 14 at 18:23












  • @Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

    – katosh
    Feb 14 at 18:26











  • @Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

    – katosh
    Feb 14 at 20:36






  • 2





    I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

    – Kusalananda
    Feb 14 at 20:42











  • The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

    – katosh
    Feb 14 at 20:48







1




1





You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

– Kusalananda
Feb 14 at 18:23






You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

– Kusalananda
Feb 14 at 18:23














@Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

– katosh
Feb 14 at 18:26





@Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

– katosh
Feb 14 at 18:26













@Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

– katosh
Feb 14 at 20:36





@Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

– katosh
Feb 14 at 20:36




2




2





I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

– Kusalananda
Feb 14 at 20:42





I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

– Kusalananda
Feb 14 at 20:42













The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

– katosh
Feb 14 at 20:48





The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

– katosh
Feb 14 at 20:48










1 Answer
1






active

oldest

votes


















1














It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions




EXAMPLE: Grepping n lines for m regular expressions.



The simplest solution to grep a big file for a lot of regexps is:



grep -f regexps.txt bigfile


Or if the regexps are fixed strings:



grep -F -f regexps.txt bigfile


There are 3 limiting factors: CPU, RAM, and disk I/O.



RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.



CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.



It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.



Limiting factor: RAM



The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.



grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.



If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:



ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2


then your regexps.txt can be converted from:



ID1.*Identifier1 
ID2.*Identifier2


into:



ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2


This way you can use grep -F which takes around 80% less memory and is
much faster.



If it still does not fit in memory you can do this:



parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'


The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:



free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


If you can live with duplicated lines and wrong order, it is faster to
do:



parallel --pipepart -a regexps.txt --block $percpu --compress 
grep -F -f - bigfile


Limiting factor: CPU



If the CPU is the limiting factor parallelization should be done on
the regexps:



cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.



Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:



parallel --pipepart --block 100M -a bigfile -k --compress 
grep -f regexp.txt


This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:



parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
| parallel --pipe -L1000 --round-robin grep -f -


If a line matches multiple regexps, the line may be duplicated.



Bigger problem



If the problem is too big to be solved by this, you are probably ready
for Lucene.







share|improve this answer























  • This makes it clear. Thank you very much!

    – katosh
    Feb 15 at 17:34










Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500676%2ffiltering-a-large-file-with-a-larg-filter%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions




EXAMPLE: Grepping n lines for m regular expressions.



The simplest solution to grep a big file for a lot of regexps is:



grep -f regexps.txt bigfile


Or if the regexps are fixed strings:



grep -F -f regexps.txt bigfile


There are 3 limiting factors: CPU, RAM, and disk I/O.



RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.



CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.



It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.



Limiting factor: RAM



The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.



grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.



If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:



ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2


then your regexps.txt can be converted from:



ID1.*Identifier1 
ID2.*Identifier2


into:



ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2


This way you can use grep -F which takes around 80% less memory and is
much faster.



If it still does not fit in memory you can do this:



parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'


The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:



free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


If you can live with duplicated lines and wrong order, it is faster to
do:



parallel --pipepart -a regexps.txt --block $percpu --compress 
grep -F -f - bigfile


Limiting factor: CPU



If the CPU is the limiting factor parallelization should be done on
the regexps:



cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.



Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:



parallel --pipepart --block 100M -a bigfile -k --compress 
grep -f regexp.txt


This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:



parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
| parallel --pipe -L1000 --round-robin grep -f -


If a line matches multiple regexps, the line may be duplicated.



Bigger problem



If the problem is too big to be solved by this, you are probably ready
for Lucene.







share|improve this answer























  • This makes it clear. Thank you very much!

    – katosh
    Feb 15 at 17:34















1














It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions




EXAMPLE: Grepping n lines for m regular expressions.



The simplest solution to grep a big file for a lot of regexps is:



grep -f regexps.txt bigfile


Or if the regexps are fixed strings:



grep -F -f regexps.txt bigfile


There are 3 limiting factors: CPU, RAM, and disk I/O.



RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.



CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.



It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.



Limiting factor: RAM



The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.



grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.



If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:



ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2


then your regexps.txt can be converted from:



ID1.*Identifier1 
ID2.*Identifier2


into:



ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2


This way you can use grep -F which takes around 80% less memory and is
much faster.



If it still does not fit in memory you can do this:



parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'


The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:



free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


If you can live with duplicated lines and wrong order, it is faster to
do:



parallel --pipepart -a regexps.txt --block $percpu --compress 
grep -F -f - bigfile


Limiting factor: CPU



If the CPU is the limiting factor parallelization should be done on
the regexps:



cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.



Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:



parallel --pipepart --block 100M -a bigfile -k --compress 
grep -f regexp.txt


This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:



parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
| parallel --pipe -L1000 --round-robin grep -f -


If a line matches multiple regexps, the line may be duplicated.



Bigger problem



If the problem is too big to be solved by this, you are probably ready
for Lucene.







share|improve this answer























  • This makes it clear. Thank you very much!

    – katosh
    Feb 15 at 17:34













1












1








1







It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions




EXAMPLE: Grepping n lines for m regular expressions.



The simplest solution to grep a big file for a lot of regexps is:



grep -f regexps.txt bigfile


Or if the regexps are fixed strings:



grep -F -f regexps.txt bigfile


There are 3 limiting factors: CPU, RAM, and disk I/O.



RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.



CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.



It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.



Limiting factor: RAM



The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.



grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.



If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:



ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2


then your regexps.txt can be converted from:



ID1.*Identifier1 
ID2.*Identifier2


into:



ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2


This way you can use grep -F which takes around 80% less memory and is
much faster.



If it still does not fit in memory you can do this:



parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'


The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:



free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


If you can live with duplicated lines and wrong order, it is faster to
do:



parallel --pipepart -a regexps.txt --block $percpu --compress 
grep -F -f - bigfile


Limiting factor: CPU



If the CPU is the limiting factor parallelization should be done on
the regexps:



cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.



Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:



parallel --pipepart --block 100M -a bigfile -k --compress 
grep -f regexp.txt


This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:



parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
| parallel --pipe -L1000 --round-robin grep -f -


If a line matches multiple regexps, the line may be duplicated.



Bigger problem



If the problem is too big to be solved by this, you are probably ready
for Lucene.







share|improve this answer













It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions




EXAMPLE: Grepping n lines for m regular expressions.



The simplest solution to grep a big file for a lot of regexps is:



grep -f regexps.txt bigfile


Or if the regexps are fixed strings:



grep -F -f regexps.txt bigfile


There are 3 limiting factors: CPU, RAM, and disk I/O.



RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.



CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.



It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.



Limiting factor: RAM



The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.



grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.



If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:



ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2


then your regexps.txt can be converted from:



ID1.*Identifier1 
ID2.*Identifier2


into:



ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2


This way you can use grep -F which takes around 80% less memory and is
much faster.



If it still does not fit in memory you can do this:



parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
sort -un | perl -pe 's/^d+://'


The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:



free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress
grep -F -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


If you can live with duplicated lines and wrong order, it is faster to
do:



parallel --pipepart -a regexps.txt --block $percpu --compress 
grep -F -f - bigfile


Limiting factor: CPU



If the CPU is the limiting factor parallelization should be done on
the regexps:



cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
grep -f - -n bigfile |
sort -un | perl -pe 's/^d+://'


The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.



Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:



parallel --pipepart --block 100M -a bigfile -k --compress 
grep -f regexp.txt


This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:



parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
| parallel --pipe -L1000 --round-robin grep -f -


If a line matches multiple regexps, the line may be duplicated.



Bigger problem



If the problem is too big to be solved by this, you are probably ready
for Lucene.








share|improve this answer












share|improve this answer



share|improve this answer










answered Feb 15 at 16:17









Ole TangeOle Tange

12.7k1455105




12.7k1455105












  • This makes it clear. Thank you very much!

    – katosh
    Feb 15 at 17:34

















  • This makes it clear. Thank you very much!

    – katosh
    Feb 15 at 17:34
















This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34





This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34

















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500676%2ffiltering-a-large-file-with-a-larg-filter%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?