filtering a large file with a larg filter

I want to extract all lines of $file1 that start with a string stored in $file2.

$file1 is 4 GB large with about 20 million lines, $file2 has 2 million lines, is about 140 MB large and contains two columns separated with ,. The maximal line length of both files is well below 1000, they are sorted with LC_ALL=C and $file1 can contain any additional characters except .

Unexpectedly this command

parallel --pipepart -a $file1 grep -Ff $file2

consumes an extreme amount of memory and gets killed by the OS.

The command works if I limit the number of threads:

parallel --pipepart -j 8 -a $file1 grep -Ff $file2

For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.

How can I achieve such a filter more efficiently?

edited Feb 15 at 11:41

asked Feb 14 at 18:19

katosh

1449

1

You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

– Kusalananda
Feb 14 at 18:23

@Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

– katosh
Feb 14 at 18:26

@Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

– katosh
Feb 14 at 20:36

2

I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

– Kusalananda
Feb 14 at 20:42

The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

– katosh
Feb 14 at 20:48

add a comment |

I want to extract all lines of $file1 that start with a string stored in $file2.

Unexpectedly this command

parallel --pipepart -a $file1 grep -Ff $file2

consumes an extreme amount of memory and gets killed by the OS.

The command works if I limit the number of threads:

parallel --pipepart -j 8 -a $file1 grep -Ff $file2

For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.

How can I achieve such a filter more efficiently?

edited Feb 15 at 11:41

asked Feb 14 at 18:19

katosh

1449

1

You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

– Kusalananda
Feb 14 at 18:23

@Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

– katosh
Feb 14 at 18:26

@Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

– katosh
Feb 14 at 20:36

2

I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

– Kusalananda
Feb 14 at 20:42

The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

– katosh
Feb 14 at 20:48

add a comment |

I want to extract all lines of $file1 that start with a string stored in $file2.

Unexpectedly this command

parallel --pipepart -a $file1 grep -Ff $file2

consumes an extreme amount of memory and gets killed by the OS.

The command works if I limit the number of threads:

parallel --pipepart -j 8 -a $file1 grep -Ff $file2

For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.

How can I achieve such a filter more efficiently?

edited Feb 15 at 11:41

asked Feb 14 at 18:19

katosh

1449

I want to extract all lines of $file1 that start with a string stored in $file2.

Unexpectedly this command

parallel --pipepart -a $file1 grep -Ff $file2

consumes an extreme amount of memory and gets killed by the OS.

The command works if I limit the number of threads:

parallel --pipepart -j 8 -a $file1 grep -Ff $file2

For the last command, htop reveals that each grep -Ff $file2-thread constantly occupies 12.3 GB of memory. I assume this demand comes from the dictionary grep builds from $file2.

How can I achieve such a filter more efficiently?

shell-script text-processing grep performance gnu-parallel

edited Feb 15 at 11:41

asked Feb 14 at 18:19

katosh

1449

edited Feb 15 at 11:41

asked Feb 14 at 18:19

katosh

1449

edited Feb 15 at 11:41

asked Feb 14 at 18:19

katosh

1449

asked Feb 14 at 18:19

katosh

1449

asked Feb 14 at 18:19

katosh

1449

1

You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

– Kusalananda
Feb 14 at 18:23

@Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

– katosh
Feb 14 at 18:26

@Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

– katosh
Feb 14 at 20:36

2

I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

– Kusalananda
Feb 14 at 20:42

The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

– katosh
Feb 14 at 20:48

add a comment |

1

You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

– Kusalananda
Feb 14 at 18:23

@Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

– katosh
Feb 14 at 18:26

@Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

– katosh
Feb 14 at 20:36

2

I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

– Kusalananda
Feb 14 at 20:42

The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

– katosh
Feb 14 at 20:48

You may want to split $file2 up into smaller pieces and run multiple times over these patterns. Feeding two million patterns into grep is an extremely suboptimal thing to do.

– Kusalananda
Feb 14 at 18:23

@Kusalananda Can I realize such a split with gnu parallel? I am using -F as I am looking for partial matches at the beginning of the line.

– katosh
Feb 14 at 18:26

@Kusalananda Why would it be suboptimal? I would argue: Splitting $file2 only means reading in $file1 multiple times potentially having more comparisons as grep may optimize its dictionary depending on the file after -f.

– katosh
Feb 14 at 20:36

I mean, it's suboptimal in the sense that grep was never made for matching 2M strings from one file in another. If $file1 has any form of structure that you could use, that would be beneficial. For example, you may only want to match against a particular column in a dataset, and only look for exact matches, well then join is a much better tool.

– Kusalananda
Feb 14 at 20:42

The matches are from the beginning of the line until the second ,. But there might be additional commata. join only uses one field and I am not sure what it does when the number of fields per line differs.

– katosh
Feb 14 at 20:48

add a comment |

1 Answer
1

active

oldest

votes

It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

EXAMPLE: Grepping n lines for m regular expressions.

The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.

RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.

CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.

It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.

Limiting factor: RAM

The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.

grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.

If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:
ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1 
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.

If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:
free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
 END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:
parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - bigfile
Limiting factor: CPU

If the CPU is the limiting factor parallelization should be done on
the regexps:
cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
 grep -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.

Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:
parallel --pipepart --block 100M -a bigfile -k --compress 
 grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:
parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
 | parallel --pipe -L1000 --round-robin grep -f - 
If a line matches multiple regexps, the line may be duplicated.

Bigger problem

If the problem is too big to be solved by this, you are probably ready
for Lucene.

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f500676%2ffiltering-a-large-file-with-a-larg-filter%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

EXAMPLE: Grepping n lines for m regular expressions.

The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.

RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.

CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.

It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.

Limiting factor: RAM

The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.

grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.

If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:
ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1 
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.

If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:
free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
 END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:
parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - bigfile
Limiting factor: CPU

If the CPU is the limiting factor parallelization should be done on
the regexps:
cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
 grep -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.

Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:
parallel --pipepart --block 100M -a bigfile -k --compress 
 grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:
parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
 | parallel --pipe -L1000 --round-robin grep -f - 
If a line matches multiple regexps, the line may be duplicated.

Bigger problem

If the problem is too big to be solved by this, you are probably ready
for Lucene.

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34

add a comment |

It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

EXAMPLE: Grepping n lines for m regular expressions.

The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.

RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.

CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.

It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.

Limiting factor: RAM

The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.

grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.

If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:
ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1 
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.

If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:
free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
 END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:
parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - bigfile
Limiting factor: CPU

If the CPU is the limiting factor parallelization should be done on
the regexps:
cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
 grep -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.

Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:
parallel --pipepart --block 100M -a bigfile -k --compress 
 grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:
parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
 | parallel --pipe -L1000 --round-robin grep -f - 
If a line matches multiple regexps, the line may be duplicated.

Bigger problem

If the problem is too big to be solved by this, you are probably ready
for Lucene.

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34

add a comment |

It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

EXAMPLE: Grepping n lines for m regular expressions.

The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.

RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.

CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.

It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.

Limiting factor: RAM

The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.

grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.

If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:
ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1 
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.

If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:
free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
 END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:
parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - bigfile
Limiting factor: CPU

If the CPU is the limiting factor parallelization should be done on
the regexps:
cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
 grep -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.

Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:
parallel --pipepart --block 100M -a bigfile -k --compress 
 grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:
parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
 | parallel --pipe -L1000 --round-robin grep -f - 
If a line matches multiple regexps, the line may be duplicated.

Bigger problem

If the problem is too big to be solved by this, you are probably ready
for Lucene.

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

It is covered in man parallel https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

EXAMPLE: Grepping n lines for m regular expressions.

The simplest solution to grep a big file for a lot of regexps is:
grep -f regexps.txt bigfile
Or if the regexps are fixed strings:
grep -F -f regexps.txt bigfile
There are 3 limiting factors: CPU, RAM, and disk I/O.

RAM is easy to measure: If the grep process takes up most of your free
memory (e.g. when running top), then RAM is a limiting factor.

CPU is also easy to measure: If the grep takes >90% CPU in top, then
the CPU is a limiting factor, and parallelization will speed this up.

It is harder to see if disk I/O is the limiting factor, and depending
on the disk system it may be faster or slower to parallelize. The only
way to know for certain is to test and measure.

Limiting factor: RAM

The normal grep -f regexs.txt bigfile works no matter the size of
bigfile, but if regexps.txt is so big it cannot fit into memory, then
you need to split this.

grep -F takes around 100 bytes of RAM and grep takes about 500 bytes
of RAM per 1 byte of regexp. So if regexps.txt is 1% of your RAM, then
it may be too big.

If you can convert your regexps into fixed strings do that. E.g. if
the lines you are looking for in bigfile all looks like:
ID1 foo bar baz Identifier1 quux
fubar ID2 foo bar baz Identifier2
then your regexps.txt can be converted from:
ID1.*Identifier1 
ID2.*Identifier2
into:
ID1 foo bar baz Identifier1
ID2 foo bar baz Identifier2
This way you can use grep -F which takes around 80% less memory and is
much faster.

If it still does not fit in memory you can do this:
parallel --pipepart -a regexps.txt --block 1M grep -Ff - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The 1M should be your free memory divided by the number of CPU threads
and divided by 200 for grep -F and by 1000 for normal grep. On
GNU/Linux you can do:
free=$(awk '/^((Swap)?Cached|MemFree|Buffers):/ sum += $2 
 END print sum ' /proc/meminfo)
percpu=$((free / 200 / $(parallel --number-of-threads)))k

parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
If you can live with duplicated lines and wrong order, it is faster to
do:
parallel --pipepart -a regexps.txt --block $percpu --compress 
 grep -F -f - bigfile
Limiting factor: CPU

If the CPU is the limiting factor parallelization should be done on
the regexps:
cat regexp.txt | parallel --pipe -L1000 --round-robin --compress 
 grep -f - -n bigfile |
 sort -un | perl -pe 's/^d+://'
The command will start one grep per CPU and read bigfile one time per
CPU, but as that is done in parallel, all reads except the first will
be cached in RAM. Depending on the size of regexp.txt it may be faster
to use --block 10m instead of -L1000.

Some storage systems perform better when reading multiple chunks in
parallel. This is true for some RAID systems and for some network file
systems. To parallelize the reading of bigfile:
parallel --pipepart --block 100M -a bigfile -k --compress 
 grep -f regexp.txt
This will split bigfile into 100MB chunks and run grep on each of
these chunks. To parallelize both reading of bigfile and regexp.txt
combine the two using --fifo:
parallel --pipepart --block 100M -a bigfile --fifo cat regexp.txt 
 | parallel --pipe -L1000 --round-robin grep -f - 
If a line matches multiple regexps, the line may be duplicated.

Bigger problem

If the problem is too big to be solved by this, you are probably ready
for Lucene.

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

answered Feb 15 at 16:17

Ole Tange

12.7k1455105

This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34

add a comment |

This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34

This makes it clear. Thank you very much!

– katosh
Feb 15 at 17:34

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu