Remove multiple strings from file on command line, high performance [closed]

up vote
3
down vote

favorite

Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?

I process large text files, e.g., 1 million lines in inputfile, and 100k matching strings in hitfile. I have a perl script which loads the hitfile into a hash, and then checks all 'words' in each line of an inputfile, but for my workflow I'd prefer a simple command to my script.

The functionality I seek is equivalent to this:

perl -pe 's/b(string1|string2|string3)b)//g'

or this method of nested sed's:

sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile

or looping in the shell:

while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile

But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:

perl -Mopen=locale -Mutf8 -lpe '
 BEGINopen(A,"hitfile"); chomp(@k = <A>) 
 for $w (@k)[ ,.Ã¢Â€Â”_;-])Q$wE([ ,.Ã¢Â€Â”_;-]' inputfile

But are there any other tricks to do this more concisely? Some other unix command or method I'm overlooking? I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

For example, one idea I had was to expand the words in a file into separate lines

Before:

Hello, world!

After:

Ã‚Â¶
Hello
, 
world
!

And then process with grep -vf, and then re-build/join the lines.

Any other ideas that would run fast and easy?

asked Jul 10 at 18:06

MichaelCodes

1184

closed as unclear what you're asking by G-Man, Rui F Ribeiro, schily, slmâ™¦ Jul 11 at 13:26

Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itÃ¢Â€Â™s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

1

What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
â€“Â undercat
Jul 10 at 19:08

My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
â€“Â MichaelCodes
Jul 10 at 19:28

I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
â€“Â user1934428
Jul 11 at 6:34

I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
â€“Â MichaelCodes
Jul 12 at 7:24

add a commentÂ |Â

up vote
3
down vote

favorite

Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?

The functionality I seek is equivalent to this:

perl -pe 's/b(string1|string2|string3)b)//g'

or this method of nested sed's:

sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile

or looping in the shell:

while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile

But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:

perl -Mopen=locale -Mutf8 -lpe '
 BEGINopen(A,"hitfile"); chomp(@k = <A>) 
 for $w (@k)[ ,.Ã¢Â€Â”_;-])Q$wE([ ,.Ã¢Â€Â”_;-]' inputfile

For example, one idea I had was to expand the words in a file into separate lines

Before:

Hello, world!

After:

Ã‚Â¶
Hello
, 
world
!

And then process with grep -vf, and then re-build/join the lines.

Any other ideas that would run fast and easy?

asked Jul 10 at 18:06

MichaelCodes

1184

closed as unclear what you're asking by G-Man, Rui F Ribeiro, schily, slmâ™¦ Jul 11 at 13:26

1

What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
â€“Â undercat
Jul 10 at 19:08

My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
â€“Â MichaelCodes
Jul 10 at 19:28

I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
â€“Â user1934428
Jul 11 at 6:34

I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
â€“Â MichaelCodes
Jul 12 at 7:24

add a commentÂ |Â

up vote
3
down vote

favorite

Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?

The functionality I seek is equivalent to this:

perl -pe 's/b(string1|string2|string3)b)//g'

or this method of nested sed's:

sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile

or looping in the shell:

while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile

But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:

perl -Mopen=locale -Mutf8 -lpe '
 BEGINopen(A,"hitfile"); chomp(@k = <A>) 
 for $w (@k)[ ,.Ã¢Â€Â”_;-])Q$wE([ ,.Ã¢Â€Â”_;-]' inputfile

For example, one idea I had was to expand the words in a file into separate lines

Before:

Hello, world!

After:

Ã‚Â¶
Hello
, 
world
!

And then process with grep -vf, and then re-build/join the lines.

Any other ideas that would run fast and easy?

asked Jul 10 at 18:06

MichaelCodes

1184

Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?

The functionality I seek is equivalent to this:

perl -pe 's/b(string1|string2|string3)b)//g'

or this method of nested sed's:

sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile

or looping in the shell:

while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile

But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:

perl -Mopen=locale -Mutf8 -lpe '
 BEGINopen(A,"hitfile"); chomp(@k = <A>) 
 for $w (@k)[ ,.Ã¢Â€Â”_;-])Q$wE([ ,.Ã¢Â€Â”_;-]' inputfile

For example, one idea I had was to expand the words in a file into separate lines

Before:

Hello, world!

After:

Ã‚Â¶
Hello
, 
world
!

And then process with grep -vf, and then re-build/join the lines.

Any other ideas that would run fast and easy?

asked Jul 10 at 18:06

MichaelCodes

1184

asked Jul 10 at 18:06

MichaelCodes

1184

asked Jul 10 at 18:06

MichaelCodes

1184

asked Jul 10 at 18:06

MichaelCodes

1184

closed as unclear what you're asking by G-Man, Rui F Ribeiro, schily, slmâ™¦ Jul 11 at 13:26

1

What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
â€“Â undercat
Jul 10 at 19:08

My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
â€“Â MichaelCodes
Jul 10 at 19:28

I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
â€“Â user1934428
Jul 11 at 6:34

I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
â€“Â MichaelCodes
Jul 12 at 7:24

add a commentÂ |Â

1

What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
â€“Â undercat
Jul 10 at 19:08

My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
â€“Â MichaelCodes
Jul 10 at 19:28

I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
â€“Â user1934428
Jul 11 at 6:34

I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
â€“Â MichaelCodes
Jul 12 at 7:24

What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
â€“Â undercat
Jul 10 at 19:08

My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
â€“Â MichaelCodes
Jul 10 at 19:28

I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
â€“Â user1934428
Jul 11 at 6:34

I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
â€“Â MichaelCodes
Jul 12 at 7:24

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

How big is your hitfile exactly? Could you show some actual examples of what you're trying to do? Since you haven't provided more details on your input data, this is just one idea to try out and benchmark against your real data.

Perl regexes are capable of becoming pretty big, and a single regex would allow you to modify the input file in a single pass. Here, I'm using /usr/share/dict/words as an example for building a huge regex, mine has ~99k lines and is ~1MB big.

use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do 
 open my $wfh, '<', '/usr/share/dict/words' or die $!;
 chomp( my @words = <$wfh> );
 map qr/b(?:$_)b/ join ';

while (<>) 
 s/$big_regex//g;
 print;

I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

If "pine" should not match "pineapple", you need to check the characters before and after the occurrence of "pine" in the input as well. While certainly possible with fixed string methods, it sounds like the regex concept of word boundaries (b) is what you're after.

Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.

I'm not sure I agree with this sentiment. What's wrong with perl script.pl? You can use it with shell redirections/pipes just like a one-liner. Putting code into a script will unclutter your command line, and allow you to do complex things without trying to jam it all into a one-liner. Plus, short does not necessarily mean fast.

Another reason you might want to use a script is if you have multiple input files. With the code I showed above, building the regex is fairly expensive, so calling the script multiple times will be expensive - processing multiple files in a single script will eliminate that overhead. I love the UNIX principle, but for big data, calling multiple processes (sometimes many times over) and piping data between them is not always the most efficient method, and streamlining it all in a single program can help.

Update: As per the comments, enough rope to shoot yourself in the foot Ã°ÂŸÂ˜Â‰ Code that does the same as above in a one-liner:

perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt

edited Jul 12 at 8:58

answered Jul 11 at 8:22

haukex

2839

Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
â€“Â MichaelCodes
Jul 12 at 7:29

@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner Ã¢Â˜Âº
â€“Â haukex
Jul 12 at 8:57

That works nicely and fast, thanks!
â€“Â MichaelCodes
Jul 27 at 18:04

add a commentÂ |Â

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
2
down vote

accepted

use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do 
 open my $wfh, '<', '/usr/share/dict/words' or die $!;
 chomp( my @words = <$wfh> );
 map qr/b(?:$_)b/ join ';

while (<>) 
 s/$big_regex//g;
 print;

I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.

Update: As per the comments, enough rope to shoot yourself in the foot Ã°ÂŸÂ˜Â‰ Code that does the same as above in a one-liner:

perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt

edited Jul 12 at 8:58

answered Jul 11 at 8:22

haukex

2839

Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
â€“Â MichaelCodes
Jul 12 at 7:29

@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner Ã¢Â˜Âº
â€“Â haukex
Jul 12 at 8:57

That works nicely and fast, thanks!
â€“Â MichaelCodes
Jul 27 at 18:04

add a commentÂ |Â

up vote
2
down vote

accepted

use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do 
 open my $wfh, '<', '/usr/share/dict/words' or die $!;
 chomp( my @words = <$wfh> );
 map qr/b(?:$_)b/ join ';

while (<>) 
 s/$big_regex//g;
 print;

I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.

Update: As per the comments, enough rope to shoot yourself in the foot Ã°ÂŸÂ˜Â‰ Code that does the same as above in a one-liner:

perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt

edited Jul 12 at 8:58

answered Jul 11 at 8:22

haukex

2839

Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
â€“Â MichaelCodes
Jul 12 at 7:29

@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner Ã¢Â˜Âº
â€“Â haukex
Jul 12 at 8:57

That works nicely and fast, thanks!
â€“Â MichaelCodes
Jul 27 at 18:04

add a commentÂ |Â

up vote
2
down vote

accepted

use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do 
 open my $wfh, '<', '/usr/share/dict/words' or die $!;
 chomp( my @words = <$wfh> );
 map qr/b(?:$_)b/ join ';

while (<>) 
 s/$big_regex//g;
 print;

I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.

Update: As per the comments, enough rope to shoot yourself in the foot Ã°ÂŸÂ˜Â‰ Code that does the same as above in a one-liner:

perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt

edited Jul 12 at 8:58

answered Jul 11 at 8:22

haukex

2839

use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do 
 open my $wfh, '<', '/usr/share/dict/words' or die $!;
 chomp( my @words = <$wfh> );
 map qr/b(?:$_)b/ join ';

while (<>) 
 s/$big_regex//g;
 print;

I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".

Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.

Update: As per the comments, enough rope to shoot yourself in the foot Ã°ÂŸÂ˜Â‰ Code that does the same as above in a one-liner:

perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt

edited Jul 12 at 8:58

answered Jul 11 at 8:22

haukex

2839

edited Jul 12 at 8:58

answered Jul 11 at 8:22

haukex

2839

answered Jul 11 at 8:22

haukex

2839

answered Jul 11 at 8:22

haukex

2839

Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
â€“Â MichaelCodes
Jul 12 at 7:29

@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner Ã¢Â˜Âº
â€“Â haukex
Jul 12 at 8:57

That works nicely and fast, thanks!
â€“Â MichaelCodes
Jul 27 at 18:04

add a commentÂ |Â

Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
â€“Â MichaelCodes
Jul 12 at 7:29

@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner Ã¢Â˜Âº
â€“Â haukex
Jul 12 at 8:57

That works nicely and fast, thanks!
â€“Â MichaelCodes
Jul 27 at 18:04

Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
â€“Â MichaelCodes
Jul 12 at 7:29

@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner Ã¢Â˜Âº
â€“Â haukex
Jul 12 at 8:57

That works nicely and fast, thanks!
â€“Â MichaelCodes
Jul 27 at 18:04

add a commentÂ |Â

搜尋此網誌

mjhjmtu