Remove multiple strings from file on command line, high performance [closed]

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
3
down vote

favorite












Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?



I process large text files, e.g., 1 million lines in inputfile, and 100k matching strings in hitfile. I have a perl script which loads the hitfile into a hash, and then checks all 'words' in each line of an inputfile, but for my workflow I'd prefer a simple command to my script.



The functionality I seek is equivalent to this:



perl -pe 's/b(string1|string2|string3)b)//g' 


or this method of nested sed's:



sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile


or looping in the shell:



while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile


But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:



perl -Mopen=locale -Mutf8 -lpe '
BEGINopen(A,"hitfile"); chomp(@k = <A>)
for $w (@k)[ ,.—_;-])Q$wE([ ,.—_;-]' inputfile


But are there any other tricks to do this more concisely? Some other unix command or method I'm overlooking? I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".



For example, one idea I had was to expand the words in a file into separate lines



Before:



Hello, world!


After:



¶
Hello
,
world
!


And then process with grep -vf, and then re-build/join the lines.



Any other ideas that would run fast and easy?







share|improve this question











closed as unclear what you're asking by G-Man, Rui F Ribeiro, schily, slm♦ Jul 11 at 13:26


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 1




    What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
    – undercat
    Jul 10 at 19:08










  • My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
    – MichaelCodes
    Jul 10 at 19:28










  • I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
    – user1934428
    Jul 11 at 6:34










  • I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
    – MichaelCodes
    Jul 12 at 7:24














up vote
3
down vote

favorite












Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?



I process large text files, e.g., 1 million lines in inputfile, and 100k matching strings in hitfile. I have a perl script which loads the hitfile into a hash, and then checks all 'words' in each line of an inputfile, but for my workflow I'd prefer a simple command to my script.



The functionality I seek is equivalent to this:



perl -pe 's/b(string1|string2|string3)b)//g' 


or this method of nested sed's:



sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile


or looping in the shell:



while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile


But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:



perl -Mopen=locale -Mutf8 -lpe '
BEGINopen(A,"hitfile"); chomp(@k = <A>)
for $w (@k)[ ,.—_;-])Q$wE([ ,.—_;-]' inputfile


But are there any other tricks to do this more concisely? Some other unix command or method I'm overlooking? I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".



For example, one idea I had was to expand the words in a file into separate lines



Before:



Hello, world!


After:



¶
Hello
,
world
!


And then process with grep -vf, and then re-build/join the lines.



Any other ideas that would run fast and easy?







share|improve this question











closed as unclear what you're asking by G-Man, Rui F Ribeiro, schily, slm♦ Jul 11 at 13:26


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.










  • 1




    What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
    – undercat
    Jul 10 at 19:08










  • My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
    – MichaelCodes
    Jul 10 at 19:28










  • I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
    – user1934428
    Jul 11 at 6:34










  • I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
    – MichaelCodes
    Jul 12 at 7:24












up vote
3
down vote

favorite









up vote
3
down vote

favorite











Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?



I process large text files, e.g., 1 million lines in inputfile, and 100k matching strings in hitfile. I have a perl script which loads the hitfile into a hash, and then checks all 'words' in each line of an inputfile, but for my workflow I'd prefer a simple command to my script.



The functionality I seek is equivalent to this:



perl -pe 's/b(string1|string2|string3)b)//g' 


or this method of nested sed's:



sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile


or looping in the shell:



while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile


But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:



perl -Mopen=locale -Mutf8 -lpe '
BEGINopen(A,"hitfile"); chomp(@k = <A>)
for $w (@k)[ ,.—_;-])Q$wE([ ,.—_;-]' inputfile


But are there any other tricks to do this more concisely? Some other unix command or method I'm overlooking? I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".



For example, one idea I had was to expand the words in a file into separate lines



Before:



Hello, world!


After:



¶
Hello
,
world
!


And then process with grep -vf, and then re-build/join the lines.



Any other ideas that would run fast and easy?







share|improve this question











Is there an elegant, high-performance one-liner way to remove multiple complete strings from an input?



I process large text files, e.g., 1 million lines in inputfile, and 100k matching strings in hitfile. I have a perl script which loads the hitfile into a hash, and then checks all 'words' in each line of an inputfile, but for my workflow I'd prefer a simple command to my script.



The functionality I seek is equivalent to this:



perl -pe 's/b(string1|string2|string3)b)//g' 


or this method of nested sed's:



sed -e "$(sed 's:.*:s/&//ig:' hitfile)" inputfile


or looping in the shell:



while read w; do sed -i "s/$w//ig" hitfile ; done < inputfile


But those are way too expensive. This slightly more-efficient method works (How to delete all occurrences of a list of words from a text file?) but it's still very slow:



perl -Mopen=locale -Mutf8 -lpe '
BEGINopen(A,"hitfile"); chomp(@k = <A>)
for $w (@k)[ ,.—_;-])Q$wE([ ,.—_;-]' inputfile


But are there any other tricks to do this more concisely? Some other unix command or method I'm overlooking? I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".



For example, one idea I had was to expand the words in a file into separate lines



Before:



Hello, world!


After:



¶
Hello
,
world
!


And then process with grep -vf, and then re-build/join the lines.



Any other ideas that would run fast and easy?









share|improve this question










share|improve this question




share|improve this question









asked Jul 10 at 18:06









MichaelCodes

1184




1184




closed as unclear what you're asking by G-Man, Rui F Ribeiro, schily, slm♦ Jul 11 at 13:26


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.






closed as unclear what you're asking by G-Man, Rui F Ribeiro, schily, slm♦ Jul 11 at 13:26


Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, it’s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.









  • 1




    What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
    – undercat
    Jul 10 at 19:08










  • My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
    – MichaelCodes
    Jul 10 at 19:28










  • I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
    – user1934428
    Jul 11 at 6:34










  • I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
    – MichaelCodes
    Jul 12 at 7:24












  • 1




    What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
    – undercat
    Jul 10 at 19:08










  • My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
    – MichaelCodes
    Jul 10 at 19:28










  • I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
    – user1934428
    Jul 11 at 6:34










  • I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
    – MichaelCodes
    Jul 12 at 7:24







1




1




What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
– undercat
Jul 10 at 19:08




What is the size of your file? Are you certain that your bottleneck is the CPU and not I/O?
– undercat
Jul 10 at 19:08












My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
– MichaelCodes
Jul 10 at 19:28




My file sizes vary, some have <50 chars per line, others have 500+ chars per line. Typical file size might be 100 MB for the input file.
– MichaelCodes
Jul 10 at 19:28












I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
– user1934428
Jul 11 at 6:34




I would first build a Set data structure (available in Ruby by default, in Perl from CPAN), where I put each word from hitfile, and then go through inputfile, split every line int words, and apply the transformation.
– user1934428
Jul 11 at 6:34












I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
– MichaelCodes
Jul 12 at 7:24




I already have a solution in perl with a hash, which is very fast. However, it's a whole script, and I have long had the sense that there's a more cunning method which would also fit neatly on the command line.
– MichaelCodes
Jul 12 at 7:24










1 Answer
1






active

oldest

votes

















up vote
2
down vote



accepted










How big is your hitfile exactly? Could you show some actual examples of what you're trying to do? Since you haven't provided more details on your input data, this is just one idea to try out and benchmark against your real data.



Perl regexes are capable of becoming pretty big, and a single regex would allow you to modify the input file in a single pass. Here, I'm using /usr/share/dict/words as an example for building a huge regex, mine has ~99k lines and is ~1MB big.



use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do
open my $wfh, '<', '/usr/share/dict/words' or die $!;
chomp( my @words = <$wfh> );
map qr/b(?:$_)b/ join ';

while (<>)
s/$big_regex//g;
print;




I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".




If "pine" should not match "pineapple", you need to check the characters before and after the occurrence of "pine" in the input as well. While certainly possible with fixed string methods, it sounds like the regex concept of word boundaries (b) is what you're after.




Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.




I'm not sure I agree with this sentiment. What's wrong with perl script.pl? You can use it with shell redirections/pipes just like a one-liner. Putting code into a script will unclutter your command line, and allow you to do complex things without trying to jam it all into a one-liner. Plus, short does not necessarily mean fast.



Another reason you might want to use a script is if you have multiple input files. With the code I showed above, building the regex is fairly expensive, so calling the script multiple times will be expensive - processing multiple files in a single script will eliminate that overhead. I love the UNIX principle, but for big data, calling multiple processes (sometimes many times over) and piping data between them is not always the most efficient method, and streamlining it all in a single program can help.




Update: As per the comments, enough rope to shoot yourself in the foot 😉 Code that does the same as above in a one-liner:



perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt





share|improve this answer























  • Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
    – MichaelCodes
    Jul 12 at 7:29










  • @MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺
    – haukex
    Jul 12 at 8:57











  • That works nicely and fast, thanks!
    – MichaelCodes
    Jul 27 at 18:04


















1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
2
down vote



accepted










How big is your hitfile exactly? Could you show some actual examples of what you're trying to do? Since you haven't provided more details on your input data, this is just one idea to try out and benchmark against your real data.



Perl regexes are capable of becoming pretty big, and a single regex would allow you to modify the input file in a single pass. Here, I'm using /usr/share/dict/words as an example for building a huge regex, mine has ~99k lines and is ~1MB big.



use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do
open my $wfh, '<', '/usr/share/dict/words' or die $!;
chomp( my @words = <$wfh> );
map qr/b(?:$_)b/ join ';

while (<>)
s/$big_regex//g;
print;




I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".




If "pine" should not match "pineapple", you need to check the characters before and after the occurrence of "pine" in the input as well. While certainly possible with fixed string methods, it sounds like the regex concept of word boundaries (b) is what you're after.




Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.




I'm not sure I agree with this sentiment. What's wrong with perl script.pl? You can use it with shell redirections/pipes just like a one-liner. Putting code into a script will unclutter your command line, and allow you to do complex things without trying to jam it all into a one-liner. Plus, short does not necessarily mean fast.



Another reason you might want to use a script is if you have multiple input files. With the code I showed above, building the regex is fairly expensive, so calling the script multiple times will be expensive - processing multiple files in a single script will eliminate that overhead. I love the UNIX principle, but for big data, calling multiple processes (sometimes many times over) and piping data between them is not always the most efficient method, and streamlining it all in a single program can help.




Update: As per the comments, enough rope to shoot yourself in the foot 😉 Code that does the same as above in a one-liner:



perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt





share|improve this answer























  • Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
    – MichaelCodes
    Jul 12 at 7:29










  • @MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺
    – haukex
    Jul 12 at 8:57











  • That works nicely and fast, thanks!
    – MichaelCodes
    Jul 27 at 18:04















up vote
2
down vote



accepted










How big is your hitfile exactly? Could you show some actual examples of what you're trying to do? Since you haven't provided more details on your input data, this is just one idea to try out and benchmark against your real data.



Perl regexes are capable of becoming pretty big, and a single regex would allow you to modify the input file in a single pass. Here, I'm using /usr/share/dict/words as an example for building a huge regex, mine has ~99k lines and is ~1MB big.



use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do
open my $wfh, '<', '/usr/share/dict/words' or die $!;
chomp( my @words = <$wfh> );
map qr/b(?:$_)b/ join ';

while (<>)
s/$big_regex//g;
print;




I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".




If "pine" should not match "pineapple", you need to check the characters before and after the occurrence of "pine" in the input as well. While certainly possible with fixed string methods, it sounds like the regex concept of word boundaries (b) is what you're after.




Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.




I'm not sure I agree with this sentiment. What's wrong with perl script.pl? You can use it with shell redirections/pipes just like a one-liner. Putting code into a script will unclutter your command line, and allow you to do complex things without trying to jam it all into a one-liner. Plus, short does not necessarily mean fast.



Another reason you might want to use a script is if you have multiple input files. With the code I showed above, building the regex is fairly expensive, so calling the script multiple times will be expensive - processing multiple files in a single script will eliminate that overhead. I love the UNIX principle, but for big data, calling multiple processes (sometimes many times over) and piping data between them is not always the most efficient method, and streamlining it all in a single program can help.




Update: As per the comments, enough rope to shoot yourself in the foot 😉 Code that does the same as above in a one-liner:



perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt





share|improve this answer























  • Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
    – MichaelCodes
    Jul 12 at 7:29










  • @MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺
    – haukex
    Jul 12 at 8:57











  • That works nicely and fast, thanks!
    – MichaelCodes
    Jul 27 at 18:04













up vote
2
down vote



accepted







up vote
2
down vote



accepted






How big is your hitfile exactly? Could you show some actual examples of what you're trying to do? Since you haven't provided more details on your input data, this is just one idea to try out and benchmark against your real data.



Perl regexes are capable of becoming pretty big, and a single regex would allow you to modify the input file in a single pass. Here, I'm using /usr/share/dict/words as an example for building a huge regex, mine has ~99k lines and is ~1MB big.



use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do
open my $wfh, '<', '/usr/share/dict/words' or die $!;
chomp( my @words = <$wfh> );
map qr/b(?:$_)b/ join ';

while (<>)
s/$big_regex//g;
print;




I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".




If "pine" should not match "pineapple", you need to check the characters before and after the occurrence of "pine" in the input as well. While certainly possible with fixed string methods, it sounds like the regex concept of word boundaries (b) is what you're after.




Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.




I'm not sure I agree with this sentiment. What's wrong with perl script.pl? You can use it with shell redirections/pipes just like a one-liner. Putting code into a script will unclutter your command line, and allow you to do complex things without trying to jam it all into a one-liner. Plus, short does not necessarily mean fast.



Another reason you might want to use a script is if you have multiple input files. With the code I showed above, building the regex is fairly expensive, so calling the script multiple times will be expensive - processing multiple files in a single script will eliminate that overhead. I love the UNIX principle, but for big data, calling multiple processes (sometimes many times over) and piping data between them is not always the most efficient method, and streamlining it all in a single program can help.




Update: As per the comments, enough rope to shoot yourself in the foot 😉 Code that does the same as above in a one-liner:



perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt





share|improve this answer















How big is your hitfile exactly? Could you show some actual examples of what you're trying to do? Since you haven't provided more details on your input data, this is just one idea to try out and benchmark against your real data.



Perl regexes are capable of becoming pretty big, and a single regex would allow you to modify the input file in a single pass. Here, I'm using /usr/share/dict/words as an example for building a huge regex, mine has ~99k lines and is ~1MB big.



use warnings;
use strict;
use open qw/:std :encoding(UTF-8)/;

my ($big_regex) = do
open my $wfh, '<', '/usr/share/dict/words' or die $!;
chomp( my @words = <$wfh> );
map qr/b(?:$_)b/ join ';

while (<>)
s/$big_regex//g;
print;




I don't need regex, I only need to compare pure/exact strings against a hash (for speed). i.e. "pine" should not match "pineapple", but it should match "(pine)".




If "pine" should not match "pineapple", you need to check the characters before and after the occurrence of "pine" in the input as well. While certainly possible with fixed string methods, it sounds like the regex concept of word boundaries (b) is what you're after.




Is there an elegant, high-performance one-liner way ... for my workflow I'd prefer a simple command to my script.




I'm not sure I agree with this sentiment. What's wrong with perl script.pl? You can use it with shell redirections/pipes just like a one-liner. Putting code into a script will unclutter your command line, and allow you to do complex things without trying to jam it all into a one-liner. Plus, short does not necessarily mean fast.



Another reason you might want to use a script is if you have multiple input files. With the code I showed above, building the regex is fairly expensive, so calling the script multiple times will be expensive - processing multiple files in a single script will eliminate that overhead. I love the UNIX principle, but for big data, calling multiple processes (sometimes many times over) and piping data between them is not always the most efficient method, and streamlining it all in a single program can help.




Update: As per the comments, enough rope to shoot yourself in the foot 😉 Code that does the same as above in a one-liner:



perl -CDS -ple 'BEGIN",mapquotemetasortlength$b<=>length$asplit/n/,<>s/$r//g' /usr/share/dict/words input.txt






share|improve this answer















share|improve this answer



share|improve this answer








edited Jul 12 at 8:58


























answered Jul 11 at 8:22









haukex

2839




2839











  • Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
    – MichaelCodes
    Jul 12 at 7:29










  • @MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺
    – haukex
    Jul 12 at 8:57











  • That works nicely and fast, thanks!
    – MichaelCodes
    Jul 27 at 18:04

















  • Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
    – MichaelCodes
    Jul 12 at 7:29










  • @MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺
    – haukex
    Jul 12 at 8:57











  • That works nicely and fast, thanks!
    – MichaelCodes
    Jul 27 at 18:04
















Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
– MichaelCodes
Jul 12 at 7:29




Thanks for your code, which is shorter than my script. I used a different method, checking each "word" against a hash in perl, also with good performance. My hitfile is currently exactly 100k lines (short lines, mostly dictionary words). I didn't know that perl regexs can be so huge. Still... the heart of my question, and I don't understand what's unclear about this... is my curiosity over whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner.
– MichaelCodes
Jul 12 at 7:29












@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺
– haukex
Jul 12 at 8:57





@MichaelCodes "... whether reconfiguring the workflow could allow it to be done (fast) as a command-line one-liner" Then I did misunderstand, I thought your emphasis was on speed, not code length. (I'm still curious, is my solution faster than your existing script?) Although I still don't think that one-liners are always better, see my edit, I've jammed that script into a one-liner ☺
– haukex
Jul 12 at 8:57













That works nicely and fast, thanks!
– MichaelCodes
Jul 27 at 18:04





That works nicely and fast, thanks!
– MichaelCodes
Jul 27 at 18:04



Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?