Parsing only lines that have 9 periods

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
0
down vote

favorite












I have 90 gig of data culled from 13.5 Terabytes.



I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.



Some malformed data was apparent so I reran the parse with awk and 'seen' like so:



 awk -F, '!seen[$1]++' inputfile > outputfile


This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.



There are 3 IP addresses per valid line.



Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.







share|improve this question


















  • 2




    Do you care how valid the IP's appear? In other words, are you OK selecting a line like: 20.20.20.20 foo.bar.baz 300.300.300.300 ?
    – Jeff Schaller
    Nov 29 '17 at 20:06














up vote
0
down vote

favorite












I have 90 gig of data culled from 13.5 Terabytes.



I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.



Some malformed data was apparent so I reran the parse with awk and 'seen' like so:



 awk -F, '!seen[$1]++' inputfile > outputfile


This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.



There are 3 IP addresses per valid line.



Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.







share|improve this question


















  • 2




    Do you care how valid the IP's appear? In other words, are you OK selecting a line like: 20.20.20.20 foo.bar.baz 300.300.300.300 ?
    – Jeff Schaller
    Nov 29 '17 at 20:06












up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have 90 gig of data culled from 13.5 Terabytes.



I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.



Some malformed data was apparent so I reran the parse with awk and 'seen' like so:



 awk -F, '!seen[$1]++' inputfile > outputfile


This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.



There are 3 IP addresses per valid line.



Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.







share|improve this question














I have 90 gig of data culled from 13.5 Terabytes.



I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.



Some malformed data was apparent so I reran the parse with awk and 'seen' like so:



 awk -F, '!seen[$1]++' inputfile > outputfile


This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.



There are 3 IP addresses per valid line.



Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.









share|improve this question













share|improve this question




share|improve this question








edited Nov 29 '17 at 19:55









Jeff Schaller

32.1k849109




32.1k849109










asked Nov 29 '17 at 19:46









0xffffff

1




1







  • 2




    Do you care how valid the IP's appear? In other words, are you OK selecting a line like: 20.20.20.20 foo.bar.baz 300.300.300.300 ?
    – Jeff Schaller
    Nov 29 '17 at 20:06












  • 2




    Do you care how valid the IP's appear? In other words, are you OK selecting a line like: 20.20.20.20 foo.bar.baz 300.300.300.300 ?
    – Jeff Schaller
    Nov 29 '17 at 20:06







2




2




Do you care how valid the IP's appear? In other words, are you OK selecting a line like: 20.20.20.20 foo.bar.baz 300.300.300.300 ?
– Jeff Schaller
Nov 29 '17 at 20:06




Do you care how valid the IP's appear? In other words, are you OK selecting a line like: 20.20.20.20 foo.bar.baz 300.300.300.300 ?
– Jeff Schaller
Nov 29 '17 at 20:06










1 Answer
1






active

oldest

votes

















up vote
1
down vote













Let's take this as a test file:



$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period


Using grep



To select lines with exactly nine periods:



$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.



Using sed



$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.



Using awk



$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):



$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep





share|improve this answer






















  • how about awk -F. NF==10 ?
    – Jeff Schaller
    Nov 29 '17 at 20:01











  • @JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
    – John1024
    Nov 29 '17 at 20:06











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f407809%2fparsing-only-lines-that-have-9-periods%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
1
down vote













Let's take this as a test file:



$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period


Using grep



To select lines with exactly nine periods:



$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.



Using sed



$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.



Using awk



$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):



$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep





share|improve this answer






















  • how about awk -F. NF==10 ?
    – Jeff Schaller
    Nov 29 '17 at 20:01











  • @JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
    – John1024
    Nov 29 '17 at 20:06















up vote
1
down vote













Let's take this as a test file:



$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period


Using grep



To select lines with exactly nine periods:



$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.



Using sed



$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.



Using awk



$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):



$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep





share|improve this answer






















  • how about awk -F. NF==10 ?
    – Jeff Schaller
    Nov 29 '17 at 20:01











  • @JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
    – John1024
    Nov 29 '17 at 20:06













up vote
1
down vote










up vote
1
down vote









Let's take this as a test file:



$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period


Using grep



To select lines with exactly nine periods:



$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.



Using sed



$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.



Using awk



$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):



$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep





share|improve this answer














Let's take this as a test file:



$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period


Using grep



To select lines with exactly nine periods:



$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.



Using sed



$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.



Using awk



$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep


Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):



$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 29 '17 at 20:16

























answered Nov 29 '17 at 19:58









John1024

44.2k4100117




44.2k4100117











  • how about awk -F. NF==10 ?
    – Jeff Schaller
    Nov 29 '17 at 20:01











  • @JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
    – John1024
    Nov 29 '17 at 20:06

















  • how about awk -F. NF==10 ?
    – Jeff Schaller
    Nov 29 '17 at 20:01











  • @JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
    – John1024
    Nov 29 '17 at 20:06
















how about awk -F. NF==10 ?
– Jeff Schaller
Nov 29 '17 at 20:01





how about awk -F. NF==10 ?
– Jeff Schaller
Nov 29 '17 at 20:01













@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
– John1024
Nov 29 '17 at 20:06





@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
– John1024
Nov 29 '17 at 20:06


















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f407809%2fparsing-only-lines-that-have-9-periods%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

Peggy Mitchell

Palaiologos

The Forum (Inglewood, California)