Parsing only lines that have 9 periods

Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I have 90 gig of data culled from 13.5 Terabytes.
I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.
Some malformed data was apparent so I reran the parse with awk and 'seen' like so:
awk -F, '!seen[$1]++' inputfile > outputfile
This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.
There are 3 IP addresses per valid line.
Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.
text-processing awk sed grep
add a comment |Â
up vote
0
down vote
favorite
I have 90 gig of data culled from 13.5 Terabytes.
I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.
Some malformed data was apparent so I reran the parse with awk and 'seen' like so:
awk -F, '!seen[$1]++' inputfile > outputfile
This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.
There are 3 IP addresses per valid line.
Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.
text-processing awk sed grep
2
Do you care how valid the IP's appear? In other words, are you OK selecting a line like:20.20.20.20 foo.bar.baz 300.300.300.300?
â Jeff Schaller
Nov 29 '17 at 20:06
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have 90 gig of data culled from 13.5 Terabytes.
I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.
Some malformed data was apparent so I reran the parse with awk and 'seen' like so:
awk -F, '!seen[$1]++' inputfile > outputfile
This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.
There are 3 IP addresses per valid line.
Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.
text-processing awk sed grep
I have 90 gig of data culled from 13.5 Terabytes.
I have tried sort -u | uniq on data that has been awk'd from the 13.5T of syslog data.
Some malformed data was apparent so I reran the parse with awk and 'seen' like so:
awk -F, '!seen[$1]++' inputfile > outputfile
This turned out to be the most time efficient means but also included some malformed data... maybe there are malformed log entries or in sorting uniq'ing and awk'ing some lines got munged. I do not care if there is a more/better way of parsing the original data, since I have a large enough sample size - meaning losing a little data out of 13.5T is OK.
There are 3 IP addresses per valid line.
Since there are 3 periods in an IP address, I need something that will parse out only lines that have 9 "."'s.
text-processing awk sed grep
edited Nov 29 '17 at 19:55
Jeff Schaller
32.1k849109
32.1k849109
asked Nov 29 '17 at 19:46
0xffffff
1
1
2
Do you care how valid the IP's appear? In other words, are you OK selecting a line like:20.20.20.20 foo.bar.baz 300.300.300.300?
â Jeff Schaller
Nov 29 '17 at 20:06
add a comment |Â
2
Do you care how valid the IP's appear? In other words, are you OK selecting a line like:20.20.20.20 foo.bar.baz 300.300.300.300?
â Jeff Schaller
Nov 29 '17 at 20:06
2
2
Do you care how valid the IP's appear? In other words, are you OK selecting a line like:
20.20.20.20 foo.bar.baz 300.300.300.300 ?â Jeff Schaller
Nov 29 '17 at 20:06
Do you care how valid the IP's appear? In other words, are you OK selecting a line like:
20.20.20.20 foo.bar.baz 300.300.300.300 ?â Jeff Schaller
Nov 29 '17 at 20:06
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
1
down vote
Let's take this as a test file:
$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period
Using grep
To select lines with exactly nine periods:
$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.
Using sed
$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.
Using awk
$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):
$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
how aboutawk -F. NF==10?
â Jeff Schaller
Nov 29 '17 at 20:01
@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
â John1024
Nov 29 '17 at 20:06
add a comment |Â
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Let's take this as a test file:
$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period
Using grep
To select lines with exactly nine periods:
$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.
Using sed
$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.
Using awk
$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):
$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
how aboutawk -F. NF==10?
â Jeff Schaller
Nov 29 '17 at 20:01
@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
â John1024
Nov 29 '17 at 20:06
add a comment |Â
up vote
1
down vote
Let's take this as a test file:
$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period
Using grep
To select lines with exactly nine periods:
$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.
Using sed
$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.
Using awk
$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):
$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
how aboutawk -F. NF==10?
â Jeff Schaller
Nov 29 '17 at 20:01
@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
â John1024
Nov 29 '17 at 20:06
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Let's take this as a test file:
$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period
Using grep
To select lines with exactly nine periods:
$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.
Using sed
$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.
Using awk
$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):
$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
Let's take this as a test file:
$ cat testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
1.2.3.4 5.6.7.8 9.10.11 Bad: Missing 1
1.2.3.4 5.6.7.8 9.10.11.12. Bad: Extra period
Using grep
To select lines with exactly nine periods:
$ grep -E '^([^.]*.)9[^.]*$' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
[^.]*. matches any number of non-period characters followed by a ([^.]*.)9 matches exactly nine sequences of zero or more non-period characters followed by a period. The ^ at the beginning requires that the regex match starting at the beginning of the line. The [^.]*$ means that, between the end of the nine sequences and the end of the line, only non-period characters are allowed.
Using sed
$ sed -En '/^([^.]*.)9[^.]*$/p' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
The -n option tells sed not to print unless we explicitly ask it to. The p following the regex explicitly asks sed to print those lines which match the regex.
Using awk
$ awk '/^([^.]*.)9[^.]*$/' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
Or, using awk's ability to define a character to separate fields (hat tip: Jeff Schaller):
$ awk -F. 'NF==10' testfile
1.2.3.4 5.6.7.8 9.10.11.12 Keep
edited Nov 29 '17 at 20:16
answered Nov 29 '17 at 19:58
John1024
44.2k4100117
44.2k4100117
how aboutawk -F. NF==10?
â Jeff Schaller
Nov 29 '17 at 20:01
@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
â John1024
Nov 29 '17 at 20:06
add a comment |Â
how aboutawk -F. NF==10?
â Jeff Schaller
Nov 29 '17 at 20:01
@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
â John1024
Nov 29 '17 at 20:06
how about
awk -F. NF==10 ?â Jeff Schaller
Nov 29 '17 at 20:01
how about
awk -F. NF==10 ?â Jeff Schaller
Nov 29 '17 at 20:01
@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
â John1024
Nov 29 '17 at 20:06
@JeffSchaller Yes, excellent. (I had too look up in the docs to verify that, when FS is not treated as a regex when it is a single character.)
â John1024
Nov 29 '17 at 20:06
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f407809%2fparsing-only-lines-that-have-9-periods%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
2
Do you care how valid the IP's appear? In other words, are you OK selecting a line like:
20.20.20.20 foo.bar.baz 300.300.300.300?â Jeff Schaller
Nov 29 '17 at 20:06