How can I standardize the phone numbers in a text file?

Clash Royale CLAN TAG#URR8PPP
up vote
0
down vote
favorite
I periodically receive a text file with phone numbers formatted in wildly different ways: ##########, ###-###-####, (###) ###-###, etc. Usually there's ten digits, but I've seen +1 (###) ###-####.
Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.
The only constant is that the phone numbers always fall between the second and third tab character on each line.
Is there a way to do this from the command line?
text-processing
add a comment |Â
up vote
0
down vote
favorite
I periodically receive a text file with phone numbers formatted in wildly different ways: ##########, ###-###-####, (###) ###-###, etc. Usually there's ten digits, but I've seen +1 (###) ###-####.
Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.
The only constant is that the phone numbers always fall between the second and third tab character on each line.
Is there a way to do this from the command line?
text-processing
The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â schily
May 4 at 10:23
To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â JdeBP
May 4 at 10:38
This particular system uses only US phone numbers, and the "standard" is for human readability.
â Chuck
May 4 at 17:14
add a comment |Â
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I periodically receive a text file with phone numbers formatted in wildly different ways: ##########, ###-###-####, (###) ###-###, etc. Usually there's ten digits, but I've seen +1 (###) ###-####.
Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.
The only constant is that the phone numbers always fall between the second and third tab character on each line.
Is there a way to do this from the command line?
text-processing
I periodically receive a text file with phone numbers formatted in wildly different ways: ##########, ###-###-####, (###) ###-###, etc. Usually there's ten digits, but I've seen +1 (###) ###-####.
Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.
The only constant is that the phone numbers always fall between the second and third tab character on each line.
Is there a way to do this from the command line?
text-processing
asked May 3 at 23:47
Chuck
250111
250111
The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â schily
May 4 at 10:23
To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â JdeBP
May 4 at 10:38
This particular system uses only US phone numbers, and the "standard" is for human readability.
â Chuck
May 4 at 17:14
add a comment |Â
The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â schily
May 4 at 10:23
To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â JdeBP
May 4 at 10:38
This particular system uses only US phone numbers, and the "standard" is for human readability.
â Chuck
May 4 at 17:14
The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â schily
May 4 at 10:23
The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â schily
May 4 at 10:23
To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â JdeBP
May 4 at 10:38
To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â JdeBP
May 4 at 10:38
This particular system uses only US phone numbers, and the "standard" is for human readability.
â Chuck
May 4 at 17:14
This particular system uses only US phone numbers, and the "standard" is for human readability.
â Chuck
May 4 at 17:14
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
1
down vote
accepted
This should cover you as long as the file is as you have described. The command preserves information before and after the phone number and formats it in the way that you asked for. If the output looks good, add the -i option to sed to edit it in place or provide it with output redirection using > output_file at the end.
sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename
I tested it on a file containing this text:
jfk 902-765-9292 hat jump cat
jk 902 819 2244 hat jump cat
98 902 823-4456 hat jump cat
78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat
The output was:
jfk (902) 765-9292 hat jump cat
jk (902) 819-2244 hat jump cat
98 (902) 823-4456 hat jump cat
78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat
add a comment |Â
up vote
1
down vote
You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.
For example, to match and capture a sequence of three decimal digits optionally surrounded by parentheses with an Extended Regular Expression (ERE), you can write (?([0-9]3))? while [- ]? matches an optional hyphen or space. Building up in this way
(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)
will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.
Applying the expression in a sed substitution:
$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890,
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890,
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.
add a comment |Â
up vote
1
down vote
You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)
$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)
$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)
1
Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
This should cover you as long as the file is as you have described. The command preserves information before and after the phone number and formats it in the way that you asked for. If the output looks good, add the -i option to sed to edit it in place or provide it with output redirection using > output_file at the end.
sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename
I tested it on a file containing this text:
jfk 902-765-9292 hat jump cat
jk 902 819 2244 hat jump cat
98 902 823-4456 hat jump cat
78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat
The output was:
jfk (902) 765-9292 hat jump cat
jk (902) 819-2244 hat jump cat
98 (902) 823-4456 hat jump cat
78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat
add a comment |Â
up vote
1
down vote
accepted
This should cover you as long as the file is as you have described. The command preserves information before and after the phone number and formats it in the way that you asked for. If the output looks good, add the -i option to sed to edit it in place or provide it with output redirection using > output_file at the end.
sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename
I tested it on a file containing this text:
jfk 902-765-9292 hat jump cat
jk 902 819 2244 hat jump cat
98 902 823-4456 hat jump cat
78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat
The output was:
jfk (902) 765-9292 hat jump cat
jk (902) 819-2244 hat jump cat
98 (902) 823-4456 hat jump cat
78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat
add a comment |Â
up vote
1
down vote
accepted
up vote
1
down vote
accepted
This should cover you as long as the file is as you have described. The command preserves information before and after the phone number and formats it in the way that you asked for. If the output looks good, add the -i option to sed to edit it in place or provide it with output redirection using > output_file at the end.
sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename
I tested it on a file containing this text:
jfk 902-765-9292 hat jump cat
jk 902 819 2244 hat jump cat
98 902 823-4456 hat jump cat
78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat
The output was:
jfk (902) 765-9292 hat jump cat
jk (902) 819-2244 hat jump cat
98 (902) 823-4456 hat jump cat
78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat
This should cover you as long as the file is as you have described. The command preserves information before and after the phone number and formats it in the way that you asked for. If the output looks good, add the -i option to sed to edit it in place or provide it with output redirection using > output_file at the end.
sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename
I tested it on a file containing this text:
jfk 902-765-9292 hat jump cat
jk 902 819 2244 hat jump cat
98 902 823-4456 hat jump cat
78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat
The output was:
jfk (902) 765-9292 hat jump cat
jk (902) 819-2244 hat jump cat
98 (902) 823-4456 hat jump cat
78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat
edited May 4 at 4:12
answered May 4 at 1:30
Jeff H.
1667
1667
add a comment |Â
add a comment |Â
up vote
1
down vote
You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.
For example, to match and capture a sequence of three decimal digits optionally surrounded by parentheses with an Extended Regular Expression (ERE), you can write (?([0-9]3))? while [- ]? matches an optional hyphen or space. Building up in this way
(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)
will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.
Applying the expression in a sed substitution:
$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890,
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890,
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.
add a comment |Â
up vote
1
down vote
You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.
For example, to match and capture a sequence of three decimal digits optionally surrounded by parentheses with an Extended Regular Expression (ERE), you can write (?([0-9]3))? while [- ]? matches an optional hyphen or space. Building up in this way
(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)
will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.
Applying the expression in a sed substitution:
$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890,
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890,
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.
For example, to match and capture a sequence of three decimal digits optionally surrounded by parentheses with an Extended Regular Expression (ERE), you can write (?([0-9]3))? while [- ]? matches an optional hyphen or space. Building up in this way
(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)
will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.
Applying the expression in a sed substitution:
$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890,
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890,
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.
You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.
For example, to match and capture a sequence of three decimal digits optionally surrounded by parentheses with an Extended Regular Expression (ERE), you can write (?([0-9]3))? while [- ]? matches an optional hyphen or space. Building up in this way
(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)
will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.
Applying the expression in a sed substitution:
$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890,
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890,
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.
answered May 4 at 1:18
steeldriver
31.3k34978
31.3k34978
add a comment |Â
add a comment |Â
up vote
1
down vote
You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)
$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)
$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)
1
Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.
add a comment |Â
up vote
1
down vote
You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)
$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)
$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)
1
Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)
$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)
$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)
1
Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.
You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):
$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)
$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)
$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/
$3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)
1
Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.
answered May 4 at 2:05
Jeff Schaller
31.1k846105
31.1k846105
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f441676%2fhow-can-i-standardize-the-phone-numbers-in-a-text-file%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â schily
May 4 at 10:23
To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â JdeBP
May 4 at 10:38
This particular system uses only US phone numbers, and the "standard" is for human readability.
â Chuck
May 4 at 17:14