How can I standardize the phone numbers in a text file?

up vote
0
down vote

favorite

I periodically receive a text file with phone numbers formatted in wildly different ways: ##########, ###-###-####, (###) ###-###, etc. Usually there's ten digits, but I've seen +1 (###) ###-####.

Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.

The only constant is that the phone numbers always fall between the second and third tab character on each line.

Is there a way to do this from the command line?

asked May 3 at 23:47

Chuck

250111

The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â€“Â schily
May 4 at 10:23

To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â€“Â JdeBP
May 4 at 10:38

This particular system uses only US phone numbers, and the "standard" is for human readability.
â€“Â Chuck
May 4 at 17:14

add a commentÂ |Â

up vote
0
down vote

favorite

Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.

The only constant is that the phone numbers always fall between the second and third tab character on each line.

Is there a way to do this from the command line?

asked May 3 at 23:47

Chuck

250111

The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â€“Â schily
May 4 at 10:23

To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â€“Â JdeBP
May 4 at 10:38

This particular system uses only US phone numbers, and the "standard" is for human readability.
â€“Â Chuck
May 4 at 17:14

add a commentÂ |Â

up vote
0
down vote

favorite

Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.

The only constant is that the phone numbers always fall between the second and third tab character on each line.

Is there a way to do this from the command line?

asked May 3 at 23:47

Chuck

250111

Eventually the file gets imported into a database, but for reasons I won't go into, it'd be handy to have the phone numbers have a standard format, (###) ###-####.

The only constant is that the phone numbers always fall between the second and third tab character on each line.

Is there a way to do this from the command line?

asked May 3 at 23:47

Chuck

250111

asked May 3 at 23:47

Chuck

250111

asked May 3 at 23:47

Chuck

250111

asked May 3 at 23:47

Chuck

250111

The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â€“Â schily
May 4 at 10:23

To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â€“Â JdeBP
May 4 at 10:38

This particular system uses only US phone numbers, and the "standard" is for human readability.
â€“Â Chuck
May 4 at 17:14

add a commentÂ |Â

The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â€“Â schily
May 4 at 10:23

To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â€“Â JdeBP
May 4 at 10:38

This particular system uses only US phone numbers, and the "standard" is for human readability.
â€“Â Chuck
May 4 at 17:14

The phone number system you mention in your question hast been introduced by Germany in the 1950s and withdrawn soon as it was not flexible enough. The US later imported that scheme and unless you only need to deal with US phone numbers, you need to be aware of much longer phone numbers. IIRC, current international rules require that the contry code (1 in case of the US, but up to three digits in general) + the rest of the phone number may be up to 16 digits. BTW: the area code could be from one to 5 digits.
â€“Â schily
May 4 at 10:23

To standardize, one would do well to use a standard. An appropriate standard for telephone numbers is E.164 from the International Telecommunications Union.
â€“Â JdeBP
May 4 at 10:38

This particular system uses only US phone numbers, and the "standard" is for human readability.
â€“Â Chuck
May 4 at 17:14

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

This should cover you as long as the file is as you have described. The command preserves information before and after the phone number and formats it in the way that you asked for. If the output looks good, add the -i option to sed to edit it in place or provide it with output redirection using > output_file at the end.

sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename

I tested it on a file containing this text:

 jfk 902-765-9292 hat jump cat
 jk 902 819 2244 hat jump cat
 98 902 823-4456 hat jump cat
 78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat

The output was:

 jfk (902) 765-9292 hat jump cat
 jk (902) 819-2244 hat jump cat
 98 (902) 823-4456 hat jump cat
 78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat

edited May 4 at 4:12

answered May 4 at 1:30

Jeff H.

1667

add a commentÂ |Â

up vote
1
down vote

You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.

For example, to match and capture a sequence of three decimal digits optionally surrounded by parentheses with an Extended Regular Expression (ERE), you can write (?([0-9]3))? while [- ]? matches an optional hyphen or space. Building up in this way

(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)

will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.

Applying the expression in a sed substitution:

$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted 
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890, 
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted 
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890, 
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.

answered May 4 at 1:18

steeldriver

31.3k34978

add a commentÂ |Â

up vote
1
down vote

You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):

$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)


$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)


$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)


1

Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.

answered May 4 at 2:05

Jeff Schaller

31.1k846105

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f441676%2fhow-can-i-standardize-the-phone-numbers-in-a-text-file%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

accepted

sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename

I tested it on a file containing this text:

 jfk 902-765-9292 hat jump cat
 jk 902 819 2244 hat jump cat
 98 902 823-4456 hat jump cat
 78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat

The output was:

 jfk (902) 765-9292 hat jump cat
 jk (902) 819-2244 hat jump cat
 98 (902) 823-4456 hat jump cat
 78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat

edited May 4 at 4:12

answered May 4 at 1:30

Jeff H.

1667

add a commentÂ |Â

up vote
1
down vote

accepted

sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename

I tested it on a file containing this text:

 jfk 902-765-9292 hat jump cat
 jk 902 819 2244 hat jump cat
 98 902 823-4456 hat jump cat
 78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat

The output was:

 jfk (902) 765-9292 hat jump cat
 jk (902) 819-2244 hat jump cat
 98 (902) 823-4456 hat jump cat
 78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat

edited May 4 at 4:12

answered May 4 at 1:30

Jeff H.

1667

add a commentÂ |Â

up vote
1
down vote

accepted

sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename

I tested it on a file containing this text:

 jfk 902-765-9292 hat jump cat
 jk 902 819 2244 hat jump cat
 98 902 823-4456 hat jump cat
 78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat

The output was:

 jfk (902) 765-9292 hat jump cat
 jk (902) 819-2244 hat jump cat
 98 (902) 823-4456 hat jump cat
 78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat

edited May 4 at 4:12

answered May 4 at 1:30

Jeff H.

1667

sed -E "s/(.*t.*t)+?1?[[:space:]]?(?([0-9]3))?.*([0-9]3).*([0-9]4)(.*)/1(2) 3-45/g" filename

I tested it on a file containing this text:

 jfk 902-765-9292 hat jump cat
 jk 902 819 2244 hat jump cat
 98 902 823-4456 hat jump cat
 78h +1 075 242 1566 hat jump cat
jklj kjlj +1 075-242-1566 hat jump cat
jk jkj +1 (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204)-799-9810 hat jump cat

The output was:

 jfk (902) 765-9292 hat jump cat
 jk (902) 819-2244 hat jump cat
 98 (902) 823-4456 hat jump cat
 78h (075) 242-1566 hat jump cat
jklj kjlj (075) 242-1566 hat jump cat
jk jkj (075) 242-1566 hat jump cat
 kj (204) 799-9810 hat jump cat
kj 89 (204) 799-9810 hat jump cat

edited May 4 at 4:12

answered May 4 at 1:30

Jeff H.

1667

edited May 4 at 4:12

answered May 4 at 1:30

Jeff H.

1667

answered May 4 at 1:30

Jeff H.

1667

answered May 4 at 1:30

Jeff H.

1667

add a commentÂ |Â

up vote
1
down vote

You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.

(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)

will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.

Applying the expression in a sed substitution:

$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted 
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890, 
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted 
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890, 
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.

answered May 4 at 1:18

steeldriver

31.3k34978

add a commentÂ |Â

up vote
1
down vote

You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.

(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)

will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.

Applying the expression in a sed substitution:

$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted 
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890, 
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted 
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890, 
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.

answered May 4 at 1:18

steeldriver

31.3k34978

add a commentÂ |Â

up vote
1
down vote

You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.

(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)

will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.

Applying the expression in a sed substitution:

$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted 
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890, 
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted 
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890, 
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.

answered May 4 at 1:18

steeldriver

31.3k34978

You can construct a regular expression that matches any of the formats, and captures the digits then re-substitutes them in your desired format.

(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)

will match 3 digits optionally parenthesized, optionally followed by a hyphen or space, then more digits optionally followed by a hyphen or space, followed by 4 digits.

Applying the expression in a sed substitution:

$ cat <<EOF | sed -E 's/(?([0-9]3))?[- ]?([0-9]3)[- ]?([0-9]4)/(1) 2-3/g'
I periodically receive a text file with phone numbers formatted 
in wildly different ways: 123 456-7890, 123 456-7890, 123 456-7890, 
etc. Usually there's ten digits, but I've seen +1 555 456-7890.
EOF
I periodically receive a text file with phone numbers formatted 
in wildly different ways: (123) 456-7890, (123) 456-7890, (123) 456-7890, 
etc. Usually there's ten digits, but I've seen +1 (555) 456-7890.

answered May 4 at 1:18

steeldriver

31.3k34978

answered May 4 at 1:18

steeldriver

31.3k34978

answered May 4 at 1:18

steeldriver

31.3k34978

answered May 4 at 1:18

steeldriver

31.3k34978

add a commentÂ |Â

up vote
1
down vote

You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):

$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)


$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)


$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)


1

Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.

answered May 4 at 2:05

Jeff Schaller

31.1k846105

add a commentÂ |Â

up vote
1
down vote

You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):

$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)


$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)


$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)


1

Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.

answered May 4 at 2:05

Jeff Schaller

31.1k846105

add a commentÂ |Â

up vote
1
down vote

You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):

$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)


$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)


$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)


1

Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.

answered May 4 at 2:05

Jeff Schaller

31.1k846105

You need to match the field and the re-format it; here's an awk script that looks for three variations and re-formats them (before default-printing the reconstituted line):

$3 ~ /^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 4, 3) "-" substr($3, 7, 4)


$3 ~ /^[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 1, 3) ") " substr($3, 5, 3) "-" substr($3, 9, 4)


$3 ~ /^+1 ([0-9][0-9][0-9]) [0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$/ 
 $3="(" substr($3, 5, 3) ") " substr($3, 10, 3) "-" substr($3, 14, 4)


1

Save that to a file, perhaps phone.awk, then call it with: awk -F $'t' -f phone.awk < input.

answered May 4 at 2:05

Jeff Schaller

31.1k846105

answered May 4 at 2:05

Jeff Schaller

31.1k846105

answered May 4 at 2:05

Jeff Schaller

31.1k846105

answered May 4 at 2:05

Jeff Schaller

31.1k846105

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu