Merge multiple CSV files with partially matching key column
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I have 100 csv
files each containing 2 columns. The first is taxonomy
and the second counts
. Each file has about 10000 rows. The taxonomy
data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa
in one file are assigned the value 0
. The result should be a table (csv
or tsv
) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv
):
A,1
B,20
C,30
File 2 (R2.csv
):
C,1
D,13
E,15
F,19
File 3 (R3.csv
):
A,1
B,4
E,2
G,6
H,8
Expected result:
Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Any idea how to do this with bash scripting?
bash text-processing awk join
add a comment |Â
up vote
2
down vote
favorite
I have 100 csv
files each containing 2 columns. The first is taxonomy
and the second counts
. Each file has about 10000 rows. The taxonomy
data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa
in one file are assigned the value 0
. The result should be a table (csv
or tsv
) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv
):
A,1
B,20
C,30
File 2 (R2.csv
):
C,1
D,13
E,15
F,19
File 3 (R3.csv
):
A,1
B,4
E,2
G,6
H,8
Expected result:
Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Any idea how to do this with bash scripting?
bash text-processing awk join
Wouldn't csvkit be more suitable and pleasant to use for this task?
â smbear
Apr 13 at 12:27
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have 100 csv
files each containing 2 columns. The first is taxonomy
and the second counts
. Each file has about 10000 rows. The taxonomy
data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa
in one file are assigned the value 0
. The result should be a table (csv
or tsv
) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv
):
A,1
B,20
C,30
File 2 (R2.csv
):
C,1
D,13
E,15
F,19
File 3 (R3.csv
):
A,1
B,4
E,2
G,6
H,8
Expected result:
Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Any idea how to do this with bash scripting?
bash text-processing awk join
I have 100 csv
files each containing 2 columns. The first is taxonomy
and the second counts
. Each file has about 10000 rows. The taxonomy
data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa
in one file are assigned the value 0
. The result should be a table (csv
or tsv
) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv
):
A,1
B,20
C,30
File 2 (R2.csv
):
C,1
D,13
E,15
F,19
File 3 (R3.csv
):
A,1
B,4
E,2
G,6
H,8
Expected result:
Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Any idea how to do this with bash scripting?
bash text-processing awk join
edited Apr 12 at 18:21
ñÃÂsýù÷
14.8k82462
14.8k82462
asked Apr 11 at 13:51
Mina Bizic
111
111
Wouldn't csvkit be more suitable and pleasant to use for this task?
â smbear
Apr 13 at 12:27
add a comment |Â
Wouldn't csvkit be more suitable and pleasant to use for this task?
â smbear
Apr 13 at 12:27
Wouldn't csvkit be more suitable and pleasant to use for this task?
â smbear
Apr 13 at 12:27
Wouldn't csvkit be more suitable and pleasant to use for this task?
â smbear
Apr 13 at 12:27
add a comment |Â
3 Answers
3
active
oldest
votes
up vote
1
down vote
this is going to be painfull
join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
sort -u r12?.csv > r12.csv
join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
sort -u r123a,b.csv
- the first join print unpaired value in file x (
-ax
) with a default (-e0
),-o auto
tell join to print 0 sort -u
sort and keep unique record.
I an not sure awk
code would be more readable.
This can be extended to process all 100 files?
â Ã±ÃÂsýù÷
Apr 11 at 16:57
add a comment |Â
up vote
1
down vote
I used awk
which handling multiple files at once:
sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')
The output:
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Code break-down:
awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv
The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen
here is an array name with the key as first column and value with ,FILENAME:$2
in appending mode.
In seen[$1]=seen[$1]","FILENAME":"$2;
means print a comma ,
followed by current processing file FILENAME
with awk, a colon :
followed by second column value $2
when it has same first column seen[$1]=...
and append into the same key index =seen[$1]...
and save in same key's value.
The END
statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.
Will result:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0
, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER
-variable:
awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...
later we will use this HEADER
line and fill missed files' data with 0
. Now our input is like this format:
$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
Next I used another below awk script to fill the non-exiting files' data with 0
which I copied from my another answer to the question "Formatting and fill missed data based on column".
... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'
In the end, the sed 's/,R[1-9]+.csv:/,/g'
is used to replace the existing filess name in result with single comma ,
.
add a comment |Â
up vote
0
down vote
There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:
import os,re,argparse
import csv
parser = argparse.ArgumentParser(description='join csvs with rows of the
form w+,[1-9], inserting a zero for a row label if it does not
exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()
d =
file_idx = 0
for infile in args.infiles:
with open(infile, 'r') as f:
for line in f:
parsed_line = re.match('(w+),([0-9]+)', line)
if not parsed_line:
print("line not parsed in file ".format(line, infile))
continue
if parsed_line.group(1) in d:
d[parsed_line.group(1)].append(parsed_line.group(2))
else:
l = [0]*(file_idx)
l.append(parsed_line.group(2))
d[parsed_line.group(1)]=l
for k in d:
if (len(d[k]) == file_idx):
d[k].append(0)
if not(len(d[k]) == file_idx+1):
print("problem with file , dict , key ".format(f,d,k))
file_idx = file_idx + 1
## output time
with open('results.csv','w') as csvfile:
cwriter = csv.writer(csvfile)
header = [os.path.splitext(x)[0] for x in args.infiles]
header.insert(0,'Taxa')
cwriter.writerow(header)
for k in sorted(d.keys()):
d[k].insert(0,k)
cwriter.writerow(d[k])
add a comment |Â
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
this is going to be painfull
join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
sort -u r12?.csv > r12.csv
join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
sort -u r123a,b.csv
- the first join print unpaired value in file x (
-ax
) with a default (-e0
),-o auto
tell join to print 0 sort -u
sort and keep unique record.
I an not sure awk
code would be more readable.
This can be extended to process all 100 files?
â Ã±ÃÂsýù÷
Apr 11 at 16:57
add a comment |Â
up vote
1
down vote
this is going to be painfull
join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
sort -u r12?.csv > r12.csv
join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
sort -u r123a,b.csv
- the first join print unpaired value in file x (
-ax
) with a default (-e0
),-o auto
tell join to print 0 sort -u
sort and keep unique record.
I an not sure awk
code would be more readable.
This can be extended to process all 100 files?
â Ã±ÃÂsýù÷
Apr 11 at 16:57
add a comment |Â
up vote
1
down vote
up vote
1
down vote
this is going to be painfull
join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
sort -u r12?.csv > r12.csv
join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
sort -u r123a,b.csv
- the first join print unpaired value in file x (
-ax
) with a default (-e0
),-o auto
tell join to print 0 sort -u
sort and keep unique record.
I an not sure awk
code would be more readable.
this is going to be painfull
join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
sort -u r12?.csv > r12.csv
join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
sort -u r123a,b.csv
- the first join print unpaired value in file x (
-ax
) with a default (-e0
),-o auto
tell join to print 0 sort -u
sort and keep unique record.
I an not sure awk
code would be more readable.
answered Apr 11 at 14:53
Archemar
18.9k93366
18.9k93366
This can be extended to process all 100 files?
â Ã±ÃÂsýù÷
Apr 11 at 16:57
add a comment |Â
This can be extended to process all 100 files?
â Ã±ÃÂsýù÷
Apr 11 at 16:57
This can be extended to process all 100 files?
â Ã±ÃÂsýù÷
Apr 11 at 16:57
This can be extended to process all 100 files?
â Ã±ÃÂsýù÷
Apr 11 at 16:57
add a comment |Â
up vote
1
down vote
I used awk
which handling multiple files at once:
sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')
The output:
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Code break-down:
awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv
The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen
here is an array name with the key as first column and value with ,FILENAME:$2
in appending mode.
In seen[$1]=seen[$1]","FILENAME":"$2;
means print a comma ,
followed by current processing file FILENAME
with awk, a colon :
followed by second column value $2
when it has same first column seen[$1]=...
and append into the same key index =seen[$1]...
and save in same key's value.
The END
statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.
Will result:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0
, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER
-variable:
awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...
later we will use this HEADER
line and fill missed files' data with 0
. Now our input is like this format:
$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
Next I used another below awk script to fill the non-exiting files' data with 0
which I copied from my another answer to the question "Formatting and fill missed data based on column".
... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'
In the end, the sed 's/,R[1-9]+.csv:/,/g'
is used to replace the existing filess name in result with single comma ,
.
add a comment |Â
up vote
1
down vote
I used awk
which handling multiple files at once:
sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')
The output:
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Code break-down:
awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv
The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen
here is an array name with the key as first column and value with ,FILENAME:$2
in appending mode.
In seen[$1]=seen[$1]","FILENAME":"$2;
means print a comma ,
followed by current processing file FILENAME
with awk, a colon :
followed by second column value $2
when it has same first column seen[$1]=...
and append into the same key index =seen[$1]...
and save in same key's value.
The END
statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.
Will result:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0
, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER
-variable:
awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...
later we will use this HEADER
line and fill missed files' data with 0
. Now our input is like this format:
$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
Next I used another below awk script to fill the non-exiting files' data with 0
which I copied from my another answer to the question "Formatting and fill missed data based on column".
... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'
In the end, the sed 's/,R[1-9]+.csv:/,/g'
is used to replace the existing filess name in result with single comma ,
.
add a comment |Â
up vote
1
down vote
up vote
1
down vote
I used awk
which handling multiple files at once:
sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')
The output:
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Code break-down:
awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv
The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen
here is an array name with the key as first column and value with ,FILENAME:$2
in appending mode.
In seen[$1]=seen[$1]","FILENAME":"$2;
means print a comma ,
followed by current processing file FILENAME
with awk, a colon :
followed by second column value $2
when it has same first column seen[$1]=...
and append into the same key index =seen[$1]...
and save in same key's value.
The END
statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.
Will result:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0
, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER
-variable:
awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...
later we will use this HEADER
line and fill missed files' data with 0
. Now our input is like this format:
$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
Next I used another below awk script to fill the non-exiting files' data with 0
which I copied from my another answer to the question "Formatting and fill missed data based on column".
... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'
In the end, the sed 's/,R[1-9]+.csv:/,/g'
is used to replace the existing filess name in result with single comma ,
.
I used awk
which handling multiple files at once:
sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')
The output:
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8
Code break-down:
awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv
The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen
here is an array name with the key as first column and value with ,FILENAME:$2
in appending mode.
In seen[$1]=seen[$1]","FILENAME":"$2;
means print a comma ,
followed by current processing file FILENAME
with awk, a colon :
followed by second column value $2
when it has same first column seen[$1]=...
and append into the same key index =seen[$1]...
and save in same key's value.
The END
statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.
Will result:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0
, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER
-variable:
awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...
later we will use this HEADER
line and fill missed files' data with 0
. Now our input is like this format:
$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8
Next I used another below awk script to fill the non-exiting files' data with 0
which I copied from my another answer to the question "Formatting and fill missed data based on column".
... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'
In the end, the sed 's/,R[1-9]+.csv:/,/g'
is used to replace the existing filess name in result with single comma ,
.
edited May 9 at 17:50
answered Apr 11 at 19:46
ñÃÂsýù÷
14.8k82462
14.8k82462
add a comment |Â
add a comment |Â
up vote
0
down vote
There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:
import os,re,argparse
import csv
parser = argparse.ArgumentParser(description='join csvs with rows of the
form w+,[1-9], inserting a zero for a row label if it does not
exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()
d =
file_idx = 0
for infile in args.infiles:
with open(infile, 'r') as f:
for line in f:
parsed_line = re.match('(w+),([0-9]+)', line)
if not parsed_line:
print("line not parsed in file ".format(line, infile))
continue
if parsed_line.group(1) in d:
d[parsed_line.group(1)].append(parsed_line.group(2))
else:
l = [0]*(file_idx)
l.append(parsed_line.group(2))
d[parsed_line.group(1)]=l
for k in d:
if (len(d[k]) == file_idx):
d[k].append(0)
if not(len(d[k]) == file_idx+1):
print("problem with file , dict , key ".format(f,d,k))
file_idx = file_idx + 1
## output time
with open('results.csv','w') as csvfile:
cwriter = csv.writer(csvfile)
header = [os.path.splitext(x)[0] for x in args.infiles]
header.insert(0,'Taxa')
cwriter.writerow(header)
for k in sorted(d.keys()):
d[k].insert(0,k)
cwriter.writerow(d[k])
add a comment |Â
up vote
0
down vote
There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:
import os,re,argparse
import csv
parser = argparse.ArgumentParser(description='join csvs with rows of the
form w+,[1-9], inserting a zero for a row label if it does not
exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()
d =
file_idx = 0
for infile in args.infiles:
with open(infile, 'r') as f:
for line in f:
parsed_line = re.match('(w+),([0-9]+)', line)
if not parsed_line:
print("line not parsed in file ".format(line, infile))
continue
if parsed_line.group(1) in d:
d[parsed_line.group(1)].append(parsed_line.group(2))
else:
l = [0]*(file_idx)
l.append(parsed_line.group(2))
d[parsed_line.group(1)]=l
for k in d:
if (len(d[k]) == file_idx):
d[k].append(0)
if not(len(d[k]) == file_idx+1):
print("problem with file , dict , key ".format(f,d,k))
file_idx = file_idx + 1
## output time
with open('results.csv','w') as csvfile:
cwriter = csv.writer(csvfile)
header = [os.path.splitext(x)[0] for x in args.infiles]
header.insert(0,'Taxa')
cwriter.writerow(header)
for k in sorted(d.keys()):
d[k].insert(0,k)
cwriter.writerow(d[k])
add a comment |Â
up vote
0
down vote
up vote
0
down vote
There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:
import os,re,argparse
import csv
parser = argparse.ArgumentParser(description='join csvs with rows of the
form w+,[1-9], inserting a zero for a row label if it does not
exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()
d =
file_idx = 0
for infile in args.infiles:
with open(infile, 'r') as f:
for line in f:
parsed_line = re.match('(w+),([0-9]+)', line)
if not parsed_line:
print("line not parsed in file ".format(line, infile))
continue
if parsed_line.group(1) in d:
d[parsed_line.group(1)].append(parsed_line.group(2))
else:
l = [0]*(file_idx)
l.append(parsed_line.group(2))
d[parsed_line.group(1)]=l
for k in d:
if (len(d[k]) == file_idx):
d[k].append(0)
if not(len(d[k]) == file_idx+1):
print("problem with file , dict , key ".format(f,d,k))
file_idx = file_idx + 1
## output time
with open('results.csv','w') as csvfile:
cwriter = csv.writer(csvfile)
header = [os.path.splitext(x)[0] for x in args.infiles]
header.insert(0,'Taxa')
cwriter.writerow(header)
for k in sorted(d.keys()):
d[k].insert(0,k)
cwriter.writerow(d[k])
There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:
import os,re,argparse
import csv
parser = argparse.ArgumentParser(description='join csvs with rows of the
form w+,[1-9], inserting a zero for a row label if it does not
exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()
d =
file_idx = 0
for infile in args.infiles:
with open(infile, 'r') as f:
for line in f:
parsed_line = re.match('(w+),([0-9]+)', line)
if not parsed_line:
print("line not parsed in file ".format(line, infile))
continue
if parsed_line.group(1) in d:
d[parsed_line.group(1)].append(parsed_line.group(2))
else:
l = [0]*(file_idx)
l.append(parsed_line.group(2))
d[parsed_line.group(1)]=l
for k in d:
if (len(d[k]) == file_idx):
d[k].append(0)
if not(len(d[k]) == file_idx+1):
print("problem with file , dict , key ".format(f,d,k))
file_idx = file_idx + 1
## output time
with open('results.csv','w') as csvfile:
cwriter = csv.writer(csvfile)
header = [os.path.splitext(x)[0] for x in args.infiles]
header.insert(0,'Taxa')
cwriter.writerow(header)
for k in sorted(d.keys()):
d[k].insert(0,k)
cwriter.writerow(d[k])
answered Apr 11 at 17:55
hhoke1
31416
31416
add a comment |Â
add a comment |Â
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437026%2fmerge-multiple-csv-files-with-partially-matching-key-column%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Wouldn't csvkit be more suitable and pleasant to use for this task?
â smbear
Apr 13 at 12:27