Merge multiple CSV files with partially matching key column

up vote
2
down vote

favorite

I have 100 csv files each containing 2 columns. The first is taxonomy and the second counts. Each file has about 10000 rows. The taxonomy data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa in one file are assigned the value 0. The result should be a table (csv or tsv) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv):

A,1
B,20
C,30

File 2 (R2.csv):

C,1
D,13
E,15
F,19

File 3 (R3.csv):

A,1
B,4
E,2
G,6
H,8

Expected result:

Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Any idea how to do this with bash scripting?

edited Apr 12 at 18:21

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

asked Apr 11 at 13:51

Mina Bizic

111

Wouldn't csvkit be more suitable and pleasant to use for this task?
â€“Â smbear
Apr 13 at 12:27

add a commentÂ |Â

up vote
2
down vote

favorite

A,1
B,20
C,30

File 2 (R2.csv):

C,1
D,13
E,15
F,19

File 3 (R3.csv):

A,1
B,4
E,2
G,6
H,8

Expected result:

Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Any idea how to do this with bash scripting?

edited Apr 12 at 18:21

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

asked Apr 11 at 13:51

Mina Bizic

111

Wouldn't csvkit be more suitable and pleasant to use for this task?
â€“Â smbear
Apr 13 at 12:27

add a commentÂ |Â

up vote
2
down vote

favorite

A,1
B,20
C,30

File 2 (R2.csv):

C,1
D,13
E,15
F,19

File 3 (R3.csv):

A,1
B,4
E,2
G,6
H,8

Expected result:

Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Any idea how to do this with bash scripting?

edited Apr 12 at 18:21

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

asked Apr 11 at 13:51

Mina Bizic

111

A,1
B,20
C,30

File 2 (R2.csv):

C,1
D,13
E,15
F,19

File 3 (R3.csv):

A,1
B,4
E,2
G,6
H,8

Expected result:

Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Any idea how to do this with bash scripting?

edited Apr 12 at 18:21

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

asked Apr 11 at 13:51

Mina Bizic

111

edited Apr 12 at 18:21

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

edited Apr 12 at 18:21

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

edited Apr 12 at 18:21

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

asked Apr 11 at 13:51

Mina Bizic

111

asked Apr 11 at 13:51

Mina Bizic

111

asked Apr 11 at 13:51

Mina Bizic

111

Wouldn't csvkit be more suitable and pleasant to use for this task?
â€“Â smbear
Apr 13 at 12:27

add a commentÂ |Â

Wouldn't csvkit be more suitable and pleasant to use for this task?
â€“Â smbear
Apr 13 at 12:27

Wouldn't csvkit be more suitable and pleasant to use for this task?
â€“Â smbear
Apr 13 at 12:27

add a commentÂ |Â

3 Answers
3

active

oldest

votes

up vote
1
down vote

this is going to be painfull

 join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
 join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
 sort -u r12?.csv > r12.csv
 join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
 join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
 sort -u r123a,b.csv

the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0

sort -u sort and keep unique record.

I an not sure awk code would be more readable.

answered Apr 11 at 14:53

Archemar

18.9k93366

This can be extended to process all 100 files?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Apr 11 at 16:57

add a commentÂ |Â

up vote
1
down vote

I used awk which handling multiple files at once:

sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')

The output:

A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Code break-down:

awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv

The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen here is an array name with the key as first column and value with ,FILENAME:$2 in appending mode.

In seen[$1]=seen[$1]","FILENAME":"$2; means print a comma , followed by current processing file FILENAME with awk, a colon : followed by second column value $2 when it has same first column seen[$1]=... and append into the same key index =seen[$1]... and save in same key's value.

The END statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.

Will result:

A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER -variable:

awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...

later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:

$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".

... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'

In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.

edited May 9 at 17:50

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

add a commentÂ |Â

up vote
0
down vote

There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:

import os,re,argparse
import csv

parser = argparse.ArgumentParser(description='join csvs with rows of the 
 form w+,[1-9], inserting a zero for a row label if it does not 
 exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()

d = 
file_idx = 0
for infile in args.infiles:
 with open(infile, 'r') as f:
 for line in f:
 parsed_line = re.match('(w+),([0-9]+)', line)
 if not parsed_line:
 print("line not parsed in file ".format(line, infile))
 continue
 if parsed_line.group(1) in d:
 d[parsed_line.group(1)].append(parsed_line.group(2))
 else:
 l = [0]*(file_idx)
 l.append(parsed_line.group(2))
 d[parsed_line.group(1)]=l
 for k in d:
 if (len(d[k]) == file_idx):
 d[k].append(0)
 if not(len(d[k]) == file_idx+1):
 print("problem with file , dict , key ".format(f,d,k))
 file_idx = file_idx + 1

## output time
with open('results.csv','w') as csvfile:
 cwriter = csv.writer(csvfile)
 header = [os.path.splitext(x)[0] for x in args.infiles]
 header.insert(0,'Taxa')
 cwriter.writerow(header)
 for k in sorted(d.keys()):
 d[k].insert(0,k)
 cwriter.writerow(d[k])

answered Apr 11 at 17:55

hhoke1

31416

add a commentÂ |Â

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437026%2fmerge-multiple-csv-files-with-partially-matching-key-column%23new-answer', 'question_page');

);

Post as a guest

Name

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
1
down vote

this is going to be painfull

 join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
 join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
 sort -u r12?.csv > r12.csv
 join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
 join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
 sort -u r123a,b.csv

the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0

sort -u sort and keep unique record.

I an not sure awk code would be more readable.

answered Apr 11 at 14:53

Archemar

18.9k93366

This can be extended to process all 100 files?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Apr 11 at 16:57

add a commentÂ |Â

up vote
1
down vote

this is going to be painfull

 join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
 join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
 sort -u r12?.csv > r12.csv
 join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
 join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
 sort -u r123a,b.csv

the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0

sort -u sort and keep unique record.

I an not sure awk code would be more readable.

answered Apr 11 at 14:53

Archemar

18.9k93366

This can be extended to process all 100 files?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Apr 11 at 16:57

add a commentÂ |Â

up vote
1
down vote

this is going to be painfull

 join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
 join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
 sort -u r12?.csv > r12.csv
 join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
 join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
 sort -u r123a,b.csv

the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0

sort -u sort and keep unique record.

I an not sure awk code would be more readable.

answered Apr 11 at 14:53

Archemar

18.9k93366

this is going to be painfull

 join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
 join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
 sort -u r12?.csv > r12.csv
 join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
 join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
 sort -u r123a,b.csv

the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0

sort -u sort and keep unique record.

I an not sure awk code would be more readable.

answered Apr 11 at 14:53

Archemar

18.9k93366

answered Apr 11 at 14:53

Archemar

18.9k93366

answered Apr 11 at 14:53

Archemar

18.9k93366

answered Apr 11 at 14:53

Archemar

18.9k93366

This can be extended to process all 100 files?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Apr 11 at 16:57

add a commentÂ |Â

This can be extended to process all 100 files?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Apr 11 at 16:57

This can be extended to process all 100 files?
â€“Â ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·
Apr 11 at 16:57

add a commentÂ |Â

up vote
1
down vote

I used awk which handling multiple files at once:

sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')

The output:

A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Code break-down:

awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv

Will result:

A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...

later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:

$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".

... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'

In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.

edited May 9 at 17:50

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

add a commentÂ |Â

up vote
1
down vote

I used awk which handling multiple files at once:

sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')

The output:

A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Code break-down:

awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv

Will result:

A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...

later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:

$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".

... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'

In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.

edited May 9 at 17:50

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

add a commentÂ |Â

up vote
1
down vote

I used awk which handling multiple files at once:

sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')

The output:

A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Code break-down:

awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv

Will result:

A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...

later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:

$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".

... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'

In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.

edited May 9 at 17:50

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

I used awk which handling multiple files at once:

sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')

The output:

A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8

Code break-down:

awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv

Will result:

A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...

later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:

$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
 seen[$1]=seen[$1]","FILENAME":"$2; 
 END print HEADER; for (x in seen) print x seen[x]' R1..3.csv 
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8

Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".

... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
 for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'

In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.

edited May 9 at 17:50

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

edited May 9 at 17:50

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

answered Apr 11 at 19:46

ÃŽÂ±Ã’Â“sÃÂ½ÃŽÂ¹ÃŽÂ·

14.8k82462

add a commentÂ |Â

up vote
0
down vote

import os,re,argparse
import csv

parser = argparse.ArgumentParser(description='join csvs with rows of the 
 form w+,[1-9], inserting a zero for a row label if it does not 
 exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()

d = 
file_idx = 0
for infile in args.infiles:
 with open(infile, 'r') as f:
 for line in f:
 parsed_line = re.match('(w+),([0-9]+)', line)
 if not parsed_line:
 print("line not parsed in file ".format(line, infile))
 continue
 if parsed_line.group(1) in d:
 d[parsed_line.group(1)].append(parsed_line.group(2))
 else:
 l = [0]*(file_idx)
 l.append(parsed_line.group(2))
 d[parsed_line.group(1)]=l
 for k in d:
 if (len(d[k]) == file_idx):
 d[k].append(0)
 if not(len(d[k]) == file_idx+1):
 print("problem with file , dict , key ".format(f,d,k))
 file_idx = file_idx + 1

## output time
with open('results.csv','w') as csvfile:
 cwriter = csv.writer(csvfile)
 header = [os.path.splitext(x)[0] for x in args.infiles]
 header.insert(0,'Taxa')
 cwriter.writerow(header)
 for k in sorted(d.keys()):
 d[k].insert(0,k)
 cwriter.writerow(d[k])

answered Apr 11 at 17:55

hhoke1

31416

add a commentÂ |Â

up vote
0
down vote

import os,re,argparse
import csv

parser = argparse.ArgumentParser(description='join csvs with rows of the 
 form w+,[1-9], inserting a zero for a row label if it does not 
 exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()

d = 
file_idx = 0
for infile in args.infiles:
 with open(infile, 'r') as f:
 for line in f:
 parsed_line = re.match('(w+),([0-9]+)', line)
 if not parsed_line:
 print("line not parsed in file ".format(line, infile))
 continue
 if parsed_line.group(1) in d:
 d[parsed_line.group(1)].append(parsed_line.group(2))
 else:
 l = [0]*(file_idx)
 l.append(parsed_line.group(2))
 d[parsed_line.group(1)]=l
 for k in d:
 if (len(d[k]) == file_idx):
 d[k].append(0)
 if not(len(d[k]) == file_idx+1):
 print("problem with file , dict , key ".format(f,d,k))
 file_idx = file_idx + 1

## output time
with open('results.csv','w') as csvfile:
 cwriter = csv.writer(csvfile)
 header = [os.path.splitext(x)[0] for x in args.infiles]
 header.insert(0,'Taxa')
 cwriter.writerow(header)
 for k in sorted(d.keys()):
 d[k].insert(0,k)
 cwriter.writerow(d[k])

answered Apr 11 at 17:55

hhoke1

31416

add a commentÂ |Â

up vote
0
down vote

import os,re,argparse
import csv

parser = argparse.ArgumentParser(description='join csvs with rows of the 
 form w+,[1-9], inserting a zero for a row label if it does not 
 exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()

d = 
file_idx = 0
for infile in args.infiles:
 with open(infile, 'r') as f:
 for line in f:
 parsed_line = re.match('(w+),([0-9]+)', line)
 if not parsed_line:
 print("line not parsed in file ".format(line, infile))
 continue
 if parsed_line.group(1) in d:
 d[parsed_line.group(1)].append(parsed_line.group(2))
 else:
 l = [0]*(file_idx)
 l.append(parsed_line.group(2))
 d[parsed_line.group(1)]=l
 for k in d:
 if (len(d[k]) == file_idx):
 d[k].append(0)
 if not(len(d[k]) == file_idx+1):
 print("problem with file , dict , key ".format(f,d,k))
 file_idx = file_idx + 1

## output time
with open('results.csv','w') as csvfile:
 cwriter = csv.writer(csvfile)
 header = [os.path.splitext(x)[0] for x in args.infiles]
 header.insert(0,'Taxa')
 cwriter.writerow(header)
 for k in sorted(d.keys()):
 d[k].insert(0,k)
 cwriter.writerow(d[k])

answered Apr 11 at 17:55

hhoke1

31416

import os,re,argparse
import csv

parser = argparse.ArgumentParser(description='join csvs with rows of the 
 form w+,[1-9], inserting a zero for a row label if it does not 
 exist.')
parser.add_argument('infiles', type=str, help='infile names', nargs='+')
args = parser.parse_args()

d = 
file_idx = 0
for infile in args.infiles:
 with open(infile, 'r') as f:
 for line in f:
 parsed_line = re.match('(w+),([0-9]+)', line)
 if not parsed_line:
 print("line not parsed in file ".format(line, infile))
 continue
 if parsed_line.group(1) in d:
 d[parsed_line.group(1)].append(parsed_line.group(2))
 else:
 l = [0]*(file_idx)
 l.append(parsed_line.group(2))
 d[parsed_line.group(1)]=l
 for k in d:
 if (len(d[k]) == file_idx):
 d[k].append(0)
 if not(len(d[k]) == file_idx+1):
 print("problem with file , dict , key ".format(f,d,k))
 file_idx = file_idx + 1

## output time
with open('results.csv','w') as csvfile:
 cwriter = csv.writer(csvfile)
 header = [os.path.splitext(x)[0] for x in args.infiles]
 header.insert(0,'Taxa')
 cwriter.writerow(header)
 for k in sorted(d.keys()):
 d[k].insert(0,k)
 cwriter.writerow(d[k])

answered Apr 11 at 17:55

hhoke1

31416

answered Apr 11 at 17:55

hhoke1

31416

answered Apr 11 at 17:55

hhoke1

31416

answered Apr 11 at 17:55

hhoke1

31416

add a commentÂ |Â

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu