Merge multiple CSV files with partially matching key column

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I have 100 csv files each containing 2 columns. The first is taxonomy and the second counts. Each file has about 10000 rows. The taxonomy data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa in one file are assigned the value 0. The result should be a table (csv or tsv) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv):



A,1
B,20
C,30


File 2 (R2.csv):



C,1
D,13
E,15
F,19


File 3 (R3.csv):



A,1
B,4
E,2
G,6
H,8


Expected result:



Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8


Any idea how to do this with bash scripting?







share|improve this question






















  • Wouldn't csvkit be more suitable and pleasant to use for this task?
    – smbear
    Apr 13 at 12:27














up vote
2
down vote

favorite












I have 100 csv files each containing 2 columns. The first is taxonomy and the second counts. Each file has about 10000 rows. The taxonomy data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa in one file are assigned the value 0. The result should be a table (csv or tsv) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv):



A,1
B,20
C,30


File 2 (R2.csv):



C,1
D,13
E,15
F,19


File 3 (R3.csv):



A,1
B,4
E,2
G,6
H,8


Expected result:



Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8


Any idea how to do this with bash scripting?







share|improve this question






















  • Wouldn't csvkit be more suitable and pleasant to use for this task?
    – smbear
    Apr 13 at 12:27












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I have 100 csv files each containing 2 columns. The first is taxonomy and the second counts. Each file has about 10000 rows. The taxonomy data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa in one file are assigned the value 0. The result should be a table (csv or tsv) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv):



A,1
B,20
C,30


File 2 (R2.csv):



C,1
D,13
E,15
F,19


File 3 (R3.csv):



A,1
B,4
E,2
G,6
H,8


Expected result:



Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8


Any idea how to do this with bash scripting?







share|improve this question














I have 100 csv files each containing 2 columns. The first is taxonomy and the second counts. Each file has about 10000 rows. The taxonomy data in each file is only partially shared with a total of about 50000 unique values. I need to merge these into one table where missing taxa in one file are assigned the value 0. The result should be a table (csv or tsv) with 50000 rows and 101 columns.
A simplified example would be:
File 1 (R1.csv):



A,1
B,20
C,30


File 2 (R2.csv):



C,1
D,13
E,15
F,19


File 3 (R3.csv):



A,1
B,4
E,2
G,6
H,8


Expected result:



Taxa,R1,R2,R3
A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8


Any idea how to do this with bash scripting?









share|improve this question













share|improve this question




share|improve this question








edited Apr 12 at 18:21









αғsнιη

14.8k82462




14.8k82462










asked Apr 11 at 13:51









Mina Bizic

111




111











  • Wouldn't csvkit be more suitable and pleasant to use for this task?
    – smbear
    Apr 13 at 12:27
















  • Wouldn't csvkit be more suitable and pleasant to use for this task?
    – smbear
    Apr 13 at 12:27















Wouldn't csvkit be more suitable and pleasant to use for this task?
– smbear
Apr 13 at 12:27




Wouldn't csvkit be more suitable and pleasant to use for this task?
– smbear
Apr 13 at 12:27










3 Answers
3






active

oldest

votes

















up vote
1
down vote













this is going to be painfull



 join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
sort -u r12?.csv > r12.csv
join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
sort -u r123a,b.csv


  • the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0


  • sort -u sort and keep unique record.

I an not sure awk code would be more readable.






share|improve this answer




















  • This can be extended to process all 100 files?
    – Î±Ò“sнιη
    Apr 11 at 16:57

















up vote
1
down vote













I used awk which handling multiple files at once:



sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
|awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')


The output:



A,1,0,1
B,20,0,4
C,30,1,0
D,0,13,0
E,0,15,2
F,0,19,0
G,0,0,6
H,0,0,8



Code break-down:



awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv


The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen here is an array name with the key as first column and value with ,FILENAME:$2 in appending mode.



In seen[$1]=seen[$1]","FILENAME":"$2; means print a comma , followed by current processing file FILENAME with awk, a colon : followed by second column value $2 when it has same first column seen[$1]=... and append into the same key index =seen[$1]... and save in same key's value.



The END statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.



Will result:



A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8


OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER -variable:



awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...


later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:



$ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
seen[$1]=seen[$1]","FILENAME":"$2;
END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
,R1.csv:,R2.csv:,R3.csv:
A,R1.csv:1,R3.csv:1
B,R1.csv:20,R3.csv:4
C,R1.csv:30,R2.csv:1
D,R2.csv:13
E,R2.csv:15,R3.csv:2
F,R2.csv:19
G,R3.csv:6
H,R3.csv:8


Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".



... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'


In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.






share|improve this answer





























    up vote
    0
    down vote













    There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:



    import os,re,argparse
    import csv

    parser = argparse.ArgumentParser(description='join csvs with rows of the
    form w+,[1-9], inserting a zero for a row label if it does not
    exist.')
    parser.add_argument('infiles', type=str, help='infile names', nargs='+')
    args = parser.parse_args()

    d =
    file_idx = 0
    for infile in args.infiles:
    with open(infile, 'r') as f:
    for line in f:
    parsed_line = re.match('(w+),([0-9]+)', line)
    if not parsed_line:
    print("line not parsed in file ".format(line, infile))
    continue
    if parsed_line.group(1) in d:
    d[parsed_line.group(1)].append(parsed_line.group(2))
    else:
    l = [0]*(file_idx)
    l.append(parsed_line.group(2))
    d[parsed_line.group(1)]=l
    for k in d:
    if (len(d[k]) == file_idx):
    d[k].append(0)
    if not(len(d[k]) == file_idx+1):
    print("problem with file , dict , key ".format(f,d,k))
    file_idx = file_idx + 1

    ## output time
    with open('results.csv','w') as csvfile:
    cwriter = csv.writer(csvfile)
    header = [os.path.splitext(x)[0] for x in args.infiles]
    header.insert(0,'Taxa')
    cwriter.writerow(header)
    for k in sorted(d.keys()):
    d[k].insert(0,k)
    cwriter.writerow(d[k])





    share|improve this answer




















      Your Answer







      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "106"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      convertImagesToLinks: false,
      noModals: false,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );








       

      draft saved


      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437026%2fmerge-multiple-csv-files-with-partially-matching-key-column%23new-answer', 'question_page');

      );

      Post as a guest






























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes








      up vote
      1
      down vote













      this is going to be painfull



       join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
      join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
      sort -u r12?.csv > r12.csv
      join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
      join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
      sort -u r123a,b.csv


      • the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0


      • sort -u sort and keep unique record.

      I an not sure awk code would be more readable.






      share|improve this answer




















      • This can be extended to process all 100 files?
        – Î±Ò“sнιη
        Apr 11 at 16:57














      up vote
      1
      down vote













      this is going to be painfull



       join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
      join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
      sort -u r12?.csv > r12.csv
      join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
      join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
      sort -u r123a,b.csv


      • the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0


      • sort -u sort and keep unique record.

      I an not sure awk code would be more readable.






      share|improve this answer




















      • This can be extended to process all 100 files?
        – Î±Ò“sнιη
        Apr 11 at 16:57












      up vote
      1
      down vote










      up vote
      1
      down vote









      this is going to be painfull



       join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
      join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
      sort -u r12?.csv > r12.csv
      join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
      join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
      sort -u r123a,b.csv


      • the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0


      • sort -u sort and keep unique record.

      I an not sure awk code would be more readable.






      share|improve this answer












      this is going to be painfull



       join -t, -j1 -a1 -e0 -o auto r1.csv r2.csv > r12a.csv
      join -t, -j1 -a2 -e0 -o auto r1.csv r2.csv > r12b.csv
      sort -u r12?.csv > r12.csv
      join -t, -j1 -a1 -e0 -o auto r12.csv r3.csv > r123a.csv
      join -t, -j1 -a2 -e0 -o auto r12.csv r3.csv > r123b.csv
      sort -u r123a,b.csv


      • the first join print unpaired value in file x (-ax) with a default (-e0), -o auto tell join to print 0


      • sort -u sort and keep unique record.

      I an not sure awk code would be more readable.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Apr 11 at 14:53









      Archemar

      18.9k93366




      18.9k93366











      • This can be extended to process all 100 files?
        – Î±Ò“sнιη
        Apr 11 at 16:57
















      • This can be extended to process all 100 files?
        – Î±Ò“sнιη
        Apr 11 at 16:57















      This can be extended to process all 100 files?
      – Î±Ò“sнιη
      Apr 11 at 16:57




      This can be extended to process all 100 files?
      – Î±Ò“sнιη
      Apr 11 at 16:57












      up vote
      1
      down vote













      I used awk which handling multiple files at once:



      sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
      seen[$1]=seen[$1]","FILENAME":"$2;
      END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
      |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
      for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')


      The output:



      A,1,0,1
      B,20,0,4
      C,30,1,0
      D,0,13,0
      E,0,15,2
      F,0,19,0
      G,0,0,6
      H,0,0,8



      Code break-down:



      awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
      END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv


      The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen here is an array name with the key as first column and value with ,FILENAME:$2 in appending mode.



      In seen[$1]=seen[$1]","FILENAME":"$2; means print a comma , followed by current processing file FILENAME with awk, a colon : followed by second column value $2 when it has same first column seen[$1]=... and append into the same key index =seen[$1]... and save in same key's value.



      The END statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.



      Will result:



      A,R1.csv:1,R3.csv:1
      B,R1.csv:20,R3.csv:4
      C,R1.csv:30,R2.csv:1
      D,R2.csv:13
      E,R2.csv:15,R3.csv:2
      F,R2.csv:19
      G,R3.csv:6
      H,R3.csv:8


      OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER -variable:



      awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...


      later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:



      $ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
      seen[$1]=seen[$1]","FILENAME":"$2;
      END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
      ,R1.csv:,R2.csv:,R3.csv:
      A,R1.csv:1,R3.csv:1
      B,R1.csv:20,R3.csv:4
      C,R1.csv:30,R2.csv:1
      D,R2.csv:13
      E,R2.csv:15,R3.csv:2
      F,R2.csv:19
      G,R3.csv:6
      H,R3.csv:8


      Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".



      ... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
      for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'


      In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.






      share|improve this answer


























        up vote
        1
        down vote













        I used awk which handling multiple files at once:



        sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
        seen[$1]=seen[$1]","FILENAME":"$2;
        END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
        |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
        for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')


        The output:



        A,1,0,1
        B,20,0,4
        C,30,1,0
        D,0,13,0
        E,0,15,2
        F,0,19,0
        G,0,0,6
        H,0,0,8



        Code break-down:



        awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
        END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv


        The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen here is an array name with the key as first column and value with ,FILENAME:$2 in appending mode.



        In seen[$1]=seen[$1]","FILENAME":"$2; means print a comma , followed by current processing file FILENAME with awk, a colon : followed by second column value $2 when it has same first column seen[$1]=... and append into the same key index =seen[$1]... and save in same key's value.



        The END statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.



        Will result:



        A,R1.csv:1,R3.csv:1
        B,R1.csv:20,R3.csv:4
        C,R1.csv:30,R2.csv:1
        D,R2.csv:13
        E,R2.csv:15,R3.csv:2
        F,R2.csv:19
        G,R3.csv:6
        H,R3.csv:8


        OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER -variable:



        awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...


        later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:



        $ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
        seen[$1]=seen[$1]","FILENAME":"$2;
        END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
        ,R1.csv:,R2.csv:,R3.csv:
        A,R1.csv:1,R3.csv:1
        B,R1.csv:20,R3.csv:4
        C,R1.csv:30,R2.csv:1
        D,R2.csv:13
        E,R2.csv:15,R3.csv:2
        F,R2.csv:19
        G,R3.csv:6
        H,R3.csv:8


        Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".



        ... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
        for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'


        In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.






        share|improve this answer
























          up vote
          1
          down vote










          up vote
          1
          down vote









          I used awk which handling multiple files at once:



          sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
          seen[$1]=seen[$1]","FILENAME":"$2;
          END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
          |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
          for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')


          The output:



          A,1,0,1
          B,20,0,4
          C,30,1,0
          D,0,13,0
          E,0,15,2
          F,0,19,0
          G,0,0,6
          H,0,0,8



          Code break-down:



          awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
          END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv


          The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen here is an array name with the key as first column and value with ,FILENAME:$2 in appending mode.



          In seen[$1]=seen[$1]","FILENAME":"$2; means print a comma , followed by current processing file FILENAME with awk, a colon : followed by second column value $2 when it has same first column seen[$1]=... and append into the same key index =seen[$1]... and save in same key's value.



          The END statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.



          Will result:



          A,R1.csv:1,R3.csv:1
          B,R1.csv:20,R3.csv:4
          C,R1.csv:30,R2.csv:1
          D,R2.csv:13
          E,R2.csv:15,R3.csv:2
          F,R2.csv:19
          G,R3.csv:6
          H,R3.csv:8


          OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER -variable:



          awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...


          later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:



          $ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
          seen[$1]=seen[$1]","FILENAME":"$2;
          END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
          ,R1.csv:,R2.csv:,R3.csv:
          A,R1.csv:1,R3.csv:1
          B,R1.csv:20,R3.csv:4
          C,R1.csv:30,R2.csv:1
          D,R2.csv:13
          E,R2.csv:15,R3.csv:2
          F,R2.csv:19
          G,R3.csv:6
          H,R3.csv:8


          Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".



          ... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
          for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'


          In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.






          share|improve this answer














          I used awk which handling multiple files at once:



          sed 's/,R[1-9]+.csv:/,/g' <(awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
          seen[$1]=seen[$1]","FILENAME":"$2;
          END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
          |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
          for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""')


          The output:



          A,1,0,1
          B,20,0,4
          C,30,1,0
          D,0,13,0
          E,0,15,2
          F,0,19,0
          G,0,0,6
          H,0,0,8



          Code break-down:



          awk -F, ' seen[$1]=seen[$1]","FILENAME":"$2; 
          END print HEADER; for (x in seen) print x seen[x] ' R1..3.csv


          The main part of the code that join all second column in all files into one together printing with the value belongs to which file having same first column. The seen here is an array name with the key as first column and value with ,FILENAME:$2 in appending mode.



          In seen[$1]=seen[$1]","FILENAME":"$2; means print a comma , followed by current processing file FILENAME with awk, a colon : followed by second column value $2 when it has same first column seen[$1]=... and append into the same key index =seen[$1]... and save in same key's value.



          The END statement, awk executing this block in end when all records/line read, and we used a for-loop to iterate over the array seen and print the key first and the value of the key in next.



          Will result:



          A,R1.csv:1,R3.csv:1
          B,R1.csv:20,R3.csv:4
          C,R1.csv:30,R2.csv:1
          D,R2.csv:13
          E,R2.csv:15,R3.csv:2
          F,R2.csv:19
          G,R3.csv:6
          H,R3.csv:8


          OK, now we know exiting values are from which files and which files doesn't have those data. For this to filling non-existing files' data with 0, I used shell command substitution to generate a header line congaing all files name and passed to awk as HEADER -variable:



          awk -v HEADER="$(printf ",%s:" R1..3.csv)" ...


          later we will use this HEADER line and fill missed files' data with 0. Now our input is like this format:



          $ awk -v HEADER="$(printf ",%s:" R1..3.csv)" -F, '
          seen[$1]=seen[$1]","FILENAME":"$2;
          END print HEADER; for (x in seen) print x seen[x]' R1..3.csv
          ,R1.csv:,R2.csv:,R3.csv:
          A,R1.csv:1,R3.csv:1
          B,R1.csv:20,R3.csv:4
          C,R1.csv:30,R2.csv:1
          D,R2.csv:13
          E,R2.csv:15,R3.csv:2
          F,R2.csv:19
          G,R3.csv:6
          H,R3.csv:8


          Next I used another below awk script to fill the non-exiting files' data with 0 which I copied from my another answer to the question "Formatting and fill missed data based on column".



          ... |awk -F, 'NR==1split($0,arr,/,/);next SEP=""; fld=1;
          for (x in arr)printf ($0 ~ arr[x])?SEP""$(fld++):",0";SEP=",";print ""'


          In the end, the sed 's/,R[1-9]+.csv:/,/g' is used to replace the existing filess name in result with single comma ,.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited May 9 at 17:50

























          answered Apr 11 at 19:46









          αғsнιη

          14.8k82462




          14.8k82462




















              up vote
              0
              down vote













              There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:



              import os,re,argparse
              import csv

              parser = argparse.ArgumentParser(description='join csvs with rows of the
              form w+,[1-9], inserting a zero for a row label if it does not
              exist.')
              parser.add_argument('infiles', type=str, help='infile names', nargs='+')
              args = parser.parse_args()

              d =
              file_idx = 0
              for infile in args.infiles:
              with open(infile, 'r') as f:
              for line in f:
              parsed_line = re.match('(w+),([0-9]+)', line)
              if not parsed_line:
              print("line not parsed in file ".format(line, infile))
              continue
              if parsed_line.group(1) in d:
              d[parsed_line.group(1)].append(parsed_line.group(2))
              else:
              l = [0]*(file_idx)
              l.append(parsed_line.group(2))
              d[parsed_line.group(1)]=l
              for k in d:
              if (len(d[k]) == file_idx):
              d[k].append(0)
              if not(len(d[k]) == file_idx+1):
              print("problem with file , dict , key ".format(f,d,k))
              file_idx = file_idx + 1

              ## output time
              with open('results.csv','w') as csvfile:
              cwriter = csv.writer(csvfile)
              header = [os.path.splitext(x)[0] for x in args.infiles]
              header.insert(0,'Taxa')
              cwriter.writerow(header)
              for k in sorted(d.keys()):
              d[k].insert(0,k)
              cwriter.writerow(d[k])





              share|improve this answer
























                up vote
                0
                down vote













                There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:



                import os,re,argparse
                import csv

                parser = argparse.ArgumentParser(description='join csvs with rows of the
                form w+,[1-9], inserting a zero for a row label if it does not
                exist.')
                parser.add_argument('infiles', type=str, help='infile names', nargs='+')
                args = parser.parse_args()

                d =
                file_idx = 0
                for infile in args.infiles:
                with open(infile, 'r') as f:
                for line in f:
                parsed_line = re.match('(w+),([0-9]+)', line)
                if not parsed_line:
                print("line not parsed in file ".format(line, infile))
                continue
                if parsed_line.group(1) in d:
                d[parsed_line.group(1)].append(parsed_line.group(2))
                else:
                l = [0]*(file_idx)
                l.append(parsed_line.group(2))
                d[parsed_line.group(1)]=l
                for k in d:
                if (len(d[k]) == file_idx):
                d[k].append(0)
                if not(len(d[k]) == file_idx+1):
                print("problem with file , dict , key ".format(f,d,k))
                file_idx = file_idx + 1

                ## output time
                with open('results.csv','w') as csvfile:
                cwriter = csv.writer(csvfile)
                header = [os.path.splitext(x)[0] for x in args.infiles]
                header.insert(0,'Taxa')
                cwriter.writerow(header)
                for k in sorted(d.keys()):
                d[k].insert(0,k)
                cwriter.writerow(d[k])





                share|improve this answer






















                  up vote
                  0
                  down vote










                  up vote
                  0
                  down vote









                  There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:



                  import os,re,argparse
                  import csv

                  parser = argparse.ArgumentParser(description='join csvs with rows of the
                  form w+,[1-9], inserting a zero for a row label if it does not
                  exist.')
                  parser.add_argument('infiles', type=str, help='infile names', nargs='+')
                  args = parser.parse_args()

                  d =
                  file_idx = 0
                  for infile in args.infiles:
                  with open(infile, 'r') as f:
                  for line in f:
                  parsed_line = re.match('(w+),([0-9]+)', line)
                  if not parsed_line:
                  print("line not parsed in file ".format(line, infile))
                  continue
                  if parsed_line.group(1) in d:
                  d[parsed_line.group(1)].append(parsed_line.group(2))
                  else:
                  l = [0]*(file_idx)
                  l.append(parsed_line.group(2))
                  d[parsed_line.group(1)]=l
                  for k in d:
                  if (len(d[k]) == file_idx):
                  d[k].append(0)
                  if not(len(d[k]) == file_idx+1):
                  print("problem with file , dict , key ".format(f,d,k))
                  file_idx = file_idx + 1

                  ## output time
                  with open('results.csv','w') as csvfile:
                  cwriter = csv.writer(csvfile)
                  header = [os.path.splitext(x)[0] for x in args.infiles]
                  header.insert(0,'Taxa')
                  cwriter.writerow(header)
                  for k in sorted(d.keys()):
                  d[k].insert(0,k)
                  cwriter.writerow(d[k])





                  share|improve this answer












                  There are a lot of ways to handle CSVs on the command line, or the answer by Archemar. However, because of your requirements, I would advise using python. I tested out this script in python 3.5 and should solve the problem, or at least give you a good start:



                  import os,re,argparse
                  import csv

                  parser = argparse.ArgumentParser(description='join csvs with rows of the
                  form w+,[1-9], inserting a zero for a row label if it does not
                  exist.')
                  parser.add_argument('infiles', type=str, help='infile names', nargs='+')
                  args = parser.parse_args()

                  d =
                  file_idx = 0
                  for infile in args.infiles:
                  with open(infile, 'r') as f:
                  for line in f:
                  parsed_line = re.match('(w+),([0-9]+)', line)
                  if not parsed_line:
                  print("line not parsed in file ".format(line, infile))
                  continue
                  if parsed_line.group(1) in d:
                  d[parsed_line.group(1)].append(parsed_line.group(2))
                  else:
                  l = [0]*(file_idx)
                  l.append(parsed_line.group(2))
                  d[parsed_line.group(1)]=l
                  for k in d:
                  if (len(d[k]) == file_idx):
                  d[k].append(0)
                  if not(len(d[k]) == file_idx+1):
                  print("problem with file , dict , key ".format(f,d,k))
                  file_idx = file_idx + 1

                  ## output time
                  with open('results.csv','w') as csvfile:
                  cwriter = csv.writer(csvfile)
                  header = [os.path.splitext(x)[0] for x in args.infiles]
                  header.insert(0,'Taxa')
                  cwriter.writerow(header)
                  for k in sorted(d.keys()):
                  d[k].insert(0,k)
                  cwriter.writerow(d[k])






                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Apr 11 at 17:55









                  hhoke1

                  31416




                  31416






















                       

                      draft saved


                      draft discarded


























                       


                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f437026%2fmerge-multiple-csv-files-with-partially-matching-key-column%23new-answer', 'question_page');

                      );

                      Post as a guest













































































                      Popular posts from this blog

                      How to check contact read email or not when send email to Individual?

                      Displaying single band from multi-band raster using QGIS

                      How many registers does an x86_64 CPU actually have?