How to split a single file into multiple files based on a column in linux?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP












0















I have a text file with following information:



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.



First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38


Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38


Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


How to do this linux?










share|improve this question
























  • Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

    – user3351523
    Jan 31 '18 at 12:51






  • 1





    @user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

    – RomanPerekhrest
    Jan 31 '18 at 12:53












  • Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

    – user3351523
    Jan 31 '18 at 12:57















0















I have a text file with following information:



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.



First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38


Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38


Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


How to do this linux?










share|improve this question
























  • Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

    – user3351523
    Jan 31 '18 at 12:51






  • 1





    @user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

    – RomanPerekhrest
    Jan 31 '18 at 12:53












  • Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

    – user3351523
    Jan 31 '18 at 12:57













0












0








0


1






I have a text file with following information:



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.



First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38


Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38


Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


How to do this linux?










share|improve this question
















I have a text file with following information:



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.



First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38


Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38


Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt



Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38


How to do this linux?







linux files split






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 31 '18 at 12:56







user3351523

















asked Jan 31 '18 at 12:41









user3351523user3351523

15739




15739












  • Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

    – user3351523
    Jan 31 '18 at 12:51






  • 1





    @user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

    – RomanPerekhrest
    Jan 31 '18 at 12:53












  • Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

    – user3351523
    Jan 31 '18 at 12:57

















  • Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

    – user3351523
    Jan 31 '18 at 12:51






  • 1





    @user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

    – RomanPerekhrest
    Jan 31 '18 at 12:53












  • Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

    – user3351523
    Jan 31 '18 at 12:57
















Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51





Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51




1




1





@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53






@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53














Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57





Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57










1 Answer
1






active

oldest

votes


















3














Awk solution:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file



  • NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)


  • NR > 1 - for all records except the first one:


    • <cond>? <operand_1> : <operand_2> - classical ternary operator


    • !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a


    • h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0


    • print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt



Or a more self-explanatory version:



awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file


Viewing results:



$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38



To adjust a filename based on 15-char sequence of barcode value:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file 





share|improve this answer

























  • Thank you very much !! Could you please explain the command?

    – user3351523
    Jan 31 '18 at 13:13












  • Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

    – user3351523
    Jan 31 '18 at 13:31






  • 1





    @user3351523, yes, see my explanation

    – RomanPerekhrest
    Jan 31 '18 at 13:47










Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f420938%2fhow-to-split-a-single-file-into-multiple-files-based-on-a-column-in-linux%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














Awk solution:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file



  • NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)


  • NR > 1 - for all records except the first one:


    • <cond>? <operand_1> : <operand_2> - classical ternary operator


    • !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a


    • h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0


    • print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt



Or a more self-explanatory version:



awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file


Viewing results:



$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38



To adjust a filename based on 15-char sequence of barcode value:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file 





share|improve this answer

























  • Thank you very much !! Could you please explain the command?

    – user3351523
    Jan 31 '18 at 13:13












  • Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

    – user3351523
    Jan 31 '18 at 13:31






  • 1





    @user3351523, yes, see my explanation

    – RomanPerekhrest
    Jan 31 '18 at 13:47















3














Awk solution:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file



  • NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)


  • NR > 1 - for all records except the first one:


    • <cond>? <operand_1> : <operand_2> - classical ternary operator


    • !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a


    • h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0


    • print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt



Or a more self-explanatory version:



awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file


Viewing results:



$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38



To adjust a filename based on 15-char sequence of barcode value:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file 





share|improve this answer

























  • Thank you very much !! Could you please explain the command?

    – user3351523
    Jan 31 '18 at 13:13












  • Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

    – user3351523
    Jan 31 '18 at 13:31






  • 1





    @user3351523, yes, see my explanation

    – RomanPerekhrest
    Jan 31 '18 at 13:47













3












3








3







Awk solution:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file



  • NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)


  • NR > 1 - for all records except the first one:


    • <cond>? <operand_1> : <operand_2> - classical ternary operator


    • !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a


    • h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0


    • print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt



Or a more self-explanatory version:



awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file


Viewing results:



$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38



To adjust a filename based on 15-char sequence of barcode value:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file 





share|improve this answer















Awk solution:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file



  • NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)


  • NR > 1 - for all records except the first one:


    • <cond>? <operand_1> : <operand_2> - classical ternary operator


    • !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a


    • h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0


    • print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt



Or a more self-explanatory version:



awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file


Viewing results:



$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38



To adjust a filename based on 15-char sequence of barcode value:



awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file 






share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 31 '18 at 13:53

























answered Jan 31 '18 at 13:06









RomanPerekhrestRomanPerekhrest

23k12447




23k12447












  • Thank you very much !! Could you please explain the command?

    – user3351523
    Jan 31 '18 at 13:13












  • Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

    – user3351523
    Jan 31 '18 at 13:31






  • 1





    @user3351523, yes, see my explanation

    – RomanPerekhrest
    Jan 31 '18 at 13:47

















  • Thank you very much !! Could you please explain the command?

    – user3351523
    Jan 31 '18 at 13:13












  • Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

    – user3351523
    Jan 31 '18 at 13:31






  • 1





    @user3351523, yes, see my explanation

    – RomanPerekhrest
    Jan 31 '18 at 13:47
















Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13






Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13














Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31





Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31




1




1





@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47





@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f420938%2fhow-to-split-a-single-file-into-multiple-files-based-on-a-column-in-linux%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown






Popular posts from this blog

How to check contact read email or not when send email to Individual?

Displaying single band from multi-band raster using QGIS

How many registers does an x86_64 CPU actually have?