How to split a single file into multiple files based on a column in linux?

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

How to do this linux?

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

How to do this linux?

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

How to do this linux?

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

I have a text file with following information:

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.

First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt

Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

How to do this linux?

linux files split

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

edited Jan 31 '18 at 12:56

asked Jan 31 '18 at 12:41

user3351523

15739

asked Jan 31 '18 at 12:41

user3351523

15739

asked Jan 31 '18 at 12:41

user3351523

15739

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

1

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?

– user3351523
Jan 31 '18 at 12:51

@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)

– RomanPerekhrest
Jan 31 '18 at 12:53

Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?

– user3351523
Jan 31 '18 at 12:57

add a comment |

1 Answer
1

active

oldest

votes

Awk solution:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file

NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 header = $0; next
 !header_printed[$2]++ print header > $2".txt"
 print > $2".txt"' < file

Viewing results:

$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f420938%2fhow-to-split-a-single-file-into-multiple-files-based-on-a-column-in-linux%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Awk solution:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file

NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 header = $0; next
 !header_printed[$2]++ print header > $2".txt"
 print > $2".txt"' < file

Viewing results:

$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Awk solution:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file

NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 header = $0; next
 !header_printed[$2]++ print header > $2".txt"
 print > $2".txt"' < file

Viewing results:

$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Awk solution:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file

NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 header = $0; next
 !header_printed[$2]++ print header > $2".txt"
 print > $2".txt"' < file

Viewing results:

$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

Awk solution:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file

NR==1 h=$0 - capture the 1st line/record as header line (NR points to a record number, $0 - contains the current line)

NR > 1 - for all records except the first one:
- <cond>? <operand_1> : <operand_2> - classical ternary operator
- !a[$2]++? - check for the 1st occurrence of barcode value $2 used as a key of associative array a
- h ORS $0 - common header line concatenated with ORS(output record separator, defaults to n) and current record $0
- print ... > $2".txt" - print custom content or the current line(if nothing was specified) into file <barcode_value>.txt

Or a more self-explanatory version:

awk 'NR==1 header = $0; next
 !header_printed[$2]++ print header > $2".txt"
 print > $2".txt"' < file

Viewing results:

$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38

==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38

==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38

To adjust a filename based on 15-char sequence of barcode value:

awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

edited Jan 31 '18 at 13:53

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

answered Jan 31 '18 at 13:06

RomanPerekhrest

23k12447

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

1

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

Thank you very much !! Could you please explain the command?

– user3351523
Jan 31 '18 at 13:13

Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt

– user3351523
Jan 31 '18 at 13:31

@user3351523, yes, see my explanation

– RomanPerekhrest
Jan 31 '18 at 13:47

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

搜尋此網誌

mjhjmtu