How to split a single file into multiple files based on a column in linux?
Clash Royale CLAN TAG#URR8PPP
I have a text file with following information:
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.
First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
How to do this linux?
linux files split
add a comment |
I have a text file with following information:
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.
First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
How to do this linux?
linux files split
Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?
– user3351523
Jan 31 '18 at 12:51
1
@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)
– RomanPerekhrest
Jan 31 '18 at 12:53
Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?
– user3351523
Jan 31 '18 at 12:57
add a comment |
I have a text file with following information:
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.
First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
How to do this linux?
linux files split
I have a text file with following information:
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
As you see there are multiple samples I want to split the file into multiple files based on the column "Tumor_Sample_Barcode". The output files need to be named with samplename.txt.
First output - TCGA-BD-A2L6-01A-11D-A20W-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
Second output - TCGA-O8-A75V-01A-11D-A32G-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
Third output - TCGA-G3-A7M5-01A-11D-A33Q-10.txt
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
How to do this linux?
linux files split
linux files split
edited Jan 31 '18 at 12:56
user3351523
asked Jan 31 '18 at 12:41
user3351523user3351523
15739
15739
Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?
– user3351523
Jan 31 '18 at 12:51
1
@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)
– RomanPerekhrest
Jan 31 '18 at 12:53
Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?
– user3351523
Jan 31 '18 at 12:57
add a comment |
Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?
– user3351523
Jan 31 '18 at 12:51
1
@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)
– RomanPerekhrest
Jan 31 '18 at 12:53
Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?
– user3351523
Jan 31 '18 at 12:57
Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?
– user3351523
Jan 31 '18 at 12:51
Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?
– user3351523
Jan 31 '18 at 12:51
1
1
@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)
– RomanPerekhrest
Jan 31 '18 at 12:53
@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)
– RomanPerekhrest
Jan 31 '18 at 12:53
Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?
– user3351523
Jan 31 '18 at 12:57
Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?
– user3351523
Jan 31 '18 at 12:57
add a comment |
1 Answer
1
active
oldest
votes
Awk
solution:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file
NR==1 h=$0
- capture the 1st line/record as header line (NR
points to a record number,$0
- contains the current line)NR > 1
- for all records except the first one:<cond>? <operand_1> : <operand_2>
- classical ternary operator!a[$2]++?
- check for the 1st occurrence of barcode value$2
used as a key of associative arraya
h ORS $0
- common header line concatenated withORS
(output record separator, defaults ton
) and current record$0
print ... > $2".txt"
- print custom content or the current line(if nothing was specified) into file<barcode_value>.txt
Or a more self-explanatory version:
awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file
Viewing results:
$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
To adjust a filename based on 15-char sequence of barcode value:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file
Thank you very much !! Could you please explain the command?
– user3351523
Jan 31 '18 at 13:13
Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt
– user3351523
Jan 31 '18 at 13:31
1
@user3351523, yes, see my explanation
– RomanPerekhrest
Jan 31 '18 at 13:47
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f420938%2fhow-to-split-a-single-file-into-multiple-files-based-on-a-column-in-linux%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Awk
solution:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file
NR==1 h=$0
- capture the 1st line/record as header line (NR
points to a record number,$0
- contains the current line)NR > 1
- for all records except the first one:<cond>? <operand_1> : <operand_2>
- classical ternary operator!a[$2]++?
- check for the 1st occurrence of barcode value$2
used as a key of associative arraya
h ORS $0
- common header line concatenated withORS
(output record separator, defaults ton
) and current record$0
print ... > $2".txt"
- print custom content or the current line(if nothing was specified) into file<barcode_value>.txt
Or a more self-explanatory version:
awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file
Viewing results:
$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
To adjust a filename based on 15-char sequence of barcode value:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file
Thank you very much !! Could you please explain the command?
– user3351523
Jan 31 '18 at 13:13
Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt
– user3351523
Jan 31 '18 at 13:31
1
@user3351523, yes, see my explanation
– RomanPerekhrest
Jan 31 '18 at 13:47
add a comment |
Awk
solution:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file
NR==1 h=$0
- capture the 1st line/record as header line (NR
points to a record number,$0
- contains the current line)NR > 1
- for all records except the first one:<cond>? <operand_1> : <operand_2>
- classical ternary operator!a[$2]++?
- check for the 1st occurrence of barcode value$2
used as a key of associative arraya
h ORS $0
- common header line concatenated withORS
(output record separator, defaults ton
) and current record$0
print ... > $2".txt"
- print custom content or the current line(if nothing was specified) into file<barcode_value>.txt
Or a more self-explanatory version:
awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file
Viewing results:
$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
To adjust a filename based on 15-char sequence of barcode value:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file
Thank you very much !! Could you please explain the command?
– user3351523
Jan 31 '18 at 13:13
Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt
– user3351523
Jan 31 '18 at 13:31
1
@user3351523, yes, see my explanation
– RomanPerekhrest
Jan 31 '18 at 13:47
add a comment |
Awk
solution:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file
NR==1 h=$0
- capture the 1st line/record as header line (NR
points to a record number,$0
- contains the current line)NR > 1
- for all records except the first one:<cond>? <operand_1> : <operand_2>
- classical ternary operator!a[$2]++?
- check for the 1st occurrence of barcode value$2
used as a key of associative arraya
h ORS $0
- common header line concatenated withORS
(output record separator, defaults ton
) and current record$0
print ... > $2".txt"
- print custom content or the current line(if nothing was specified) into file<barcode_value>.txt
Or a more self-explanatory version:
awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file
Viewing results:
$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
To adjust a filename based on 15-char sequence of barcode value:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file
Awk
solution:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > $2".txt" ' file
NR==1 h=$0
- capture the 1st line/record as header line (NR
points to a record number,$0
- contains the current line)NR > 1
- for all records except the first one:<cond>? <operand_1> : <operand_2>
- classical ternary operator!a[$2]++?
- check for the 1st occurrence of barcode value$2
used as a key of associative arraya
h ORS $0
- common header line concatenated withORS
(output record separator, defaults ton
) and current record$0
print ... > $2".txt"
- print custom content or the current line(if nothing was specified) into file<barcode_value>.txt
Or a more self-explanatory version:
awk 'NR==1 header = $0; next
!header_printed[$2]++ print header > $2".txt"
print > $2".txt"' < file
Viewing results:
$ head TCGA*.txt
==> TCGA-BD-A2L6-01A-11D-A20W-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
MTHFR TCGA-BD-A2L6-01A-11D-A20W-10 4524 BCM GRCh38
SLC30A1 TCGA-BD-A2L6-01A-11D-A20W-10 7779 BCM GRCh38
USH2A TCGA-BD-A2L6-01A-11D-A20W-10 7399 BCM GRCh38
SOS1 TCGA-BD-A2L6-01A-11D-A20W-10 6654 BCM GRCh38
==> TCGA-G3-A7M5-01A-11D-A33Q-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
PRDM16 TCGA-G3-A7M5-01A-11D-A33Q-10 63976 BCM GRCh38
DNAJC11 TCGA-G3-A7M5-01A-11D-A33Q-10 55735 BCM GRCh38
HNRNPCL2 TCGA-G3-A7M5-01A-11D-A33Q-10 440563 BCM GRCh38
C1orf94 TCGA-G3-A7M5-01A-11D-A33Q-10 84970 BCM GRCh38
NFYC TCGA-G3-A7M5-01A-11D-A33Q-10 4802 BCM GRCh38
IPP TCGA-G3-A7M5-01A-11D-A33Q-10 3652 BCM GRCh38
==> TCGA-O8-A75V-01A-11D-A32G-10.txt <==
Hugo_Symbol Tumor_Sample_Barcode Entrez_Gene_Id Center NCBI_Build
TMEM51 TCGA-O8-A75V-01A-11D-A32G-10 55092 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
FLG TCGA-O8-A75V-01A-11D-A32G-10 2312 BCM GRCh38
To adjust a filename based on 15-char sequence of barcode value:
awk 'NR==1 h=$0 NR>1 print (!a[$2]++? h ORS $0 : $0) > substr($2, 1, 15)".txt" ' file
edited Jan 31 '18 at 13:53
answered Jan 31 '18 at 13:06
RomanPerekhrestRomanPerekhrest
23k12447
23k12447
Thank you very much !! Could you please explain the command?
– user3351523
Jan 31 '18 at 13:13
Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt
– user3351523
Jan 31 '18 at 13:31
1
@user3351523, yes, see my explanation
– RomanPerekhrest
Jan 31 '18 at 13:47
add a comment |
Thank you very much !! Could you please explain the command?
– user3351523
Jan 31 '18 at 13:13
Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt
– user3351523
Jan 31 '18 at 13:31
1
@user3351523, yes, see my explanation
– RomanPerekhrest
Jan 31 '18 at 13:47
Thank you very much !! Could you please explain the command?
– user3351523
Jan 31 '18 at 13:13
Thank you very much !! Could you please explain the command?
– user3351523
Jan 31 '18 at 13:13
Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt
– user3351523
Jan 31 '18 at 13:31
Please explain the command and could you also tell me how to get the output files with only 0-15 substring in the sample names like TCGA-BD-A2L6-01.txt, TCGA-G3-A7M5-01.txt and TCGA-O8-A75V-01.txt
– user3351523
Jan 31 '18 at 13:31
1
1
@user3351523, yes, see my explanation
– RomanPerekhrest
Jan 31 '18 at 13:47
@user3351523, yes, see my explanation
– RomanPerekhrest
Jan 31 '18 at 13:47
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f420938%2fhow-to-split-a-single-file-into-multiple-files-based-on-a-column-in-linux%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Thank you for the reply. But I don't see any headers in the output files. How to get the columns names also in the outputs?
– user3351523
Jan 31 '18 at 12:51
1
@user3351523, "headers" should be the next moment. The first moment should be your posting a testable input (as a text, not as image)
– RomanPerekhrest
Jan 31 '18 at 12:53
Yes, sorry for that. I posted test table input as text now. How to get the headers in output files?
– user3351523
Jan 31 '18 at 12:57