How to concatenate RNA-seq files generated in differnt lanes [closed]
Clash Royale CLAN TAG#URR8PPP
up vote
-1
down vote
favorite
I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.
MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz
I want to concatenate all the sequence generated in different lanes for the forward and reverse read. For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_FREN
). I want to concatenate all the forward read XXXXX_R1_001.fastq.gz
that are generated in different lanes and put in the file name MC9_FREN_R1.fastq.gz
and all reverse reads XXXX_R2_001.fastq.gz
to MC9_FREN_R2.fastq.gz
cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz
linux bioinformatics
closed as unclear what you're asking by Kiwy, Jeff Schaller, Timothy Martin, roaima, Eliah Kagan Apr 10 at 18:23
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
 |Â
show 6 more comments
up vote
-1
down vote
favorite
I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.
MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz
I want to concatenate all the sequence generated in different lanes for the forward and reverse read. For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_FREN
). I want to concatenate all the forward read XXXXX_R1_001.fastq.gz
that are generated in different lanes and put in the file name MC9_FREN_R1.fastq.gz
and all reverse reads XXXX_R2_001.fastq.gz
to MC9_FREN_R2.fastq.gz
cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz
linux bioinformatics
closed as unclear what you're asking by Kiwy, Jeff Schaller, Timothy Martin, roaima, Eliah Kagan Apr 10 at 18:23
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
3
Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â Kusalananda
Apr 10 at 13:01
I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â desu
Apr 10 at 13:05
2
What is the logic behind what files should be concatenated?
â Kusalananda
Apr 10 at 13:06
For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â desu
Apr 10 at 13:12
1
I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â roaima
Apr 10 at 13:27
 |Â
show 6 more comments
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.
MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz
I want to concatenate all the sequence generated in different lanes for the forward and reverse read. For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_FREN
). I want to concatenate all the forward read XXXXX_R1_001.fastq.gz
that are generated in different lanes and put in the file name MC9_FREN_R1.fastq.gz
and all reverse reads XXXX_R2_001.fastq.gz
to MC9_FREN_R2.fastq.gz
cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz
linux bioinformatics
I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.
MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz
I want to concatenate all the sequence generated in different lanes for the forward and reverse read. For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_FREN
). I want to concatenate all the forward read XXXXX_R1_001.fastq.gz
that are generated in different lanes and put in the file name MC9_FREN_R1.fastq.gz
and all reverse reads XXXX_R2_001.fastq.gz
to MC9_FREN_R2.fastq.gz
cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz
linux bioinformatics
edited Apr 12 at 15:51
asked Apr 10 at 12:57
desu
544
544
closed as unclear what you're asking by Kiwy, Jeff Schaller, Timothy Martin, roaima, Eliah Kagan Apr 10 at 18:23
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
closed as unclear what you're asking by Kiwy, Jeff Schaller, Timothy Martin, roaima, Eliah Kagan Apr 10 at 18:23
Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itâÂÂs hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.
3
Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â Kusalananda
Apr 10 at 13:01
I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â desu
Apr 10 at 13:05
2
What is the logic behind what files should be concatenated?
â Kusalananda
Apr 10 at 13:06
For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â desu
Apr 10 at 13:12
1
I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â roaima
Apr 10 at 13:27
 |Â
show 6 more comments
3
Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â Kusalananda
Apr 10 at 13:01
I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â desu
Apr 10 at 13:05
2
What is the logic behind what files should be concatenated?
â Kusalananda
Apr 10 at 13:06
For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â desu
Apr 10 at 13:12
1
I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â roaima
Apr 10 at 13:27
3
3
Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â Kusalananda
Apr 10 at 13:01
Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â Kusalananda
Apr 10 at 13:01
I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â desu
Apr 10 at 13:05
I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â desu
Apr 10 at 13:05
2
2
What is the logic behind what files should be concatenated?
â Kusalananda
Apr 10 at 13:06
What is the logic behind what files should be concatenated?
â Kusalananda
Apr 10 at 13:06
For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â desu
Apr 10 at 13:12
For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â desu
Apr 10 at 13:12
1
1
I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â roaima
Apr 10 at 13:27
I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â roaima
Apr 10 at 13:27
 |Â
show 6 more comments
2 Answers
2
active
oldest
votes
up vote
2
down vote
accepted
The following loop gives us the unique filename prefixes of the FastQ files in the current directory. It relies on the fact that there will always be four underscores (_
) between the filename prefix that we want and the R1
or R2
later in the filename.
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq
The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq
With the given list of files, either of the above returns
DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH
We then read these prefixes and create our concatenated files:
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
or, using the sed
code from above,
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
No code above uses bash
-specific (or GNU-specific) features and should work in all POSIX shells.
UPDATE: I work with bioinformaticians, and a colleague of mine commented:
One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.
For questions about this, please refer to the Bioinformatics Stack Exchange site.
add a comment |Â
up vote
1
down vote
Bash
solution:
for f in *.fastq.gz; do
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done
^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$
- the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex.MC9_PREN
) andR
-named suffix into the 2nd captured group (for ex._R1
)
add a comment |Â
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
accepted
The following loop gives us the unique filename prefixes of the FastQ files in the current directory. It relies on the fact that there will always be four underscores (_
) between the filename prefix that we want and the R1
or R2
later in the filename.
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq
The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq
With the given list of files, either of the above returns
DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH
We then read these prefixes and create our concatenated files:
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
or, using the sed
code from above,
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
No code above uses bash
-specific (or GNU-specific) features and should work in all POSIX shells.
UPDATE: I work with bioinformaticians, and a colleague of mine commented:
One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.
For questions about this, please refer to the Bioinformatics Stack Exchange site.
add a comment |Â
up vote
2
down vote
accepted
The following loop gives us the unique filename prefixes of the FastQ files in the current directory. It relies on the fact that there will always be four underscores (_
) between the filename prefix that we want and the R1
or R2
later in the filename.
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq
The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq
With the given list of files, either of the above returns
DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH
We then read these prefixes and create our concatenated files:
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
or, using the sed
code from above,
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
No code above uses bash
-specific (or GNU-specific) features and should work in all POSIX shells.
UPDATE: I work with bioinformaticians, and a colleague of mine commented:
One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.
For questions about this, please refer to the Bioinformatics Stack Exchange site.
add a comment |Â
up vote
2
down vote
accepted
up vote
2
down vote
accepted
The following loop gives us the unique filename prefixes of the FastQ files in the current directory. It relies on the fact that there will always be four underscores (_
) between the filename prefix that we want and the R1
or R2
later in the filename.
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq
The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq
With the given list of files, either of the above returns
DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH
We then read these prefixes and create our concatenated files:
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
or, using the sed
code from above,
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
No code above uses bash
-specific (or GNU-specific) features and should work in all POSIX shells.
UPDATE: I work with bioinformaticians, and a colleague of mine commented:
One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.
For questions about this, please refer to the Bioinformatics Stack Exchange site.
The following loop gives us the unique filename prefixes of the FastQ files in the current directory. It relies on the fact that there will always be four underscores (_
) between the filename prefix that we want and the R1
or R2
later in the filename.
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq
The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq
With the given list of files, either of the above returns
DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH
We then read these prefixes and create our concatenated files:
for name in *.fastq.gz; do
printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
or, using the sed
code from above,
printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done
No code above uses bash
-specific (or GNU-specific) features and should work in all POSIX shells.
UPDATE: I work with bioinformaticians, and a colleague of mine commented:
One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.
For questions about this, please refer to the Bioinformatics Stack Exchange site.
edited Apr 12 at 15:57
answered Apr 10 at 13:23
Kusalananda
102k13200317
102k13200317
add a comment |Â
add a comment |Â
up vote
1
down vote
Bash
solution:
for f in *.fastq.gz; do
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done
^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$
- the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex.MC9_PREN
) andR
-named suffix into the 2nd captured group (for ex._R1
)
add a comment |Â
up vote
1
down vote
Bash
solution:
for f in *.fastq.gz; do
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done
^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$
- the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex.MC9_PREN
) andR
-named suffix into the 2nd captured group (for ex._R1
)
add a comment |Â
up vote
1
down vote
up vote
1
down vote
Bash
solution:
for f in *.fastq.gz; do
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done
^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$
- the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex.MC9_PREN
) andR
-named suffix into the 2nd captured group (for ex._R1
)
Bash
solution:
for f in *.fastq.gz; do
[[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done
^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$
- the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex.MC9_PREN
) andR
-named suffix into the 2nd captured group (for ex._R1
)
answered Apr 10 at 13:18
RomanPerekhrest
22.4k12144
22.4k12144
add a comment |Â
add a comment |Â
3
Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â Kusalananda
Apr 10 at 13:01
I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â desu
Apr 10 at 13:05
2
What is the logic behind what files should be concatenated?
â Kusalananda
Apr 10 at 13:06
For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â desu
Apr 10 at 13:12
1
I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â roaima
Apr 10 at 13:27