How to concatenate RNA-seq files generated in differnt lanes [closed]

up vote
-1
down vote

favorite

I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.

MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz

I want to concatenate all the sequence generated in different lanes for the forward and reverse read. For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_FREN). I want to concatenate all the forward read XXXXX_R1_001.fastq.gz that are generated in different lanes and put in the file name MC9_FREN_R1.fastq.gz and all reverse reads XXXX_R2_001.fastq.gz to MC9_FREN_R2.fastq.gz

cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz 
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz

edited Apr 12 at 15:51

asked Apr 10 at 12:57

desu

544

closed as unclear what you're asking by Kiwy, Jeff Schaller, Timothy Martin, roaima, Eliah Kagan Apr 10 at 18:23

Please clarify your specific problem or add additional details to highlight exactly what you need. As it's currently written, itÃ¢Â€Â™s hard to tell exactly what you're asking. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.

3

Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â€“Â Kusalananda
Apr 10 at 13:01

I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â€“Â desu
Apr 10 at 13:05

2

What is the logic behind what files should be concatenated?
â€“Â Kusalananda
Apr 10 at 13:06

For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â€“Â desu
Apr 10 at 13:12

1

I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â€“Â roaima
Apr 10 at 13:27

Â |Â
show 6 more comments

up vote
-1
down vote

favorite

I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.

MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz

cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz 
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz

edited Apr 12 at 15:51

asked Apr 10 at 12:57

desu

544

closed as unclear what you're asking by Kiwy, Jeff Schaller, Timothy Martin, roaima, Eliah Kagan Apr 10 at 18:23

3

Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â€“Â Kusalananda
Apr 10 at 13:01

I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â€“Â desu
Apr 10 at 13:05

2

What is the logic behind what files should be concatenated?
â€“Â Kusalananda
Apr 10 at 13:06

For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â€“Â desu
Apr 10 at 13:12

1

I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â€“Â roaima
Apr 10 at 13:27

Â |Â
show 6 more comments

up vote
-1
down vote

favorite

I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.

MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz

cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz 
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz

edited Apr 12 at 15:51

asked Apr 10 at 12:57

desu

544

I have very large RNA-seq files generated in different lanes. I extracted few of the file names as shown below.

MC9_FNEN_638A_S19_L008_R1_001.fastq.gz
MC9_FNEN_638A_S19_L008_R2_001.fastq.gz
MC9_FNEN_638A_S9_L001_R1_001.fastq.gz
MC9_FNEN_638A_S9_L001_R2_001.fastq.gz
MC9_FNEN_638A_S9_L002_R1_001.fastq.gz
MC9_FREN_638A_S9_L002_R2_001.fastq.gz
MC9_FREN_638A_S9_L006_R1_001.fastq.gz
MC9_FREN_638A_S9_L006_R2_001.fastq.gz
MC9_FREN_638A_S9_L008_R1_001.fastq.gz
MC9_FREN_638A_S9_L008_R2_001.fastq.gz
MC9_ZH_637A_S74_L001_R1_001.fastq.gz
MC9_ZH_637A_S74_L001_R2_001.fastq.gz
MC9_ZH_637A_S74_L003_R1_001.fastq.gz
MC9_ZH_637A_S74_L003_R2_001.fastq.gz
MC9_ZH_637A_S74_L007_R1_001.fastq.gz
MC9_ZH_637A_S74_L007_R2_001.fastq.gz
MC9_ZH_637A_S74_L008_R1_001.fastq.gz
MC9_ZH_637A_S74_L008_R2_001.fastq.gz
MC9_ZH_637A_S84_L008_R1_001.fastq.gz
MC9_ZH_637A_S84_L008_R2_001.fastq.gz
DR14_DCRP_479C_S50_L001_R1_001.fastq.gz
DR14_DCRP_479C_S50_L001_R2_001.fastq.gz
DR14_DCRP_479C_S50_L002_R1_001.fastq.gz
DR14_DCRP_479C_S50_L002_R2_001.fastq.gz
DR14_DCRP_479C_S50_L006_R1_001.fastq.gz
DR14_DCRP_479C_S50_L006_R2_001.fastq.gz
DR14_DCRP_479C_S50_L007_R1_001.fastq.gz
DR14_DCRP_479C_S50_L007_R2_001.fastq.gz
DR14_DCRP_479C_S50_L008_R1_001.fastq.gz
DR14_DCRP_479C_S50_L008_R2_001.fastq.gz

cat MC9_FREN_638A_S19_L008_R1_001.fastq.gz MC9_FREN_638A_S9_L001_R1_001.fastq.gz MC9_FREN_638A_S9_L002_R1_001.fastq.gz MC9_FREN_638A_S9_L007_R1_001.fastq.gz MC9_FREN_638A_S9_L008_R1_001.fastq.gz > MC9_FREN_R1.fastq.gz
cat MC9_FREN_638A_S19_L008_R2_001.fastq.gz MC9_FREN_638A_S9_L001_R2_001.fastq.gz MC9_FREN_638A_S9_L002_R2_001.fastq.gz MC9_FREN_638A_S9_L007_R2_001.fastq.gz MC9_FREN_638A_S9_L008_R2_001.fastq.gz > MC9_FREN_R2.fastq.gz
cat MC9_ZH_637A_S74_L001_R1_001.fastq.gz MC9_ZH_637A_S74_L003_R1_001.fastq.gz MC9_ZH_637A_S74_L007_R1_001.fastq.gz MC9_ZH_637A_S74_L008_R1_001.fastq.gz MC9_ZH_637A_S84_L008_R1_001.fastq.gz > MC9_ZH_R1.gz
cat MC9_ZH_637A_S74_L001_R2_001.fastq.gz MC9_ZH_637A_S74_L003_R2_001.fastq.gz MC9_ZH_637A_S74_L007_R2_001.fastq.gz MC9_ZH_637A_S74_L008_R2_001.fastq.gz MC9_ZH_637A_S84_L008_R2_001.fastq.gz > MC9_ZH_R2.gz
cat DR14_DCRP_479C_S50_L001_R1_001.fastq.gz DR14_DCRP_479C_S50_L002_R1_001.fastq.gz DR14_DCRP_479C_S50_L006_R1_001.fastq.gz DR14_DCRP_479C_S50_L007_R1_001.fastq.gz DR14_DCRP_479C_S50_L008_R1_001.fastq.gz > DR14_DCRP_R1.gz 
cat DR14_DCRP_479C_S50_L001_R2_001.fastq.gz DR14_DCRP_479C_S50_L002_R2_001.fastq.gz DR14_DCRP_479C_S50_L006_R2_001.fastq.gz DR14_DCRP_479C_S50_L007_R2_001.fastq.gz DR14_DCRP_479C_S50_L008_R2_001.fastq.gz > DR14_DCRP_R1.gz

edited Apr 12 at 15:51

asked Apr 10 at 12:57

desu

544

edited Apr 12 at 15:51

asked Apr 10 at 12:57

desu

544

asked Apr 10 at 12:57

desu

544

asked Apr 10 at 12:57

desu

544

closed as unclear what you're asking by Kiwy, Jeff Schaller, Timothy Martin, roaima, Eliah Kagan Apr 10 at 18:23

3

Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â€“Â Kusalananda
Apr 10 at 13:01

I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â€“Â desu
Apr 10 at 13:05

2

What is the logic behind what files should be concatenated?
â€“Â Kusalananda
Apr 10 at 13:06

For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â€“Â desu
Apr 10 at 13:12

1

I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â€“Â roaima
Apr 10 at 13:27

Â |Â
show 6 more comments

3

Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â€“Â Kusalananda
Apr 10 at 13:01

I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â€“Â desu
Apr 10 at 13:05

2

What is the logic behind what files should be concatenated?
â€“Â Kusalananda
Apr 10 at 13:06

For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â€“Â desu
Apr 10 at 13:12

1

I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â€“Â roaima
Apr 10 at 13:27

Yes? What is the question. You have now concatenated the files. Is that the type of "merging" that you want to do?
â€“Â Kusalananda
Apr 10 at 13:01

I have very large number of files, writing each file name manually may be time consuming. I wonder if you could write a command line using regular expression.
â€“Â desu
Apr 10 at 13:05

What is the logic behind what files should be concatenated?
â€“Â Kusalananda
Apr 10 at 13:06

For example the first 10 lines are sequence file from the same animal and specific tissue (MC9_PREN). I want to merge all XXXXX_R1_001.fastq.gz and put in the file name MC9_PREN_R1.fastq.gz and all XXXX_R2_001.fastq.gz to MC9_PREN_R2.fastq.gz
â€“Â desu
Apr 10 at 13:12

I don't think you are using the word "merge" in the way that we, as computing people, would expect. Please update your question to provide a short worked example of what you are trying to achieve.
â€“Â roaima
Apr 10 at 13:27

Â |Â
show 6 more comments

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

The following loop gives us the unique filename prefixes of the FastQ files in the current directory. It relies on the fact that there will always be four underscores (_) between the filename prefix that we want and the R1 or R2 later in the filename.

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq

The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq

With the given list of files, either of the above returns

DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH

We then read these prefixes and create our concatenated files:

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

or, using the sed code from above,

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

No code above uses bash-specific (or GNU-specific) features and should work in all POSIX shells.

UPDATE: I work with bioinformaticians, and a colleague of mine commented:

One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.

For questions about this, please refer to the Bioinformatics Stack Exchange site.

edited Apr 12 at 15:57

answered Apr 10 at 13:23

Kusalananda

102k13200317

add a commentÂ |Â

up vote
1
down vote

Bash solution:

for f in *.fastq.gz; do 
 [[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
 cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done

^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ - the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex. MC9_PREN) and R-named suffix into the 2nd captured group (for ex. _R1)

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

add a commentÂ |Â

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
2
down vote

accepted

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq

The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq

With the given list of files, either of the above returns

DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH

We then read these prefixes and create our concatenated files:

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

or, using the sed code from above,

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

No code above uses bash-specific (or GNU-specific) features and should work in all POSIX shells.

UPDATE: I work with bioinformaticians, and a colleague of mine commented:

One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.

For questions about this, please refer to the Bioinformatics Stack Exchange site.

edited Apr 12 at 15:57

answered Apr 10 at 13:23

Kusalananda

102k13200317

add a commentÂ |Â

up vote
2
down vote

accepted

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq

The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq

With the given list of files, either of the above returns

DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH

We then read these prefixes and create our concatenated files:

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

or, using the sed code from above,

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

No code above uses bash-specific (or GNU-specific) features and should work in all POSIX shells.

UPDATE: I work with bioinformaticians, and a colleague of mine commented:

One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.

For questions about this, please refer to the Bioinformatics Stack Exchange site.

edited Apr 12 at 15:57

answered Apr 10 at 13:23

Kusalananda

102k13200317

add a commentÂ |Â

up vote
2
down vote

accepted

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq

The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq

With the given list of files, either of the above returns

DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH

We then read these prefixes and create our concatenated files:

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

or, using the sed code from above,

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

No code above uses bash-specific (or GNU-specific) features and should work in all POSIX shells.

UPDATE: I work with bioinformaticians, and a colleague of mine commented:

One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.

For questions about this, please refer to the Bioinformatics Stack Exchange site.

edited Apr 12 at 15:57

answered Apr 10 at 13:23

Kusalananda

102k13200317

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq

The following is equivalent, but does not use a loop (rather than deleting the last bit of the filename, this keeps the first bit of the filename):

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq

With the given list of files, either of the above returns

DR14_DCRP
MC9_FNEN
MC9_FREN
MC9_ZH

We then read these prefixes and create our concatenated files:

for name in *.fastq.gz; do
 printf '%sn' "$name%_*_*_*_R[12]*"
done | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

or, using the sed code from above,

printf '%sn' *.fastq.gz | sed 's/^([^_]*_[^_]*).*/1/' | uniq |
while read prefix; do
 cat "$prefix"*R1*.fastq.gz >"$prefix_R1.fastq.gz"
 cat "$prefix"*R2*.fastq.gz >"$prefix_R2.fastq.gz"
done

No code above uses bash-specific (or GNU-specific) features and should work in all POSIX shells.

UPDATE: I work with bioinformaticians, and a colleague of mine commented:

One should not just simply merge fastq files... In an ideal world, one should map each lane separately, adding an appropriate RG, and then merge the BAM files. Because lane-specific effects exist, etc. It can be more or less important, depending on the downstream application of course.

For questions about this, please refer to the Bioinformatics Stack Exchange site.

edited Apr 12 at 15:57

answered Apr 10 at 13:23

Kusalananda

102k13200317

edited Apr 12 at 15:57

answered Apr 10 at 13:23

Kusalananda

102k13200317

answered Apr 10 at 13:23

Kusalananda

102k13200317

answered Apr 10 at 13:23

Kusalananda

102k13200317

add a commentÂ |Â

up vote
1
down vote

Bash solution:

for f in *.fastq.gz; do 
 [[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
 cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done

^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ - the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex. MC9_PREN) and R-named suffix into the 2nd captured group (for ex. _R1)

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

add a commentÂ |Â

up vote
1
down vote

Bash solution:

for f in *.fastq.gz; do 
 [[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
 cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done

^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ - the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex. MC9_PREN) and R-named suffix into the 2nd captured group (for ex. _R1)

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

add a commentÂ |Â

up vote
1
down vote

Bash solution:

for f in *.fastq.gz; do 
 [[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
 cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done

^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ - the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex. MC9_PREN) and R-named suffix into the 2nd captured group (for ex. _R1)

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

Bash solution:

for f in *.fastq.gz; do 
 [[ "$f" =~ ^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ ]]
 cat "$f" >> "$BASH_REMATCH[1]$BASH_REMATCH[2].fastq.gz"
done

^([^_]+_[^_]+)_.*(_[^_]+)_[0-9]+.fastq.gz$ - the crucial regex pattern to capture the first 2 prefixes into the 1st captured group (for ex. MC9_PREN) and R-named suffix into the 2nd captured group (for ex. _R1)

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

answered Apr 10 at 13:18

RomanPerekhrest

22.4k12144

add a commentÂ |Â

搜尋此網誌

mjhjmtu