concatenating multiple fastq files

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP











up vote
2
down vote

favorite












I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:



randomString_$groupName- 


I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz



HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz


All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).



I have tried:



$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz


and it works, but as I have a lot of files, I don't want to do manually.










share|improve this question























  • Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
    – Kusalananda
    Sep 26 '17 at 8:16










  • Yes they are just regular .gzip files.
    – H.K
    Sep 26 '17 at 8:20







  • 1




    You might want to ask this sort of question over on Bioinformatics next time.
    – terdon♦
    Sep 26 '17 at 8:39














up vote
2
down vote

favorite












I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:



randomString_$groupName- 


I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz



HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz


All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).



I have tried:



$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz


and it works, but as I have a lot of files, I don't want to do manually.










share|improve this question























  • Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
    – Kusalananda
    Sep 26 '17 at 8:16










  • Yes they are just regular .gzip files.
    – H.K
    Sep 26 '17 at 8:20







  • 1




    You might want to ask this sort of question over on Bioinformatics next time.
    – terdon♦
    Sep 26 '17 at 8:39












up vote
2
down vote

favorite









up vote
2
down vote

favorite











I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:



randomString_$groupName- 


I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz



HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz


All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).



I have tried:



$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz


and it works, but as I have a lot of files, I don't want to do manually.










share|improve this question















I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:



randomString_$groupName- 


I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz



HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz


All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).



I have tried:



$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz


and it works, but as I have a lot of files, I don't want to do manually.







files merge bioinformatics






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited May 1 at 11:04









Jeff Schaller

32.3k849110




32.3k849110










asked Sep 26 '17 at 8:03









H.K

495




495











  • Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
    – Kusalananda
    Sep 26 '17 at 8:16










  • Yes they are just regular .gzip files.
    – H.K
    Sep 26 '17 at 8:20







  • 1




    You might want to ask this sort of question over on Bioinformatics next time.
    – terdon♦
    Sep 26 '17 at 8:39
















  • Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
    – Kusalananda
    Sep 26 '17 at 8:16










  • Yes they are just regular .gzip files.
    – H.K
    Sep 26 '17 at 8:20







  • 1




    You might want to ask this sort of question over on Bioinformatics next time.
    – terdon♦
    Sep 26 '17 at 8:39















Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
– Kusalananda
Sep 26 '17 at 8:16




Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
– Kusalananda
Sep 26 '17 at 8:16












Yes they are just regular .gzip files.
– H.K
Sep 26 '17 at 8:20





Yes they are just regular .gzip files.
– H.K
Sep 26 '17 at 8:20





1




1




You might want to ask this sort of question over on Bioinformatics next time.
– terdon♦
Sep 26 '17 at 8:39




You might want to ask this sort of question over on Bioinformatics next time.
– terdon♦
Sep 26 '17 at 8:39










1 Answer
1






active

oldest

votes

















up vote
3
down vote



accepted










for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*

sample=$name#*_
sample=$sample%%_*

cat "$name" >>"$sample_$rnum.fastq.gz"
done


This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample. For all the filenames shown in the question, this would be 102697-001-001.



The rnum variable will hold the R# bit at the end of the filename.



The sample name is extracted by taking the filename and first removing everything up to and including the first _ character, and then removing everything after and including the first _ character from that result. The value for the rnum variable is extracted in a similar manner.



The file is then simply appended onto the end of the aggregated file using cat >>.
The output filename will be constructed from the sample name, the R#, and the string .fastq.gz. For the shown files, this will be 102697-001-001_R1.fastq.gz.



Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.




Alternative way of doing this with bash, using a regular expression to figure out the output file name:



for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

cat "$name" >>"$outfile"
fi
done


The filename is matched against the regular expression



_([0-9-]+)_.*(..).fastq.gz


The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _ on either side. The only place in the filename that this bit matches is the sample name.



After the first group, and the _ after it, we allow for any number of any characters (.*) up to the (..).fastq.gz bit. The .fastq.gz will match the .fastq.gz string at the end of the filename, so the last group, (..), captures the R1 immediately before that (the . pattern will match any one character, while . will match a dot).



The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.






share|improve this answer


















  • 1




    @H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
    – terdon♦
    Sep 26 '17 at 8:51










  • @kusalananda Can you kindly give the description of bash file.
    – H.K
    Sep 26 '17 at 9:11










  • @H.K The sting I added at the end? I will add an explanation.
    – Kusalananda
    Sep 26 '17 at 9:11










  • @Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
    – H.K
    Sep 27 '17 at 10:06











  • @H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
    – Kusalananda
    Sep 27 '17 at 10:18











Your Answer







StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













 

draft saved


draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394479%2fconcatenating-multiple-fastq-files%23new-answer', 'question_page');

);

Post as a guest






























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
3
down vote



accepted










for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*

sample=$name#*_
sample=$sample%%_*

cat "$name" >>"$sample_$rnum.fastq.gz"
done


This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample. For all the filenames shown in the question, this would be 102697-001-001.



The rnum variable will hold the R# bit at the end of the filename.



The sample name is extracted by taking the filename and first removing everything up to and including the first _ character, and then removing everything after and including the first _ character from that result. The value for the rnum variable is extracted in a similar manner.



The file is then simply appended onto the end of the aggregated file using cat >>.
The output filename will be constructed from the sample name, the R#, and the string .fastq.gz. For the shown files, this will be 102697-001-001_R1.fastq.gz.



Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.




Alternative way of doing this with bash, using a regular expression to figure out the output file name:



for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

cat "$name" >>"$outfile"
fi
done


The filename is matched against the regular expression



_([0-9-]+)_.*(..).fastq.gz


The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _ on either side. The only place in the filename that this bit matches is the sample name.



After the first group, and the _ after it, we allow for any number of any characters (.*) up to the (..).fastq.gz bit. The .fastq.gz will match the .fastq.gz string at the end of the filename, so the last group, (..), captures the R1 immediately before that (the . pattern will match any one character, while . will match a dot).



The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.






share|improve this answer


















  • 1




    @H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
    – terdon♦
    Sep 26 '17 at 8:51










  • @kusalananda Can you kindly give the description of bash file.
    – H.K
    Sep 26 '17 at 9:11










  • @H.K The sting I added at the end? I will add an explanation.
    – Kusalananda
    Sep 26 '17 at 9:11










  • @Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
    – H.K
    Sep 27 '17 at 10:06











  • @H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
    – Kusalananda
    Sep 27 '17 at 10:18















up vote
3
down vote



accepted










for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*

sample=$name#*_
sample=$sample%%_*

cat "$name" >>"$sample_$rnum.fastq.gz"
done


This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample. For all the filenames shown in the question, this would be 102697-001-001.



The rnum variable will hold the R# bit at the end of the filename.



The sample name is extracted by taking the filename and first removing everything up to and including the first _ character, and then removing everything after and including the first _ character from that result. The value for the rnum variable is extracted in a similar manner.



The file is then simply appended onto the end of the aggregated file using cat >>.
The output filename will be constructed from the sample name, the R#, and the string .fastq.gz. For the shown files, this will be 102697-001-001_R1.fastq.gz.



Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.




Alternative way of doing this with bash, using a regular expression to figure out the output file name:



for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

cat "$name" >>"$outfile"
fi
done


The filename is matched against the regular expression



_([0-9-]+)_.*(..).fastq.gz


The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _ on either side. The only place in the filename that this bit matches is the sample name.



After the first group, and the _ after it, we allow for any number of any characters (.*) up to the (..).fastq.gz bit. The .fastq.gz will match the .fastq.gz string at the end of the filename, so the last group, (..), captures the R1 immediately before that (the . pattern will match any one character, while . will match a dot).



The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.






share|improve this answer


















  • 1




    @H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
    – terdon♦
    Sep 26 '17 at 8:51










  • @kusalananda Can you kindly give the description of bash file.
    – H.K
    Sep 26 '17 at 9:11










  • @H.K The sting I added at the end? I will add an explanation.
    – Kusalananda
    Sep 26 '17 at 9:11










  • @Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
    – H.K
    Sep 27 '17 at 10:06











  • @H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
    – Kusalananda
    Sep 27 '17 at 10:18













up vote
3
down vote



accepted







up vote
3
down vote



accepted






for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*

sample=$name#*_
sample=$sample%%_*

cat "$name" >>"$sample_$rnum.fastq.gz"
done


This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample. For all the filenames shown in the question, this would be 102697-001-001.



The rnum variable will hold the R# bit at the end of the filename.



The sample name is extracted by taking the filename and first removing everything up to and including the first _ character, and then removing everything after and including the first _ character from that result. The value for the rnum variable is extracted in a similar manner.



The file is then simply appended onto the end of the aggregated file using cat >>.
The output filename will be constructed from the sample name, the R#, and the string .fastq.gz. For the shown files, this will be 102697-001-001_R1.fastq.gz.



Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.




Alternative way of doing this with bash, using a regular expression to figure out the output file name:



for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

cat "$name" >>"$outfile"
fi
done


The filename is matched against the regular expression



_([0-9-]+)_.*(..).fastq.gz


The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _ on either side. The only place in the filename that this bit matches is the sample name.



After the first group, and the _ after it, we allow for any number of any characters (.*) up to the (..).fastq.gz bit. The .fastq.gz will match the .fastq.gz string at the end of the filename, so the last group, (..), captures the R1 immediately before that (the . pattern will match any one character, while . will match a dot).



The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.






share|improve this answer














for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*

sample=$name#*_
sample=$sample%%_*

cat "$name" >>"$sample_$rnum.fastq.gz"
done


This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample. For all the filenames shown in the question, this would be 102697-001-001.



The rnum variable will hold the R# bit at the end of the filename.



The sample name is extracted by taking the filename and first removing everything up to and including the first _ character, and then removing everything after and including the first _ character from that result. The value for the rnum variable is extracted in a similar manner.



The file is then simply appended onto the end of the aggregated file using cat >>.
The output filename will be constructed from the sample name, the R#, and the string .fastq.gz. For the shown files, this will be 102697-001-001_R1.fastq.gz.



Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.




Alternative way of doing this with bash, using a regular expression to figure out the output file name:



for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

cat "$name" >>"$outfile"
fi
done


The filename is matched against the regular expression



_([0-9-]+)_.*(..).fastq.gz


The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _ on either side. The only place in the filename that this bit matches is the sample name.



After the first group, and the _ after it, we allow for any number of any characters (.*) up to the (..).fastq.gz bit. The .fastq.gz will match the .fastq.gz string at the end of the filename, so the last group, (..), captures the R1 immediately before that (the . pattern will match any one character, while . will match a dot).



The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.







share|improve this answer














share|improve this answer



share|improve this answer








edited Sep 27 '17 at 10:18

























answered Sep 26 '17 at 8:31









Kusalananda

106k14209327




106k14209327







  • 1




    @H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
    – terdon♦
    Sep 26 '17 at 8:51










  • @kusalananda Can you kindly give the description of bash file.
    – H.K
    Sep 26 '17 at 9:11










  • @H.K The sting I added at the end? I will add an explanation.
    – Kusalananda
    Sep 26 '17 at 9:11










  • @Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
    – H.K
    Sep 27 '17 at 10:06











  • @H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
    – Kusalananda
    Sep 27 '17 at 10:18













  • 1




    @H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
    – terdon♦
    Sep 26 '17 at 8:51










  • @kusalananda Can you kindly give the description of bash file.
    – H.K
    Sep 26 '17 at 9:11










  • @H.K The sting I added at the end? I will add an explanation.
    – Kusalananda
    Sep 26 '17 at 9:11










  • @Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
    – H.K
    Sep 27 '17 at 10:06











  • @H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
    – Kusalananda
    Sep 27 '17 at 10:18








1




1




@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
– terdon♦
Sep 26 '17 at 8:51




@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
– terdon♦
Sep 26 '17 at 8:51












@kusalananda Can you kindly give the description of bash file.
– H.K
Sep 26 '17 at 9:11




@kusalananda Can you kindly give the description of bash file.
– H.K
Sep 26 '17 at 9:11












@H.K The sting I added at the end? I will add an explanation.
– Kusalananda
Sep 26 '17 at 9:11




@H.K The sting I added at the end? I will add an explanation.
– Kusalananda
Sep 26 '17 at 9:11












@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
– H.K
Sep 27 '17 at 10:06





@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
– H.K
Sep 27 '17 at 10:06













@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
– Kusalananda
Sep 27 '17 at 10:18





@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
– Kusalananda
Sep 27 '17 at 10:18


















 

draft saved


draft discarded















































 


draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394479%2fconcatenating-multiple-fastq-files%23new-answer', 'question_page');

);

Post as a guest













































































Popular posts from this blog

How to check contact read email or not when send email to Individual?

Bahrain

Postfix configuration issue with fips on centos 7; mailgun relay