concatenating multiple fastq files

up vote
2
down vote

favorite

I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:

randomString_$groupName-

I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz

HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz

All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).

I have tried:

$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz

and it works, but as I have a lot of files, I don't want to do manually.

edited May 1 at 11:04

Jeff Schaller

32.3k849110

asked Sep 26 '17 at 8:03

H.K

495

Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
â€“Â Kusalananda
Sep 26 '17 at 8:16

Yes they are just regular .gzip files.
â€“Â H.K
Sep 26 '17 at 8:20

1

You might want to ask this sort of question over on Bioinformatics next time.
â€“Â terdonâ™¦
Sep 26 '17 at 8:39

add a commentÂ |Â

up vote
2
down vote

favorite

I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:

randomString_$groupName-

HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz

All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).

I have tried:

$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz

and it works, but as I have a lot of files, I don't want to do manually.

edited May 1 at 11:04

Jeff Schaller

32.3k849110

asked Sep 26 '17 at 8:03

H.K

495

Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
â€“Â Kusalananda
Sep 26 '17 at 8:16

Yes they are just regular .gzip files.
â€“Â H.K
Sep 26 '17 at 8:20

1

You might want to ask this sort of question over on Bioinformatics next time.
â€“Â terdonâ™¦
Sep 26 '17 at 8:39

add a commentÂ |Â

up vote
2
down vote

favorite

I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:

randomString_$groupName-

HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz

All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).

I have tried:

$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz

and it works, but as I have a lot of files, I don't want to do manually.

edited May 1 at 11:04

Jeff Schaller

32.3k849110

asked Sep 26 '17 at 8:03

H.K

495

I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:

randomString_$groupName-

HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz

HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz

All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz (so keeping the string between the two first _ and after the last _ as the name).

I have tried:

$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz 
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz

and it works, but as I have a lot of files, I don't want to do manually.

files merge bioinformatics

edited May 1 at 11:04

Jeff Schaller

32.3k849110

asked Sep 26 '17 at 8:03

H.K

495

edited May 1 at 11:04

Jeff Schaller

32.3k849110

asked Sep 26 '17 at 8:03

H.K

495

edited May 1 at 11:04

Jeff Schaller

32.3k849110

edited May 1 at 11:04

Jeff Schaller

32.3k849110

edited May 1 at 11:04

Jeff Schaller

32.3k849110

asked Sep 26 '17 at 8:03

H.K

495

asked Sep 26 '17 at 8:03

H.K

495

asked Sep 26 '17 at 8:03

H.K

495

Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
â€“Â Kusalananda
Sep 26 '17 at 8:16

Yes they are just regular .gzip files.
â€“Â H.K
Sep 26 '17 at 8:20

1

You might want to ask this sort of question over on Bioinformatics next time.
â€“Â terdonâ™¦
Sep 26 '17 at 8:39

add a commentÂ |Â

Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
â€“Â Kusalananda
Sep 26 '17 at 8:16

Yes they are just regular .gzip files.
â€“Â H.K
Sep 26 '17 at 8:20

1

You might want to ask this sort of question over on Bioinformatics next time.
â€“Â terdonâ™¦
Sep 26 '17 at 8:39

Are these files compressed by regular gzip, or are they compressed by bgzip and indexed with tabix? (i.e. do you also have to regenerate any Tabix indexes?)
â€“Â Kusalananda
Sep 26 '17 at 8:16

Yes they are just regular .gzip files.
â€“Â H.K
Sep 26 '17 at 8:20

You might want to ask this sort of question over on Bioinformatics next time.
â€“Â terdonâ™¦
Sep 26 '17 at 8:39

add a commentÂ |Â

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

for name in ./*.fastq.gz; do
 rnum=$name##*_
 rnum=$rnum%%.*

 sample=$name#*_
 sample=$sample%%_*

 cat "$name" >>"$sample_$rnum.fastq.gz"
done

This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample. For all the filenames shown in the question, this would be 102697-001-001.

The rnum variable will hold the R# bit at the end of the filename.

The sample name is extracted by taking the filename and first removing everything up to and including the first _ character, and then removing everything after and including the first _ character from that result. The value for the rnum variable is extracted in a similar manner.

The file is then simply appended onto the end of the aggregated file using cat >>.
The output filename will be constructed from the sample name, the R#, and the string .fastq.gz. For the shown files, this will be 102697-001-001_R1.fastq.gz.

Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.

Alternative way of doing this with bash, using a regular expression to figure out the output file name:

for name in ./*.fastq.gz; do
 if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
 outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

 cat "$name" >>"$outfile"
 fi
done

The filename is matched against the regular expression

_([0-9-]+)_.*(..).fastq.gz

The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _ on either side. The only place in the filename that this bit matches is the sample name.

After the first group, and the _ after it, we allow for any number of any characters (.*) up to the (..).fastq.gz bit. The .fastq.gz will match the .fastq.gz string at the end of the filename, so the last group, (..), captures the R1 immediately before that (the . pattern will match any one character, while . will match a dot).

The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.

edited Sep 27 '17 at 10:18

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

1

@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â€“Â terdonâ™¦
Sep 26 '17 at 8:51

@kusalananda Can you kindly give the description of bash file.
â€“Â H.K
Sep 26 '17 at 9:11

@H.K The sting I added at the end? I will add an explanation.
â€“Â Kusalananda
Sep 26 '17 at 9:11

@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â€“Â H.K
Sep 27 '17 at 10:06

@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
â€“Â Kusalananda
Sep 27 '17 at 10:18

Â |Â
show 1 more comment

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: false,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394479%2fconcatenating-multiple-fastq-files%23new-answer', 'question_page');

);

Post as a guest

Name

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
3
down vote

accepted

for name in ./*.fastq.gz; do
 rnum=$name##*_
 rnum=$rnum%%.*

 sample=$name#*_
 sample=$sample%%_*

 cat "$name" >>"$sample_$rnum.fastq.gz"
done

The rnum variable will hold the R# bit at the end of the filename.

Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.

Alternative way of doing this with bash, using a regular expression to figure out the output file name:

for name in ./*.fastq.gz; do
 if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
 outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

 cat "$name" >>"$outfile"
 fi
done

The filename is matched against the regular expression

_([0-9-]+)_.*(..).fastq.gz

The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.

edited Sep 27 '17 at 10:18

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

1

@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â€“Â terdonâ™¦
Sep 26 '17 at 8:51

@kusalananda Can you kindly give the description of bash file.
â€“Â H.K
Sep 26 '17 at 9:11

@H.K The sting I added at the end? I will add an explanation.
â€“Â Kusalananda
Sep 26 '17 at 9:11

@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â€“Â H.K
Sep 27 '17 at 10:06

@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
â€“Â Kusalananda
Sep 27 '17 at 10:18

Â |Â
show 1 more comment

up vote
3
down vote

accepted

for name in ./*.fastq.gz; do
 rnum=$name##*_
 rnum=$rnum%%.*

 sample=$name#*_
 sample=$sample%%_*

 cat "$name" >>"$sample_$rnum.fastq.gz"
done

The rnum variable will hold the R# bit at the end of the filename.

Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.

Alternative way of doing this with bash, using a regular expression to figure out the output file name:

for name in ./*.fastq.gz; do
 if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
 outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

 cat "$name" >>"$outfile"
 fi
done

The filename is matched against the regular expression

_([0-9-]+)_.*(..).fastq.gz

The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.

edited Sep 27 '17 at 10:18

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

1

@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â€“Â terdonâ™¦
Sep 26 '17 at 8:51

@kusalananda Can you kindly give the description of bash file.
â€“Â H.K
Sep 26 '17 at 9:11

@H.K The sting I added at the end? I will add an explanation.
â€“Â Kusalananda
Sep 26 '17 at 9:11

@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â€“Â H.K
Sep 27 '17 at 10:06

@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
â€“Â Kusalananda
Sep 27 '17 at 10:18

Â |Â
show 1 more comment

up vote
3
down vote

accepted

for name in ./*.fastq.gz; do
 rnum=$name##*_
 rnum=$rnum%%.*

 sample=$name#*_
 sample=$sample%%_*

 cat "$name" >>"$sample_$rnum.fastq.gz"
done

The rnum variable will hold the R# bit at the end of the filename.

Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.

Alternative way of doing this with bash, using a regular expression to figure out the output file name:

for name in ./*.fastq.gz; do
 if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
 outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

 cat "$name" >>"$outfile"
 fi
done

The filename is matched against the regular expression

_([0-9-]+)_.*(..).fastq.gz

The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.

edited Sep 27 '17 at 10:18

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

for name in ./*.fastq.gz; do
 rnum=$name##*_
 rnum=$rnum%%.*

 sample=$name#*_
 sample=$sample%%_*

 cat "$name" >>"$sample_$rnum.fastq.gz"
done

The rnum variable will hold the R# bit at the end of the filename.

Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.

Alternative way of doing this with bash, using a regular expression to figure out the output file name:

for name in ./*.fastq.gz; do
 if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
 outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"

 cat "$name" >>"$outfile"
 fi
done

The filename is matched against the regular expression

_([0-9-]+)_.*(..).fastq.gz

The two captured groups are stored as index 1 and 2 in the BASH_REMATCH array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.

edited Sep 27 '17 at 10:18

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

edited Sep 27 '17 at 10:18

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

answered Sep 26 '17 at 8:31

Kusalananda

106k14209327

1

@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â€“Â terdonâ™¦
Sep 26 '17 at 8:51

@kusalananda Can you kindly give the description of bash file.
â€“Â H.K
Sep 26 '17 at 9:11

@H.K The sting I added at the end? I will add an explanation.
â€“Â Kusalananda
Sep 26 '17 at 9:11

@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â€“Â H.K
Sep 27 '17 at 10:06

@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
â€“Â Kusalananda
Sep 27 '17 at 10:18

Â |Â
show 1 more comment

1

@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â€“Â terdonâ™¦
Sep 26 '17 at 8:51

@kusalananda Can you kindly give the description of bash file.
â€“Â H.K
Sep 26 '17 at 9:11

@H.K The sting I added at the end? I will add an explanation.
â€“Â Kusalananda
Sep 26 '17 at 9:11

@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â€“Â H.K
Sep 27 '17 at 10:06

@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
â€“Â Kusalananda
Sep 27 '17 at 10:18

@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â€“Â terdonâ™¦
Sep 26 '17 at 8:51

@kusalananda Can you kindly give the description of bash file.
â€“Â H.K
Sep 26 '17 at 9:11

@H.K The sting I added at the end? I will add an explanation.
â€“Â Kusalananda
Sep 26 '17 at 9:11

@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â€“Â H.K
Sep 27 '17 at 10:06

@H.K Yes, you have found a bug in my code. The cat > should be cat >> as in the bash code. I will update the answer at once. Thanks!
â€“Â Kusalananda
Sep 27 '17 at 10:18

Â |Â
show 1 more comment

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

搜尋此網誌

mjhjmtu