concatenating multiple fastq files
Clash Royale CLAN TAG#URR8PPP
up vote
2
down vote
favorite
I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:
randomString_$groupName-
I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz
All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz
(so keeping the string between the two first _
and after the last _
as the name).
I have tried:
$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz
and it works, but as I have a lot of files, I don't want to do manually.
files merge bioinformatics
add a comment |Â
up vote
2
down vote
favorite
I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:
randomString_$groupName-
I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz
All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz
(so keeping the string between the two first _
and after the last _
as the name).
I have tried:
$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz
and it works, but as I have a lot of files, I don't want to do manually.
files merge bioinformatics
Are these files compressed by regulargzip
, or are they compressed bybgzip
and indexed withtabix
? (i.e. do you also have to regenerate any Tabix indexes?)
â Kusalananda
Sep 26 '17 at 8:16
Yes they are just regular .gzip files.
â H.K
Sep 26 '17 at 8:20
1
You might want to ask this sort of question over on Bioinformatics next time.
â terdonâ¦
Sep 26 '17 at 8:39
add a comment |Â
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:
randomString_$groupName-
I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz
All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz
(so keeping the string between the two first _
and after the last _
as the name).
I have tried:
$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz
and it works, but as I have a lot of files, I don't want to do manually.
files merge bioinformatics
I have a folder with almost 100 files, organized in groups of 16 files each. I need to concatenate each of the 16 files of each group into a single file. For example, one group of file names is:
randomString_$groupName-
I have a folder with almost 100 samples, the sample are run on the Nextseq500 and are single stranded. Each sample is run on 4 Flowcells for the Nextseq500 having 4 lanes. So per sample 16 fastq files are generated (see example below). Now I want to concatenate all these files and generated one output with name 102697-001-001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L001_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L002_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L003_R1.fastq.gz
HJJMYBGXX_102697-001-001_ATTACTCG-GCCTCTAT_L004_R1.fastq.gz
All of the files above should be concatenated into a single file named 102697-001-001_R1.fastq.gz
(so keeping the string between the two first _
and after the last _
as the name).
I have tried:
$ cat HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGTLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGVLWBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L001_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L002_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L003_R1.fastq.gz
HGWWHBGXX_102697-001-001_ATTACTCG-AGGCTATA_L004_R1.fastq.gz > 102697_001_001_R1.fastq.gz
and it works, but as I have a lot of files, I don't want to do manually.
files merge bioinformatics
files merge bioinformatics
edited May 1 at 11:04
Jeff Schaller
32.3k849110
32.3k849110
asked Sep 26 '17 at 8:03
H.K
495
495
Are these files compressed by regulargzip
, or are they compressed bybgzip
and indexed withtabix
? (i.e. do you also have to regenerate any Tabix indexes?)
â Kusalananda
Sep 26 '17 at 8:16
Yes they are just regular .gzip files.
â H.K
Sep 26 '17 at 8:20
1
You might want to ask this sort of question over on Bioinformatics next time.
â terdonâ¦
Sep 26 '17 at 8:39
add a comment |Â
Are these files compressed by regulargzip
, or are they compressed bybgzip
and indexed withtabix
? (i.e. do you also have to regenerate any Tabix indexes?)
â Kusalananda
Sep 26 '17 at 8:16
Yes they are just regular .gzip files.
â H.K
Sep 26 '17 at 8:20
1
You might want to ask this sort of question over on Bioinformatics next time.
â terdonâ¦
Sep 26 '17 at 8:39
Are these files compressed by regular
gzip
, or are they compressed by bgzip
and indexed with tabix
? (i.e. do you also have to regenerate any Tabix indexes?)â Kusalananda
Sep 26 '17 at 8:16
Are these files compressed by regular
gzip
, or are they compressed by bgzip
and indexed with tabix
? (i.e. do you also have to regenerate any Tabix indexes?)â Kusalananda
Sep 26 '17 at 8:16
Yes they are just regular .gzip files.
â H.K
Sep 26 '17 at 8:20
Yes they are just regular .gzip files.
â H.K
Sep 26 '17 at 8:20
1
1
You might want to ask this sort of question over on Bioinformatics next time.
â terdonâ¦
Sep 26 '17 at 8:39
You might want to ask this sort of question over on Bioinformatics next time.
â terdonâ¦
Sep 26 '17 at 8:39
add a comment |Â
1 Answer
1
active
oldest
votes
up vote
3
down vote
accepted
for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*
sample=$name#*_
sample=$sample%%_*
cat "$name" >>"$sample_$rnum.fastq.gz"
done
This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample
. For all the filenames shown in the question, this would be 102697-001-001
.
The rnum
variable will hold the R#
bit at the end of the filename.
The sample name is extracted by taking the filename and first removing everything up to and including the first _
character, and then removing everything after and including the first _
character from that result. The value for the rnum
variable is extracted in a similar manner.
The file is then simply appended onto the end of the aggregated file using cat >>
.
The output filename will be constructed from the sample name, the R#
, and the string .fastq.gz
. For the shown files, this will be 102697-001-001_R1.fastq.gz
.
Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.
Alternative way of doing this with bash
, using a regular expression to figure out the output file name:
for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"
cat "$name" >>"$outfile"
fi
done
The filename is matched against the regular expression
_([0-9-]+)_.*(..).fastq.gz
The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _
on either side. The only place in the filename that this bit matches is the sample name.
After the first group, and the _
after it, we allow for any number of any characters (.*
) up to the (..).fastq.gz
bit. The .fastq.gz
will match the .fastq.gz
string at the end of the filename, so the last group, (..)
, captures the R1
immediately before that (the .
pattern will match any one character, while .
will match a dot).
The two captured groups are stored as index 1 and 2 in the BASH_REMATCH
array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.
1
@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â terdonâ¦
Sep 26 '17 at 8:51
@kusalananda Can you kindly give the description of bash file.
â H.K
Sep 26 '17 at 9:11
@H.K The sting I added at the end? I will add an explanation.
â Kusalananda
Sep 26 '17 at 9:11
@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â H.K
Sep 27 '17 at 10:06
@H.K Yes, you have found a bug in my code. Thecat >
should becat >>
as in thebash
code. I will update the answer at once. Thanks!
â Kusalananda
Sep 27 '17 at 10:18
 |Â
show 1 more comment
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*
sample=$name#*_
sample=$sample%%_*
cat "$name" >>"$sample_$rnum.fastq.gz"
done
This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample
. For all the filenames shown in the question, this would be 102697-001-001
.
The rnum
variable will hold the R#
bit at the end of the filename.
The sample name is extracted by taking the filename and first removing everything up to and including the first _
character, and then removing everything after and including the first _
character from that result. The value for the rnum
variable is extracted in a similar manner.
The file is then simply appended onto the end of the aggregated file using cat >>
.
The output filename will be constructed from the sample name, the R#
, and the string .fastq.gz
. For the shown files, this will be 102697-001-001_R1.fastq.gz
.
Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.
Alternative way of doing this with bash
, using a regular expression to figure out the output file name:
for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"
cat "$name" >>"$outfile"
fi
done
The filename is matched against the regular expression
_([0-9-]+)_.*(..).fastq.gz
The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _
on either side. The only place in the filename that this bit matches is the sample name.
After the first group, and the _
after it, we allow for any number of any characters (.*
) up to the (..).fastq.gz
bit. The .fastq.gz
will match the .fastq.gz
string at the end of the filename, so the last group, (..)
, captures the R1
immediately before that (the .
pattern will match any one character, while .
will match a dot).
The two captured groups are stored as index 1 and 2 in the BASH_REMATCH
array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.
1
@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â terdonâ¦
Sep 26 '17 at 8:51
@kusalananda Can you kindly give the description of bash file.
â H.K
Sep 26 '17 at 9:11
@H.K The sting I added at the end? I will add an explanation.
â Kusalananda
Sep 26 '17 at 9:11
@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â H.K
Sep 27 '17 at 10:06
@H.K Yes, you have found a bug in my code. Thecat >
should becat >>
as in thebash
code. I will update the answer at once. Thanks!
â Kusalananda
Sep 27 '17 at 10:18
 |Â
show 1 more comment
up vote
3
down vote
accepted
for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*
sample=$name#*_
sample=$sample%%_*
cat "$name" >>"$sample_$rnum.fastq.gz"
done
This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample
. For all the filenames shown in the question, this would be 102697-001-001
.
The rnum
variable will hold the R#
bit at the end of the filename.
The sample name is extracted by taking the filename and first removing everything up to and including the first _
character, and then removing everything after and including the first _
character from that result. The value for the rnum
variable is extracted in a similar manner.
The file is then simply appended onto the end of the aggregated file using cat >>
.
The output filename will be constructed from the sample name, the R#
, and the string .fastq.gz
. For the shown files, this will be 102697-001-001_R1.fastq.gz
.
Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.
Alternative way of doing this with bash
, using a regular expression to figure out the output file name:
for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"
cat "$name" >>"$outfile"
fi
done
The filename is matched against the regular expression
_([0-9-]+)_.*(..).fastq.gz
The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _
on either side. The only place in the filename that this bit matches is the sample name.
After the first group, and the _
after it, we allow for any number of any characters (.*
) up to the (..).fastq.gz
bit. The .fastq.gz
will match the .fastq.gz
string at the end of the filename, so the last group, (..)
, captures the R1
immediately before that (the .
pattern will match any one character, while .
will match a dot).
The two captured groups are stored as index 1 and 2 in the BASH_REMATCH
array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.
1
@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â terdonâ¦
Sep 26 '17 at 8:51
@kusalananda Can you kindly give the description of bash file.
â H.K
Sep 26 '17 at 9:11
@H.K The sting I added at the end? I will add an explanation.
â Kusalananda
Sep 26 '17 at 9:11
@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â H.K
Sep 27 '17 at 10:06
@H.K Yes, you have found a bug in my code. Thecat >
should becat >>
as in thebash
code. I will update the answer at once. Thanks!
â Kusalananda
Sep 27 '17 at 10:18
 |Â
show 1 more comment
up vote
3
down vote
accepted
up vote
3
down vote
accepted
for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*
sample=$name#*_
sample=$sample%%_*
cat "$name" >>"$sample_$rnum.fastq.gz"
done
This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample
. For all the filenames shown in the question, this would be 102697-001-001
.
The rnum
variable will hold the R#
bit at the end of the filename.
The sample name is extracted by taking the filename and first removing everything up to and including the first _
character, and then removing everything after and including the first _
character from that result. The value for the rnum
variable is extracted in a similar manner.
The file is then simply appended onto the end of the aggregated file using cat >>
.
The output filename will be constructed from the sample name, the R#
, and the string .fastq.gz
. For the shown files, this will be 102697-001-001_R1.fastq.gz
.
Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.
Alternative way of doing this with bash
, using a regular expression to figure out the output file name:
for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"
cat "$name" >>"$outfile"
fi
done
The filename is matched against the regular expression
_([0-9-]+)_.*(..).fastq.gz
The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _
on either side. The only place in the filename that this bit matches is the sample name.
After the first group, and the _
after it, we allow for any number of any characters (.*
) up to the (..).fastq.gz
bit. The .fastq.gz
will match the .fastq.gz
string at the end of the filename, so the last group, (..)
, captures the R1
immediately before that (the .
pattern will match any one character, while .
will match a dot).
The two captured groups are stored as index 1 and 2 in the BASH_REMATCH
array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.
for name in ./*.fastq.gz; do
rnum=$name##*_
rnum=$rnum%%.*
sample=$name#*_
sample=$sample%%_*
cat "$name" >>"$sample_$rnum.fastq.gz"
done
This would iterate over all compressed Fastq files in the current directory and extract the sample name into the shell variable sample
. For all the filenames shown in the question, this would be 102697-001-001
.
The rnum
variable will hold the R#
bit at the end of the filename.
The sample name is extracted by taking the filename and first removing everything up to and including the first _
character, and then removing everything after and including the first _
character from that result. The value for the rnum
variable is extracted in a similar manner.
The file is then simply appended onto the end of the aggregated file using cat >>
.
The output filename will be constructed from the sample name, the R#
, and the string .fastq.gz
. For the shown files, this will be 102697-001-001_R1.fastq.gz
.
Gzip compressed files do not have to be uncompressed in order to concatenated them. Uncompressing the resulting file will give you the uncompressed concatenation of all Fastq files.
Alternative way of doing this with bash
, using a regular expression to figure out the output file name:
for name in ./*.fastq.gz; do
if [[ "$name" =~ _([0-9-]+)_.*(..).fastq.gz ]]; then
outfile="$BASH_REMATCH[1]_$BASH_REMATCH[2].fastq.gz"
cat "$name" >>"$outfile"
fi
done
The filename is matched against the regular expression
_([0-9-]+)_.*(..).fastq.gz
The two groups (bits in parentheses) will pick out the relevant parts of the filename for us. The first group captures a string that only consists of characters that are either digits or dashes. This group needs to be surrounded by _
on either side. The only place in the filename that this bit matches is the sample name.
After the first group, and the _
after it, we allow for any number of any characters (.*
) up to the (..).fastq.gz
bit. The .fastq.gz
will match the .fastq.gz
string at the end of the filename, so the last group, (..)
, captures the R1
immediately before that (the .
pattern will match any one character, while .
will match a dot).
The two captured groups are stored as index 1 and 2 in the BASH_REMATCH
array (the name is short for "Bash Regular Expression Match"), and we use these in constructing the output file name.
edited Sep 27 '17 at 10:18
answered Sep 26 '17 at 8:31
Kusalananda
106k14209327
106k14209327
1
@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â terdonâ¦
Sep 26 '17 at 8:51
@kusalananda Can you kindly give the description of bash file.
â H.K
Sep 26 '17 at 9:11
@H.K The sting I added at the end? I will add an explanation.
â Kusalananda
Sep 26 '17 at 9:11
@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â H.K
Sep 27 '17 at 10:06
@H.K Yes, you have found a bug in my code. Thecat >
should becat >>
as in thebash
code. I will update the answer at once. Thanks!
â Kusalananda
Sep 27 '17 at 10:18
 |Â
show 1 more comment
1
@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â terdonâ¦
Sep 26 '17 at 8:51
@kusalananda Can you kindly give the description of bash file.
â H.K
Sep 26 '17 at 9:11
@H.K The sting I added at the end? I will add an explanation.
â Kusalananda
Sep 26 '17 at 9:11
@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â H.K
Sep 27 '17 at 10:06
@H.K Yes, you have found a bug in my code. Thecat >
should becat >>
as in thebash
code. I will update the answer at once. Thanks!
â Kusalananda
Sep 27 '17 at 10:18
1
1
@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â terdonâ¦
Sep 26 '17 at 8:51
@H.K If this answer solved your issue, please take a moment and accept it by clicking on the check mark to the left. That will mark the question as answered and is the way thanks are expressed on the Stack Exchange sites.
â terdonâ¦
Sep 26 '17 at 8:51
@kusalananda Can you kindly give the description of bash file.
â H.K
Sep 26 '17 at 9:11
@kusalananda Can you kindly give the description of bash file.
â H.K
Sep 26 '17 at 9:11
@H.K The sting I added at the end? I will add an explanation.
â Kusalananda
Sep 26 '17 at 9:11
@H.K The sting I added at the end? I will add an explanation.
â Kusalananda
Sep 26 '17 at 9:11
@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â H.K
Sep 27 '17 at 10:06
@Kusalananda I tink there is some problem with thisfor name in ./*.fastq.gz; do rnum=$name##* rnum=$rnum%%.* sample=$name#* sample=$sample%%_* cat "$name" >"$sample_$rnum.fastq.gz" done I think there is some problem with this code, when i try amnually just the 1st sample the output file size is different from what this loop creates. The code that you posted before putting in rnum, it was working Ok. The bash is working perfectly Ok
â H.K
Sep 27 '17 at 10:06
@H.K Yes, you have found a bug in my code. The
cat >
should be cat >>
as in the bash
code. I will update the answer at once. Thanks!â Kusalananda
Sep 27 '17 at 10:18
@H.K Yes, you have found a bug in my code. The
cat >
should be cat >>
as in the bash
code. I will update the answer at once. Thanks!â Kusalananda
Sep 27 '17 at 10:18
 |Â
show 1 more comment
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f394479%2fconcatenating-multiple-fastq-files%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Are these files compressed by regular
gzip
, or are they compressed bybgzip
and indexed withtabix
? (i.e. do you also have to regenerate any Tabix indexes?)â Kusalananda
Sep 26 '17 at 8:16
Yes they are just regular .gzip files.
â H.K
Sep 26 '17 at 8:20
1
You might want to ask this sort of question over on Bioinformatics next time.
â terdonâ¦
Sep 26 '17 at 8:39