Hi, just a general question: my experience is that there seems to be no standard way of defining paired-end reads in a FASTQ file.
For example, Trinity expects that paired-end reads are defined with
or
in the header line of a FASTQ file, ..and that the pairs are in order in two separate files for left and right reads (as I understand it)
Trinity does add the /1 or /2 during the assembly process, but some preprocessing steps (e.g. in silico read normalization) expect that these tags are already there.
Some programs expect the pairs to be interleaved in a single file, some don't (leading to scripts like shufflesequences.pl etc and Galaxy tools to interleave and un-interleave things).
I was wondering, is there some standard way of defining paired-end sequences that I'm not aware of? If not, could we, as a community, come up with one? Thoughts?
As a side issue, I've seen some unsafe code to add the /1 and /2 tags onto the end of FASTQ files; for example, prior to in silico read normalisation as described here:
Anything that uses the @ tag at the start of the FASTQ header line is potentially unsafe since the @ (and any other unique bits after the @) can potentially turn up in the quality scores, and even potentially at the start of the quality scores, such that the start of the quality score line might be indistinguishable from the header line. e.g.:
..is potentially unsafe since it searches for the @M00 at the start of the header line (the @ is standard FASTQ, the M00 is presumably some tag from a MiSeq), and it's possible (given millions/billions of reads) that some quality score lines might start with @M00 too. My alternative approach is just to add /1 (or /2 for right reads) to every fourth line.
(for left reads) , or
(for right reads).
This simply adds ' /1' ( i.e. a space, a slash and a 1) to the end of every 4th line starting with the first line. If your file is FASTQ format this should work (works for me anyway). Would'nt be too hard to modify this to add the tags to interleaved paired-end FASTQ files too. You can use the sed -i option to replace rather than redirecting to a new file if you want.
For example, Trinity expects that paired-end reads are defined with
Code:
@nameofthesequence /1
Code:
@nameofthesequence /2
Trinity does add the /1 or /2 during the assembly process, but some preprocessing steps (e.g. in silico read normalization) expect that these tags are already there.
Some programs expect the pairs to be interleaved in a single file, some don't (leading to scripts like shufflesequences.pl etc and Galaxy tools to interleave and un-interleave things).
I was wondering, is there some standard way of defining paired-end sequences that I'm not aware of? If not, could we, as a community, come up with one? Thoughts?
As a side issue, I've seen some unsafe code to add the /1 and /2 tags onto the end of FASTQ files; for example, prior to in silico read normalisation as described here:
Anything that uses the @ tag at the start of the FASTQ header line is potentially unsafe since the @ (and any other unique bits after the @) can potentially turn up in the quality scores, and even potentially at the start of the quality scores, such that the start of the quality score line might be indistinguishable from the header line. e.g.:
Code:
sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
Code:
sed '1~4 s/$/ \/1/g' your_fastq_file.fastq > your_new_fastq_file.fastq
Code:
sed '1~4 s/$/ \/2/g' your_fastq_file.fastq > your_new_fastq_file.fastq
This simply adds ' /1' ( i.e. a space, a slash and a 1) to the end of every 4th line starting with the first line. If your file is FASTQ format this should work (works for me anyway). Would'nt be too hard to modify this to add the tags to interleaved paired-end FASTQ files too. You can use the sed -i option to replace rather than redirecting to a new file if you want.
Comment