Seqanswers Leaderboard Ad

**GenoMax** · 01-16-2014, 02:44 PM

The way paired-end reads are denoted for illumina sequencing changed over the years. You can find the evolution here: http://en.wikipedia.org/wiki/FASTQ_format (Illumina sequence identifiers)

**danwiththeplan** · 01-16-2014, 02:50 PM

Originally posted by GenoMax View Post

The way paired-end reads are denoted for illumina sequencing changed over the years. You can find the evolution here: http://en.wikipedia.org/wiki/FASTQ_format (Illumina sequence identifiers)

I agree that page is very useful regarding the evolution of the FASTQ format with regards to quality encoding and Illumina "machine" tags but actually contains no information at all on how paired-end reads are defined. EDIT: sorry, yes it does.

HTML Code:

/1 	the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

But this seems to be very inconsistently applied, I'm working on data right now that is paired-end but these tags are not in it. This is possibly because Illumina 1.8 / 1.9 pipeline no longer adds these tags.
I guess I was just saying that it all seems a bit inconsistent (does it have a tag or not, are the files produced as interleaved files or not) which can be problematic for tool development.

**danwiththeplan** · 01-16-2014, 03:02 PM

Actually to answer my own question I see that even Illumina 1.8 (and presumably 1.9) have designations for paired-end reads, it's just that it's now buried in the middle of the header line and some tools still expect them to be at the end as per earlier pipeline versions

**Zapages** · 01-16-2014, 05:21 PM

if you don't know the format type... I usually go with FastQC to see what the format type is.

Overall the basics are:

Sanger/Illumina 1.8+ (1.9) are phred33
Illumina 1.3 to 1.5 are phred66

Also please take a look at page 27 in this pdf document: https://www.msi.umn.edu/sites/defaul...Module%201.pdf

All the best.

**gringer** · 01-16-2014, 05:38 PM

Originally posted by Zapages View Post

if you don't know the format type... I usually go with FastQC to see what the format type is.

Overall the basics are:

Sanger/Illumina 1.8+ (1.9) are phred33
Illumina 1.3 to 1.5 are phred66

It seems like you've missed the nature of this question, which was about how the end pair status was encoded in the sequence header, rather than how the quality scores are encoded.

**danwiththeplan** · 01-16-2014, 05:59 PM

Originally posted by gringer View Post

It seems like you've missed the nature of this question, which was about how the end pair status was encoded in the sequence header, rather than how the quality scores are encoded.

Correct, but I really should have looked at the wiki, since this info is included.

However it was also more of an observation that how the paired-end status is encoded has changed without warning and is very inconsistently applied in practice, which and a bit of a pain for developers of analysis tools. Also to warn about the unsafe code some people seem to be using to apply the old-style paired-end tags (that some tools still look for) to new formats.

Also, FASTQC is not that great in my opinion. Or rather, it was originally designed to do one thing (analysing Illumina data, where all the reads are the same length, and are less than 100bp) but people then go and use it for absolutely everything. For example, if your reads are longer than that, it compresses the end of the reads into the last bit of the graph. It does not deal particularly well with reads that are not all the same length (eg Ion torrent). I prefer custom R scripts in this case.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Is there a standard file format for FASTQ paired-end reads?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News