Seqanswers Leaderboard Ad

**BAJ** · 04-16-2009, 04:50 AM

maybe you can also describe the various header lines and what they mean...
Illumina gives something like this:
@HWI-EAS285:1:1:1582:1499#0/1
swift outputs:
@L1-100:474:2

Unfortunately I don't know what the numbers mean. the "@HWI_EAS285" and "@L1" are user specified names.
in Illumina the following ":1" refers to the lane and then to the tile (I believe).
I am inclined to believe the following numbers refer to the x/y coordinates of the registered images, but I don't know for sure...

Thx, Bernd

**dcjamison** · 04-16-2009, 05:16 AM

Very nice. One minor issue:

"Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99."

The "99" should be 104, or else the range is only 0 to 35.

Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.

Curt

**Torst** · 04-16-2009, 05:01 PM

Originally posted by BAJ View Post

maybe you can also describe the various header lines and what they mean... Illumina gives something like this:
@HWI-EAS285:1:1:1582:1499#0/1

I am not 100% sure of the fields, and my colleague has contacted Illumina for clarification, but what I do know I have added to the Wiki page:

FASTQ format - Wikipedia

http://en.wikipedia.org/wiki/FASTQ_format

@HWUSI-EAS100R:6:73:941:1973#0/1

HWUSI-EAS100R the unique instrument name
6 flowcell lane
73 tile number within the flowcell
941 'x'-coordinate of the cluster within the tile
1973 'y'-coordinate of the cluster within the tile
#0 unknown
/1 the member of a pair, /1 or /2 (paired-end or mate-pair reads only)

**Torst** · 04-16-2009, 05:05 PM

Curt,

Originally posted by dcjamison View Post

Very nice. One minor issue:
"Illumina 1.3 format encodes a Phred quality score from 0 to 40 using ASCII 64 to 99." The "99" should be 104, or else the range is only 0 to 35.
Also, as a very minor quibble, I think the scores are still computed using the 4-color probablility of the original Solexa scoring method, and just the -5 to 0 range is calibrated back into the positive range. So calling it a Phred quality score maybe misleading.

I have fixed the 99/104 typo, thank you for replying!

The 1.3 Pipeline user manual says it uses pure Phred scores -10*log10(e) but it does NOT clarify how it maps it to ASCII. As these can not be negative, I am somewhat confused

**chris** · 04-17-2009, 01:03 AM

That's a useful page, thanks for setting it up.

Regarding the Phred -> Seloxa quality scores I think it's worth mentioning this paper:

http://nar.oxfordjournals.org/cgi/content/abstract/36/16/e105

As they show (in Table 3) that the Solexa error rates are not comparable to Phred at the same score. e.g. Phred has an error rate of 0.01% at score 40, but solexa has calculated error of 0.43% at score 40.

Overall, Solexa is overly optimistic at high quality scores and overly pessimistic at low quality scores.

**clivey** · 04-17-2009, 01:45 AM

you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.

**chris** · 04-17-2009, 02:17 AM

I don't think it matters that Q40 != Q40 just as long as people are aware of the fact. Which I didn't think was the case in this thread.

**dlepp** · 04-17-2009, 07:17 AM

Originally posted by clivey View Post

you simply need to 'recalibrate' the score so that Q40 means Q40 etc. some software tools are available to do this and it is not hard to write something.

I wonder if you could explain the recalibration and point towards some tools?

Thanks.

**ohlsson** · 07-22-2009, 05:35 AM

Great job, Torst! I have been struggling to get a grip of those Illumina FASTQ headers for a month now, but somehow I missed your wiki page.
I'm still not clear on one point though. I have a heap of data from a multiplexed run on Illumina GA2. The read headers largely fit your description, but what puzzles me is the index part:
@HWI-EAS178:1:1:2:1349#TGGCAT/1
As you can see, instead of an index number I have a short nucleotide sequence, which I suppose is meant to be the multiplex index sequence. As a rule, these 6-mer tags do not appear in the read sequence that follows. Do you think that they represent the multiplex index tags?

Many thanks for any suggestions!
/Ingemar

**Torst** · 08-04-2009, 02:19 AM

ohlsson,

The nucleotide sequence instead of the number must be new for GAPipeline 1.4. We are about to finish a multiplex run, so I will check what our files look like and let you know. But I suspect you are right and that it is the barcode for the multiplex. I think they are usually 6 or 7 base pairs long.

**jkbonfield** · 08-04-2009, 02:34 AM

They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).

I'm not really sure what to make of this notation though. They don't seem entirely consistent between file formats either. I've seen other files that had #0/1, implying it's a number and not a string.

**Torst** · 08-04-2009, 06:21 PM

Originally posted by jkbonfield View Post

They are indeed the multiplex barcode samples, but I think they're the sequenced DNA rather than the closest matching barcode. So you'll need to write your own code to do the matching (Illumina do not provide such a tool IIRC).

From the manual:

The split_on_index.py script identifies all read index sequences that are identical to the reference index sequences, or that differ by a user-defined number of bases. It then breaks up the rows of the export.txt or sorted.txt file and places each row into a separate file, one for each sample.

In order for this process to work, you need the following:

* All samples in a lane are aligned to the same target sequences. The output will be stored in the GERALD directory in export.txt and sorted.txt files.

* A sample sheet, which is an xml configuration file entered during cluster generation. The sample sheet associates index sequences with sample IDs

Sounds like the right tool for the job?

**ohlsson** · 08-04-2009, 11:23 PM

Ah, interesting! I will try to find that python script and see how it works.

I already coded a pretty simple perl script that separates reads by exact matching of the header tag to a list of barcodes. It seems to work pretty well: for a mixture of four indexed samples, roughly one fifth of the mixture was sorted to each of the four used barcodes, and one fifth was left unsorted (due to mismatches, so yes jkbonfield, I also think that the tag in the header is sequenced DNA).
Interestingly, each of the eight unused barcodes got only a few hits, in the region of 1-20 reads (out of ~20 million), so the number of false-positives was very low.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Solution to Sanger/Solexa/Illumina FASTQ confusion

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News