Announcement

Collapse
No announcement yet.

"#" in illumina reads fastq quality line

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • "#" in illumina reads fastq quality line

    Hi all,

    I am using prinseq to trim low quality tail of illumina reads. From the manual, I found -trim_qual_left and -trim_qual_right can trim seq by quality score from the 5' or 3' -end with certain threshold score. So, here are parameters I used:

    Code:
    #trim 
    -trim_qual_right 33
    -trim_qual_left 33
    I am not sure whether I chose correct parameter and whether 33 is too high.

    In addition, I also used below two to trim polyA/T
    Code:
     
    -trim_tail_right 10
    -trim_tail_left 10
    When program was running, I checked the filtered out reads in so called 'bad_reads" file, I found the quality lines of most of reads contain a very long "#########", some of them even have entire line of #, e.g:

    Code:
    @HWI-ST538:217:C0NFWACXX:4:1101:19625:1943 1:Y:0:GAGTGG
    CNCGTCCCTTGATATGTTGTAATTCGTCTTTCATTTCCATTATGATGGCATCTGCAGCATCCTGCCAGAGACCTTTCAGATGAATATTTTCTTGCTGCAA
    +
    ####################################################################################################
    @HWI-ST538:217:C0NFWACXX:4:1101:20481:1941 1:Y:0:GAGTGG
    TNCATACTTTCGTTCCTTTCTCTTTATACGGATCGACTTCGTTCCAAGCTGTGGGAATCTTGACCGTGTTGTGCATCAGGGGTCATCTGCTTCGGTCATT
    +
    3#[email protected]<@[email protected]:[email protected]?)>:>><>>@?9???8?4((--<(97<;):)7>7>???9?>???>)<>=99=?##############################
    @HWI-ST538:217:C0NFWACXX:4:1101:20349:1946 1:Y:0:GAGTGG
    CNGCGCTGCTGCCAACTAGTAAAGGAAGTATTCATTAAAATGCAGGGAGACCGCAGGAATGGGGACATGTTCCCCTTTGGGGACCCTTTTGGCAGCTTCG
    +
    ;#[email protected]=?<>>>??9??>.8=9>@>@<?>?=?>?>????>?>?<<=5=???<??<?9>?########################################
    I am not quite clear the meaning of symbols in quality. Does multiple "#" really mean these sequences are bad? In "good reads" file, none of read contain "#" in its quality. I am afraid that I did anything wrongly and discard read which should be kept.

    The version is illumina 1.9 based on fastQC.

    Any advice is highly appreciated.

    -alice
    Last edited by doublealice; 06-09-2012, 03:18 PM.

  • #2
    Since you have data from Illumina 1.9 this is using the Sanger FASTQ encoding, so '#' (ASCII 66) means PHRED quality 2 which has a special meaning with Illumina as the "Read Segment Quality Control Indicator". Under the old Illumina FASTQ encoding Q2 was a 'B'.

    See http://seqanswers.com/forums/showpos...91&postcount=3

    i.e. the run of PHRED 2 means the read failed in some specific way according to the Illumina software. Even without this knowledge, PHRED quality 2 is very bad and should be clipped/discarded.

    Comment


    • #3
      maubp, thanks very much. This is very helpful.

      Comment

      Working...
      X