Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina HiSeq - first base quality score and trimming

    Hi all,

    We had an external company run RNA-seq for us and I'm now knee-deep in trying to assemble these sequences. The platform used was Illumina HiSeq 2000, producing a couple of fq files containing paired end data. I've noticed that some of the sequences in file 1 begin with an N, with a quality score of B - I've read other threads here that advise that this is a low quality score equivalent to 2. The paired sequences in file 2 don't seem to have this issue, although may end with a B quality base - here's an example

    Code:
    @ABC123:1:1101:1423:1934#/1
    NACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA
    +
    BP\aceeca]cgcdgcegfdgdgdcgd_aa^cSXcgecaW^eeg_[aW\Za_fghhh]ddgdbaabbccZ_R`Z`T\KTTZZ`b^WXX]bY_bY`baa[[
    
    @ABC123:1:1101:1423:1934#/2
    GACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA
    +
    _^aceeeegegggfffefb`eeaffggcgh_cgffhghhibeffgfgfegdfgighhghhihggcghigdgggggdabc_abbb`a_`_ccb`Z_bcccB
    I don't think it's a huge issue in the data as a whole as FastQC doesn't flag any problem with the number of Ns at the first base, so it's likely a small subset of the sequences.

    Nevertheless I'd like to remove these bases and am struggling to find a tool that does what I need (or, perhaps more likely, am struggling to use the tools available correctly) - fastx toolkit only seems to remove bases from the 3' end, and when I use Trimmomatic with options PE -phred64 LEADING:3 TRAILING:3 it happily removes the poor quality bases from the 3' end but not the 5' - so in the above example the final A of the file2 sequence is removed, but not the first N of file1. I don't know if this is because it is an N rather than a nucleotide or if it's due to its position in the sequence.

    Any advice on the nature of these initial Ns in Illumina data and how best to remove them would be much appreciated!

  • #2
    Trimmomatic's LEADING:3 command should remove very low quality bases from the 5' end.

    What version of Illumina's software was used to produce your fastq files?
    The quality encodings used by Illumina have changed a few times.
    See



    Since Illumina v1.8, they now use the phred33 quality scale,
    which might explain why trimmomatic didn't remove the N base from the 5' end of your sequence.

    Comment


    • #3
      Hi mastal, thanks for your reply.

      We weren't given the info on Illumina software when we received the data, but I've gone back to request that just to make sure.

      Going by the wiki page it does look to be phred64, most of the quality scores tend to by in the range of "[\]^_`abcdefghi" ASCII characters. I've tried running Trimmomatic with the -phred33 option and it removes nothing, not even the 3' bases that were removed before.

      I've also tried editing the first base of the first sequence to a G rather than an N, and it's still not removing it - the LEADING command just doesn't seem to be working for me, it's baffling.

      EDIT: OK, I've just tried using the single end (SE) version on the first sequence file only, and it's removed the leading Ns. As I'm not removing any reads, just trimming the 5' and 3' ends where there is a low quality base, I suppose running it this way on each file in turn should do the trick and I shouldn't (in theory) have any problems with unmatched reads. I'd still prefer to get it working using the PE function just in case, though.

      FURTHER EDIT: Problem resolved. Short answer: I'm an idiot. Long answer: my Trimmomatic command was missing a path for the second unpaired output file. Add one of these in, and it worked fine.
      Last edited by beej; 10-15-2013, 03:14 AM. Reason: Further investigation results & resolution

      Comment


      • #4
        Originally posted by beej View Post
        Hi all,

        We had an external company run RNA-seq for us and I'm now knee-deep in trying to assemble these sequences. The platform used was Illumina HiSeq 2000, producing a couple of fq files containing paired end data. I've noticed that some of the sequences in file 1 begin with an N, with a quality score of B - I've read other threads here that advise that this is a low quality score equivalent to 2. The paired sequences in file 2 don't seem to have this issue, although may end with a B quality base - here's an example

        Code:
        @ABC123:1:1101:1423:1934#/1
        NACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA
        +
        BP\aceeca]cgcdgcegfdgdgdcgd_aa^cSXcgecaW^eeg_[aW\Za_fghhh]ddgdbaabbccZ_R`Z`T\KTTZZ`b^WXX]bY_bY`baa[[
        
        @ABC123:1:1101:1423:1934#/2
        GACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCAAACGAGACCA
        +
        _^aceeeegegggfffefb`eeaffggcgh_cgffhghhibeffgfgfegdfgighhghhihggcghigdgggggdabc_abbb`a_`_ccb`Z_bcccB
        I don't think it's a huge issue in the data as a whole as FastQC doesn't flag any problem with the number of Ns at the first base, so it's likely a small subset of the sequences.

        Nevertheless I'd like to remove these bases and am struggling to find a tool that does what I need (or, perhaps more likely, am struggling to use the tools available correctly) - fastx toolkit only seems to remove bases from the 3' end, and when I use Trimmomatic with options PE -phred64 LEADING:3 TRAILING:3 it happily removes the poor quality bases from the 3' end but not the 5' - so in the above example the final A of the file2 sequence is removed, but not the first N of file1. I don't know if this is because it is an N rather than a nucleotide or if it's due to its position in the sequence.

        Any advice on the nature of these initial Ns in Illumina data and how best to remove them would be much appreciated!
        Some programs give you the option to simply remove the first & last one, two or three bases from every read - I'm guessing that this is a pre-emptive way of dealing with data that may have a lower score.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 08:47 AM
        0 responses
        14 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X