Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimmomatic quality trimming

    I have been using Trimmomatic to trim adapters and quality scores. In general, I have been pleased with the performance, but I just ran some low quality samples through and Trimmomatic doesn't appear to be trimming correctly based on quality? In certain cases I even see the per base quality being worse after trimming than before. I have set my cutoff to '10', so I would expect everything below that to be cut off. Furthermore, I have specified a sliding-window minimum quality of '18'. A couple of examples:

    Before:


    After:


    Before:


    After:


    In each case I ran the following commands:
    Code:
    java -Xmx2g -classpath /usr/local/bin/trimmomatic/trimmomatic.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 sample.fastq.gz sample.trimmed.fastq ILLUMINACLIP:/Volumes/Storage_1/Sequencing_1/References/Contaminants/contaminants.fasta:2:40:12 LEADING:10 TRAILING:10 SLIDINGWINDOW:4:18 MINLEN:18

  • #2
    I tried running these samples through PrinSeq and cutadapt as well with very similar results. This means that the problem isn't specific to Trimmomatic, but I'm still interested to hear if anybody knows what is causing this? I guess it only happens on really low-quality reads?

    Comment


    • #3
      Originally posted by kga1978 View Post
      I have been using Trimmomatic to trim adapters and quality scores. In general, I have been pleased with the performance, but I just ran some low quality samples through and Trimmomatic doesn't appear to be trimming correctly based on quality?
      Strange indeed.

      Is your data really phred33 as suggested in the command line? Illumina 1.5 is normally phred64.

      Comment


      • #4
        To be perfectly honest, I'm not sure - the quality score thing is doing my head in (damn you, Illumina!). I assumed if it was phred64, my maximum score would be higher than 40, no? I'll try and rerun with phred64 and see what happens.

        Comment


        • #5
          Originally posted by kga1978 View Post
          To be perfectly honest, I'm not sure - the quality score thing is doing my head in (damn you, Illumina!). I assumed if it was phred64, my maximum score would be higher than 40, no? I'll try and rerun with phred64 and see what happens.
          If the data really is phred-64 but trimmomatic is told that it is phred33, trimmomatic will interpret each score as 31 higher than it really is - thus not really trimming much since the quality appears 'excellent'. I really should add a warning if the quality scores are outside the expected range, as this is nearly always caused by wrong phred-33/phred-64 selection, and results in either no trimming, or almost everything trimmed, depending on the direction of the mistake.

          In any case, you really shouldn't see a significant percentage of the reads with base calls much below the sliding window threshold - e.g. in fastQC, the yellow bars should mostly be above, but the whiskers will tend to be below. On really bad data, you might also see the yellow bars drop in the last few bases, an artefact of 'under-testing' as the sliding window runs off the end of the reads - this is to be expected.

          Here's an example of some really low quality data pre/post trimming, using sliding window 4 wide, quality 15.

          Untrimmed Forward:

          Untrimmed Reverse:

          Trimmed Forward Paired:

          Trimmed Forward Unpaired:

          Trimmed Reverse Paired:

          Trimmed Reverse Unpaired:

          Comment


          • #6
            Hi Tony,

            Got it. I reran some of the reads and most of them got better with phred64 (I mostly use trimming for adapters though - my aligner takes into consideration quality). However, as you said, really bad reads still fall off dramatically in the end - probably due to the sliding window. So, just to be clear - am I correct in the following?

            Casava 1.3 - 1.7: Use Phred64
            Casava 1.8+: Use Phred33
            454 data (although Trimmomatic can't do this right now): Use Phred33

            Thanks for following up.

            Comment


            • #7
              Originally posted by kga1978 View Post
              Got it. I reran some of the reads and most of them got better with phred64 (I mostly use trimming for adapters though - my aligner takes into consideration quality). However, as you said, really bad reads still fall off dramatically in the end - probably due to the sliding window.
              How far in do you see the low bases, i.e below the threshold cut-off? Just the last few? Do your new plots look anything like the ones i posted?

              Originally posted by kga1978 View Post
              So, just to be clear - am I correct in the following?

              Casava 1.3 - 1.7: Use Phred64
              Casava 1.8+: Use Phred33
              454 data (although Trimmomatic can't do this right now): Use Phred33
              I believe so - though generally i verify by looking at the scores by eye, and checking here. Occasionally i've seen data in the 'wrong' phred because someone decided to be 'helpful'

              Comment


              • #8
                Actually, it's all good - the one that had a dramatic drop-off in the end, I had forgotten to change to phred64!

                This is what the data looks like now:

                Comment


                • #9
                  Originally posted by tonybolger View Post
                  If the data really is phred-64 but trimmomatic is told that it is phred33, trimmomatic will interpret each score as 31 higher than it really is - thus not really trimming much since the quality appears 'excellent'. I really should add a warning if the quality scores are outside the expected range, as this is nearly always caused by wrong phred-33/phred-64 selection, and results in either no trimming, or almost everything trimmed, depending on the direction of the mistake.

                  In any case, you really shouldn't see a significant percentage of the reads with base calls much below the sliding window threshold - e.g. in fastQC, the yellow bars should mostly be above, but the whiskers will tend to be below. On really bad data, you might also see the yellow bars drop in the last few bases, an artefact of 'under-testing' as the sliding window runs off the end of the reads - this is to be expected.

                  Here's an example of some really low quality data pre/post trimming, using sliding window 4 wide, quality 15.

                  Untrimmed Forward:

                  Untrimmed Reverse:

                  Trimmed Forward Paired:

                  Trimmed Forward Unpaired:

                  Trimmed Reverse Paired:

                  Trimmed Reverse Unpaired:

                  ok, i get this part very well, but my question is please if i want to use tophat for mapping which of these files should i use? (forward paired and reverse paired) what about the unpaired. i am new to trimmomatic and tophat sorry if this seems a stupid question.
                  thanks in advance

                  Comment


                  • #10
                    Trimmomatic quality trimming

                    I don't think Tophat and Bowtie will let you use paired reads and unpaired reads in the same run, so you would have to do 2 runs, one with the R1_paired.fastq and R2_paired.fastq files, and another run with the files containing the R1_unpaired.fastq and R2_unpaired.fastq reads.

                    Comment


                    • #11
                      How is quality score evaluated ?

                      Hi
                      I wondered whether anybody can explain me how the quality scores of the program
                      are actually calculated.
                      E.g. for the Lead-Trimming using a often cited value of 3 - obviously that won't be phred score. So what is it ?

                      Comment


                      • #12
                        Originally posted by ebioman View Post
                        Hi
                        I wondered whether anybody can explain me how the quality scores of the program
                        are actually calculated.
                        E.g. for the Lead-Trimming using a often cited value of 3 - obviously that won't be phred score. So what is it ?
                        It's a phred score

                        Historically, the illumina pipeline occasionally created reads with one (or more rarely two) N base-calls at the start, and more often, a set of trailing B phred quality scores at the end. N-base calls are treated as zero phred score, and B are quality 2, so by trimming both ends for all scores below 3, these artefacts are removed.

                        Comment


                        • #13
                          Thanks that was as short as informative ! I always thought it might be some other internal scores and tried desperately to reveal its calculation

                          Comment


                          • #14
                            Originally posted by ebioman View Post
                            Thanks that was as short as informative ! I always thought it might be some other internal scores and tried desperately to reveal its calculation
                            I had just the same doubt!! Very informative indeed...

                            Comment


                            • #15
                              need help with trimmomatics

                              Hi everyone,

                              I am have been having some issues with my command line for trimmomatics,

                              this is what ive been using:
                              java -jar /Users/omriadini/Desktop/Trimmomatic-0.33/trimmomatic-0.33.jar SE -threads 4 -trimlog /Users/omriadini/Desktop/156\ L001/L002.trimLog /Volumes/omri\ hard\ drive/Ally\'s\ stuff/Liron\'s\ Project/Raw\ Data/NRF2-1_S17/NRF2-1_S17_L001_R1_001.fastq trimmed.NRF2-1_S17_L001_R1_001.fastq ILLUMINACLIP:/Users/omriadini/Desktop/156\ L001/Truseq_NEBnext_adapter_sequences\ \(1\).txt:2:30:10 HEADCROP:12 MAXINFO:0:40:0.5 MINLEN:36

                              however, this is the response i get everytime:
                              ILLUMINACLIP: Using 0 prefix pairs, 48 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
                              Quality encoding detected as phred33
                              Input Reads: 7877072 Surviving: 0 (0.00%) Dropped: 7877072 (100.00%)
                              TrimmomaticSE: Completed successfully

                              I am not sure why it keeps dropping all of my reads,

                              any ideas?

                              thanks in advance

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X