Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimming left end (5') of reads??

    Can anyone explain why there is a sequence bias in the first 15bp of Illumina reads? I am pretty sure this is not an adapter leftover. The researchers who did lettuce transcriptome identified the same issue, with results at:

    And we saw the same bias in the first 15bp of our reads also. I think I read somewhere that it's caused by GC content. Even after removing low & medium quality reads, we still see the bias in the first 10-15nt. Can anyone explain?

  • #2
    Short answer, the random hexamer priming is "not so random". Illumina has acknowledged this in one of their FAQs:

    Q482. Why is GC high in the first few bases?
    It is perfectly normal to observe both a slight GC bias and a distinctly non-random base composition over the first 12 bases of the data. This is observed when looking, for instance, at the IVC (intensity versus cycle number) plots which are part of the output of the Pipeline. In genomic DNA sequencing, the base composition is usually quite uniform across all bases; but in mRNA-Seq, the base composition is noticeably uneven across the first 10 to 12 bases. Illumina believes this effect is caused by the "not so random" nature of the random priming process used in the protocol. This may explain why there is a slight overall G/C bias in the starting positions of each read. The first 12 bases probably represent the sites that were being primed by the hexamers used in the random priming process. The first twelve bases in the random priming full-length cDNA sequencing protocol (mRNA-seq) always have IVC plots that look like what has been described. This is because the random priming is not truly random and the first twelve bases (the length of two hexamers) are biased towards sequences that prime more efficiently.This is entirely normal and expected.
    There was also a publication which investigated this:

    Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res 2010 Apr.;
    Last edited by kmcarr; 04-12-2013, 12:35 PM. Reason: Hyperlink reference

    Comment


    • #3
      Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??

      Comment


      • #4
        Originally posted by blindtiger454 View Post
        Is it recommended to trim these first bases then? It sounds like they are valid mRNA sequence, even though there is a preference to certain reads from the "random" priming. The researchers who did lettuce transcriptome created better assemblies when they trimmed this region. I don't understand why this occurred. Maybe in the process of trimming the reads they removed some poor quality regions in the 5' end??
        I have carefully studied the UC Davis poster in the past and what strikes me is that the effect of trimming the 5' end appears nearly identical to that of trimming the 3' end so I'm not convinced of their conclusion that it is important to trim the initial 15nt. However I have heard from other researchers that they do present a particular problem for de novo assembly with de bruijn graph assemblers (which is just about all of the most popular short read assemblers, including velvet). The thinking is that the k-mer diversity of the first 15nt is significantly lower than the remainder of the read which seems to cause problems for the assembler.

        If you are doing a de novo assembly why not give it a try both ways and see what your results are?

        On the other hand if I am mapping the reads to a genome (vs de novo) I never trim the 5' ends of RNA-Seq reads and I find they map perfectly well.

        Comment


        • #5
          Thanks for the information. Our reads are 55bp, and it is from a tetraploid plant. Given the large amount paralogues and allelic diversity in plants, I want to do minimal trimming for the assembly. It's bad enough having 55bp. The UC Davis folks had 80bp reads. If I trimmed my reads down to 40bp, I'm afraid the assembler will incorrectly assembly paralogues. Sometimes 15 nucleotides is all the difference between 2 closely related transcripts/genes.

          Comment


          • #6
            FASTQ Trimmer tool

            hi guys,
            I'm new to this forum...can anyone tell how do I know homa many bases should I trim with FASTQ Trimmer?Wht is the ideal score and which values do I have to look at?(Q1, median or Q3)

            Thanks!

            Comment


            • #7
              bump

              Comment


              • #8
                I sorted that out...if anyone needs info glad to help

                Comment


                • #9
                  Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

                  I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

                  When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

                  Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

                  So why does the adapter appear at the beginning of the read and not at the end?

                  Am I misunderstanding something? I would love to have a clarification of this.

                  Thanks,
                  blanco
                  Attached Files

                  Comment


                  • #10
                    Originally posted by blanco View Post
                    Hi folks - hope some of you can help me clarify something about adapter contamination and adapter trimming.

                    I made TruSeq Illumina libraries and sequenced them for 100bp paired end reads.

                    When I view the 'per base sequence content' with fastQC I get something that looks like adapter contamination. I then used cutadapt to remove the adapter sequence. The 'per base sequence content' before and after cutadapt is shown in the attached pdf.

                    Now this is all fine and dandy but what I find a bit confusing is why the adapter sequence is at the beginning of the read. My understanding was that adapter contamination mainly arises when the read is too short so at the end of the read the sequencer starts to sequence the adapter.

                    So why does the adapter appear at the beginning of the read and not at the end?


                    Am I misunderstanding something? I would love to have a clarification of this.

                    Thanks,
                    blanco
                    You can get adapter-dimer (where the DNA insert size is effectively 0) meaning that you only sequence adapter (hence it appears at the 5' end). If this is the case, I believe using cutadapt willl just remove those reads from your fastq file (maybe someone can confirm).
                    Those peaks don't look like dimer to me, more the random priming issue. When you get bad adapter, you can actually read the adapter sequence in your %base graph (see attached plot of a run that had 10% adapter dimer).
                    Attached Files

                    Comment


                    • #11
                      I got the same problem to and produce exactly the same ACGT bias for the first 15bp/cycle. And I've asked the representative for Illumina and they mentioned that this is due to the hexamer random priming as mentioned above.

                      Comment


                      • #12
                        What if it's WGS and not RNA-Seq. I see the same thing with the NexteraXT kit on the MiSeq. Is it a non-random recognition site for the Tagmentation enzyme?

                        Comment


                        • #13
                          Hi IBseq

                          Originally posted by IBseq View Post
                          I sorted that out...if anyone needs info glad to help
                          I need help. Can you please help me to trim both ends 5' and 3'?

                          Thanks in advance.

                          Comment


                          • #14
                            Originally posted by nareshvasani View Post
                            I need help. Can you please help me to trim both ends 5' and 3'?

                            Thanks in advance.

                            You can use cutadapt to trim both 5' and 3' bps. The fastx_clipper can only trim 3' end. When you use cutadapt, you must use cutadapt -g firstly, and use the processed sequence to do cutadapt -a. If you use -g and -a at the same time, it will only cut one end.

                            Comment


                            • #15
                              Originally posted by nareshvasani View Post
                              I need help. Can you please help me to trim both ends 5' and 3'?

                              Thanks in advance.
                              I always use the fastx_trimmer; you can use the -f and -l options to set the first and the last base to be kept.
                              Last edited by Michael.Ante; 09-25-2013, 07:23 AM. Reason: typo

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              43 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X