Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • wenhuang
    Member
    • Feb 2010
    • 30

    Overlapping paired end - tophat

    Hi,

    I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

    My questions are:

    1) is this going to affect tophat alignment ? how should the -m option be specified?

    2) when counting coverage, my intuition is that those overlapping bases might be counted twice, while they only appear in the library once, is there any way to get around this?

    3) is this going to affect cufflinks transcript assembly and quantitation?

    Thanks for your help!
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

    I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

    Of course, this is not an ideal solution.

    Simon

    Comment

    • KevinLam
      Senior Member
      • Nov 2009
      • 204

      #3
      Originally posted by Simon Anders View Post
      I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

      I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

      Of course, this is not an ideal solution.

      Simon
      how did you stitch them?
      samtools merge?
      http://kevin-gattaca.blogspot.com/

      Comment

      • KevinLam
        Senior Member
        • Nov 2009
        • 204

        #4
        Originally posted by wenhuang View Post
        Hi,

        I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

        Thanks for your help!
        Why not convert your paired end data into single end?
        Since there is a 30 bp overlap. they should assemble into a single read quite nicely.

        so you end up with a 120 bp SE data.
        http://kevin-gattaca.blogspot.com/

        Comment

        • wenhuang
          Member
          • Feb 2010
          • 30

          #5
          My alignment did not seem to have too much problem. Here is just a sample of the first few alignments. It appeared to me that the two reads were processed separately, but I am not so sure about that.

          HWUSI-EAS787_0001:5:70:1610:809#AAATAG 99 chr1 5312 255 81M = 5366 0
          GCGAGGAAAGAAATGCACTAAGTAAAAAACTTAGTCATTTTTTAAAGAGAATTAAAATGAAGTCCAATTCCTTTGAGTTAC HGHHI
          HHHGHHHGGGHHHHHHHHIHHHGHFHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHEHHFHEHGHHG NM:i:0
          HWUSI-EAS787_0001:5:70:1610:809#AAATAG 147 chr1 5366 255 81M = 5312 0
          AAATGAAGTCCAATTCCTTTGAGTTACAAATTTACAATCACTACTCAGTAATTAAAACTATTCAGTTATAGTGAACTGATT IHFHH
          IHBGHHHHHGHHFEHHHHHHHHHHHHHHHHHHHHEHHGHHHHHHHHHHHHGGHHHHHHHHHHIHHHHHHGHHHHHH NM:i:0


          HWUSI-EAS787_0001:5:30:1504:1763#TTGTCG 163 chr1 5822 255 81M = 5860 0
          CCAGAGCCCACAGCTTACTTTTGGTGGTACCCATCCTAAGGGTCTGGGCAAACATATAACGATAAATGTCCATCATTATAA HHGHH
          GGFHHHHHHHHHEHHHHHHHHHHHEHHGHDEGHHHHHBBBGGG7FHH2HEHBHH0FHEFHC+?6><CC-CEDDBA@ NM:i:0
          HWUSI-EAS787_0001:5:30:1504:1763#TTGTCG 83 chr1 5860 255 81M = 5822 0
          AGGGTCTGGGCAAACATATAACGATAAATGTCCATCATTATAATATCACACAGAGTAGTTTCACTGCCCTGAAACTCTTTT G@CBF
          HE?G=HHGIHHHHGHGHBHGHHHEGHDHHGHHFFHHHHHHHHHHGHHGHGFHCHHGHHHHFHHHHHHHHHHHHHHH NM:i:0



          Originally posted by Simon Anders View Post
          I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

          I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

          Of course, this is not an ideal solution.

          Simon

          Comment

          • wenhuang
            Member
            • Feb 2010
            • 30

            #6
            I think this is a decent solution. Many of my reads suffered from bad quality at the end though. Can you recommend a type of tools that might do this job ? Thanks!

            Originally posted by KevinLam View Post
            Why not convert your paired end data into single end?
            Since there is a 30 bp overlap. they should assemble into a single read quite nicely.

            so you end up with a 120 bp SE data.

            Comment

            • KevinLam
              Senior Member
              • Nov 2009
              • 204

              #7
              Originally posted by wenhuang View Post
              I think this is a decent solution. Many of my reads suffered from bad quality at the end though. Can you recommend a type of tools that might do this job ? Thanks!
              I only know phrap which can do this but if applied to so many reads I am not sure how long it will take.
              http://kevin-gattaca.blogspot.com/

              Comment

              • Cole Trapnell
                Senior Member
                • Nov 2008
                • 213

                #8
                Originally posted by wenhuang View Post
                Hi,

                I have a paired end (2x75) Illumina data set that might have overlap at the ends. The fragment size selected was 240 and after subtracting adapter/primer sequences, there was about 120 bp left, which generated about 30bp overlap at the ends.

                My questions are:

                1) is this going to affect tophat alignment ? how should the -m option be specified?

                2) when counting coverage, my intuition is that those overlapping bases might be counted twice, while they only appear in the library once, is there any way to get around this?

                3) is this going to affect cufflinks transcript assembly and quantitation?

                Thanks for your help!
                As of TopHat 1.0.13, you should be able to specify a negative inner distance of -30. TopHat does map the reads independently, and has a different algorithm from Bowtie for handling the ends. The coverage.wig file display depth of read coverage, not depth of physical coverage, so those bases will be double counted, as you suggest. However, Cufflinks operates at the fragment level, not the read level, and so should do the right thing here.

                Comment

                • ecabot
                  Junior Member
                  • Jul 2008
                  • 6

                  #9
                  Here are more details about Wen's run which was 2x75.

                  The minimum fragment size, including flanking adapters is 150 bp. Thus fragments with the smallest insert could be diagrammed like this with 32 bases of overlapping cDNA


                  [adapter:59][cDNA 32][adapter:59]
                  o~~~~~~~~~~~> (with 43bp of adapter)
                  <~~~~~~~~~~~~o


                  I am assuming, however that reads this short would fail to map because of the high proportion of adapter-derived sequences embedded in the reads.


                  These considerations lead me to the following questions:


                  1) Does the negative inner distance of, for example, -30 reflect an expected mean of 30 bp of overlap or does it specify a maximum amount of overlap.

                  Afterall, most of Wen's reads don't overlap and the overlap could be as high as a full 75bp for a 193bp fragment. If I were to calculate the actual mean inner distance taking overlaps as having negative distances, the overall mean might well turn out to be positive.

                  2) If we were to trim the adapters this would invariably lead to a distribution of read lengths rather than a uniform 75 bases. Can Bowtie and TopHat deal with unequal read lengths or is this likely to be a problem?

                  Comment

                  • ecabot
                    Junior Member
                    • Jul 2008
                    • 6

                    #10
                    Here is how the diagram from my previous posting should look (with dots replacing whitespace). Sorry for the confusion.

                    [adapter:59][cDNA 32][adapter:59]
                    .............................o~~~~~~~~~~~> (with 43bp of adapter)
                    ...........<~~~~~~~~~~~~o

                    Comment

                    • Auction
                      Member
                      • Jul 2009
                      • 24

                      #11
                      Originally posted by Simon Anders View Post
                      I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

                      I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

                      Of course, this is not an ideal solution.

                      Simon
                      In my case, it seems bowtie 0.12.3 (and also BWA) works well for overlap pair-end. I have 2*59 reads, and I found the ISIZE for many records is less than 118 and the FLAG field indicate they are properly mapped.

                      Comment

                      • Cole Trapnell
                        Senior Member
                        • Nov 2008
                        • 213

                        #12
                        Originally posted by Simon Anders View Post
                        I don't know how TopHat reacts to it but I can already tell you that Bowtie won't like it, and hence Tophat will fail, too.

                        I'm currently working with a similar data set and noted that Bowtie fails to find an alignment for an overlapping paired read (and so does Eland). I ended up aligning the two ends separately and then stitching things together manually.

                        Of course, this is not an ideal solution.

                        Simon
                        TopHat and Bowtie use completely different procedures to handle paired ends, and their policies are not the same. TopHat maps the left and right reads independently, and recent versions should have no trouble with paired end libraries with negative inner distances and overlapping reads. With TopHat 1.0.13 and Cufflinks 0.8.0, I have processed an RNA-Seq library size selected to 100bp and sequenced with 2x76bp GAII. The mean inner distance in this case is negative, and the TopHat/Cufflinks stack produced fine results.

                        To answer a previous question - TopHat will not handle reads of different lengths gracefully, so if you make "virtual" long reads from overlapping mates, make sure to trim the products down to a uniform length.

                        Comment

                        • ACTGangster
                          Junior Member
                          • Sep 2009
                          • 8

                          #13
                          Another possible solution

                          I had to edit this post. I wrote a program that assembles overlapping paired ends from illumina. It used to be public but now it's private because I want to do a paper on it.

                          If you want a copy, you can e-mail me and I'll send it to you.

                          I tested it on 1.5 million reads that overlapping ~25 bp and it assembled about 78% into larger contigs which can then be de novo assembled. In the overlapping region, it chooses the nucleotide with the best quality score (if there is a discrepancy). If the there is a discrepancy and the quality scores are the same it chooses the appropriate ambiguous nucleotide.
                          Last edited by ACTGangster; 07-24-2010, 05:26 PM. Reason: makebettered

                          Comment

                          • Zigster
                            Jeremy Leipzig
                            • May 2009
                            • 117

                            #14
                            I uploaded a python script I wrote for this to SVAR:
                            --
                            Jeremy Leipzig
                            Bioinformatics Programmer
                            --
                            My blog
                            Twitter

                            Comment

                            • ACTGangster
                              Junior Member
                              • Sep 2009
                              • 8

                              #15
                              stitch

                              I open-sourced my Stitch program as I do not plan on writing a paper on it specifically.



                              It runs on as many cores as you have. I did 20 million reads in 40 minutes on a 16-core mac pro.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...