Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • assembly strategies for repetitive dna

    Hi all,

    I was hoping to get some feedback on an assembly strategy of repetitive dna using Illumina 100-PE data.

    I've seen some strategies, such as ALL-PATHS-LG, which utilize multiple libraries of increasingly larger insert size to resolve repetitive regions. For example, assembly of the potato genome (which is ~60% repetitive) used 7 libraries ranging from an insert size of 200 bp - 20,000 bp.

    We're running a pilot on a few samples to see how well the assembly will be, but because of coverage issues, will only be creating 2 libraries per sample.

    So here is my question:

    Is it better to create the libraries with insert sizes that are close in range (ex. 200 bp and 500 bp) or large in range (ex. 200 bp and 5,000 bp)? I can see pros and cons of both, but wanted elicit some advice before going forward.

    Thanks,

    John

  • #2
    One more thing while I'm at it...

    Has anyone used Telescoper (DOI:10.1093/bioinformatics/bts399)? If so, I'd be interested in hearing about your experiences with it.

    Comment


    • #3
      Assembly of repetitive DNA

      The big question: what is repeating/how much is repeating? Sounds vaguely leninist.

      One person who has experience with this is Matt Riley at U. Tennessee, at least in microbial genomes.

      Anyway, if you have tandem repeats of a few bp, then read length is your big factor; if you have repeats of a gene, then you need jumps/paired ends. If you have repeats of gene clusters, you may need something more substantial. 40kb jumps are possible and published; PacBio reads are another option; an Optical Map may be the answer. Of course, you'll need to 'fill in' the map or fix the SNPs in the SMRT reads. Joint assemblies are performed by a number of groups; the folks at NCBI, the FDA, and UMD (Mihai Pop) are familiar with the strategies.

      Ultimately, if you have the worst case scenario, some sort of scale-free nesting of repeats within repeats, you would need all these solutions combined.

      Hope that points you toward some ideas.

      Comment


      • #4
        Originally posted by bckirkup View Post
        The big question: what is repeating/how much is repeating? Sounds vaguely leninist.
        Leninist indeed

        Thanks for your response...They are VERY helpful.

        I'm interested in one chromosome which is probably a worst case scenario - variable sized microsatellites, minisatellites, transposable elements, and variable sized rDNA arrays. Quite frankly, it's a mess.

        I'm not looking for a complete assembly, but the chromosome is ~40 Mb and I would like to generate scaffolds large enough to give me something to work with. Using a published illumina dataset (86 bp pe), I wasn't able to assembly anything larger than 1kb, although in non repetitive regions I was getting scaffolds as large as 400 Kb.

        My goal is to predict functional motifs from the assembly (ex. TF binding sites, transposable element content, CNV in satellite sequences) and identify variation in this chromosome across populations.

        Comment


        • #5
          Ugh. (sorry, I meant to say "What a challenging project!")

          Regarding your initial question (close vs. large size differences in the two libraries), the larger difference will be more useful for assembly. The ideal is to have paired end read jumps that span the repeats, which delimits the number of copies in the intervening region. You can calculate transposable element content and satellite CNV by read depth (although their positions will be difficult/impossible to assign).

          Note that coverage issues do not necessarily limit you to two library sizes. For assembly, it would be more useful to have additional jump libraries sequenced at lower depth. IIRC, ALLPATHS-LG sequenced large (10kbp) jump libraries at 1/25th the depth of the shorter-sized inserts. A similar hybrid approach may also be your best bet.

          Good luck!

          Comment


          • #6
            Originally posted by HESmith View Post
            Ugh. (sorry, I meant to say "What a challenging project!")
            haha...indeed!

            Thanks for your comments. This is making me thing starting with 2 libraries for the pilot (probably 250 bp and 1 kb) at a higher depth and then running another 2 libraries (5 kb and 10 kb) at a lower depth would be a good strategy.

            Comment


            • #7
              Jumping libraries are a possibility, but the cost and difficulty to make the libraries are a consideration and there may be value in sequencing through the entire region. The current PacBio C2 XL chemistry averages 5kb read length with 10% reads 10kb+, max is usually ~20kb.

              If you would like more information about potentially doing a PacBio library, we can discuss I can help make some introductions to labs in the area to get your challenging region sequenced. You can email me: akieu at pacificbiosciences.com

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Genetic Variation in Immunogenetics and Antibody Diversity
                by seqadmin



                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                11-06-2024, 07:24 PM
              • seqadmin
                Choosing Between NGS and qPCR
                by seqadmin



                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                10-18-2024, 07:11 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 11-08-2024, 11:09 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 11-08-2024, 06:13 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 11-01-2024, 06:09 AM
              0 responses
              32 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-30-2024, 05:31 AM
              0 responses
              23 views
              0 likes
              Last Post seqadmin  
              Working...
              X