Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • assembly strategies for repetitive dna

    Hi all,

    I was hoping to get some feedback on an assembly strategy of repetitive dna using Illumina 100-PE data.

    I've seen some strategies, such as ALL-PATHS-LG, which utilize multiple libraries of increasingly larger insert size to resolve repetitive regions. For example, assembly of the potato genome (which is ~60% repetitive) used 7 libraries ranging from an insert size of 200 bp - 20,000 bp.

    We're running a pilot on a few samples to see how well the assembly will be, but because of coverage issues, will only be creating 2 libraries per sample.

    So here is my question:

    Is it better to create the libraries with insert sizes that are close in range (ex. 200 bp and 500 bp) or large in range (ex. 200 bp and 5,000 bp)? I can see pros and cons of both, but wanted elicit some advice before going forward.

    Thanks,

    John

  • #2
    One more thing while I'm at it...

    Has anyone used Telescoper (DOI:10.1093/bioinformatics/bts399)? If so, I'd be interested in hearing about your experiences with it.

    Comment


    • #3
      Assembly of repetitive DNA

      The big question: what is repeating/how much is repeating? Sounds vaguely leninist.

      One person who has experience with this is Matt Riley at U. Tennessee, at least in microbial genomes.

      Anyway, if you have tandem repeats of a few bp, then read length is your big factor; if you have repeats of a gene, then you need jumps/paired ends. If you have repeats of gene clusters, you may need something more substantial. 40kb jumps are possible and published; PacBio reads are another option; an Optical Map may be the answer. Of course, you'll need to 'fill in' the map or fix the SNPs in the SMRT reads. Joint assemblies are performed by a number of groups; the folks at NCBI, the FDA, and UMD (Mihai Pop) are familiar with the strategies.

      Ultimately, if you have the worst case scenario, some sort of scale-free nesting of repeats within repeats, you would need all these solutions combined.

      Hope that points you toward some ideas.

      Comment


      • #4
        Originally posted by bckirkup View Post
        The big question: what is repeating/how much is repeating? Sounds vaguely leninist.
        Leninist indeed

        Thanks for your response...They are VERY helpful.

        I'm interested in one chromosome which is probably a worst case scenario - variable sized microsatellites, minisatellites, transposable elements, and variable sized rDNA arrays. Quite frankly, it's a mess.

        I'm not looking for a complete assembly, but the chromosome is ~40 Mb and I would like to generate scaffolds large enough to give me something to work with. Using a published illumina dataset (86 bp pe), I wasn't able to assembly anything larger than 1kb, although in non repetitive regions I was getting scaffolds as large as 400 Kb.

        My goal is to predict functional motifs from the assembly (ex. TF binding sites, transposable element content, CNV in satellite sequences) and identify variation in this chromosome across populations.

        Comment


        • #5
          Ugh. (sorry, I meant to say "What a challenging project!")

          Regarding your initial question (close vs. large size differences in the two libraries), the larger difference will be more useful for assembly. The ideal is to have paired end read jumps that span the repeats, which delimits the number of copies in the intervening region. You can calculate transposable element content and satellite CNV by read depth (although their positions will be difficult/impossible to assign).

          Note that coverage issues do not necessarily limit you to two library sizes. For assembly, it would be more useful to have additional jump libraries sequenced at lower depth. IIRC, ALLPATHS-LG sequenced large (10kbp) jump libraries at 1/25th the depth of the shorter-sized inserts. A similar hybrid approach may also be your best bet.

          Good luck!

          Comment


          • #6
            Originally posted by HESmith View Post
            Ugh. (sorry, I meant to say "What a challenging project!")
            haha...indeed!

            Thanks for your comments. This is making me thing starting with 2 libraries for the pilot (probably 250 bp and 1 kb) at a higher depth and then running another 2 libraries (5 kb and 10 kb) at a lower depth would be a good strategy.

            Comment


            • #7
              Jumping libraries are a possibility, but the cost and difficulty to make the libraries are a consideration and there may be value in sequencing through the entire region. The current PacBio C2 XL chemistry averages 5kb read length with 10% reads 10kb+, max is usually ~20kb.

              If you would like more information about potentially doing a PacBio library, we can discuss I can help make some introductions to labs in the area to get your challenging region sequenced. You can email me: akieu at pacificbiosciences.com

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 08:47 AM
              0 responses
              14 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              60 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Working...
              X