Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genes with multiple copies assembling as single contig

    Hi all,

    I'm doing a denovo assembly of a cyanobacterial genome with SPades, all is working well but when there are multiple copies of a gene (e.g. 16srRNA gene), it appears that all reads associated with that gene are being mapped to a single contig.

    Coverage of these contigs appears to correspond quite well to number of expected copies in the genome (i.e. normal coverage ~50x, for a contig with a gene with four copies, coverage ~200x).

    Does anyone know of a method to prevent this from happening so that each of the copies assemble separately in different contigs?

    Cheers
    N

  • #2
    Hi Cyanoevo,

    I have the exact same problem. Did you ever find an answer?

    Cheers,

    Eduardo

    Comment


    • #3
      You can try mapping reads to a 16S copy, then clustering the reads that mapped, then assembling the clusters. This will work if the reads are sufficiently long (for Illumina, merging them may be useful) and the 16S are sufficiently different. If not, you'll just get one cluster. You probably need overlapping 2x250bp reads at a minimum (insert size around 400bp+) to have a good chance.

      You can cluster like this with Dedupe (packaged with BBMap):

      dedupe.sh in=merged.fq -Xmx30g am=f ac=f fo c rnc=f mcs=50 mo=350 pto pattern=cluster_%.fq

      The "mo=350" specifies a min overlap of 350bp. This should be around 80%-90% of your read length. If you have single-ended 250bp reads, set it to 200; if you have merged reads with an insert size of around 400bp, try 350. If you have 100bp non-overlapping reads, don't bother, they're too short.

      For this kind of situation, which is very sensitive to chimeras, I recommend merging reads with BBMerge using the "vstrict" flag.

      Comment


      • #4
        Thanks Brian, I'll give it a try. I anticipate that I'm going to get one cluster because the reads are seemingly identical. It's suggestive that the coverage for the rRNA operon is about 3 times the coverage of the neighboring genes so at a minimum I'll report that in the submission.
        I guess the alternative would be going back to the wet lab to check how many copies there are.

        Cheers,

        Eduardo

        Comment


        • #5
          Originally posted by ecastron View Post
          Thanks Brian, I'll give it a try. I anticipate that I'm going to get one cluster because the reads are seemingly identical. It's suggestive that the coverage for the rRNA operon is about 3 times the coverage of the neighboring genes so at a minimum I'll report that in the submission.
          I guess the alternative would be going back to the wet lab to check how many copies there are.
          Eduardo
          Since you're having multiple copies of 16s then you need to have sufficiently long insert length to allow assembler resolve this repetitive region. Otherwise, indeed, everything will be inside single contig. Given the length of 16s you'd need at least mate pairs of > 2-3kb insert length or long reads (PacBio / Nanopore).

          Comment


          • #6
            Thanks for the reply! That was my impression; that I wouldn't be able to resolve it with 300bp insert library but only with mate pairs or long read technology.

            Cheers,

            @ecastron

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            68 views
            0 likes
            Last Post seqadmin  
            Working...
            X