Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbler - Isogroup with isotigs AND contigs

    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    Last edited by dschika; 01-11-2011, 07:02 AM.

  • #2
    Originally posted by dschika View Post
    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    dschika,

    First off don't worry, you didn't do anything wrong. I've seen this behavior myself many times with the Newbler cDNA assembler. I can't give you a complete explanation but there are some possibilities.

    - If a contig is > 500bp but no path connects it to other contigs to form an isotig, it will end up in the output as a contig.

    - If the number of contigs in an isogroup exceeds the isogroup threshold (set by the -ig parameter) the isogroup will not be traversed to identify isotigs. Contigs > 500bp will be reported in 454Isotigs.fna but nothing about this isogroup will be reported in 454IsotigsLayout.txt.

    - If the number of isotigs in an isogroup exceeds the isotig threshold (set by the -it parameter) isogroup traversal will stop and its contigs (>500bp) will appear in 454Isotigs.fna.

    - If the number of contigs in an isotig exceeds the isotig contig count threshold (set by the -icc parameter) further traversal of that isotig will stop. If the contigs in that isotig are not part of any other isotig they will be reported as contigs in the output files.

    As you can see there are many complex ways in which contigs my be reported in the final output and how some isogroups may not. I give Roche/454 credit for attempting to create a true transcriptome assembler but the output from it can be incredibly difficult to deal with.

    Comment


    • #3
      Thank you for your reply, kmcarr!

      So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?

      Comment


      • #4
        Originally posted by dschika View Post
        Thank you for your reply, kmcarr!

        So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?
        My answer to that question would be a qualified yes. As a first approximation, of the number of genes identified can be estimated by the number of isogroups reported but there are always exceptions to this. You will probably find that the larger isogroups (which also happen to be those at the beginning of the isogroup list) contain contigs and isotigs which clearly derive from multiple, distinct genes. This occurs in almost every cDNA assembly project I run (though it may have something to do with the oddball samples we tend to sequence). Given even a small region of similarity such as a conserved domain, the isogroup clustering can link together reads from several different transcripts.

        On the other end of the spectrum, for very low coverage transcripts, you may find these split into multiple isogroups. Without enough reads to sufficiently cover the length of the transcript they may not all be linked together into a single isogroup. You may end up with 2 or more isogroups (and hence [con/iso]tigs) which represent different regions of the same transcript.

        These problems are not at all unique to the gsAssembler; similar problems occurred when I used the TGICL pipeline for assembling 454 cDNA reads. The simple truth is that de novo assembly of transcriptomes is very, very hard, in some ways harder than genomes. There is no perfect assembler or optimal set of parameters which will take your reads and spit out a perfect set of transcript sequences. (And no matter how many times I tell the researchers I work with they still don't seem to believe me!)

        Comment

        Latest Articles

        Collapse

        • noor121
          Reply to Latest Developments in Precision Medicine
          by noor121
          Qadri offers efficient online services designed for students and staff of University Targu Mures Medical Campus Hamburg. We streamline your academic and administrative processes for a hassle-free experience.

          VIsit us:
          https://qadri-international.com/univ...s-hamburg-umch...
          Yesterday, 09:33 PM
        • seqadmin
          Non-Coding RNA Research and Technologies
          by seqadmin




          Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

          Nobel Prize for MicroRNA Discovery
          This week,...
          10-07-2024, 08:07 AM
        • seqadmin
          Recent Developments in Metagenomics
          by seqadmin





          Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
          09-23-2024, 06:35 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 10-02-2024, 04:51 AM
        0 responses
        98 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-01-2024, 07:10 AM
        0 responses
        107 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-30-2024, 08:33 AM
        1 response
        111 views
        0 likes
        Last Post EmiTom
        by EmiTom
         
        Started by seqadmin, 09-26-2024, 12:57 PM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Working...
        X