Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbler - Isogroup with isotigs AND contigs

    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    Last edited by dschika; 01-11-2011, 07:02 AM.

  • #2
    Originally posted by dschika View Post
    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    dschika,

    First off don't worry, you didn't do anything wrong. I've seen this behavior myself many times with the Newbler cDNA assembler. I can't give you a complete explanation but there are some possibilities.

    - If a contig is > 500bp but no path connects it to other contigs to form an isotig, it will end up in the output as a contig.

    - If the number of contigs in an isogroup exceeds the isogroup threshold (set by the -ig parameter) the isogroup will not be traversed to identify isotigs. Contigs > 500bp will be reported in 454Isotigs.fna but nothing about this isogroup will be reported in 454IsotigsLayout.txt.

    - If the number of isotigs in an isogroup exceeds the isotig threshold (set by the -it parameter) isogroup traversal will stop and its contigs (>500bp) will appear in 454Isotigs.fna.

    - If the number of contigs in an isotig exceeds the isotig contig count threshold (set by the -icc parameter) further traversal of that isotig will stop. If the contigs in that isotig are not part of any other isotig they will be reported as contigs in the output files.

    As you can see there are many complex ways in which contigs my be reported in the final output and how some isogroups may not. I give Roche/454 credit for attempting to create a true transcriptome assembler but the output from it can be incredibly difficult to deal with.

    Comment


    • #3
      Thank you for your reply, kmcarr!

      So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?

      Comment


      • #4
        Originally posted by dschika View Post
        Thank you for your reply, kmcarr!

        So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?
        My answer to that question would be a qualified yes. As a first approximation, of the number of genes identified can be estimated by the number of isogroups reported but there are always exceptions to this. You will probably find that the larger isogroups (which also happen to be those at the beginning of the isogroup list) contain contigs and isotigs which clearly derive from multiple, distinct genes. This occurs in almost every cDNA assembly project I run (though it may have something to do with the oddball samples we tend to sequence). Given even a small region of similarity such as a conserved domain, the isogroup clustering can link together reads from several different transcripts.

        On the other end of the spectrum, for very low coverage transcripts, you may find these split into multiple isogroups. Without enough reads to sufficiently cover the length of the transcript they may not all be linked together into a single isogroup. You may end up with 2 or more isogroups (and hence [con/iso]tigs) which represent different regions of the same transcript.

        These problems are not at all unique to the gsAssembler; similar problems occurred when I used the TGICL pipeline for assembling 454 cDNA reads. The simple truth is that de novo assembly of transcriptomes is very, very hard, in some ways harder than genomes. There is no perfect assembler or optimal set of parameters which will take your reads and spit out a perfect set of transcript sequences. (And no matter how many times I tell the researchers I work with they still don't seem to believe me!)

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Genetic Variation in Immunogenetics and Antibody Diversity
          by seqadmin



          The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
          11-06-2024, 07:24 PM
        • seqadmin
          Choosing Between NGS and qPCR
          by seqadmin



          Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
          10-18-2024, 07:11 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 11:09 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Today, 06:13 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 11-01-2024, 06:09 AM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-30-2024, 05:31 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Working...
        X