Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Newbler - Isogroup with isotigs AND contigs

    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    Last edited by dschika; 01-11-2011, 07:02 AM.

  • #2
    Originally posted by dschika View Post
    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    dschika,

    First off don't worry, you didn't do anything wrong. I've seen this behavior myself many times with the Newbler cDNA assembler. I can't give you a complete explanation but there are some possibilities.

    - If a contig is > 500bp but no path connects it to other contigs to form an isotig, it will end up in the output as a contig.

    - If the number of contigs in an isogroup exceeds the isogroup threshold (set by the -ig parameter) the isogroup will not be traversed to identify isotigs. Contigs > 500bp will be reported in 454Isotigs.fna but nothing about this isogroup will be reported in 454IsotigsLayout.txt.

    - If the number of isotigs in an isogroup exceeds the isotig threshold (set by the -it parameter) isogroup traversal will stop and its contigs (>500bp) will appear in 454Isotigs.fna.

    - If the number of contigs in an isotig exceeds the isotig contig count threshold (set by the -icc parameter) further traversal of that isotig will stop. If the contigs in that isotig are not part of any other isotig they will be reported as contigs in the output files.

    As you can see there are many complex ways in which contigs my be reported in the final output and how some isogroups may not. I give Roche/454 credit for attempting to create a true transcriptome assembler but the output from it can be incredibly difficult to deal with.

    Comment


    • #3
      Thank you for your reply, kmcarr!

      So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?

      Comment


      • #4
        Originally posted by dschika View Post
        Thank you for your reply, kmcarr!

        So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?
        My answer to that question would be a qualified yes. As a first approximation, of the number of genes identified can be estimated by the number of isogroups reported but there are always exceptions to this. You will probably find that the larger isogroups (which also happen to be those at the beginning of the isogroup list) contain contigs and isotigs which clearly derive from multiple, distinct genes. This occurs in almost every cDNA assembly project I run (though it may have something to do with the oddball samples we tend to sequence). Given even a small region of similarity such as a conserved domain, the isogroup clustering can link together reads from several different transcripts.

        On the other end of the spectrum, for very low coverage transcripts, you may find these split into multiple isogroups. Without enough reads to sufficiently cover the length of the transcript they may not all be linked together into a single isogroup. You may end up with 2 or more isogroups (and hence [con/iso]tigs) which represent different regions of the same transcript.

        These problems are not at all unique to the gsAssembler; similar problems occurred when I used the TGICL pipeline for assembling 454 cDNA reads. The simple truth is that de novo assembly of transcriptomes is very, very hard, in some ways harder than genomes. There is no perfect assembler or optimal set of parameters which will take your reads and spit out a perfect set of transcript sequences. (And no matter how many times I tell the researchers I work with they still don't seem to believe me!)

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 08:47 AM
        0 responses
        16 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X