Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • kmcarr
    replied
    Originally posted by dschika View Post
    Thank you for your reply, kmcarr!

    So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?
    My answer to that question would be a qualified yes. As a first approximation, of the number of genes identified can be estimated by the number of isogroups reported but there are always exceptions to this. You will probably find that the larger isogroups (which also happen to be those at the beginning of the isogroup list) contain contigs and isotigs which clearly derive from multiple, distinct genes. This occurs in almost every cDNA assembly project I run (though it may have something to do with the oddball samples we tend to sequence). Given even a small region of similarity such as a conserved domain, the isogroup clustering can link together reads from several different transcripts.

    On the other end of the spectrum, for very low coverage transcripts, you may find these split into multiple isogroups. Without enough reads to sufficiently cover the length of the transcript they may not all be linked together into a single isogroup. You may end up with 2 or more isogroups (and hence [con/iso]tigs) which represent different regions of the same transcript.

    These problems are not at all unique to the gsAssembler; similar problems occurred when I used the TGICL pipeline for assembling 454 cDNA reads. The simple truth is that de novo assembly of transcriptomes is very, very hard, in some ways harder than genomes. There is no perfect assembler or optimal set of parameters which will take your reads and spit out a perfect set of transcript sequences. (And no matter how many times I tell the researchers I work with they still don't seem to believe me!)

    Leave a comment:


  • dschika
    replied
    Thank you for your reply, kmcarr!

    So if I want to say how many possible genes i.e. isogroups are in the assembly, the best way would be to take the number of different isogroups I can find in the 454Isotigs.fna, as there are also the sequences?

    Leave a comment:


  • kmcarr
    replied
    Originally posted by dschika View Post
    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    dschika,

    First off don't worry, you didn't do anything wrong. I've seen this behavior myself many times with the Newbler cDNA assembler. I can't give you a complete explanation but there are some possibilities.

    - If a contig is > 500bp but no path connects it to other contigs to form an isotig, it will end up in the output as a contig.

    - If the number of contigs in an isogroup exceeds the isogroup threshold (set by the -ig parameter) the isogroup will not be traversed to identify isotigs. Contigs > 500bp will be reported in 454Isotigs.fna but nothing about this isogroup will be reported in 454IsotigsLayout.txt.

    - If the number of isotigs in an isogroup exceeds the isotig threshold (set by the -it parameter) isogroup traversal will stop and its contigs (>500bp) will appear in 454Isotigs.fna.

    - If the number of contigs in an isotig exceeds the isotig contig count threshold (set by the -icc parameter) further traversal of that isotig will stop. If the contigs in that isotig are not part of any other isotig they will be reported as contigs in the output files.

    As you can see there are many complex ways in which contigs my be reported in the final output and how some isogroups may not. I give Roche/454 credit for attempting to create a true transcriptome assembler but the output from it can be incredibly difficult to deal with.

    Leave a comment:


  • dschika
    started a topic Newbler - Isogroup with isotigs AND contigs

    Newbler - Isogroup with isotigs AND contigs

    Dear all,

    I thought a isogroup would consist of either isotigs OR contigs. Now I found the following in one of my assemblies:
    Code:
    grep isogroup00001 454Isotigs.fna
    >isotig00001  gene=isogroup00001  length=472  numContigs=2
    >isotig00002  gene=isogroup00001  length=542  numContigs=2
    >contig00048  gene=isogroup00001  length=536  
    >contig00049  gene=isogroup00001  length=629  
    ... (200 more >contig with isogroup00001)
    
    > grep isogroup00099 454Isotigs.fna 
    >isotig01656  gene=isogroup00099  length=491  numContigs=2
    >isotig01657  gene=isogroup00099  length=415  numContigs=3
    >isotig01658  gene=isogroup00099  length=383  numContigs=2
    >isotig01659  gene=isogroup00099  length=326  numContigs=2
    >isotig01660  gene=isogroup00099  length=176  numContigs=2
    >contig02143  gene=isogroup00099  length=518
    Any ideas why this happens or what it means??

    And btw: The number of isogroups in the 454NewblerMetric file is larger than the number of different isogroups I can find in the 454Isotigs.fna and both number are different from the number of different isogroups I can find in the 454IsotigsLayout...any explanations?
    Last edited by dschika; 01-11-2011, 07:02 AM.

Latest Articles

Collapse

  • seqadmin
    Non-Coding RNA Research and Technologies
    by seqadmin




    Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

    Nobel Prize for MicroRNA Discovery
    This week,...
    10-07-2024, 08:07 AM
  • seqadmin
    Recent Developments in Metagenomics
    by seqadmin





    Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
    09-23-2024, 06:35 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 10-11-2024, 06:55 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-02-2024, 04:51 AM
0 responses
110 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-01-2024, 07:10 AM
0 responses
114 views
0 likes
Last Post seqadmin  
Started by seqadmin, 09-30-2024, 08:33 AM
1 response
120 views
0 likes
Last Post EmiTom
by EmiTom
 
Working...
X