Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cDNA analysis 454 assembler

    Hello,
    Could anybody explain from his experience the output files from 454 cDNA assembly? ( Isotigs, contigs, graph etc.) . For example, which file to use for further analysis- the 454AllIsotigs or the AllContigs and what exactly is the difference? how to visualize the graph? It is impossible to understand something from the graph.txt output file etc. THANKS ALOT!!!!!

  • #2
    Isotigs are transcripts, build out of the contigs. Different isogroups within the same isogroup represent alternative splice variants. This makes the isogroup the equivalent of a gene.

    Take this with a grain of salt, though, it is based on mining the contig graph for subgraphs (isogroups) and traversing all possible subgraphs (isotigs). We find, for example, small variations (SNPs, indels) generating almost identical isotigs. So, perhaps cluster the isotigs using CD-hit would help.

    Visualizing the graph is a wish we all have.

    Comment


    • #3
      more about cDNA

      Thanks alot, I have read your blog which explains in a very good way. Still, some questions are left:
      1. In the file 454AllContigs, there are some "contigs" with one or a few nucleotides.
      What are those "contigs"?
      2. some isogroups include only contigs and not isotigs (the first 2 groups in our case), the short "contigs" from the previous question are also assigned to this isogroup. So what is this isogroup? it is all the same gene? different genes? why there are no isotigs?
      3. In the file " 454 graph" there is the scaffold section, however, we had non-paired end sequencing, so what is the basis for this scaffold?
      4. Which of the files are recommended for further analysis, such as blast? The 454Isotigs.fna ? The 454AllContigs.fna (and then how all the very short sequences should be treated?)

      Comment


      • #4
        1. In the file 454AllContigs, there are some "contigs" with one or a few nucleotides.
        What are those "contigs"?
        These very small contigs seem to be produced when Newbler has difficulty resolving the edges of real contigs. We often see these in very highly abundant transcripts, presumably because the number of sequencing errors is high enough to make Newbler think these are real variations. So if the edge of an exon look like:


        ...CATGCATGAAA
        ...CATGCATGAAA
        ...CATGCATGAAA
        ...CATGCATGAAAA
        ...CATGCATGAAAA


        Newbler might consider that fourth 'A' in the last two reads to be a separate exon/contig.


        2. some isogroups include only contigs and not isotigs (the first 2 groups in our case), the short "contigs" from the previous question are also assigned to this isogroup. So what is this isogroup? it is all the same gene? different genes? why there are no isotigs?
        The isotigs are computed by traversing the contig graph, and Newbler has limits to how deep it will recurse when doing this. So if you have a bunch of these false contigs, it will eventually give up on trying to produce isotigs. You can try increasing the default limts, but in my experience even the max allowed values are not always sufficient.

        Which of the files are recommended for further analysis, such as blast? The 454Isotigs.fna ? The 454AllContigs.fna (and then how all the very short sequences should be treated?)
        Unfortunately, the only way to make sure your further analyses are using all your data is to take the 454Isotigs.fna plus the larger contigs from those isogroups where proper isotig formation failed.

        Comment


        • #5
          Originally posted by litali View Post
          3. In the file " 454 graph" there is the scaffold section, however, we had non-paired end sequencing, so what is the basis for this scaffold?
          Scaffolding is not really scaffolding here, just a description of the relation between the contigs and the isotigs. The same description is given in different ways in the 454IsotigsLayout.txt and 454Isotigs.txt files

          Comment


          • #6
            Originally posted by flxlex View Post
            Different isogroups within the same isogroup represent alternative splice variants.
            I guess you meant: Different "isotigs" within the same isogroup represent (...)

            Comment


            • #7
              Originally posted by CHRYSES View Post
              I guess you meant: Different "isotigs" within the same isogroup represent (...)
              Yep. Thanks...

              Comment


              • #8
                Hi all!
                I did a Newbler transcriptome assembly a year ago and it was very difficult to find some information about the process outcome (flxlex , thank you very much for your blog!). About this, I tried to know how many reads assembled, and I got different results depending the file I saw. For instance, according to 454AllContigs.fna 12310 reads were assembled in a sample identified by a MID tag (multiplexed) (I added all reads from the last column, numreads=), but I got such information in the 454NewblerMetrics.txt file:
                numberAssembled = 6603;
                numberPartial = 5359;
                numberSingleton = 8674;
                numberRepeat = 1101;
                numberOutlier = 723;
                Total reads = 22460
                Which could be the reason for this discrepancy?
                I did the assembly with the release 1.1.03.24 of Newbler.
                Regards,

                Comment


                • #9
                  Originally posted by jordi View Post
                  Hi all!
                  Which could be the reason for this discrepancy?
                  I suspect that some of the reads are being split among the contigs. Such reads would be counted twice.

                  Comment


                  • #10
                    mmmmm, Ponder

                    Hello.
                    I also want to make sure every possibly sequence is used in my further data analyses;

                    Originally posted by flxlex
                    "Isotigs are transcripts, build out of the contigs."
                    Originally posted by cram
                    "Unfortunately, the only way to make sure your further analyses are using all your data is to take the 454Isotigs.fna plus the larger contigs from those isogroups where proper isotig formation failed.
                    Originally posted by flxlex
                    CD-hit would help
                    Thanks flxlex, that program is a real help.

                    To clarify; would combining the Isotig.fna and the contigs.fna files into a single file and then running CD-hit give you a comprehensive, non-redundant set of transcripts from your 454 transcriptome for further analyses?

                    Are there are single reads anywhere else that are neither contigs nor isotigs but are still useful?

                    Thank you for any advice,

                    John.

                    Comment


                    • #11
                      Originally posted by poisson200 View Post
                      To clarify; would combining the Isotig.fna and the contigs.fna files into a single file and then running CD-hit give you a comprehensive, non-redundant set of transcripts from your 454 transcriptome for further analyses?
                      Hmm, that could actually work, hadn't thought of that. I always thought of running CD-HIT per isogroup with some looping script. Taking all contigs and isotigs into a CD-HIT run might collapse paralogues, though...

                      Are there are single reads anywhere else that are neither contigs nor isotigs but are still useful?
                      Yep, but so far, newbler does not output them in a separate file. You can get the IDs of the singleton reads from the 454ReadStatus file. Further, check this post:

                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

                      Comment


                      • #12
                        Hi flxlex,
                        Thanks for the quick reply and the answers.

                        Originally posted by flxlex
                        Taking all contigs and isotigs into a CD-HIT run might collapse paralogues, though...
                        Looking at CD-hit, by default it looks for 98% identity or greater, which I think should be stringent enough not to collapse any paralogs (paralogs would have to be from a very recent gene duplication event or from a CNV for that to happen) but it is a good point to bear in mind.

                        To correct; cdhit-est, for me, should be set to 0.98, which is 0.9 by default.

                        Thanks again,

                        John.
                        Last edited by poisson200; 10-28-2010, 05:36 AM.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Exploring the Dynamics of the Tumor Microenvironment
                          by seqadmin




                          The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                          07-08-2024, 03:19 PM
                        • seqadmin
                          Exploring Human Diversity Through Large-Scale Omics
                          by seqadmin


                          In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                          06-25-2024, 06:43 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 07:20 AM
                        0 responses
                        20 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-16-2024, 05:49 AM
                        0 responses
                        36 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-15-2024, 06:53 AM
                        0 responses
                        40 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 07-10-2024, 07:30 AM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X