Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Isogroup Sequence

    Hello all,

    I have a question about a transcriptome that was sequenced by Titanium 454 and assembled using Newbler's cDNA option. I'd like to do a blast search with my "unigenes" (i.e. isogroups, any contigs/isotigs not in an isogroup, and singletons). Of course, I have many more isotigs than isogroups, so it would be helpful if I could just blast the consensus sequence of an isogroup instead of blasting all the isotigs and contigs in the isogroup. Does anyone know how one might assemble all of the isotigs in an isogroup - preferably it would be an automated process so that I don't have to do it 17,000~ish times

    I think the real kicker is that I don't have a reference genome for this transcriptome. Best I could do is a (relatively) closely related organism from a different genus within the family. This is going to make my head spin when I try to quantitate gene expression, so if anyone has a suggestion for mapping and quantitating a nematode transcriptome without a good reference genome I would appreciate any advice!!!

  • #2
    Usually with a transcriptome project you will have multiple samples being sequenced. I.e., a time-course experiment or a tissue differential experiment. What I do is to create an assembly using the reads from all of the samples. These are then annotated via the 'Blast2Go' program. Then I count up the number of reads from each sample that contribute to isotigs (and singletons) in the combined assembly. This gives a rough idea of expression. It does over-represent isogroups (since multiple isotigs can be in an isogroup) however it also allows for the detection of differential expression of isotigs which is probably more important.

    Using your relatively closely related organism and running the 454 mapping program might be better than the above.

    As example, here are the columns in the spreadsheet that I give to my customers. First is the column headers. This is followed by the first line of data. In this example all of the samples are roughly the same in number of reads. If there are only two samples then my scripts come up with a differential ratio. With more than two samples I leave that up to the customers since they know their experiment better than I do.

    Contig
    Contig length
    Total reads
    Reads avg length
    Total bases
    Coverage
    Sample 372 reads
    Sample 373 reads
    Sample 374 reads
    etc. for the rest of the samples
    Isogroup
    Blast hit
    GO terms
    isotig15101
    13055
    1762
    342
    602669
    46.2
    69
    68
    44
    isogroup11437
    nadh dehydrogenase subunit 5
    organelle inner membrane;establishment of localization...

    Comment


    • #3
      I'm currently trying to do something similar. I want to map the raw reads against the isotigs, but since mapping ignores reads that arent unique (and it would skew the data) I need to use only one isotig from each isogroup.

      It takes some data manipulation but you can identify which isotig from an isogroup is the largest (more likely to get a BLAST hit/ represent most of the exons) from the 454isotigslayout.txt file, then use a grep function in whatever your language of choice is to get the sequence of that isotig from the 454isotigs.fna file. If your data is like mine then many isogroups will only have one isotig anyway.

      Also there is something wierd in the 454allcontigs.fna file for contigs that form an isotig, basically the sequence of the previous contig is appended to the output sequence resulting in the length listed being incorrect.

      Comment


      • #4
        Actually, I'm wanting to quantify expression because I'd like to compare with another species. It seems like it would be easier to do this if I were working with 2 libraries from the same organism. When we did the sequencing, one of my libraries came out MUCH better than the other, so I need to do a TON of normalization in order to compare the expression of orthologs in my two organisms. I know that I need to normalize to the size of the gene, but, as I said, I don't have a reference. The next best thing would be to normalize to the total length of the isogroup, but I can't figure out how to find that.

        Comment


        • #5
          Jeremy, I know its been a LONG time since my original post, but it seems like the there's a perl script on this page that might be useful for you.

          It tells you exactly which reads are in each isogroup. Is that what you were trying to do?

          Comment


          • #6
            I am not sure if "total length of the isogroup" makes much sense. For example below is an isogroup from one of my recent projects. I have put dots (.) in place of spaces so that the alignment looks better.



            isogroup00001 numIsotigs=6 numContigs=5
            ...Length : .1379 ..768 ...11 .1644 .1597 (bp)
            ...Contig : 20827 20828 20831 20829 20830 Total:
            isotig00001 >>>>> >>>>> >>>>> >>>>> >>>>> 5399
            isotig00002 ..... >>>>> ..... >>>>> >>>>> 4020
            isotig00003 >>>>> >>>>> ..... >>>>> >>>>> 5388
            isotig00004 >>>>> >>>>> ..... >>>>> ..... 3791
            isotig00005 ..... ..... <<<<< ..... <<<<< 1608
            isotig00006 >>>>> >>>>> ..... ..... ..... 2147


            The different isotigs inside the given isogroup have different lengths. So what is the isogroup length? The longest of the isotigs? The sum of the lengths on the 2nd line? An average of the individual isotig lengths?

            Comment


            • #7
              I suppose I would say that the "total isogroup length" is equal to the length of an isotig with all the exons included. So would I just take the longest isotig then?

              Comment


              • #8
                Originally posted by SammyGirl View Post
                I suppose I would say that the "total isogroup length" is equal to the length of an isotig with all the exons included. So would I just take the longest isotig then?
                Not necessarily. While the longest isotig is likely to contain all of the exons (my example above does), it is possible for the longest isotig to not contain all of the isotigs. So summing up the lengths in the "Length:" line is the correct way. On the other hand it would not be too far wrong to just take the longest isotig. And some people might argue that this is even more correct.

                Comment


                • #9
                  So now that I'm looking at your data, it makes sense. If I sum the lengths of all 5 contigs in your example, I get the total length of isotig00001 because it contains all the contigs. Of course, when I went back to my file, I found a lot of instances where there are contigs reported for an isogroup that aren't included in any isotigs. Do you have any idea why that is? Could it be because the person who did my assembly set a contig length cutoff?

                  Comment


                  • #10
                    Originally posted by SammyGirl View Post
                    ... I found a lot of instances where there are contigs reported for an isogroup that aren't included in any isotigs. Do you have any idea why that is?
                    I suspect that it is because the assembler did not use the '-rip' option. Thus any given read could be scattered over multiple contigs. These shorter and ripped up reads would not be included in isotigs.

                    There may be another reason as well. Without looking at the data it is hard to tell.

                    Comment


                    • #11
                      Originally posted by SammyGirl View Post
                      Actually, I'm wanting to quantify expression because I'd like to compare with another species. It seems like it would be easier to do this if I were working with 2 libraries from the same organism. When we did the sequencing, one of my libraries came out MUCH better than the other, so I need to do a TON of normalization in order to compare the expression of orthologs in my two organisms. I know that I need to normalize to the size of the gene, but, as I said, I don't have a reference. The next best thing would be to normalize to the total length of the isogroup, but I can't figure out how to find that.
                      I did the cDNA assembly of both sequence files together to produce a single set of isotigs representing both samples, then mapped each sample against the file of non redundant Isotigs that I generated (using gsmapper). gsmapper outputs a file with the number of reads per isotig that I plugged directly into DESeq (R package) to identify differential expression. (edit: not quite directly, some zeroes need to be added for cases where no reads mapped to an isotig thus allowing comparison to the other sample that did have reads map)
                      Last edited by Jeremy; 11-10-2010, 10:34 PM.

                      Comment


                      • #12
                        I have been thinking about how to get around the problem of multiple isotigs per isogroup.

                        Im most of my cases the longest isotig in an isogroup does not use all of the contigs, just taking the longest isotig will cause some contigs (exons) to be excluded. So just using the largest isotig as a reference means that any differential expression identified may in fact represent the same expression level but from different mRNA isoforms if one of the isoforms uses exons not included in the reference file.

                        You could represent each isogroup by taking all the contigs within it but then the output file for the contigs has other contig data appended to the beginning of it requiring even more data manipulation. Plus this case would not identify different isoforms expressed at the same level

                        Allowing reads to map to multiple locations and using all isotigs would get around this problem but will result in an artifically inflated read count for isotigs in a large isogroup and may allow for some reads such as poly A to be included that would otherwise be identified as repeat.

                        Has anybody else dealt with this?

                        Comment


                        • #13
                          Drawing on all of my (VERY, VERY LITTLE) knowledge of programming, I managed to come up with a script that uses the 454IsotigsLayout.txt file to look up the contigs that are included in the isotigs of an isogroup (as indicated by the '>>>>>' and '<<<<<' symbols under the contig names). I summed the lengths of these contigs and used that number to do my gene size correction. Its not perfect, but it was the best thing I could come up with.

                          Comment


                          • #14
                            urgent for answer

                            Originally posted by westerman View Post
                            I suspect that it is because the assembler did not use the '-rip' option. Thus any given read could be scattered over multiple contigs. These shorter and ripped up reads would not be included in isotigs.

                            There may be another reason as well. Without looking at the data it is hard to tell.
                            Excuse me, I want to know, for cDNA assembly, reads should be assembled to multiple contigs or not? If not, in what situation reads should be assembled to multiple contigs? urgent for answer...

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Essential Discoveries and Tools in Epitranscriptomics
                              by seqadmin




                              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                              04-22-2024, 07:01 AM
                            • seqadmin
                              Current Approaches to Protein Sequencing
                              by seqadmin


                              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                              04-04-2024, 04:25 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 04-25-2024, 11:49 AM
                            0 responses
                            19 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-24-2024, 08:47 AM
                            0 responses
                            20 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-11-2024, 12:08 PM
                            0 responses
                            62 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 04-10-2024, 10:19 PM
                            0 responses
                            60 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X