Seqanswers Leaderboard Ad

**westerman** · 10-04-2010, 10:59 AM

Usually with a transcriptome project you will have multiple samples being sequenced. I.e., a time-course experiment or a tissue differential experiment. What I do is to create an assembly using the reads from all of the samples. These are then annotated via the 'Blast2Go' program. Then I count up the number of reads from each sample that contribute to isotigs (and singletons) in the combined assembly. This gives a rough idea of expression. It does over-represent isogroups (since multiple isotigs can be in an isogroup) however it also allows for the detection of differential expression of isotigs which is probably more important.

Using your relatively closely related organism and running the 454 mapping program might be better than the above.

As example, here are the columns in the spreadsheet that I give to my customers. First is the column headers. This is followed by the first line of data. In this example all of the samples are roughly the same in number of reads. If there are only two samples then my scripts come up with a differential ratio. With more than two samples I leave that up to the customers since they know their experiment better than I do.

Contig
Contig length
Total reads
Reads avg length
Total bases
Coverage
Sample 372 reads
Sample 373 reads
Sample 374 reads
etc. for the rest of the samples
Isogroup
Blast hit
GO terms
isotig15101
13055
1762
342
602669
46.2
69
68
44
isogroup11437
nadh dehydrogenase subunit 5
organelle inner membrane;establishment of localization...

**Jeremy** · 10-05-2010, 12:03 AM

I'm currently trying to do something similar. I want to map the raw reads against the isotigs, but since mapping ignores reads that arent unique (and it would skew the data) I need to use only one isotig from each isogroup.

It takes some data manipulation but you can identify which isotig from an isogroup is the largest (more likely to get a BLAST hit/ represent most of the exons) from the 454isotigslayout.txt file, then use a grep function in whatever your language of choice is to get the sequence of that isotig from the 454isotigs.fna file. If your data is like mine then many isogroups will only have one isotig anyway.

Also there is something wierd in the 454allcontigs.fna file for contigs that form an isotig, basically the sequence of the previous contig is appended to the output sequence resulting in the length listed being incorrect.

**SammyGirl** · 10-22-2010, 09:06 AM

Actually, I'm wanting to quantify expression because I'd like to compare with another species. It seems like it would be easier to do this if I were working with 2 libraries from the same organism. When we did the sequencing, one of my libraries came out MUCH better than the other, so I need to do a TON of normalization in order to compare the expression of orthologs in my two organisms. I know that I need to normalize to the size of the gene, but, as I said, I don't have a reference. The next best thing would be to normalize to the total length of the isogroup, but I can't figure out how to find that.

**SammyGirl** · 10-22-2010, 09:56 AM

Jeremy, I know its been a LONG time since my original post, but it seems like the there's a perl script on this page that might be useful for you.

Detection of alternative splicing events from 454 output - SEQanswers

http://seqanswers.com/forums/showthread.php?p=26445#post26445

Pyrosequencing in picotiter plates, custom arrays for enrichment/decomplexing. (Roche)

It tells you exactly which reads are in each isogroup. Is that what you were trying to do?

**westerman** · 10-22-2010, 10:07 AM

I am not sure if "total length of the isogroup" makes much sense. For example below is an isogroup from one of my recent projects. I have put dots (.) in place of spaces so that the alignment looks better.

isogroup00001 numIsotigs=6 numContigs=5
...Length : .1379 ..768 ...11 .1644 .1597 (bp)
...Contig : 20827 20828 20831 20829 20830 Total:
isotig00001 >>>>> >>>>> >>>>> >>>>> >>>>> 5399
isotig00002 ..... >>>>> ..... >>>>> >>>>> 4020
isotig00003 >>>>> >>>>> ..... >>>>> >>>>> 5388
isotig00004 >>>>> >>>>> ..... >>>>> ..... 3791
isotig00005 ..... ..... <<<<< ..... <<<<< 1608
isotig00006 >>>>> >>>>> ..... ..... ..... 2147

The different isotigs inside the given isogroup have different lengths. So what is the isogroup length? The longest of the isotigs? The sum of the lengths on the 2nd line? An average of the individual isotig lengths?

**SammyGirl** · 10-22-2010, 10:15 AM

I suppose I would say that the "total isogroup length" is equal to the length of an isotig with all the exons included. So would I just take the longest isotig then?

**westerman** · 10-22-2010, 10:28 AM

Originally posted by SammyGirl View Post

I suppose I would say that the "total isogroup length" is equal to the length of an isotig with all the exons included. So would I just take the longest isotig then?

Not necessarily. While the longest isotig is likely to contain all of the exons (my example above does), it is possible for the longest isotig to not contain all of the isotigs. So summing up the lengths in the "Length:" line is the correct way. On the other hand it would not be too far wrong to just take the longest isotig. And some people might argue that this is even more correct.

**SammyGirl** · 10-22-2010, 10:33 AM

So now that I'm looking at your data, it makes sense. If I sum the lengths of all 5 contigs in your example, I get the total length of isotig00001 because it contains all the contigs. Of course, when I went back to my file, I found a lot of instances where there are contigs reported for an isogroup that aren't included in any isotigs. Do you have any idea why that is? Could it be because the person who did my assembly set a contig length cutoff?

**westerman** · 10-22-2010, 10:38 AM

Originally posted by SammyGirl View Post

... I found a lot of instances where there are contigs reported for an isogroup that aren't included in any isotigs. Do you have any idea why that is?

I suspect that it is because the assembler did not use the '-rip' option. Thus any given read could be scattered over multiple contigs. These shorter and ripped up reads would not be included in isotigs.

There may be another reason as well. Without looking at the data it is hard to tell.

**Jeremy** · 11-10-2010, 10:28 PM

Originally posted by SammyGirl View Post

Actually, I'm wanting to quantify expression because I'd like to compare with another species. It seems like it would be easier to do this if I were working with 2 libraries from the same organism. When we did the sequencing, one of my libraries came out MUCH better than the other, so I need to do a TON of normalization in order to compare the expression of orthologs in my two organisms. I know that I need to normalize to the size of the gene, but, as I said, I don't have a reference. The next best thing would be to normalize to the total length of the isogroup, but I can't figure out how to find that.

I did the cDNA assembly of both sequence files together to produce a single set of isotigs representing both samples, then mapped each sample against the file of non redundant Isotigs that I generated (using gsmapper). gsmapper outputs a file with the number of reads per isotig that I plugged directly into DESeq (R package) to identify differential expression. (edit: not quite directly, some zeroes need to be added for cases where no reads mapped to an isotig thus allowing comparison to the other sample that did have reads map)

**Jeremy** · 11-11-2010, 11:57 PM

I have been thinking about how to get around the problem of multiple isotigs per isogroup.

Im most of my cases the longest isotig in an isogroup does not use all of the contigs, just taking the longest isotig will cause some contigs (exons) to be excluded. So just using the largest isotig as a reference means that any differential expression identified may in fact represent the same expression level but from different mRNA isoforms if one of the isoforms uses exons not included in the reference file.

You could represent each isogroup by taking all the contigs within it but then the output file for the contigs has other contig data appended to the beginning of it requiring even more data manipulation. Plus this case would not identify different isoforms expressed at the same level

Allowing reads to map to multiple locations and using all isotigs would get around this problem but will result in an artifically inflated read count for isotigs in a large isogroup and may allow for some reads such as poly A to be included that would otherwise be identified as repeat.

Has anybody else dealt with this?

**SammyGirl** · 11-12-2010, 06:55 AM

Drawing on all of my (VERY, VERY LITTLE) knowledge of programming, I managed to come up with a script that uses the 454IsotigsLayout.txt file to look up the contigs that are included in the isotigs of an isogroup (as indicated by the '>>>>>' and '<<<<<' symbols under the contig names). I summed the lengths of these contigs and used that number to do my gene size correction. Its not perfect, but it was the best thing I could come up with.

**shaojingwang** · 11-15-2010, 08:02 PM

urgent for answer

Originally posted by westerman View Post

I suspect that it is because the assembler did not use the '-rip' option. Thus any given read could be scattered over multiple contigs. These shorter and ripped up reads would not be included in isotigs.

There may be another reason as well. Without looking at the data it is hard to tell.

Excuse me, I want to know, for cDNA assembly, reads should be assembled to multiple contigs or not? If not, in what situation reads should be assembled to multiple contigs? urgent for answer...

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Isogroup Sequence

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News