Seqanswers Leaderboard Ad

**vinay052003** · 04-15-2010, 02:30 PM

I would go and align each consensus sequence to the reference genome and then compare the alignment with the putative transcripts from UCSC, NCBI or ensembl. That will give you new and novel splice variants in your data.
You might need to write some script/code for the processing.
Antother good way is to go and look at tha ASAP2 database. They have alternative splice variants detected from Unigene/dbEST. There are couple of other such databases but I don't have their names in the top of my head right now.

**westerman** · 04-16-2010, 07:37 AM

I am not really sure what type of information you want nor in what format. The IsotigsLayout file probably contains what you need. All of the isogroups with more than one isotig are representative of alternative splicing events so, looking at a simple example:

<pre>
>isogroup01387 numIsotigs=2 numContigs=3
Length : 429 264 460 (bp)
Contig : 00424 08522 05780 Total:
isotig03834 >>>>> >>>>> 889
isotig03835 >>>>> >>>>> 724
</pre>

If you want is information which contigs are invariant and which are found in splicing then I suppose you could parse the above information. Or you could try the Isotigs.ace file.

If you want to know about isotig 'A' being a splice variant of isotig 'B' then the above will list that.

If you want to find novel splicing then, yes, alignment back to the reference genome could be useful. Or looking at a database of known splicing.

In any case it does seem like some data manipulation will be required.

**Khanjan** · 04-19-2010, 05:49 AM

Originally posted by vinay052003 View Post

I would go and align each consensus sequence to the reference genome and then compare the alignment with the putative transcripts from UCSC, NCBI or ensembl. That will give you new and novel splice variants in your data.
You might need to write some script/code for the processing.
Antother good way is to go and look at tha ASAP2 database. They have alternative splice variants detected from Unigene/dbEST. There are couple of other such databases but I don't have their names in the top of my head right now.

Hey Vinay,

I am doing this for the prairie vole transcriptome. And its genome has not yet been sequenced, so we are not able to compare it with any reference.

Thanks,
Khanjan

**Khanjan** · 04-20-2010, 06:11 AM

Originally posted by westerman View Post

I am not really sure what type of information you want nor in what format. The IsotigsLayout file probably contains what you need. All of the isogroups with more than one isotig are representative of alternative splicing events so, looking at a simple example:

<pre>
>isogroup01387 numIsotigs=2 numContigs=3
Length : 429 264 460 (bp)
Contig : 00424 08522 05780 Total:
isotig03834 >>>>> >>>>> 889
isotig03835 >>>>> >>>>> 724
</pre>

If you want is information which contigs are invariant and which are found in splicing then I suppose you could parse the above information. Or you could try the Isotigs.ace file.

If you want to know about isotig 'A' being a splice variant of isotig 'B' then the above will list that.

If you want to find novel splicing then, yes, alignment back to the reference genome could be useful. Or looking at a database of known splicing.

In any case it does seem like some data manipulation will be required.

Yes, that is right. I am doing the same thing. Just when it involves a larger number of contigs, it gets complicated. Some are 5' or 3' extensions, while some are indels of several bases.
Also, one more peculiar thing I found was, there is a discrepancy in 454ReadStatus.txt and 454AllContigs.fna, regarding the number of reads forming a contig.

For example, the def line of a contig in AllContigs.fna says
>contig01359 length=1060 numreads=402 and so on,

However, if I do a grep on contig01359 in the 454ReadStatus to obtain all the read ids that went into contig, I get only 383 reads.
Am I doing something wrong here? The contig I described above is an isotig. I obtained all the isotigs for the isogroup, but the number of reads defined in the def line is not same as that obtained from ReadStatus.
Has anyone had the same problem??

Thanks a lot,
Khanjan

**westerman** · 04-20-2010, 06:38 AM

Originally posted by Khanjan View Post

... but the number of reads defined in the def line is not same as that obtained from ReadStatus.
Has anyone had the same problem??

Can't say that I have ever noticed a problem. To double check I just now looked at my most recent 454 project. I did not check everything however in a spot check of 4 different contigs ranging in length from 543 to 1644 bases and in coverage from 20 reads to 1309 reads I found that the AllContigs.fna and ReadStatus file gave me the same read numbers. In other words I see no problem.

As for the rest of your post I am still not certain what output you want (an example would be useful) but in addition to parsing the IsotigsLayout and/or the Isotigs.ace file using both the RefLink.txt and Isotigs.fna file might prove to be worthwhile.

**Khanjan** · 04-20-2010, 06:58 AM

That's nice ! But, then I do not understand what is going on at my side. I will paste some examples, if they make any sense, please do tell me.

$ grep "contig01359" 454AllContigs.fna
>contig01359 length=1060 numreads=402 gene=isogroup00051 status=isotig
$ grep -c "contig01359" 454ReadStatus.txt
383
$ grep "contig01552" 454AllContigs.fna
>contig01552 length=595 numreads=1236 gene=isogroup00065 status=isotig
$ grep -c "contig01552" 454ReadStatus.txt
15
$ grep "contig13419" 454AllContigs.fna
>contig13419 length=308 numreads=2 gene=isogroup00997 status=isotig
$ grep -c "contig13419" 454ReadStatus.txt
7
$ grep "contig13419" 454ReadStatus.txt
GEK13OB02HUIS6 Assembled contig13419 71 - contig13419 15 +
GCTUXKD02I3A06 Assembled contig13419 1 + contig27255 398 -
GCTUXKD02IFIQQ PartiallyAssembled contig27255 518 - contig13419 269 +
GCTUXKD02GE4SE PartiallyAssembled contig27255 518 - contig13419 273 +
GCTUXKD02J1OTX PartiallyAssembled contig27255 518 - contig13419 302 +
GCTUXKD02IMAKY PartiallyAssembled contig27255 518 - contig13419 245 +
GCTUXKD02ICDIT Assembled contig13419 1 + contig27255 339 -

**Khanjan** · 04-20-2010, 07:27 AM

Ok, it seems 454ReadStatus.txt is listing only one or max two contigs per read. So, is there any other way of getting all the reads that went into a contig?

Thanks a lot,

**westerman** · 04-20-2010, 09:36 AM

Ah, I now see what you are talking about. And indeed my files show the same -- at least for some (if not most) of the contigs listed as isotigs that also have more than one isotig per isogroup. That latter condition is important.

Short answer: I don't know why this is occurring.

When I get some time I will explore this some more.

**kmcarr** · 04-20-2010, 12:51 PM

Counting reads in a contig is a very tricky proposition when you have used the cDNA assembly mode of the Roche gsAssembler; a single read may thread through multiple contigs. Do you count it multiple times, only once, fractionally depending on the proportion of the read in each contig? I think these are the issues Roche was trying to deal with which led to this clear as mud solution.

Frankly, when I use the cDNA assembly tool in the gsAssembler, I ignore the 454AllContigs.fna file entirely. I use the 454Isotigs.fna for further analysis. Now it is possible to count reads in an isotig but remember, reads are not necessarily unique to a single isotig. The same read may be contained in multiple isotigs. Now you're back to the same questions as above.

If a read is assembled it belongs on one and only one isogroup. I choose to focus on this for read counts. (As westerman pointed out this really only matters if there is more than one isotig in the isogroup.) I've attached a short perl script I use to parse the 454ReadStatus.txt and 454IsotigsLayout.txt files to count the number of reads assigned to each isogroup. It outputs two files, one just listing the counts for each isogroup, and a second listing the ID of each read in each isogroup.

[Usage notes: The script must be run from the assembly directory. The only input is the file name prefix you want for the two output files. If an isogroup contains multiple isotigs it will print multiple lines to the count file (one line for each isotig). This is done because the count file is merged into a table which contains one row for each isotig.]

Attached Files

membersFromLayout.pl (1.2 KB, 110 views)

**Jeremy** · 09-28-2010, 11:50 PM

I just noticed something wierd in the 454AllContigs.fna from a cDNA de-novo assembly that is probably relevant to this topic. Many of the contigs with status=isotig have the sequence of the previous contig appended to them.

e.g.
>contig08566 length=114 numreads=3 gene=isogroup00004 status=isotig
length of sequence is 114 (first contig listed in this isogroup)
>contig09051 length=184 numreads=3 gene=isogroup00004 status=isotig
length of sequence is 298 (114+184) and the first 114 bp are identical to contig08566 (second contig listed in this isogroup)
>contig08352 length=25 numreads=3 gene=isogroup00004 status=isotig
length of sequence is 323 (298+25) and the first 298 bp are identical to contig09051 (third contig listed in this isogroup)

**flxlex** · 10-04-2010, 11:35 PM

@Jeremy: I can confirm the same thing for one of my assemblies.... OK, I just checked only one isogroup but it is really strange...

**Jeremy** · 10-04-2010, 11:50 PM

I spoke to one of Roche's bioinformatics consultants (South East Asia region) about the problem and she had noticed it also and has brought up the problem with the programming team. I guess they will fix the code error in the next release.

**martin2** · 11-09-2010, 08:19 AM

Originally posted by kmcarr View Post

Counting reads in a contig is a very tricky proposition when you have used the cDNA assembly mode of the Roche gsAssembler; a single read may thread through multiple contigs. Do you count it multiple times, only once, fractionally depending on the proportion of the read in each contig? I think these are the issues Roche was trying to deal with which led to this clear as mud solution.

Use the -rip option when starting the assembly, or edit 454AssemblyProject.xml
and set <ripMode>true</ripMode>. Newbler should include reads in the assembly once once.

Truly said, it does not work well for me (maybe Roche only promises that no 2 read appear in a same contig, not isotig, maybe that is the point), I still have 2 isotigs with same reads included, so consed later on barfs on me that it cannot import "contigs" (in its terminology) because of duplicated read names.

**kmcarr** · 11-09-2010, 08:51 AM

Originally posted by martin2 View Post

Use the -rip option when starting the assembly, or edit 454AssemblyProject.xml
and set <ripMode>true</ripMode>. Newbler should include reads in the assembly once once.

Truly said, it does not work well for me (maybe Roche only promises that no 2 read appear in a same contig, not isotig, maybe that is the point), I still have 2 isotigs with same reads included, so consed later on barfs on me that it cannot import "contigs" (in its terminology) because of duplicated read names.

The -rip option does not apply to the cDNA assembly mode for gsAssembler. You will get reads split into multiple contigs as the assembler tries to resolve conflicting branches. Also, as you said, since the same contig may be part of more then one isotig the same read, even if not split, may be a member of more than one isotig. The only unique assignment of reads to isogroups.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 23 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Detection of alternative splicing events from 454 output

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News