Seqanswers Leaderboard Ad

**Kennels** · 07-16-2013, 07:55 PM

There are a few things you could do.

1. As described here: http://trinityrnaseq.sourceforge.net...stimation.html, you could filter out lowly expressed components, but there will be caveats with that as described.

2. You could reduce the redundancy of your data set with software like cd-hit-est, or TGICL clustering, fastanrdb from the Exonerate package, or Minimus2, which will collapse/cluster similar sequences into a longest representative sequence. I typically do cdhitest with 100% identity first, and then use TGICL. This will generate a 'unigene' set based on the threshold identity, which should substantially reduce the number of contigs in your assembly. You can set the threshold lower for cdhit if you kind of know what the level of similarity is between paralogous sequences.

You can then map the original reads you had back to this 'consensus' assembly, to identify and visualize the polymorphic sites and/or call SNPs (reads which have a couple of mismatches to the consensus assembly will show up as SNPs).

Note though by doing method 2, you might potentially by clustering wrong/chimeric transcripts and in essence enriching for these wrong sequences, though in my experience they are not a high percentage and you can validate these anyway with blastx for your genes of interest. (won't apply to ncRNA though)

You should also filter out transcripts less than a certain size (at least 200 nt), but I think Trinity by default already does this.

**atcghelix** · 07-16-2013, 09:47 PM

Thanks very much--I really appreciate your help. I think I almost understand.

If I collapse similar contigs into representatives using something like cd-hit-est, do I lose information about whether paralogs exist for a particular gene/transcript? Or is this this something I could determine during the read-mapping step? My end goal is to select a few thousand single-copy genes/exons to sequence in the future for a few hundred individuals.

**Kennels** · 07-16-2013, 10:08 PM

Originally posted by atcghelix View Post

Thanks very much--I really appreciate your help. I think I almost understand.

If I collapse similar contigs into representatives using something like cd-hit-est, do I lose information about whether paralogs exist for a particular gene/transcript? Or is this this something I could determine during the read-mapping step? My end goal is to select a few thousand single-copy genes/exons to sequence in the future for a few hundred individuals.

No you won't lose this information. You won't be able to tell how many sequences were clustered into the representative sequence directly from the fasta output, but in cd-hit-est, there is another file where you can find all the transcripts (component headers in your case) that were put in a particular cluster. So from this file, you can identify transcripts that did not have any other similar sequences (i.e. 'single' copy within the definition of your clustering). You need to be aware though that a 'single copy' transcript from your RNAseq assembly doesn't mean that it is truly single copy - just maybe other paralogs didn't express enough or couldn't be assembled properly.

You won't (easily) be able to tell how many paralogs exists from read mapping, just the possibility that they exist or not based on the number of polymorphisms you see. However you probably can determine if a sequence is 'single' copy by the absence of any polymorphisms, but again you need to be aware of expression levels of paralogs in your RNAseq data.
hth

**atcghelix** · 07-16-2013, 10:13 PM

Yes, very helpful--thanks again!

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Trinity paralog filtering

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News