Seqanswers Leaderboard Ad

**peromhc** · 06-18-2013, 06:48 PM

try filtering out contigs that have a FPKM of less than 1, or .5. This should get rid of a large number of, likely junk, contigs. There are tools in Trinity (RSEM or eXpress) to to this.

Also, you could try clustering with cd-hit-est to get rid of redundancy.

**shi** · 06-18-2013, 08:23 PM

Dear Dario1984,

You may try the Subread aligner which can deal with large number of contigs.

The Subread package

http://subread.sourceforge.net/

Best wishes,

Wei

**Dario1984** · 06-18-2013, 09:00 PM

Thanks for alerting me to the CD-HIT program. I wasn't aware of it. Have you published a journal article using those two steps already ?

**alexdobin** · 06-19-2013, 01:09 PM

Originally posted by Dario1984 View Post

What advice do researchers who have previously done RNA-seq on a non-model organism have ? I have RNA-seq data on sea urchin. The current version of the genome has 174772 contigs. I have so far tried generating a genome index with STAR. It used up all of the RAM, and the author said the mapping performance wasn't good on any genomes with more than 50000 contigs. I have also tried de-novo assembly with Trinity, and the number of genes and isoforms found was unrealistically large. Does anyone have a success story to share ?

To avoid RAM problems for the large number of contigs with STAR, try reducing --genomeChrBinNbits (=18 by default) to a smaller number, ~14 or less. The mapping speed will be slow by STAR's standards, but it may still adequate.

**Kennels** · 06-19-2013, 07:33 PM

Originally posted by Dario1984 View Post

Thanks for alerting me to the CD-HIT program. I wasn't aware of it. Have you published a journal article using those two steps already ?

This paper should be of good reference:
https://www.biomedcentral.com/1471-2164/13/392

**Dario1984** · 06-20-2013, 10:00 PM

I used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.

**Jeremy** · 06-20-2013, 10:17 PM

Most of the reads in the Trinity assembly will be background RNA (something like 80% of the genome is transcribed remember) and assembly junk. As mentioned already mapping the reads to the Trinity assembly and excluding low count sequences will remove this junk. I prefer to use raw read count, then you can easily see what portion of reads map to the 20-40K Trinity sequences you are left with. I have done something like that and from 370,000 trinity sequences, 96% of the reads mapped to about 38,000 trinity sequences and the rest were discarded.

**shi** · 06-21-2013, 03:32 AM

Originally posted by Dario1984 View Post

I used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.

Hi Dario,

Could you please provide a bit more info about your data such as read length, single-end or paired-end etc? There could be many reasons contributing to a low mappability. Although Subread does not allow mismatches in the seeds, these seeds are quite short (16bp), so I do not really think this was the reason you got a low mapping percentage when mapping your reads to a related species.

One thing which may be worthwhile to try is to set -m=1 to test how many reads have a 16bp substring perfectly matched with the reference. If you still got a low percentage, this may simply tell you that your reads are very different from the reference.

Best regards,

Wei

**Wallysb01** · 06-21-2013, 09:18 AM

What happens if you only take the 50000 biggest contigs from your reference? A lot of times these draft assemblies have many small contigs that aren't going to contain useful information for gene expression analysis anyway. Meaning they will mostly not contain coding regions, or if they do its only one, maybe two exons, and you can't assign orthology anyway.

**Dario1984** · 06-26-2013, 05:00 PM

I think the related genome is too distant. I took 100 random reads and used BLAST to get an impression of what the mapping would be like. Two representative examples of one of the 50 base read pairs are

Code:

>Scaffold915 
          Length = 323013

 Score = 42.1 bits (21), Expect = 0.006
 Identities = 39/45 (86%)
 Strand = Plus / Plus

                                                           
Query: 6      ttccagacaaaacagacaacaaatcataatcataaatatcatttg 50
              |||| ||||||| ||||||||  || |||| ||||||||||||||
Sbjct: 261960 ttcctgacaaaatagacaacatttcttaattataaatatcatttg 262004

and

Code:

>Scaffold476 
          Length = 632255

 Score = 40.1 bits (20), Expect = 0.025
 Identities = 20/20 (100%)
 Strand = Plus / Minus

                                  
Query: 8      caagaatttttttgatgaaa 27
              ||||||||||||||||||||
Sbjct: 568677 caagaatttttttgatgaaa 568658

I will proceed by implementing the filtering strategies for de-novo assembly.

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 20 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

RNA-seq Mapping to Many Contigs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News