What advice do researchers who have previously done RNA-seq on a non-model organism have ? I have RNA-seq data on sea urchin. The current version of the genome has 174772 contigs. I have so far tried generating a genome index with STAR. It used up all of the RAM, and the author said the mapping performance wasn't good on any genomes with more than 50000 contigs. I have also tried de-novo assembly with Trinity, and the number of genes and isoforms found was unrealistically large. Does anyone have a success story to share ?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
try filtering out contigs that have a FPKM of less than 1, or .5. This should get rid of a large number of, likely junk, contigs. There are tools in Trinity (RSEM or eXpress) to to this.
Also, you could try clustering with cd-hit-est to get rid of redundancy.
-
Dear Dario1984,
You may try the Subread aligner which can deal with large number of contigs.
Best wishes,
Wei
Comment
-
Originally posted by Dario1984 View PostWhat advice do researchers who have previously done RNA-seq on a non-model organism have ? I have RNA-seq data on sea urchin. The current version of the genome has 174772 contigs. I have so far tried generating a genome index with STAR. It used up all of the RAM, and the author said the mapping performance wasn't good on any genomes with more than 50000 contigs. I have also tried de-novo assembly with Trinity, and the number of genes and isoforms found was unrealistically large. Does anyone have a success story to share ?
Comment
-
Originally posted by Dario1984 View PostThanks for alerting me to the CD-HIT program. I wasn't aware of it. Have you published a journal article using those two steps already ?
https://www.biomedcentral.com/1471-2164/13/392
Comment
-
I used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.
Comment
-
Most of the reads in the Trinity assembly will be background RNA (something like 80% of the genome is transcribed remember) and assembly junk. As mentioned already mapping the reads to the Trinity assembly and excluding low count sequences will remove this junk. I prefer to use raw read count, then you can easily see what portion of reads map to the 20-40K Trinity sequences you are left with. I have done something like that and from 370,000 trinity sequences, 96% of the reads mapped to about 38,000 trinity sequences and the rest were discarded.
Comment
-
Originally posted by Dario1984 View PostI used Subread on the data. Because the seed has to be matched exactly, it isn't suitable for mapping to a related organism's genome. 11 % of my reads mapped. I can see it would be great for mapping to a high quality reference genome, such as the human genome sequence.
Could you please provide a bit more info about your data such as read length, single-end or paired-end etc? There could be many reasons contributing to a low mappability. Although Subread does not allow mismatches in the seeds, these seeds are quite short (16bp), so I do not really think this was the reason you got a low mapping percentage when mapping your reads to a related species.
One thing which may be worthwhile to try is to set -m=1 to test how many reads have a 16bp substring perfectly matched with the reference. If you still got a low percentage, this may simply tell you that your reads are very different from the reference.
Best regards,
Wei
Comment
-
What happens if you only take the 50000 biggest contigs from your reference? A lot of times these draft assemblies have many small contigs that aren't going to contain useful information for gene expression analysis anyway. Meaning they will mostly not contain coding regions, or if they do its only one, maybe two exons, and you can't assign orthology anyway.
Comment
-
I think the related genome is too distant. I took 100 random reads and used BLAST to get an impression of what the mapping would be like. Two representative examples of one of the 50 base read pairs are
Code:>Scaffold915 Length = 323013 Score = 42.1 bits (21), Expect = 0.006 Identities = 39/45 (86%) Strand = Plus / Plus Query: 6 ttccagacaaaacagacaacaaatcataatcataaatatcatttg 50 |||| ||||||| |||||||| || |||| |||||||||||||| Sbjct: 261960 ttcctgacaaaatagacaacatttcttaattataaatatcatttg 262004
Code:>Scaffold476 Length = 632255 Score = 40.1 bits (20), Expect = 0.025 Identities = 20/20 (100%) Strand = Plus / Minus Query: 8 caagaatttttttgatgaaa 27 |||||||||||||||||||| Sbjct: 568677 caagaatttttttgatgaaa 568658
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-02-2024, 08:06 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
05-02-2024, 08:06 AM
|
||
Started by seqadmin, 04-30-2024, 12:17 PM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-30-2024, 12:17 PM
|
||
Started by seqadmin, 04-29-2024, 10:49 AM
|
0 responses
25 views
0 likes
|
Last Post
by seqadmin
04-29-2024, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
Comment