why low mapping rates for RNAseq?

NGSfan replied

04-24-2009, 02:47 AM
Originally posted by klh View Post

Hi Jean-marc,

I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.

But your point about dependence on transcript length, number of exons per transcript and read length is well made. Performing a similar calculation on the 1372 protein-coding transcript on human chr22 (Ensembl-53) gives:

Read-length 25 => ~9% cross splice junctions
36 => ~13%
50 => ~18%

However, this calculation assumes (a) uniform coverage across all transcripts, and (b) uniform expression of all transcripts. Both of these are clearly both gross simplifications! A figure of 3% might sound low, but if the sample contains a number of highly-expressed transcripts with long/few exons, it becomes more reasonable.

Kevin

Hi Kevin,

Thanks for the nice explanation. I think your points on the assumptions makes it much easier to understand why one would observe 3% instead of the expected higher frequency.
Leave a comment:
klh replied

04-24-2009, 02:20 AM
Hi Jean-marc,

Originally posted by jmaury View Post

Hello,

For example, if you sequence a transcript of 1500nt, and obtain 555 reads of 25nt (so an average coverage of 9,25X). If your mRNA contains 10 exons, you will have around 25% of the reads that fall on splice junctions (So you'll only map 75% of the reads, without consider sequencing errors and artefacts).

Jean-marc

I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.

But your point about dependence on transcript length, number of exons per transcript and read length is well made. Performing a similar calculation on the 1372 protein-coding transcript on human chr22 (Ensembl-53) gives:

Read-length 25 => ~9% cross splice junctions
36 => ~13%
50 => ~18%

However, this calculation assumes (a) uniform coverage across all transcripts, and (b) uniform expression of all transcripts. Both of these are clearly both gross simplifications! A figure of 3% might sound low, but if the sample contains a number of highly-expressed transcripts with long/few exons, it becomes more reasonable.

Kevin
Leave a comment:
NGSfan replied

04-24-2009, 02:02 AM
Originally posted by Michael.James.Clark View Post

In some cases that's true. For mouse? Not as big a problem.

My advice is to generate a reference genome with all possible splice variants and align to it.

Otherwise you will see a massive drop in coverage at all exon-exon junctions as I said earlier, which will result in a large number of "unmappable reads" when you're only robust against two mismatches.

We've done it this way in our lab with human and it has worked.

Wow, this is interesting. What percentage of reads did you recover when you did this? how long are your reads?

One thing for sure, as the reads get longer 50, 75, 100, etc.. the more chance they will cross a splice junction. So the recovery effect will be stronger for longer reads.

When I use the Mortazavi dataset (25bp) and compare genome reference vs UCSC Known Genes transcripts (no introns), I see a very small increase in reads recovered ~2% , much like their paper states. Of course, when switching over to a transcript reference, I am also losing reads that are falling in places that are not in the set of annotated transcripts.

Using a reference of transcripts with all splice variants instead of the whole genome has its caveats as well - you will miss the novel junctons not documented. But of course, this will be a very very small number of reads.

You could run a newer program called TopHat that will handle splice junctions, but only it captures ~80% of them.

The mystery to me is that jmaury's argument makes sense - one would expect more splice junction reads, so this is quite odd - isn't it?

Last edited by NGSfan; 04-24-2009, 02:14 AM.
Leave a comment:
NGSfan replied

04-24-2009, 01:51 AM
Hi guys, the paper is:

"Mapping and quantifying mammalian transcriptomes by RNA-Seq"

http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.html

"Splice-crossing reads, such as are shown for Myf6 (Fig. 1b), were identified by mapping otherwise unassigned sequence reads to a library of all known splice events in all University of California Santa Cruz genome database (UCSC) Mouse July 2007 (mm9) gene model splices. When we summed over the entire dataset, including all otherwise unmappable reads, splice-spanning reads comprised approx3% (Supplementary Table 1), which is consistent with splice frequency in gene models across the genome."
Leave a comment:
jmaury replied

04-24-2009, 12:43 AM
Hello,

Originally posted by NGSfan View Post

According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.

I would like to say : "Don't underestimate splice junctions !!"

As Michael.James.Clark, I'm very suprised by the number of 3% (could you give us a link to this article)!!!

For example, if you sequence a transcript of 1500nt, and obtain 555 reads of 25nt (so an average coverage of 9,25X). If your mRNA contains 10 exons, you will have around 25% of the reads that fall on splice junctions (So you'll only map 75% of the reads, without consider sequencing errors and artefacts).
To obtain 3% with 25nt reads, your initial transcript should contains only two exons, so the number of unmapped reads is highly correlated with the number of exons per transcript and size of reads !!

Cheers,

Jean-marc
Leave a comment:
Michael.James.Clark replied

04-23-2009, 11:06 PM
Originally posted by francesco.vezzi View Post

We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

Francesco

In some cases that's true. For mouse? Not as big a problem.

My advice is to generate a reference genome with all possible splice variants and align to it.

Otherwise you will see a massive drop in coverage at all exon-exon junctions as I said earlier, which will result in a large number of "unmappable reads" when you're only robust against two mismatches.

We've done it this way in our lab with human and it has worked.
Leave a comment:
francesco.vezzi replied

04-23-2009, 10:21 PM
Originally posted by Michael.James.Clark View Post

Generate an adequate transcriptome reference and align to that.

We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

Francesco
Leave a comment:
Michael.James.Clark replied

04-23-2009, 09:07 AM
Originally posted by NGSfan View Post

Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?

Yes, it should. Of course, you can check.

Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.

Have you considered using something other than whole genome as a reference? Since you're using mouse, it should be easy to do.

Generate an adequate transcriptome reference and align to that. In our lab, we've used both Refseq and UCSC known genes as a reference successfully, although I feel we will do a better job the more permissive we become.

I've definitely seen in my own data even coverage across exon-exon junctions, and used exon coverage to identify splice variants.

For discovering novel transcripts, whole genome is fine. But for looking at expression, splice variants, et cetera, I think we're better off using a transcriptome reference.

Originally posted by NGSfan View Post

According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.

Can you reference the source for some of these estimates so we can evaluate that claim? I'd just like to take a look at the papers myself. Thanks.

Also, what alignment algorithm are you using? You say you are only robust against two mismatches. That's pretty low. Consider that means if you have a SNP and a sequencing error, you've filled your quota. This also means since you're using whole genome with 25-base reads that as soon as you're within 23 bases of an exon junction, you won't align any reads. Considering the size of a lot of exons, you'll just completely miss quite a few of them.

Again, I think aligning to a better reference will help you out on that front.

Last edited by Michael.James.Clark; 04-23-2009, 09:21 AM.
Leave a comment:
NGSfan replied

04-23-2009, 06:01 AM
Originally posted by Michael.James.Clark View Post

50-60% being "unmappable" sounds strangely high to me.

In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.

Originally posted by Michael.James.Clark View Post

Do you know if your reference contains rRNA?

Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?

Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.
Leave a comment:
francesco.vezzi replied

04-23-2009, 06:00 AM
Originally posted by NGSfan View Post

It would be nice to know and figure out how to recover these reads, and take more advantage of the data.

It looks like a philosophic problem... what to do with something that at a first sight has no importance?

Where you take the data that originated this discussion? I wont to align them with our tool and see what happens...
Leave a comment:
NGSfan replied

04-23-2009, 05:54 AM
Originally posted by Melissa View Post

Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.

I am referring to the primary sequence data - all the raw reads generated from a RNA-seq experiment.

That is interesting what you mention about a low temp problem and flow cell edges - I did not know about these issues. It would be interesting to know what fraction of reads are unmappable because of those issues.

Originally posted by Melissa View Post

Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.

but if you are aligning to the entire genome, the rRNA reads should align as well no?

Originally posted by Melissa View Post

I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't . Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.

According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.
Leave a comment:
NGSfan replied

04-23-2009, 05:27 AM
Originally posted by francesco.vezzi View Post

Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.

Yes, it definitely seems that one can get higher rates of mapped reads when one aligns reads generated from genomic DNA.

But for RNA-seq, when looking at the literature (RNA-seq mouse studies Mortazavi, Pan, Sultan, etc) the trend tends to be 50-60% of the all the generated reads can be mapped when using the entire genome as the reference sequence.

Originally posted by francesco.vezzi View Post

You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".

For sure longer reads will help, especially for genome assembly and re-sequencing. But the problem for RNA-seq appears to be different.

Originally posted by francesco.vezzi View Post

About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.

Actually the 50-60% mapped reads I mentioned includes reads that are mapped to multiple locations on the genome (ie. repeat regions, paralogs, etc). If you count just the reads that *uniquely* map to just one location in the reference genome, then the percentage drops to 44%.

Originally posted by francesco.vezzi View Post

No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

Very true, compared to Sanger sequencing this is much more cost effective. But if people want to use RNA-seq for de-novo profiling or - taken a step further - quantitative gene expression measurement, then they should be aware that according to current results from published work, 40-50% of your data is not usable, that's 40-50% of a ~$6300 sequencing run. This is fine for labs with lots of money to burn, but smaller labs will have to consider this more carefully before they treat this as something routine.

It would be nice to know and figure out how to recover these reads, and take more advantage of the data.

Last edited by NGSfan; 04-23-2009, 05:33 AM.
Leave a comment:
francesco.vezzi replied

04-23-2009, 01:34 AM
Originally posted by jmaury View Post

Hi,
Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.

We have develop similar experiments on grapevine (I work at IGA in Udine) obtaining the same results. Actually the number of reads that span over the exons introns junctions is really low.
Leave a comment:
jmaury replied

04-23-2009, 01:30 AM
Hi,

As mentionned by Melissa, exon/exon junctions will significantly reduce the number of mappable reads ! And so the number of mapped reads will be determined by the characteristics of your genome. Particularly, the number of exons per gene have to be take into account !!
Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.
You'll find more informations here : http://seqanswers.com/forums/showthread.php?t=1015

Hope this helps,

Cheers,
Jean-Marc
Leave a comment:
Michael.James.Clark replied

04-22-2009, 05:12 PM
Originally posted by NGSfan View Post

Hi everyone!

I must say, I'm very happy to find a community where we can discuss this new technology.

I have searched the forum, but could not turn up a thread that discusses the issue of unmappable RNAseq reads.

According to the article "The digital generation" by Nathan Blow, Dr. Liu is quoted as saying that it is not unusual that only "40-50%" of the data generated are mappable. There is some mention that perhaps this unmappable sequence is from antisense transcripts or artefacts of the RNA processing.

Interesting J.Shendure mentions being about to achieve 95% mapping with genomic DNA.

Losing 60-50% of the RNA-seq data seems quite high. Has anyone looked into this more carefully? Are the majority of these unmappable reads just full of sequencing errors? could there be contamination? and what is meant by artefacts in making a sequencing library from RNA? what would these artefacts look like to make them unmappable?

Thanks for any thoughts.

What is your reference genome? Whole genome? Or just transcriptome? And if the latter, what's your definition of transcriptome? In the distant past, I used Refseq as a reference genome, but I now think that's too limiting.

50-60% being "unmappable" sounds strangely high to me.

In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.

In that experiment, only an additional 9.5% of the data aligned to Refseq. That means that only 20.5% of my data was "unmappable" for whatever reason. This was also using an older version of our aligner (BFAST), so it's possible more of them would be aligned if I were to re-run the data.

Sufficed to say, in that experiment, my major issue was the rRNA contamination. As Michelle pointed out, a poly-A pulldown can do a good job alleviating this problem and from what I've read can enrich your sample for mRNA such that you only end up with 30-40% rRNA contamination after a single round of poly-A purification.

That said, if your reference genome doesn't include ribosomal RNA, it's possible the "unmapped" reads are mostly ribosomal RNA. Do you know if your reference contains rRNA?
Leave a comment:

Previous 1 2 3 4 template Next

Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM
Strategies for Sequencing Challenging Samples

by seqadmin

Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
- Channel: Articles
03-22-2024, 06:39 AM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 49 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News