Seqanswers Leaderboard Ad

**francesco.vezzi** · 04-20-2009, 10:28 PM

Hi,
I think you must be more precise. What is the length of the reads you wont to place, and how many errors you allow for each read. The reads that are placed in multiple copies are considered placed or not?

Actually I was wondering about not placed reads during last weekend (for the happiness of my girlfriend

). In particular I think that if we exclude reads that have low quality the unplaced reads hide some non trivial information...

**NGSfan** · 04-21-2009, 02:59 AM

Hi francesco!

thanks for joining in on the discussion.

I have for example, examined some published data with 25bp long reads, and aligned them to a mouse genome allowing for 2 mismatches at most. About 60% of the reads are alignable under those parameters. I looked at the other unmapped reads for possible contamination, but only 2% mapped to human or ecoli for example.

Perhaps I'll take another look at the unmapped reads and allow for 3 or 4 mismatches and see how many more reads I can recover for mapping. But I'm sure there will be many still unmapped - and I wonder if this is just because there are more errors than advertised by illumina, or what else could this be?

by non-trivial, do you mean some functional sequences? any guesses what these other sequences could be? I'm very curious.

In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?

**francesco.vezzi** · 04-21-2009, 04:20 AM

Originally posted by NGSfan View Post

Hi francesco!
In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?

Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.

You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".

About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.

Originally posted by NGSfan View Post

In my opinion, it seems really odd, that after spending several thousand dollars on an expensive experiment to only get 50% to 60% of the data, no?

No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

Obviously this is only an opinion of a four months PhD student.....

**Melissa** · 04-22-2009, 10:35 AM

Unmappable reads sounds new to me. I'm particularly interested in difficulties in de novo transcriptome assembly.

Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.

Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.

I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't

. Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.

Cheers,
Melissa

**Michael.James.Clark** · 04-22-2009, 05:12 PM

Originally posted by NGSfan View Post

Hi everyone!

I must say, I'm very happy to find a community where we can discuss this new technology.

I have searched the forum, but could not turn up a thread that discusses the issue of unmappable RNAseq reads.

According to the article "The digital generation" by Nathan Blow, Dr. Liu is quoted as saying that it is not unusual that only "40-50%" of the data generated are mappable. There is some mention that perhaps this unmappable sequence is from antisense transcripts or artefacts of the RNA processing.

Interesting J.Shendure mentions being about to achieve 95% mapping with genomic DNA.

Losing 60-50% of the RNA-seq data seems quite high. Has anyone looked into this more carefully? Are the majority of these unmappable reads just full of sequencing errors? could there be contamination? and what is meant by artefacts in making a sequencing library from RNA? what would these artefacts look like to make them unmappable?

Thanks for any thoughts.

What is your reference genome? Whole genome? Or just transcriptome? And if the latter, what's your definition of transcriptome? In the distant past, I used Refseq as a reference genome, but I now think that's too limiting.

50-60% being "unmappable" sounds strangely high to me.

In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.

In that experiment, only an additional 9.5% of the data aligned to Refseq. That means that only 20.5% of my data was "unmappable" for whatever reason. This was also using an older version of our aligner (BFAST), so it's possible more of them would be aligned if I were to re-run the data.

Sufficed to say, in that experiment, my major issue was the rRNA contamination. As Michelle pointed out, a poly-A pulldown can do a good job alleviating this problem and from what I've read can enrich your sample for mRNA such that you only end up with 30-40% rRNA contamination after a single round of poly-A purification.

That said, if your reference genome doesn't include ribosomal RNA, it's possible the "unmapped" reads are mostly ribosomal RNA. Do you know if your reference contains rRNA?

**jmaury** · 04-23-2009, 01:30 AM

Hi,

As mentionned by Melissa, exon/exon junctions will significantly reduce the number of mappable reads ! And so the number of mapped reads will be determined by the characteristics of your genome. Particularly, the number of exons per gene have to be take into account !!
Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.
You'll find more informations here : http://seqanswers.com/forums/showthread.php?t=1015

Hope this helps,

Cheers,
Jean-Marc

**francesco.vezzi** · 04-23-2009, 01:34 AM

Originally posted by jmaury View Post

Hi,
Recently I've worked on RNA-Seq from grapevine genome and we've mapped around 80% of the initial reads, but the average number of exons per gene is less than 5, relatively low compared to mammalian species.

We have develop similar experiments on grapevine (I work at IGA in Udine) obtaining the same results. Actually the number of reads that span over the exons introns junctions is really low.

**NGSfan** · 04-23-2009, 05:27 AM

Originally posted by francesco.vezzi View Post

Well this depend a lot against what reference are you aligning. We have an Illumina Genome Analyzer and we have sequenced two years ago the genome of the grapevine. When we sequence the reference plant that we have use to obtain the assembly we align more then 74% of the reads with parameters that are quite similar to yours. If we sequenced another variety of grapevine we are able to align only the 60% of all the reads. This is not strange because the two organism are different.

Yes, it definitely seems that one can get higher rates of mapped reads when one aligns reads generated from genomic DNA.

But for RNA-seq, when looking at the literature (RNA-seq mouse studies Mortazavi, Pan, Sultan, etc) the trend tends to be 50-60% of the all the generated reads can be mapped when using the entire genome as the reference sequence.

Originally posted by francesco.vezzi View Post

You are using 25 bases reads, it means that they are really old (now illumina can produce 75 paired ends reads) and probably analysed with the old pipeline. You can find one interesting data set having a look at http://tinyurl.com/68aeq3 and to the article "de novo assembly of the pseudomonas syringae pv syringae b728a genome using illumina/solexa short sequence reads".

For sure longer reads will help, especially for genome assembly and re-sequencing. But the problem for RNA-seq appears to be different.

Originally posted by francesco.vezzi View Post

About the non trivial information of not aligned reads there are a lot of possibilities like repeated regions with a lot of errors (if compared to the reference sequence) or totally new inserts.

Actually the 50-60% mapped reads I mentioned includes reads that are mapped to multiple locations on the genome (ie. repeat regions, paralogs, etc). If you count just the reads that *uniquely* map to just one location in the reference genome, then the percentage drops to 44%.

Originally posted by francesco.vezzi View Post

No, I disagree with you. If we consider an Illumina experiment with 7 lines the 50% of the reads means more then 2Gigabases of data and all this amount of data is obtained at a fraction of the cost of the methods available only 1 year ago.
The problem is the opposite, there is too much data to analyse and we are more or less using instruments that where develop for totally different kind of data.

Very true, compared to Sanger sequencing this is much more cost effective. But if people want to use RNA-seq for de-novo profiling or - taken a step further - quantitative gene expression measurement, then they should be aware that according to current results from published work, 40-50% of your data is not usable, that's 40-50% of a ~$6300 sequencing run. This is fine for labs with lots of money to burn, but smaller labs will have to consider this more carefully before they treat this as something routine.

It would be nice to know and figure out how to recover these reads, and take more advantage of the data.

**NGSfan** · 04-23-2009, 05:54 AM

Originally posted by Melissa View Post

Can you define these unmappable read more precisely? Are you referring to the raw data or high quality reads after filtering? If you are referring to filtered reads, then sequencing errors cannot contribute to this problem. I believe these reads are resulted from technical/experimental problem rather than the nature of the sequences. For example, low quality reads due to platform's temperature problem and artifacts created at the edges of the flow cell.

I am referring to the primary sequence data - all the raw reads generated from a RNA-seq experiment.

That is interesting what you mention about a low temp problem and flow cell edges - I did not know about these issues. It would be interesting to know what fraction of reads are unmappable because of those issues.

Originally posted by Melissa View Post

Most RNA-seq data usually contain 30-40% rRNA. After filtering low quality, contaminating reads and polyA tails, 50-60% of reads sounds REALLY good to begin with. So, the answer is NO to whether it's a waste to get only 50-60% reads. High redundancy is another reason why some reads are not useful at all.

but if you are aligning to the entire genome, the rRNA reads should align as well no?

Originally posted by Melissa View Post

I'm not sure the reason why some reads cannot map to the reference genome. Well, they just don't

. Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. RNA processing and other regulatory mechanisms sounds like a good explanation.

According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.

**francesco.vezzi** · 04-23-2009, 06:00 AM

Originally posted by NGSfan View Post

It would be nice to know and figure out how to recover these reads, and take more advantage of the data.

It looks like a philosophic problem... what to do with something that at a first sight has no importance?

Where you take the data that originated this discussion? I wont to align them with our tool and see what happens...

**NGSfan** · 04-23-2009, 06:01 AM

Originally posted by Michael.James.Clark View Post

50-60% being "unmappable" sounds strangely high to me.

In my experience, the major issue with RNA-seq is ribosomal RNA contamination. You must do something like a poly-A pull down for these experiments, or the majority of your data will be ribosomal RNA.

For example, in an experiment I ran over a year ago (before RNA-seq was as well established as it is now), I used oligo(dT) cDNA synthesis thinking that would enrich for mRNA sequences enough to keep the rRNA sequences at a low level. Turns out that wasn't the case, and about 70% of my data from that lane of Solexa data aligned to ribosomal RNA sequence.

Originally posted by Michael.James.Clark View Post

Do you know if your reference contains rRNA?

Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?

Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.

**Michael.James.Clark** · 04-23-2009, 09:07 AM

Originally posted by NGSfan View Post

Yes I'm using the entire genome, so I would assume rRNA should be mapping as well, no?

Yes, it should. Of course, you can check.

Yes, 50-60% mapped is really odd (talking about just simply mapping reads to anything, 2 mismatches max, not talking about those uniquely mapping, which is only 44%), especially when considering that the whole genome is used as a reference, so things like poly-A, repeat regions, rRNA, should still be mapping.

Have you considered using something other than whole genome as a reference? Since you're using mouse, it should be easy to do.

Generate an adequate transcriptome reference and align to that. In our lab, we've used both Refseq and UCSC known genes as a reference successfully, although I feel we will do a better job the more permissive we become.

I've definitely seen in my own data even coverage across exon-exon junctions, and used exon coverage to identify splice variants.

For discovering novel transcripts, whole genome is fine. But for looking at expression, splice variants, et cetera, I think we're better off using a transcriptome reference.

Originally posted by NGSfan View Post

According to most estimates (Mortazavi, for example) only 3% of all reads (mapped and unmapped) fall on splice junctions - so this is quite small really.

Can you reference the source for some of these estimates so we can evaluate that claim? I'd just like to take a look at the papers myself. Thanks.

Also, what alignment algorithm are you using? You say you are only robust against two mismatches. That's pretty low. Consider that means if you have a SNP and a sequencing error, you've filled your quota. This also means since you're using whole genome with 25-base reads that as soon as you're within 23 bases of an exon junction, you won't align any reads. Considering the size of a lot of exons, you'll just completely miss quite a few of them.

Again, I think aligning to a better reference will help you out on that front.

**francesco.vezzi** · 04-23-2009, 10:21 PM

Originally posted by Michael.James.Clark View Post

Generate an adequate transcriptome reference and align to that.

We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

Francesco

**Michael.James.Clark** · 04-23-2009, 11:06 PM

Originally posted by francesco.vezzi View Post

We have performed an experiment like that, but the problem is that often the exon/intron junctions are not perfectly known. If you try to place short reads taken from the trascriptome and align them against the reference sequence allowing gaps (this allow you to place for example half of a read in one exon and the other half in the following exon) you will probably discover that some some exon junctions are not totally correct and maybe you can determine new exons.

S the experiment you are proposing is perfect in theory but in practice I'm not sure it will work.

Francesco

In some cases that's true. For mouse? Not as big a problem.

My advice is to generate a reference genome with all possible splice variants and align to it.

Otherwise you will see a massive drop in coverage at all exon-exon junctions as I said earlier, which will result in a large number of "unmappable reads" when you're only robust against two mismatches.

We've done it this way in our lab with human and it has worked.

Topics	Statistics	Last Post
AI Tool Creates High-Resolution 3D Maps of the Mouse Brain by seqadmin Started by seqadmin, 03-20-2025, 05:03 AM	0 responses 49 views 0 reactions	Last Post by seqadmin 03-20-2025, 05:03 AM
Studying Microbial Gene Transfer with RNA Barcoding by seqadmin Started by seqadmin, 03-19-2025, 07:27 AM	0 responses 57 views 0 reactions	Last Post by seqadmin 03-19-2025, 07:27 AM
Mapping the snoRNAome in Zebrafish to Advance Disease Research by seqadmin Started by seqadmin, 03-18-2025, 12:50 PM	0 responses 50 views 0 reactions	Last Post by seqadmin 03-18-2025, 12:50 PM
TIGR Systems Offer a Compact Alternative to CRISPR for Gene Editing by seqadmin Started by seqadmin, 03-03-2025, 01:15 PM	0 responses 200 views 0 reactions	Last Post by seqadmin 03-03-2025, 01:15 PM

Seqanswers Leaderboard Ad

why low mapping rates for RNAseq?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News