Why low mapping
Thanks for the reply. I removed adaptors from the reads with cutadapt. I'm also curious how the reads map. Attached please see pic for how these map and also the presence of such a lot of N's. Should I worry about the N's and remove it or should I rather leave it. Any suggestions?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by Lizex View PostHi All
I have been mapping reads generated on a MiSeq instrument (150bp) against the apple genome. Reads were processed (trimmed and filtered using the fastx_toolkit). Reads mapped are over a 100bp long. Using Tophat (version 1.4) to align the reads, I got an average of 12% of reads mapped for e.g. out of 5 Gb of reads (read1 and read2) only 600 Mb of data in the accepted_hits.bam file. I take this as reads mapped. Is there something wrong or am I missing something here? Any suggestions as to how address this very low mapping. I'm sure that the percentage reads should be much better than 12%?
Leave a comment:
-
RNA-Seq reads mapping very low
Hi All
I have been mapping reads generated on a MiSeq instrument (150bp) against the apple genome. Reads were processed (trimmed and filtered using the fastx_toolkit). Reads mapped are over a 100bp long. Using Tophat (version 1.4) to align the reads, I got an average of 12% of reads mapped for e.g. out of 5 Gb of reads (read1 and read2) only 600 Mb of data in the accepted_hits.bam file. I take this as reads mapped. Is there something wrong or am I missing something here? Any suggestions as to how address this very low mapping. I'm sure that the percentage reads should be much better than 12%?
Leave a comment:
-
Hello to everyone and thanks for adding to the discussion here. Sorry I did not have the time to reply. I really appreciated everyone's input and I even learned more things than I expected.
Alim, so from what I understand from your post, you think that the ~60% mapping rate of raw 25bp reads from your experiment is most likely due to the higher error rates of the old illumina sequencing chemistry.
The newer chemistry then, should be better and help to increase the mapping rates by reducing the sequencing errors. I will take a closer look at this once the longer read data sets are available.
Leave a comment:
-
Definitely longer reads will help aid in the mapping process for any seq experiment.
It may not be necessarily correct to map to the middle exon. For example if the exons are 40bp, 20bp, and 20bp respectively. Then in the majority of the cases (reads spanning the exons) the first exon contributes the most sequence to the read. If you want to find the "best fit" exon, it would be the first.
As long as it is admitted that the ultimate goal is to reconstruct the isoforms, which the above is an intermediate step, or the goal is to measure gene expression, where we simply care about the abundance of transcripts from a given gene and so do not need to find the isoforms, then I would tend to agree with any intermediate approach.
Its a difficult problem, and I am still unsure of the best path of attack.
Leave a comment:
-
Melissas statment was that longer reads increases the possibility of getting a unique (exonic) sequence, and the op's question was about the fraction of unmappale reads. So a read spanning 3 exons would then be correctly aligned to the middle exon. Of course as you say the problem lies in defining the isoforms. Re 2) I did not say you have to be able to align all parts of the reads, but rather to increase the likelihood of finding the alignment of a read compared to having only one 25 mere from the same transcript.
Leave a comment:
-
Originally posted by Chipper View PostThis should not be a problem (for mapping) as long as your aligner is capable of finding the best match in the read. AFAIK it is only one method that claims to do this though, but at least AB is starting to go in this direction with the RNA-seq pipeline (by identifying the best match from either end of the read). And of course you could split your 2x100 b reads into 8x25 and use ISAS to map those reads really fast.
2. You cannot split the reads since you do not know where the splice junctions exist in the read. If you did split, you could use any aligner since they are all fast (with a few exceptions).
Leave a comment:
-
Originally posted by nilshomer View PostThat's not necessarily true. Consider 100bp paired end reads. You could have reads that have more than one exon junction (even three or four if the exons are small). This is a significant bioinformatic challenge, especially when we wish to find novel transcripts and new exons.
Leave a comment:
-
In the later case, you cannot map to all known or predicted exons (or all possible junctions) since you might be missing new exons (especially if the sample has a high mutation rate i.e. cancer, which could create gene fusions). So really you have to map back the full DNA reference, tolerating that a read can span many junctions. My cursory look has not found anything that can handle millions, if not billions of these reads (we want to sequence 1000's of human samples at good coverage), while searching for such junctions.
Finally, the mapping above is with very low sensitivity. If you allow only 2 mismatches, then that is requiring that the reads you map have <2% error rate (100bp reads), which is unrealistic and creates the majority of "unmapped" reads (you were not sensitive enough).
On the other hand, this means that people like us have a job!
Leave a comment:
-
Hi Melissa,
Let me see if I can answer your questions.
The reference to "pooled tissues" is that the RNA-samples (coming from a third party) were not from a single mouse brain but from many mouse brains pooled (same goes for the liver sample and the muscle sample) and not that we mixed brain+liver+splice in most of the analysis or in any of the sequencing. Note that these RNA samples are from the inbred mouse strain that was sequenced, so we do not expect to have many transcripts affected by variation/SNPs/etc.....
- The total raw reads for all of the reads, including the unmapped ones adds up to 216M reads, but since only 65% of the reads mapped on average, we used 140M reads in the analysis.
- Remember that we map the reads to the mouse reference genome in addition to known splice junctions. Assuming that novel transcripts are like existing transcripts, then most of their reads should map to the mouse reference genome. While there are undoubtedly thousands of junctions that we are missing, these are unlikely to be from highly expressed transcripts (say the top 10 expressed genes in muscle or liver), which are the best characterized transcripts & contribute disproportionately to the read counts.
- A better explanation for the 40% that aren't mapping is that they are likely to have the same distribution of unique/splice/multi reads as the other 60%, but with more than 2 mismatches... most of them would map to the same places as the other reads, which means primarily on known transcripts!
- the numbers for supplemental table 1 are just for illustration purposes. ERANGE does all of its calculations using mapped (unique + spliced + multi) reads.
- actually we see a surprising amount of intron retention in some genes! Clearly, there is still much work to be done in getting the genes models to be comprehensive even in genomes as extensively studied as human and mouse (and C elegans & drosophila too).
I hope this helps!
Ali
Leave a comment:
-
Confused
Originally posted by alim View PostHi,
The fraction of reads crossing known or novel junctions is a function of both the read length and the distribution of exons lengths. For mammalian genomes like human and mouse (I claim no knowledge of the grapevine genome), less than 3% of 25 mers end up crossing a known splice junctions; this number in the same samples & genomes goes up to 20% by the time you use 75 mers. This numbers does not change dramatically in a well annotated genome when you include novel splices!
It's worth pointing out that since RNA-seq is a numbers game, many people, such as our lab, used raw reads rather than the solexa quality-filtered reads (and we can discuss whether Illumina's pre-1.0 quality scores or filters made any sense at all); this strategy is referred to sometimes as "quality-filtering by mapping".
One particularly interesting aspect of Illumina's reads is that a fraction of the reads accumulate most of the errors in the 3' end of the reads [in fact we took 32mers and truncated them into 25mers just to avoid the errors accumulating in bases 26-32]. Given the error rates with the old reagents, it's not surprising at all that 40% of the reads would have more than 2 errors in 25 bp.
Ali
Great to have you here! Thanks for the explanation on filtering. That really clear things a lot.
Part of the paper can be confusing. The RNA-seq was carried out on 3 types of tissues (brain, liver, muscle). The Methods stated that total RNA is pooled from tissues. That statement is misleading.
It'll be great you can clear me out here:
-The total data in your supplementary table 1 doesn't add up to 140 million reads
-What I understand from the paper is that all the "unmappable" reads that neither map to the genome and splice junctions are potentially novel transcript, is that right?
(Just to clarify from the author although others have stated that in the previous post)
-Regarding the percentage of reads at splice junctions...Don't you think that splice-spanning reads should be divided by (total reads that map to genome) instead of the entire datasets?
By now, everyone know that 25 bp reads are too short to map anything uniquely. With reads of only 25bp long, I would imagine all the reads will align to the genome.
Thanks in advance,
Melissa
Oops...I realized I made a big mistake when I said "Maybe some reads that are overlapping/spinning exon splice junctions are lost after mapping. " I forgot that RNA-seq data is not supposed to contain introns (unless intron retention occurs... which is less likely to happen compared to plants).
Leave a comment:
-
fraction of splice-crossing reads is a function of read length
Hi,
It's nice to see a spirited discussion of what the fractions of reads crossing splice junctions should be (and to be quoted in that debate ). I'll just restate what is in retrospect obvious, based on looking at a couple year's worth of data.
The fraction of reads crossing known or novel junctions is a function of both the read length and the distribution of exons lengths. For mammalian genomes like human and mouse (I claim no knowledge of the grapevine genome), less than 3% of 25 mers end up crossing a known splice junctions; this number in the same samples & genomes goes up to 20% by the time you use 75 mers. This numbers does not change dramatically in a well annotated genome when you include novel splices!
It's worth pointing out that since RNA-seq is a numbers game, many people, such as our lab, used raw reads rather than the solexa quality-filtered reads (and we can discuss whether Illumina's pre-1.0 quality scores or filters made any sense at all); this strategy is referred to sometimes as "quality-filtering by mapping".
One particularly interesting aspect of Illumina's reads is that a fraction of the reads accumulate most of the errors in the 3' end of the reads [in fact we took 32mers and truncated them into 25mers just to avoid the errors accumulating in bases 26-32]. Given the error rates with the old reagents, it's not surprising at all that 40% of the reads would have more than 2 errors in 25 bp.
Nowadays, 2x75 reads on a good run with the new reagents are of far better quality than those 25mers.
Ali
Leave a comment:
-
Hi Kevin,
Originally posted by klh View PostHi Jean-marc,
I'm not quite sure I understand how you obtain your estimate of 25%. A 1500nt transcript with 10 x 150nt exons contains 1475 25mers, 216 of which (9 * 24) cross splice boundaries. I make this ~15% of reads crossing splice junctions in this example.
But according to the Mortazavi et al paper, they've mapped RNA-Seq against the mouse transcriptome, so reads that come from an unknwon splice junction won't fall into the categorie of reads that cross a splice boundaries... However the number of unknown splice junctions in the mouse genome may be low !!!
JM
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 11:49 AM
|
0 responses
15 views
0 likes
|
Last Post
by seqadmin
Yesterday, 11:49 AM
|
||
Started by seqadmin, 04-24-2024, 08:47 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
04-24-2024, 08:47 AM
|
||
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
61 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
60 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
Leave a comment: