Differential Expression: is it better to map reads to genome or transcriptome?

dalesan replied

04-22-2014, 03:02 AM
Originally posted by geneart View Post

Hi I have a very basic question about read mapping. For differential expression analysis of NGS data , many papers I have read , mention that they discard non -unique mapping reads. However I could not find a good summarized explanation for doing so. From what I gather and understand, the more unique the read is the better certainty it is to call its location as the technique itself could introduce some mismatches and bring about non specific mapping and also the unique location depth would still account for the naturally exisiting SNPs if any.
Have I understood this right or is there a better explanation of why we take only unique mapping reads to perform differential expression?
THanks in advance

From what I have understood, reads that map to multiple locations in the genome can not be reliably used in calculating differential gene expression. This is because there is no biologically meaningful way to know where in the genome such a read really belongs, thus knowing the "true" level of gene expression becomes confounded by including multi-reads (as such as read could align to multiple genes).

Imagine you have 100 reads that do not map uniquely. How will you determine where to assign them? Do you split them evenly across locations or implement some other ad hoc solution? In any case, I think it turns out to be a guessing game that may bias your results.
Leave a comment:
gringer replied

04-21-2014, 07:25 PM
If a read from a transcript maps to multiple *genomic* locations, then you can't be confident about what particular transcript that read came from. The reported transcript chosen for multiply mapped reads will not be all of the mapped locations (and even if it were, that would mess up statistics due to multiple counting), so the counts are not presenting an accurate representation of your sample.

However, mapping to a transcriptome and discarding multiple reads doesn't make sense to me (assuming you're working with a species that has transcript isoforms).
Leave a comment:
dpryan replied

04-20-2014, 09:42 AM
It'd be good to know how many of the reads that map to miRNAs are also multimappers.
Leave a comment:
geneart replied

04-20-2014, 02:51 AM
dpryan: you are correct. By Non unique I meant not the reads as non unique but the mapping location of reads as non unique. So in essence what I meant is I kept reads mapping to unique location while disregarding reads mapping to multiple locations , for my differential analysis.
I am looking at miRNA and hence was wondering if at all it matters that I discarded reads mapping to multiple locations? I did use single end sequencing. I had 95% of reads mapped to the genome while 4% of this are uniquely mapped.

As I am looking at miRNA expression in exosomes I expect to have all other kinds of reads mapping to tRNA rRNA etc. Hence the ambiguity is amplified even more. That is the reason I consider only uniquely mapped reads. Does this hold good? Any suggestions on this with respect to my final question of looking at miRNAs? Opinions appreciated Thanks very much in advance.

Last edited by geneart; 04-20-2014, 03:01 AM.
Leave a comment:
dpryan replied

04-20-2014, 02:34 AM
@Wallysb01: Non-unique reads have nothing to do with duplicates. Non-unique in this context refers to multimappers, which either may or may not be used in RNAseq, depending on the tool being used and the question being asked. I think most people agree with you that removing duplicates from RNAseq datasets is a good way to shoot yourself in the foot.
Leave a comment:
Wallysb01 replied

04-19-2014, 08:40 PM
Originally posted by geneart View Post

Hi I have a very basic question about read mapping. For differential expression analysis of NGS data , many papers I have read , mention that they discard non -unique mapping reads. However I could not find a good summarized explanation for doing so. From what I gather and understand, the more unique the read is the better certainty it is to call its location as the technique itself could introduce some mismatches and bring about non specific mapping and also the unique location depth would still account for the naturally exisiting SNPs if any.
Have I understood this right or is there a better explanation of why we take only unique mapping reads to perform differential expression?
THanks in advance

I think in general that people should not be disregarding non-unique reads, especially if you’re doing single end sequencing. With RNA-seq its entirely possible the many genes are sequenced at just absurd coverage and duplicates are just going to be a normal sampling process. And if you were to remove the duplicates, you’d just be reducing your power to detect expression changes for your most highly expressed genes. Meaning, all genes would basically be capped at 1 read per base pair of their length in all samples or conditions, so if any genes are expressed above that, you can’t detect changes in them if you remove duplicates.

Now, if you have some reason to believe that the libraries were over amplified and you see lots of duplicates even in your more lowly expressed genes, then you may want to remove duplicates, but you should still keep in mind the issues mentioned above and expect that your most highly expressed genes won’t be coming out as differentially expressed.
Leave a comment:
Brian Bushnell replied

04-19-2014, 10:22 AM
This is much more of a problem when mapping a transcriptome than genome, which is one reason I recommend genome mapping of RNA-seq. But either way, you can do one of three things with ambiguously-mapping reads, each with disadvantages:

1) Discard them, causing underrepresentation of transcripts homologous to other transcripts.
2) Pick one site at random, which will overrepresent the transcripts that occur less frequently and underrepresent the ones that occur more frequently.
3) Pick all top mapping sites, which will overrepresent everything.

There's no perfect answer; they'll all incur a bias. It probably doesn't matter too much which one you go with as long as every 'treatment' you compare uses identical methodology, and the same read length (since longer reads will have less ambiguity).
Leave a comment:
geneart replied

04-19-2014, 09:54 AM
Mapping uniquely

Hi I have a very basic question about read mapping. For differential expression analysis of NGS data , many papers I have read , mention that they discard non -unique mapping reads. However I could not find a good summarized explanation for doing so. From what I gather and understand, the more unique the read is the better certainty it is to call its location as the technique itself could introduce some mismatches and bring about non specific mapping and also the unique location depth would still account for the naturally exisiting SNPs if any.
Have I understood this right or is there a better explanation of why we take only unique mapping reads to perform differential expression?
THanks in advance
Leave a comment:
dalesan replied

03-25-2014, 04:27 AM
Updated results and summary

Hello All,

I thought I'd chime in again with some updated results after re-running my DEG analysis using bowtie2 to map reads to both the Arabidopsis transcriptome (1 representative isoform per locus) and Arabidopsis genome.

For transcriptome alignments, I used bowtie2 and allowed for local alignments. For the genome mapping, I used tophat (which only allows bowtie2 to run in end-to-end mode, i.e. no local alignments). I used DESeq2 for my DEG analyses.

My original question was to figure out whether it's "better" to just run a DE analysis on a well characterized organism using its transcriptome or if it's "better" to use its genome. What I've come to discover is that at least for arabidopsis, if you have the time, it's a good idea to do both because in my case I was able to recover an additional 10-15% DEGs when considering the unique DEGs found in each of the mapping scenarios. For example, in my re-analyis using bowtie 2 instead of bowtie 1, I uncovered a total of 1667 DEGs, 1348 (~81%) of which were in common between the transcriptome and genome mappings

As I had previously conducted DEG analyses with bowtie 1 alignments, I decided to look at the differences in the DEGs found between the bowtie1 vs bowtie 2 mapping against the genome (using tophat) and bowtie1 vs bowtie 2 against the transcriptome.

I was very happy to find little differences in the genome mappings. 97.5% of the differentially expressed genes were shared across the two alignment versions, which is fantastic. It's important to note here that the end-to-end alignment mode was used by default as the local alignment option in bowtie2 isn't supported when mapping to the genome in tophat.

I next checked for the overlap between bowtie1 and bowtie 2 transcriptome alignments. Here, there was less concordance: only 77.3% of the differentially expressed genes were shared across the two alignment versions.

I imagine this is largely attributable to me having used the local alignment option during the mapping. Notably, a greater percentage of my raw reads were mapped as a result of invoking the local alignment option, roughly 65-70% (bowtie2) versus 55-57% (bowtie1).

So the next question is, which of the two transcriptome alignments is more "trustworthy"? For now, I can't really say. If anyone has insight into whether or not it's worth using the local alignment option of bowtie2, I'd love to hear it.

In summary, I found 1764 DEGs using bowtie 1 and 1667 using bowtie 2, in a combined analysis of transcriptome + genome mappings. There was great agreement (97.5% shared DEGs) in genome mappings between bowtie 1 and bowtie 2, presumably due to end-to-end mode being used during alignment. I observed considerably less concordance in the transcriptome alignments (77.3% shared DEGs), probably due to me invoking the local alignment option in bowtie 2. However, more of my raw reads (~10%) were able to be mapped to the transcriptome using the local alignment option. As for my original question of whether it's better to use the transcriptome or genome for mapping -- I think if you have access to both, and the resources -- use them both. I was able to recover an additional 10-15% DEGs when considering the unique DEGs found in each of the mapping scenarios. Going forward from here, I plan on using the DEGs from the bowtie2 mappings rather than the bowtie 1 mappings for all other downstream analyses.

I'd love to hear your feedback and I hope that this short comparison was somehow useful for someone.

Cheers,
Dale
Leave a comment:
dalesan replied

03-19-2014, 03:40 AM
Originally posted by gringer View Post

Is there any particular reason why you used bowtie and not bowtie2? Were you specifically telling tophat to use bowtie, rather than bowtie2 (the default)?

I ask because if the bowtie versions are different, you'll be comparing a bit more than just genome vs transcriptome.

Additional question, did you use the transcriptome GTF file when mapping using tophat? I assume not, because that is likely to result in all transcriptome reads being picked up.

Actually, there was no particular reason that I chose to use bowtie 1. I just read up on the differences between bowtie 1 and 2 now, and it seems like bowtie2 has some nice improvements (affine gaps, local alignments, better ways to handle read pairs). Looks like I'll be re-running my pipeline one more time to see what the differences are in one alignment method vs another.

And yes, I did tell tophat to use bowtie 1. I did in fact use the transcriptome GTF file when mapping with tophat. I don't think it would pick up all transcripts because my transcriptome index file was indexed using a fasta file of only the longest/most representative gene models for each gene.

In your experience, have you noticed a big difference between bowtie 1 and 2? Does it warrant that I re-do my analysis?

Thanks for your questions!
Leave a comment:
gringer replied

03-19-2014, 03:16 AM
Originally posted by dalesan View Post

I used bowtie v 1.0.0 to map to a filtered transcriptome containing only the longest gene model isoform of each gene.... For the genome alignment, I used tophat v.2.0.10 and observed that 75-80% of the reads aligned.

Is there any particular reason why you used bowtie and not bowtie2? Were you specifically telling tophat to use bowtie, rather than bowtie2 (the default)?

I ask because if the bowtie versions are different, you'll be comparing a bit more than just genome vs transcriptome.

Additional question, did you use the transcriptome GTF file when mapping using tophat? I assume not, because that is likely to result in all transcriptome reads being picked up.
Leave a comment:
dalesan replied

03-19-2014, 02:56 AM
Results of my comparison of mapping to transcriptome vs genome and DEG

So, I've finally gotten around to comparing the results of my differential gene expression analysis based on mapping to the transcriptome and genome of Arabidopsis.

I used bowtie v 1.0.0 to map to a filtered transcriptome containing only the longest gene model isoform of each gene. I have about 30 million paired end reads for each of my 4 samples (2 control, 2 treated) and roughly 45-50% of these reads mapped to the transcriptome. For the genome alignment, I used tophat v.2.0.10 and observed that 75-80% of the reads aligned.

After summarizing counts, I used DESeq 2 v1.2.10 for DEG analysis.

Similar to sazz, I didn't observe a huge difference, but one certainly exists.

What I plan on doing is using the combined information from this analysis to work with the total number of DEGs found from both analyses (1764) rather than just the intersection (1278) or the 1536 or 1506 found in the transcriptome and genome, respectively.

Can you think of any reason to object to this line of reasoning?
Leave a comment:
rskr replied

03-16-2014, 06:05 PM
Originally posted by gringer View Post

I would recommend mapping to the genome, but using the transcriptome as a mapping template to pick up splice boundaries, etc.. In other words, something like what Tophat does. Mapping to the genome makes novel isoforms a bit easier to pick up, and mapping to the transcriptome will give you more descriptive output (e.g. proper gene names) with a bit less work. I would expect that thaliana should have a fairly well-annotated transcriptome, so you'll be losing a lot by ignoring annotated genetic features.

IMO it is obvious that Tophat went to transcriptome mapping because they were unable to solve the pseudo gene problem, what remains to be seen is does using the genome actually bring anything to the table besides huge hardware requirements, and short leading and trailing non-coding isoform? Could whatever it does bring to the table be done later with the reads that don't map to a transcript? In an analysis that is different than differential expression, like an isoform search...

Furthermore, I think most poorly characterized organisms get the transcriptomes done first since they are easier, and provide a majority of the useful information, which sort of renders the argument about uncharacterized organisms, mute.
Leave a comment:
gringer replied

03-16-2014, 04:26 PM
I would recommend mapping to the genome, but using the transcriptome as a mapping template to pick up splice boundaries, etc.. In other words, something like what Tophat does. Mapping to the genome makes novel isoforms a bit easier to pick up, and mapping to the transcriptome will give you more descriptive output (e.g. proper gene names) with a bit less work. I would expect that thaliana should have a fairly well-annotated transcriptome, so you'll be losing a lot by ignoring annotated genetic features.
Leave a comment:
dalesan replied

03-16-2014, 11:19 AM
Originally posted by Brian Bushnell View Post

Also, mapping to a genome is more objective and repeatable. Mapping to a transcriptome is very subjective, as there are a huge number of ways to design one. Add a single gene, or a single transcript, and the mappings of all reads may be affected. So, how do you choose which transcripts and isoforms to include? All of them? Just the longest for each gene? Just a full concatenation of all exons per gene? Just the ones that were known prior to date XYZ, or also the two new ones your lab found that you think are relevant? You'll get different results based on this purely subjective decision, possibly allowing results to be tweaked as desired.

Excellent points, Brian. I didn't think of it this way, in terms of the repeatability aspect. In my analysis I've limited the mapping to simply the longest isoform in the annotation. Neverthless, I'm curious to see how the results compare when I get back to my desk tomorrow.
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News