edgeR/DESeq and multi-mapping reads

Lars_R replied

08-20-2013, 01:25 AM
That is interesting. I have similar data in rat, where the problem was not nearly as big (rRNA ~ 30%).

I have used bowtie2 directly so far, but for no other reason than the fact that I was more familiar with it. I will run the analysis again, using tophat2.
Leave a comment:
dpryan replied

08-16-2013, 06:13 AM
I guess I'm not surprised that it's rRNA. When we ran gels of RNA from mouse sperm, we saw a LOT of that, so I'm not surprised its similar in humans. Did you do ribosomal depletion? I suspect we're working on related things in different organisms

Filtering by flag won't be that useful, since you probably had the aligner only output the best match. In general, you can filtering by MAPQ. Are you using bowtie2 directly or through tophat2? If you're using bowtie2 through tophat2, which would be the normal way to go about things, then I think a MAPQ of 255 still means unique alignment. If you're actually using bowtie2 directly, then you might just up your MAPQ threshold (bowtie2's MAPQs only vaguely correspond to reality).
Leave a comment:
Lars_R replied

08-16-2013, 05:15 AM
As suggested by by dpryan I look at the locations of where the reads map. About 70% map to the repeatMasker track's rRNA genes. Fortunately (I guess) this turns out to be expected with sperm, since new data shows that rRNA is degraded but still present in large amounts, and similar levels are observed (doi: 10.1093/molehr/gar054). Being mainly interested in epigenetics, I am thus tempted to exclude rRNA genes from further analysis, but that still leaves me with quite a few multimappers.

I read somewhere on seqanswers the issues with multimappers:
Assigning multimappers to all reads can lead to some genes being perceived as differentially expressed, say gene1 is highly expressed and gene2 lowly expressed. If many reads map to both genes a small change in gene1 can cause gene2 to appear differentially expressed (same problem when randomly assigning reads)
Removing multimappers can lead to genes being lost, say gene3 has a pseudo-gene, most reads would map to both and thus be discarded.

To start with I might look at the unique ones and meanwhile sort the multimappers, removing those that map to features I'm not so interested in, and doing as Wolfgang suggested.

Sorting the reads into single- and multi-mappers have been more difficult, maybe I am doing it wrong but the primary FLAG does not seem to agree with what bowtie2 reports. None of the reads have the 0x100 (non-primary) FLAG.
http://sourceforge.net/apps/mediawik...from_SAM.2FBAM. suggests filtering based on the quality, using "-q 1" to get reliable hits, but about half my reads pass that filter, although only 3% are uniquely mapping according to bowtie2.

Am I missing something?

Last edited by Lars_R; 08-16-2013, 05:18 AM. Reason: Formating
Leave a comment:
bruce01 replied

08-16-2013, 02:52 AM
@dpryan, thanks for the detailed reply. I really just want to classify the extent and distribution of multimappers so I can put my mind at ease about using 'primary' alignments by a simple awk of flags. Also purely out of interest, like how many multimappers are to same features? Is the distribution of multimappers the same across samples, conditions? I fully appreciate that multimappers exist due to protein domains etc, have even discussed with my PI about doing work on this. But for this instance, thankfully I am not asking any biological questions at all=D
Leave a comment:
dpryan replied

08-16-2013, 02:04 AM
I'm pretty uncomfortable with the idea of assigning a count to both features, I would suggest that it's a much better idea to simply exclude those reads (I would expected an effective increase in variance from including these sorts of reads, which is usually the exact opposite of what's desired). In most cases (i.e., likely not Lars'), multimapping is due simply to the fact that reads are relatively short and there are a lot of shared elements (protein motifs, gene families where the members have high homology regions, etc.) or low complexity areas throughout the genome. In Lars' case, these multimappers might also arise from, for example, Alu or other repeats that could be aberrantly transcribed in sperm. The literature also suggests that RNA is generally degraded in sperm (though I'm not sure how true this actually is, these are from older papers), so one could suggest that only certain classes of gene families are spared from this, which could thereby lead to such an over-abundance of multimappers. There are more possibilities, but you get the gist.

Regarding using multimappers as evidence of duplication or such, that's sort of putting the cart before the horse (at least in the case of RNAseq, as this is commonly done with exome or other DNAseq experiments). If you already have a reference genome against which you're aligning reads, it would generally make more sense to simply blast your gene (or BAC, since this is human) and see if it pops up elsewhere. Granted, multimappers might suggest that there's something unannotated, though many aligners (e.g. tophat) will map to the transcriptome first, so you wouldn't end up seeing these anyway. Using a Fisher's test with RNAseq data here would give you an answer, just not to the biological question that you're asking
Leave a comment:
bruce01 replied

08-16-2013, 12:56 AM
I have been wondering about this recently, though I only have ~10% multimappers vs. unique so might not be as crucial as for OP.

My strategy is to take all multimappers and find where they map to; if to the same feature then remove all but 'primary' alignment; else make 'new' reads. So for example 'read' maps to 2 features, I make 'read_1' which maps to feature_1, 'read_2' mapping to feature_2. I do this because randomly assigning to feature_1 or feature_2 skews data more than assigning to both. Being conservative we would just throw out the data, but for OP that is not really an option (and I feel it is a waste too...).

Also, can multimapping be used as evidence of duplications, pseudogenes etc? Has anyone published on this sort of analysis? Wolfgang, is this what you mean by 'paralog equivalence classes'? Would a basic over-representation work, i.e. Fishers Exact test on two genes found vs. those two genes with all other genes?
Leave a comment:
Wolfgang Huber replied

08-15-2013, 12:35 PM
Originally posted by Lars_R View Post

The other options seems to be
1. randomly assigning the multi-reads, how bowtie2 usually does, if I am correct (but how does that affect the models, when so much is randomly assigned?)

2. saving the n best hits and using that mapping, but I believe edgeR/DESeq assumes each count is a unique read, so that might not be good either.

Neither of these is helpful for the purpose of differential expression analysis. As dpryan suggests, a good way to move forward would seem to be to identify the paralogous regions in the genome that attract these multimappers, and to combine them into 'paralog equivalence classes'. In this way, while you cannot map these reads uniquely to genomic loci, you can map them uniquely to these equivalence classes, and then proceed with DESeq2 as usual.

Hope this helps
Leave a comment:
dpryan replied

08-13-2013, 05:59 AM
The ones I have locally are mouse, so I can't give any specific recommendations from personal experience. Have a look at GEO datasets GSE39527 and GSE42326 to get a start (you can just search for "((sperm) AND homo sapiens[Organism]) AND "high throughput sequencing"[Platform Technology Type]" in GEO for a full list). The raw reads are available in SRA, which you could alternatively search to get the same results.
Leave a comment:
Lars_R replied

08-13-2013, 05:50 AM
Not yet, though that is a good idea. I was still trying to figure out how to approach the analysis.

The data is human, are there any datasets you can recommend?
Leave a comment:
dpryan replied

08-13-2013, 05:47 AM
Have you looked at where the multi-mappers are mapping? Depending on the organism you're using, there are a few sperm RNAseq datasets out there, so you might be able to compare some of your results with that of others to see if something has gone amiss.
Leave a comment:
Lars_R started a topic edgeR/DESeq and multi-mapping reads

08-13-2013, 05:20 AM
edgeR/DESeq and multi-mapping reads

I have a RNAseq data-set with the majority (90%) of the reads being multi-reads (multi-hits / multi-mapped reads). The data is from sperm, so the high level of multi-reads may not be a sign that something is terribly wrong (could be from e.g. piwiRNA)

But I am unsure how best to handle these reads. I do not particularly care about alternative splicing, so I was thinking of using edgeR/DESeq rather than Cufflinks/Cuffdiff.

Simple probabilistic assignments of the reads would not work with edgeR/DESeq. In this post http://seqanswers.com/forums/showthread.php?t=26661 using Cufflinks to assign the reads and estimate raw counts is suggested, but does that work with the models used in edgeR/DESeq?

The other options seems to be
1. randomly assigning the multi-reads, how bowtie2 usually does, if I am correct (but how does that affect the models, when so much is randomly assigned?)

2. saving the n best hits and using that mapping, but I believe edgeR/DESeq assumes each count is a unique read, so that might not be good either.

How would you handle such data?
Tags: cuffdiff, cufflinks, deseq, edger, multihit reads

Previous template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: