Seqanswers Leaderboard Ad

**dpryan** · 08-13-2013, 05:47 AM

Have you looked at where the multi-mappers are mapping? Depending on the organism you're using, there are a few sperm RNAseq datasets out there, so you might be able to compare some of your results with that of others to see if something has gone amiss.

**Lars_R** · 08-13-2013, 05:50 AM

Not yet, though that is a good idea. I was still trying to figure out how to approach the analysis.

The data is human, are there any datasets you can recommend?

**dpryan** · 08-13-2013, 05:59 AM

The ones I have locally are mouse, so I can't give any specific recommendations from personal experience. Have a look at GEO datasets GSE39527 and GSE42326 to get a start (you can just search for "((sperm) AND homo sapiens[Organism]) AND "high throughput sequencing"[Platform Technology Type]" in GEO for a full list). The raw reads are available in SRA, which you could alternatively search to get the same results.

**Wolfgang Huber** · 08-15-2013, 12:35 PM

Originally posted by Lars_R View Post

The other options seems to be
1. randomly assigning the multi-reads, how bowtie2 usually does, if I am correct (but how does that affect the models, when so much is randomly assigned?)

2. saving the n best hits and using that mapping, but I believe edgeR/DESeq assumes each count is a unique read, so that might not be good either.

Neither of these is helpful for the purpose of differential expression analysis. As dpryan suggests, a good way to move forward would seem to be to identify the paralogous regions in the genome that attract these multimappers, and to combine them into 'paralog equivalence classes'. In this way, while you cannot map these reads uniquely to genomic loci, you can map them uniquely to these equivalence classes, and then proceed with DESeq2 as usual.

Hope this helps

**bruce01** · 08-16-2013, 12:56 AM

I have been wondering about this recently, though I only have ~10% multimappers vs. unique so might not be as crucial as for OP.

My strategy is to take all multimappers and find where they map to; if to the same feature then remove all but 'primary' alignment; else make 'new' reads. So for example 'read' maps to 2 features, I make 'read_1' which maps to feature_1, 'read_2' mapping to feature_2. I do this because randomly assigning to feature_1 or feature_2 skews data more than assigning to both. Being conservative we would just throw out the data, but for OP that is not really an option (and I feel it is a waste too...).

Also, can multimapping be used as evidence of duplications, pseudogenes etc? Has anyone published on this sort of analysis? Wolfgang, is this what you mean by 'paralog equivalence classes'? Would a basic over-representation work, i.e. Fishers Exact test on two genes found vs. those two genes with all other genes?

**dpryan** · 08-16-2013, 02:04 AM

I'm pretty uncomfortable with the idea of assigning a count to both features, I would suggest that it's a much better idea to simply exclude those reads (I would expected an effective increase in variance from including these sorts of reads, which is usually the exact opposite of what's desired). In most cases (i.e., likely not Lars'), multimapping is due simply to the fact that reads are relatively short and there are a lot of shared elements (protein motifs, gene families where the members have high homology regions, etc.) or low complexity areas throughout the genome. In Lars' case, these multimappers might also arise from, for example, Alu or other repeats that could be aberrantly transcribed in sperm. The literature also suggests that RNA is generally degraded in sperm (though I'm not sure how true this actually is, these are from older papers), so one could suggest that only certain classes of gene families are spared from this, which could thereby lead to such an over-abundance of multimappers. There are more possibilities, but you get the gist.

Regarding using multimappers as evidence of duplication or such, that's sort of putting the cart before the horse (at least in the case of RNAseq, as this is commonly done with exome or other DNAseq experiments). If you already have a reference genome against which you're aligning reads, it would generally make more sense to simply blast your gene (or BAC, since this is human) and see if it pops up elsewhere. Granted, multimappers might suggest that there's something unannotated, though many aligners (e.g. tophat) will map to the transcriptome first, so you wouldn't end up seeing these anyway. Using a Fisher's test with RNAseq data here would give you an answer, just not to the biological question that you're asking

**bruce01** · 08-16-2013, 02:52 AM

@dpryan, thanks for the detailed reply. I really just want to classify the extent and distribution of multimappers so I can put my mind at ease about using 'primary' alignments by a simple awk of flags. Also purely out of interest, like how many multimappers are to same features? Is the distribution of multimappers the same across samples, conditions? I fully appreciate that multimappers exist due to protein domains etc, have even discussed with my PI about doing work on this. But for this instance, thankfully I am not asking any biological questions at all=D

**Lars_R** · 08-16-2013, 05:15 AM

As suggested by by dpryan I look at the locations of where the reads map. About 70% map to the repeatMasker track's rRNA genes. Fortunately (I guess) this turns out to be expected with sperm, since new data shows that rRNA is degraded but still present in large amounts, and similar levels are observed (doi: 10.1093/molehr/gar054). Being mainly interested in epigenetics, I am thus tempted to exclude rRNA genes from further analysis, but that still leaves me with quite a few multimappers.

I read somewhere on seqanswers the issues with multimappers:
Assigning multimappers to all reads can lead to some genes being perceived as differentially expressed, say gene1 is highly expressed and gene2 lowly expressed. If many reads map to both genes a small change in gene1 can cause gene2 to appear differentially expressed (same problem when randomly assigning reads)
Removing multimappers can lead to genes being lost, say gene3 has a pseudo-gene, most reads would map to both and thus be discarded.

To start with I might look at the unique ones and meanwhile sort the multimappers, removing those that map to features I'm not so interested in, and doing as Wolfgang suggested.

Sorting the reads into single- and multi-mappers have been more difficult, maybe I am doing it wrong but the primary FLAG does not seem to agree with what bowtie2 reports. None of the reads have the 0x100 (non-primary) FLAG.
http://sourceforge.net/apps/mediawik...from_SAM.2FBAM. suggests filtering based on the quality, using "-q 1" to get reliable hits, but about half my reads pass that filter, although only 3% are uniquely mapping according to bowtie2.

Am I missing something?

**dpryan** · 08-16-2013, 06:13 AM

I guess I'm not surprised that it's rRNA. When we ran gels of RNA from mouse sperm, we saw a LOT of that, so I'm not surprised its similar in humans. Did you do ribosomal depletion? I suspect we're working on related things in different organisms

Filtering by flag won't be that useful, since you probably had the aligner only output the best match. In general, you can filtering by MAPQ. Are you using bowtie2 directly or through tophat2? If you're using bowtie2 through tophat2, which would be the normal way to go about things, then I think a MAPQ of 255 still means unique alignment. If you're actually using bowtie2 directly, then you might just up your MAPQ threshold (bowtie2's MAPQs only vaguely correspond to reality).

**Lars_R** · 08-20-2013, 01:25 AM

That is interesting. I have similar data in rat, where the problem was not nearly as big (rRNA ~ 30%).

I have used bowtie2 directly so far, but for no other reason than the fact that I was more familiar with it. I will run the analysis again, using tophat2.

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

edgeR/DESeq and multi-mapping reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News