Seqanswers Leaderboard Ad

**SEQquestions** · 05-02-2013, 10:41 PM

Hi Hilary, I am facing the same problem. At first I wrote a wrapper for a programme called SEQEM (http://www.ncbi.nlm.nih.gov/pubmed/21385047), which worked fine until the amount of data we were producing became too much for the programme *Snag 1* I was then able to get BM-MAP to work and use the sam+ files to assign the most probable location, but it was also unable to cope with the level of data that we are producing from a HiSeq2000 run. *Snag 2*

One possibility for you would be to use RSEM. This uses an expectation maximisation approach to assign multiply mapped reads (for which the 'expected_count' output can be used in edgeR/DESeq or the programme EBSeq can, apparently, deal with RSEM output to give you DEGs). However, with RSEM you have to align to a transcriptome unless your genome is unspliced, as RSEM cannot cope with spliced reads. *Snag 3*

If you are not using matched samples then I believe that Cuffdiff would be a very good choice for you as it can deal with gene-level as well as isoform-level. My issue is that I have paired samples (disease and non-disease tissues from the same patient) and Cuffdiff does not use this info, meaning I don't get the power I want from the study. *Snag 4* I am, therefore, wanting to use edgeR or DESeq... which means I am back to square one with how to assign multireads so I can get my count data....

I am going to attempt to use Cufflinks to assign multiply mapped reads in the probabilistic manner it employs using the -u parameter, and then use the output FPKM values, per sample, to get count data (i.e. multiply by gene length and reads mapped) and then make use of the paired tests in edgeR (*phew*) but if anyone else has suggestions/comments regarding the issue of both a) using multireads in my analysis and b) performing tests on matched samples, I would love to hear them.

As an aside, I am also using RSEM so I can see how the 2 different approaches affects the lists of significantly DEGs I get at the end. If you are interested in the results I will post them here

Cheers

EDIT: I have realised that there is a raw_frags column in the read_group_tracking output from cuffdiff so I am going to use that rather than fudging the FPKM

**Hilary April Smith** · 05-06-2013, 08:09 AM

Hi. Statistically our design is more complex than pairwise comparisons, so we also have trouble applying Cufflinks/Cuffdiff. I'm not (yet ... pending if I can find the time to teach myself) adept at perl/python parsing so I admit I was also a bit swayed against BM-MAP due to being unsure of an efficient way to convert the sam+ into a usable, standard sam format.

I know there are some issues in converting from the Cufflinks output to edgeR and other count-based approaches (http://seqanswers.com/forums/showthread.php?t=5793). Though I think there's some option for raw counts now vs FPKM in one of the outputs.
If memory serves me right, RSEM also uses the FPKM approach and wouldn't be easily compatible with edgeR/DESeq.

So I have yet to find a good solution. I was thinking of a hybrid approach: basing most analysis on the more conservative (and hopefully easily defensible) route of discarding multi-mapped reads (using htseq with the "union" option, then edgeR) and just for comparison, perhaps also reporting what a Cufflinks/Cuffdiff run would yield. It's not ideal, especially because our data is single-end so of course we can't assign all reads -- and for our purposes it'd be good enough to know if a read belonged to gene family X (or to gene Y vs knowing the exact isoform -- we're looking for the big picture first).

Thank you for your insight. The Tuxedo suite/Cufflinks has some really nice features, as does the htseq/edgeR/DESeq route; it's too bad there isn't a way to combine the best of both worlds. I think the fact that the uncertainty from mapping multi-hits isn't accounted for with edgeR etc., also adds a level of error -- and that the extent of the possible error / P value inflation isn't known (unless someone has done a simulation study I'm not aware of ...).
Best,
Hilary

**Simon Anders** · 05-06-2013, 11:07 AM

See here for an explanation of the rational behind htseq-count's discarding of non-unique alignments: http://sourceforge.net/p/htseq/support-requests/10/

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

How to rescue multi-reads when using htseq to generate edgeR/DESeq counts?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News