Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to rescue multi-reads when using htseq to generate edgeR/DESeq counts?

    Hi. After using Bowtie2/Tophat2 for mapping an RNA-Seq study, I ran my data through htseq and then have aimed for methods like edgeR/DESeq that allow a neg. binomial distribution -- I gathered for differential gene expression, this may be more appropriate than Cufflinks/CuffDiff...?

    Yet because we just have Single End reads (100 bp HiSeq), there are a lot of multi-reads (=multi-hits / multi-mapped reads). HTSeq scores these as ambiguous and discards them; I'd like to rescue the multi-reads. Yet the Mortazavi method E-RANGE uses RPKM, and I'm unsure if I could reliably convert it back to raw counts for edgeR/DESeq. I found a newer package (BM-MAP) but am concerned (1) of the level of dependence it has on the estimate of polymorphism rates (if this is critical -- that could be a problem because I'm not sure if I can ascertain too reliably transcriptome-wide), and (2) I'm not sufficiently adept at coding to know how to translate the sam+ file it generates (appending a mapping probability to the end of each line in the sam file) into something I can put through htseq or some other program to generate counts, or to write my own script to do this. (3) Even if I CAN make this work, is it viable (or is the uncertainty due to using a Probability for multi-mapped, enough to make the DESeq/edgeR output unreliable)?

    I'm tempted to just run Cuffdiff instead -- to avoid losing the potential information gleaned by including multi-reads (for downstream analysis it may be enough to know if a certain gene family is up/down-regulated, even if we can't get at the exact gene). Yet again I gather that may be more for isoform-specific comparisons vs differential Gene expression?

    If anyone has time, suggestions are very welcome!
    ~Hilary

  • #2
    Hi Hilary, I am facing the same problem. At first I wrote a wrapper for a programme called SEQEM (http://www.ncbi.nlm.nih.gov/pubmed/21385047), which worked fine until the amount of data we were producing became too much for the programme *Snag 1* I was then able to get BM-MAP to work and use the sam+ files to assign the most probable location, but it was also unable to cope with the level of data that we are producing from a HiSeq2000 run. *Snag 2*

    One possibility for you would be to use RSEM. This uses an expectation maximisation approach to assign multiply mapped reads (for which the 'expected_count' output can be used in edgeR/DESeq or the programme EBSeq can, apparently, deal with RSEM output to give you DEGs). However, with RSEM you have to align to a transcriptome unless your genome is unspliced, as RSEM cannot cope with spliced reads. *Snag 3*

    If you are not using matched samples then I believe that Cuffdiff would be a very good choice for you as it can deal with gene-level as well as isoform-level. My issue is that I have paired samples (disease and non-disease tissues from the same patient) and Cuffdiff does not use this info, meaning I don't get the power I want from the study. *Snag 4* I am, therefore, wanting to use edgeR or DESeq... which means I am back to square one with how to assign multireads so I can get my count data....

    I am going to attempt to use Cufflinks to assign multiply mapped reads in the probabilistic manner it employs using the -u parameter, and then use the output FPKM values, per sample, to get count data (i.e. multiply by gene length and reads mapped) and then make use of the paired tests in edgeR (*phew*) but if anyone else has suggestions/comments regarding the issue of both a) using multireads in my analysis and b) performing tests on matched samples, I would love to hear them.

    As an aside, I am also using RSEM so I can see how the 2 different approaches affects the lists of significantly DEGs I get at the end. If you are interested in the results I will post them here

    Cheers

    EDIT: I have realised that there is a raw_frags column in the read_group_tracking output from cuffdiff so I am going to use that rather than fudging the FPKM
    Last edited by SEQquestions; 05-02-2013, 11:05 PM. Reason: spelling mistakes and a little more info needed

    Comment


    • #3
      Hi. Statistically our design is more complex than pairwise comparisons, so we also have trouble applying Cufflinks/Cuffdiff. I'm not (yet ... pending if I can find the time to teach myself) adept at perl/python parsing so I admit I was also a bit swayed against BM-MAP due to being unsure of an efficient way to convert the sam+ into a usable, standard sam format.

      I know there are some issues in converting from the Cufflinks output to edgeR and other count-based approaches (http://seqanswers.com/forums/showthread.php?t=5793). Though I think there's some option for raw counts now vs FPKM in one of the outputs.
      If memory serves me right, RSEM also uses the FPKM approach and wouldn't be easily compatible with edgeR/DESeq.

      So I have yet to find a good solution. I was thinking of a hybrid approach: basing most analysis on the more conservative (and hopefully easily defensible) route of discarding multi-mapped reads (using htseq with the "union" option, then edgeR) and just for comparison, perhaps also reporting what a Cufflinks/Cuffdiff run would yield. It's not ideal, especially because our data is single-end so of course we can't assign all reads -- and for our purposes it'd be good enough to know if a read belonged to gene family X (or to gene Y vs knowing the exact isoform -- we're looking for the big picture first).

      Thank you for your insight. The Tuxedo suite/Cufflinks has some really nice features, as does the htseq/edgeR/DESeq route; it's too bad there isn't a way to combine the best of both worlds. I think the fact that the uncertainty from mapping multi-hits isn't accounted for with edgeR etc., also adds a level of error -- and that the extent of the possible error / P value inflation isn't known (unless someone has done a simulation study I'm not aware of ...).
      Best,
      Hilary

      Comment


      • #4
        See here for an explanation of the rational behind htseq-count's discarding of non-unique alignments: http://sourceforge.net/p/htseq/support-requests/10/

        Comment

        Latest Articles

        Collapse

        • seqadmin
          The Impact of AI in Genomic Medicine
          by seqadmin



          Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
          02-26-2024, 02:07 PM
        • seqadmin
          Multiomics Techniques Advancing Disease Research
          by seqadmin


          New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

          A major leap in the field has
          ...
          02-08-2024, 06:33 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 02-28-2024, 06:12 AM
        0 responses
        28 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-23-2024, 04:11 PM
        0 responses
        74 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-21-2024, 08:52 AM
        0 responses
        83 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-20-2024, 08:57 AM
        0 responses
        69 views
        0 likes
        Last Post seqadmin  
        Working...
        X