Announcement

Collapse
No announcement yet.

Isoform expression quanification from rna seq: flood of tools

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Maayanster
    replied
    I'm posting some data from various tools compared to wet lab validated data. There are 3 datasets for 3 different types of analysis:
    • Pairwise DE - ALEXA seq data. In house libraries for two cell lines: MIP101 and MIP5FU
      validation data: q-pcr from Griffith et al. (in Alexa-seq paper). Expression values for each library, as well as fold changes provided. http://www.nature.com/nmeth/journal/...meth.1503.html
    • Group DE - 6 in house libraries from MAGIC project group3 and group4, (3 libs each each)
      validation data: Microarray expression results from first MAGIC paper (Northcott et al 2010). http://jco.ascopubs.org/content/earl....4324.abstract This data is a list of differentially expressed genes between groups 3 and 4, not a quantification on the isoform level. It doesn't provide a real "ground truth" set, but rather just a subset of genes and transcripts that may be biologically interesting to look at.


    The first two datasets have wet-lab experimental transcript expression values. The third dataset doesn't have actual validated transcript expression values to compare to, so I just did pairwise comparisons between the tools for a subset of transcripts from interesting genes.

    The results are in this google document. I didn't spend any time making it look nice, so you just need to zoom in to see the plots.
    https://docs.google.com/document/d/1...it?usp=sharing

    For single-library quantification, SailFish wins since it's the fastest and is qualitatively equal to the next best option, eXpress. For pairwise DE it's less clear. For group DE we can't really draw any conclusions because there's no ground truth.

    ** note: cufflinks was sometimes run twice, with different alignments: "cuff gsc" means cufflinks was run on in house spliced alignents, "cuff tophat" means cufflinks was run on tophat alignments.

    Leave a comment:


  • Maayanster
    replied
    Hi Magnus,
    Thanks for this! I did find the email. I will re-send it.

    Leave a comment:


  • magnusr
    replied
    Hi, this is a useful overview, thanks for putting it together.

    I'm one of the BitSeq authors, you mentioned you have a bug in stage 2. Peter Glaus (main BitSeq developer) doesn't have a record of a bug report. I wonder if you could email myself or Peter with details.

    Best wishes,
    Magnus

    Leave a comment:


  • Maayanster
    replied
    I've been searching for a decent set of transcript-specific qPCR validations in order to compare to these tools to ground truth for a while, and recent reading has yielded some info. Unfortunately a lot of info on these is deep in the supplementary. Here's a summary.

    The ALEXA paper http://www.nature.com/nmeth/journal/...meth.1503.html
    1. transcript expression
    - qpcr on two colorectal cancer cell lines, MIP101 and MIP5FU. 192 amplicons in 152 genes representing various event types (skipped exons, etc.)
    qpcr validation results are here: http://www.alexaplatform.org/alexa_s...ltsPackage.zip

    the cuffdiff2 paper http://www.nature.com/nbt/journal/v3.../nbt.2450.html (mostly in the supplementary)
    1. gene expression:
    - MAQC qpcr dataset (Brain and stratagene UHR as treatment and control)

    2. transcript expression:
    - ALEXA q-pcr dataset
    - simulated data using their own protocol

    For the genes, they compare both the quantification and fold-changes between the two treatments
    For the transcripts, they have plots that separate the fold-change comparisons into deciles by the level of expression.

    The SailFish paper http://arxiv.org/pdf/1308.3700.pdf
    1. gene expression:
    - they also use the MAQC data for gene-level expression. They only use the brain data, without doing any DE.

    2. transcript expression
    - data simulated using flux capacitor

    For both types of comparisons to the "ground truth" they use four statistics: Pearson, Spearman, RMSE, and MedPE, which evaluate different types of variations.

    The nature methods paper (Steijger et al) also in the supplementary http://www.nature.com/nmeth/journal/...meth.2714.html
    1. transcript expression
    - a custom assay for 109 alternatively spliced genes using the NanoString nCounter in order to compare transcript quatification (no DE)

    The statistic they use for comparison is Pearson correlation.

    The MATS paper http://nar.oxfordjournals.org/content/40/8/e61.full
    the two treatments were a human breast cancer cell line (MDA-MB-231) with ectopic expression of the epithelial-specific splicing factor ESRP1 and an empty vector (EV) control
    1. transcript expression
    - RT-PCR for 164 exons that are known as regulated by the ESRP1 gene

    The RT-PCR data is available but one would have to ask the authors for sequenced libraries. This same data was also used in the recent rSeqDiff paper

    So it seems that the MAQC dataset (which was developed for microarray validations) is only useful for gene level evaluations. This made sense when I dug into the actual data. More info on that dataset here: http://www.biostars.org/p/85219/
    For transcript level evaluations we have simulations, the alexa q-pcr data, and the new nanoString validation data from the nature methods paper.

    I've also been looking at the various RNA simulation methods available, but maybe I'll leave that for another thread.

    I've been using the Alexa validation data myself, which is very helpful for looking at pairwise comparisons, but not useful for comparing groups of libraries.

    Two notes from the blogosphere:
    1. people might like to take a look****storm on Lior Pachter's blog over GTEx's isoform analysis choices. http://liorpachter.wordpress.com/201...of-their-data/
    2. Article on Getting Genetics Done blog featuring eXpress. It's also copied on the RNA-seq blog. http://gettinggeneticsdone.blogspot....h-express.html
    Last edited by Maayanster; 12-04-2013, 04:49 PM.

    Leave a comment:


  • Maayanster
    replied
    I haven't tried rDiff.


    So I just looked at the RNA-seq blog, and there's like 8 new transcript assembly/quantification tools since the summer. Sigh. There's no way I'm keeping up, it's too much work to get these programs running, and so often things crap out on the last step.
    http://www.rna-seqblog.com/category/...ression-tools/

    Just wanted to mention that I'm still updating post #60 which has all the summaries together, since I've been running things on the same datasets in an organized manner, and still finding out new things.

    Leave a comment:


  • Tomnl
    replied
    Hi all

    I have found this thread very useful. Thank you everyone!
    (In particular for the comparison paper http://arxiv.org/abs/1304.5952 and post http://seqanswers.com/forums/showpos...1&postcount=60.)

    I also found the papers description of the differences between tools for differential splicing and differential isoform expression very informative

    Based on what I have read and what my data is (paired end, multiple samples, 2 conditions) I decided to compare BitSeq/RSEM/Cuffdiff for differential expression of isoforms.

    I have also decided to compare rDiff,DiffSplice and Cuffdiff for differential splicing.

    However, I am currently having a bit of trouble with rDiff, http://seqanswers.com/forums/showthr...509#post118509, has anybody tried rDiff here?

    Cheers
    Tom

    Leave a comment:


  • elsagc
    replied
    Originally posted by EduEyras View Post
    Without replicates MISO works reasonably well, but you need a pre-calculated set of events. It is not difficult to build one anew, but it requires some work.

    Using replicates we've seen that with |deltaPSI| > 0.25 and BF > 2, if you had replicates, that would give you a less than 1% False positive rate.

    Cufflinks can also work well with just one replicate, provided that you estimate a lower bound for FPKM to know an isoform is expressed. If you don't use tophat for mapping, be careful with the data format in BAM. Also, make sure to use the option for "RABT", which quantifies known and novel isoforms.

    Have a look at http://arxiv.org/abs/1304.5952 for further methods. Let me know if I can help with anything else.

    Good luck

    E.
    Hi E,
    I have found your article very useful. Thanks for sharing. As you mentioned before MISO requires to pre-calculate the set of events. I was wondering if you could share what tool/method do you use to calculate the splicing events?
    Thanks,
    Elsa

    Leave a comment:


  • alittleboy
    replied
    Originally posted by krespim View Post
    Nice list.



    As far as I know, and I have been using MISO regularly, it does not give information on isoforms. It is very much "exon-centric".




    I also feel the same. When I choose a tool I always look for information on validation rates, that is, where the predictions reproduced at the experimental level? It really does not matter if uses Baeysian inference or binomial distribution if the predictions are not validated in the "real data". I also take into account easy of use (very often compiling the tools is a nightmare), and whether the output understandable.

    I know these practical/trivial considerations but IMO they are worth consideration.
    Hi @krespim:

    I had a question on MISO and posted here. Can you help me with that? Thanks! ;-)

    Leave a comment:


  • pravee1216
    replied
    Thanks, E. It's a good article.

    Thanks for reminding me the RABT option. That created 400 times bigger transcripts.gtf files by cufflinks.

    What is the suggested value of --min-isoform-fraction (lower bound FPKM)? By default it is 10% (0.1).

    Raj

    Leave a comment:


  • skdhanraj
    replied
    Nice compilation. Thank you all

    Leave a comment:


  • EduEyras
    replied
    Without replicates MISO works reasonably well, but you need a pre-calculated set of events. It is not difficult to build one anew, but it requires some work.

    Using replicates we've seen that with |deltaPSI| > 0.25 and BF > 2, if you had replicates, that would give you a less than 1% False positive rate.

    Cufflinks can also work well with just one replicate, provided that you estimate a lower bound for FPKM to know an isoform is expressed. If you don't use tophat for mapping, be careful with the data format in BAM. Also, make sure to use the option for "RABT", which quantifies known and novel isoforms.

    Have a look at http://arxiv.org/abs/1304.5952 for further methods. Let me know if I can help with anything else.

    Good luck

    E.

    Leave a comment:


  • pravee1216
    replied
    Based on experience, can anyone suggest a better tool for studying spliced isoform expression between two conditions (using single-end data without replicates)? Setting up Alexa-seq is very complex as there is no pre-built database for C. elegans and package was not updated since long. Please share your experience on other tools and accuracy of results.

    Thanks
    Raj

    Leave a comment:


  • shi
    replied
    Dear @pengchy,

    A possible way to do this is just perform a differential expression analysis for junctions, in a similar way as that for performing differential expression analysis for genes (using for example edgeR and limma voom). However, to perform this kind of analysis, you will need to have the read count for each junction in each condition.

    The subjunc program detects exon-exon junctions and outputs the number of supporting reads for each junction. It outputs a bed file in which the first three columns give the location information of the discovered exon-exon junctions and the fifth column gives the number of supporting reads for each junction.

    If you are interested in using it, have a look at this short tutorial - http://bioinf.wehi.edu.au/subjunc

    Hope this is helpful.

    Best wishes,
    Wei

    Leave a comment:


  • pengchy
    replied
    Is there a method to detect differentially expressed junctions? The input is junciton information, like the output of tophat, and the output is the differentially expressed junctions, like the isoform and exon.

    Thank you.

    Leave a comment:


  • EduEyras
    replied
    Great work. We have put the pre-print of the paper in the arXiv: http://arxiv.org/abs/1304.5952

    I hope it is useful.

    E

    Leave a comment:

Working...
X