Header Leaderboard Ad


Isoform expression quanification from rna seq: flood of tools



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    I agree that there is the need to be able to do the comparisons systematically, and have a good dataset to perform them.

    Also, I think these methods papers provide comparisons that do not always make sense. Perhaps they want to illustrate something (e.g. cuffdiff2 better than doing RSEM/IsoEM/etc + EdgeR/DESeq ).

    But there are other methods that have been explicitly developed to calculate the differential expression of isoforms between two conditions (at least they claim so in their papers 8-) ). I've been able to gather these (including Cuffdiff2):


    Some already mentioned here. So wouldn't it make sense to perform comparisons between methods alike rather than an all against all?



    • #17
      The obvious comparison that needs to be made that I haven't seen and which was strangely absent in the Cuffdiff2 paper is a comparison between Cuffdiff and Cuffdiff2.


      • #18

        I am not able to access http://regulatorygenomics.upf.edu/So..._and_splicing/



        • #19
          Sorry! I'm updating the list. I hope it can be back up today!
          Stay tuned



          • #20
            DSGseq summary:
            This programs usses gapped alignments to the genome to generate differential splicing for groups of technical and biological replicates in two treatments. You can't compare just two samples, two samples per group is the minimum.
            It generates a ranking of differentially spliced genes using their negative binomial statistic which focuses of difference in expression. The NB statistic is provided per gene and per exon. A threshold used in the paper is NB > 5. The program doesn't support reconstruction of isoforms or quantification of specific isoforms, which apparently is computationally harder.
            I found it easy to get it to run using the example data provided and the instructions. You need to run a preparation step on the gene annotation. Starting from BAM files, you also need to run two preparation steps on each library, first to convert it to BED, and then to get the counts.
            While the paper clearly says that transcript annotation information is not necessary for the algorithm, you do need to provide a gene annotation file in refFlat format, which the output is based on.
            The developers are unresponsive so no help is at hand if you get stuck.
            Last edited by Maayanster; 03-21-2013, 02:12 PM.


            • #21

              The list is back up


              • #22
                Hi Eduardo, thanks for all the work.

                The categorization is till a bit unwieldy though. For example, the approaches in DSGseq, DEXseq, and DiffSplice are very similar (they all bypass transcriptome reconstruction and isoform quantification and focus only on differential exon expression), but they're in different categories.

                Also, you've included tools that are purely qualitative/visual (like spliceSeq and JuncBase) in the same categories as tools that have statistical outputs... I don't know, maybe a "qualitative" section would be good?


                • #23

                  thanks for the comments. You're right, DSGSeq and DEXSeq are quite similar. However, as I understood, DSGSeq provides abundances for isoforms and calculates changes at the isoform level, whereas DEXSeq does that only at exon level.

                  DEXSeq and DiffSplice are in the same category already, since they provide the differential expression at the "event level", where that is an exon for DEXSeq and an "alternative splicing module" for DiffSplice.

                  SpliceSeq and JuncBASE provide a simple statistical test to compare the read content of each event under two conditions. It is much simpler than other methods as they do not model the read densities, but still they provide a measure of differential splicing. That could be a totally valid approach in some contexts.

                  I agree with you that this organization is not right from the point of view of the methodology,
                  would you think that could be a better way of grouping the methods?
                  I thought that perhaps it will be more useful for the end-user if they were grouped according to what they aim to calculate, e.g. regulated splicing events, quantify isoforms from the annotation, etc... and at the same time taking into account what sort of input you would need: annotations, a genome, events, junctions.... It's not easy, as many of the papers are not clear enough even about what they themselves do.

                  Thanks for your comments



                  • #24
                    I'm quite sure that DSGseq, like DEXseq and DiffSplice does not supply expression at the isoform level. The output is basically a file with the genes ranked by their NB statistic, each of which has an NB statistic per exon as well.

                    From the paper:
                    As discussed above, for the purpose of inferring differential splicing, we do not necessarily need to estimate the expression of isoforms as differential splicing can be reflected from read distributions across the exons composing the isoforms. In this way, we also don't need to know isoform structures or even the existence of multiple isoforms. We use the negative binomial (NB) distribution to model read-counts on all exons of a gene. It considers over-dispersion in read-count distribution and borrows information across samples to get better estimation of the signal of each exon


                    On the other hand, one can also perceive that identifying differences in splicing patterns or isoform proportions is relatively an easier task than inferring splicing isoforms and estimating their expressions
                    As for suggestions for categorization, it's tricky and I think you're doing an awesome job! I might have more suggestions as I continue to test out the various options and learn more about them!


                    • #25

                      thanks a lot for the clarification. Even though the abstract seems clear about what they do, the paper contains some sentences that led me to think that they produce differential expression of isoforms:

                      ..."Therefore, we can detect differences in isoform proportion from the information at all exons without having to estimate the expression of all isoforms."...

                      ..."For studying differential splicing, we are comparing the read count vector Yi = {Yij} conditional on Mi. That is, we focus on the proportion of isoform expression instead of the overall expression of the whole gene." ...

                      Thanks again for your comments and suggestions



                      • #26
                        ... although, looking again at the output given in:

                        There it does provide an NB statistics per isoform (line corresponding to column 7).
                        And this is reported per NM_ ... hence they must report the changes per isoform after all.


                        • #27
                          The complete results file from running their test data has only 20804 lines, and each gene symbol appears once, so it is per gene. The refseq transcript id (NM_XXX) is probably just the longest transcript or something. Worth asking the authors.


                          • #28
                            Yes, I already asked them this morning. No answer yet.


                            • #29
                              DEXseq summary (so far)
                              This is similar to DSGseq and Diffsplice insofar as the isoform reconstruction and quantification are skipped and differential exon expression is carried out. Whereas the other two tools say that they don't need an annotation for their statistics, this program is based on only annotated exons, and uses the supplied transcript annotation in the form of a GFF file.
                              It also needs at least two replicates per group.
                              I found the usage of this program extremely tedious (as a matlab person). To install it you need to also install numpy and HTSeq. For preparing the data (similarly to DSGseq) you need to do a preparation step on the annotations, and another preparation step for every sample separately which collects the counts (both using python scripts). Then you switch to R, where you need to prepare something called an ExonCountSet object. To do this you need to first make a data.frame R object with the files that come out of the counting step. Yo also need to define a bunch of parameters in the R console. Then you can finally run the analysis. Despite the long instructional PDF, all this is not especially clear, and it's a rather tedious process compared to the others I've tried so far. In the end, I ran makeCompleteDEUAnalysis, printed out a table of the results, and called it a day. I tried to plot some graphics too, but couldn't because "semi-transparency is not supported on this device". If anyone wants a copy of the workflow I used, send me a message, trying to figure it out might take weeks.

                              DiffSplice summary
                              This is a similar approach for exon-centric differential expression to DEXseq and DSGseq (no attempt to reconstruct or quantify specific isoforms). Also supports groups of treatments, minimum 2 samples per group. The SAM inputs and various rather detailed parameters are supplied in two config files. I found this very convenient. In the data config file you can specify treatment group ID, individual IDs, and sample IDs, which determine how the shuffling in their permuation test is done. It was unclear to me what the sample IDs are (as opposed to the individual ID).
                              DiffSplice prefers alignments that come from TopHat or MapSplice because it looks for the XS (strand) tag which BWA doesn't create. There's no need to do a separate preparation step on the alignments. However, if you want you can separate the three steps of the analysis using parameters for selective re-running. This program is user friendly and the doc page makes sense.
                              On the downside, when the program has bad inputs or stops in the middle there's no errors or warnings - it just completes in an unreasonably short time and you get no results.
                              Diffsplice appears to be sensitive to rare deviations from the SAM spec, because while I'm able to successfully run it on mini datasets, the whole datasets are crashing it. I ran Picard's FixMateInformation and ValidateSamFile tools to see if they will make my data acceptable (mates are fine, and sam files are valid! woot), but no dice. It definitely isn't due to the presence of unaligned reads.
                              Last edited by Maayanster; 04-25-2013, 12:51 PM.


                              • #30
                                eXpress, BitSeq, and RSEM are all based on the approach of aligning RNA-seq reads to the transcriptome, not the genome.

                                eXpress summary:
                                This program can take a BAM file in a stream, or a complete SAM or BAM file.
                                It produces a set of isoforms and a quantification of said isoforms. There is no built in differential expression function (yet) so they recommend inputting the rounded effective counts that eXpress produces into EdgeR or DEGSeq
                                I used bowtie2 for the alignments to the transcriptome. Once you have those, using eXpress is extremely simple and fun. There's also a cloud version available on Galaxy, though running from the command line is so simple in this case I don't see any advantage to that. Definite favorite!

                                RSEM +EBSeq summary:
                                This also generates isoforms and quantifies them. It also needs to be follwed by an external DE tool - they recommend EBSeq, which is actually included in the latest RSEM release.
                                RSEM can't tolerate any gaps in your transcriptome alignment, including the indels bowtie2 supports. Hence, you either need to align ahead of time with bowtie and input a SAM/BAM, or use the bowtie that's built into the RSEM call and input a fsta/fastq. For me this was unfortunate because we don't keep fastq files on hand (only illumina qseq files) which bowtie doesn't take as inputs. So for us I think this will be too slow. However, it does work! I successfully followed the instructions to execute EBSeq, which is conveniently included as an RSEM function, and gives intelligible results. Together, this workflow is complete.

                                BitSeq summary
                                This, like DEXSeq, is and R bioconductor package. I found the manual a lot easier to understand than DEXSeq.
                                They recalculate the probability of each alignment, come up with a set of isoforms, quantify them, and also provide a DE function. In this way, it is the most complete tool I've tried so far, since all the other tools have assumed, skipped, or left out at least one of these stages. Also, BitSeq automatically generates results file, which is useful for people that don't know R. However, I only successfully ran up to the isoform quantification stage, the DE stage threw an error that I have yet to resolve (google come up empty and I didn't get a reply from the developer).
                                For running BitSeq I used the same bowtie2 alignments to the transcriptome as for eXpress. You need to run the function getExpression on each sample separately. Then you make a list of the result objects in each treatment group and run the function getDE on those (this is where I got the error).
                                Last edited by Maayanster; 04-25-2013, 09:13 AM.