Announcement

Collapse
No announcement yet.

Isoform expression quanification from rna seq: flood of tools

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    iReckon summary:
    iReckon generates isoforms and quantifies them. However, this is based on gapped alignment to the genome (unlike eXpress, RSEM and BitSeq which are based on alignments to the transcriptome). It doesn't have a built in DE function, so each sample is run separately.
    This tool is a little curious because it requires both a gapped alignment to the genome, and the unaligned reads in fastq or fasta format with a reference genome. Since it requires a BWA executable, it's doing some re-alignment. iReckon claims to generate novel isoforms with low false positives by taking into consideration a whole slew of biological and technical biases.
    One irritating thing in getting the program running is that you need to re-format your refgene annotation file using an esoteric indexing tool from the Savant genome browser package. If you happen to use IGV, this is a bit tedious. Apparently, this will change in the next version. Also, iReckon takes up an enormous amount of memory and scratch space. For a library with 350 million reads, you would need about 800 G scratch space. Apparently everything (run time, RAM, and space) is linear to the number of reads, so this program would be a alright for a subset of the library or for lower coverage libraries.
    Last edited by Maayanster; 04-25-2013, 09:19 AM.

    Comment


    • #32
      Cufflinks + cuffdiff2 summary:
      This pipeline, like iReckon, is based on gapped alignment to the genome. It requires the XS tag, so if you're not using tophat to align your RNA, you need to add that tag. I also found out that our gapped aligner introduces some pesky 0M and 0N's in the cigars, since cufflinks doesn't tolerate these. But with these matters sorted out, it's pretty easy to use.
      So far I'm really liking the versatility. You can run cufflinks for transcriptome reconstruction and isoform quantification in a variety of different modes. For example, with annotations and novel transcript discovery, with annotations and no novel discovery, with no annotations, and with annotations to be ignored in the output. For differential expression, cuffdiff 2 can be run with the results of the transcript quantification from cufflinks to include novel transcripts, or, it can be run directly from the alignment bam files with an annotation. Unlike the exon-based approaches, you don't need to have more than one library in each treatment group, though if you do it's better to keep them separate than to merge them.

      I'm posting a slide I made that shows the appraches used by the 8 programs I tried. The winners so far in each rough category are:
      exon-based DE: DSGseq
      transcript quantification and DE from genome alignments: cufflinks+cuffdiff2
      transcript quantification from transcriptome alignments: eXpress

      These winners are mostly based on how practical these programs are to use. A couple (DEXseq and DiffSplice) I gave up on, the former due to undue complication and the latter due to an undiscovered formatting issue with our bam files which crashes the program. RSEM and iReckon worked fine, but both require re-alignment so they're resource-intense. I enjoyed BitSeq, but it's run from the R console which is impractical for us.
      Last edited by Maayanster; 03-12-2013, 03:50 PM.

      Comment


      • #33
        This is a useful overview, thanks. I thought RSEM also had a mode for quantification from genome alignments, though (where it maps with Bowtie) in addition to the transcriptome alignments?

        Comment


        • #34
          Originally posted by kopi-o View Post
          This is a useful overview, thanks. I thought RSEM also had a mode for quantification from genome alignments, though (where it maps with Bowtie) in addition to the transcriptome alignments?
          As far as I know it doesn't, but please post a link if I'm wrong.

          from the RSEM website:
          However, please note that RSEM does * not * support gapped alignments. So make sure that your aligner does not produce alignments with intersions/deletions

          Comment


          • #35
            Hi @Maayanster,

            Regarding DEXSeq, have you seen the pasilla vignette in R? It indicates how to generate an ExonCountSet object. Alternatively, in the newest version of DEXSeq in the svn (1.5.9) you will find the code to generate an ExonCountSet object from transcriptDb objects and bam files.

            Are you getting any error message? If so, could you add the output of your sessionInfo()? and a reproducible code?

            Alejandro

            Comment


            • #36
              Hi Alejandro,
              I was trying to follow the instructions sections 7 and 8 of the manual (which are sort of in the wrong logical order) with my own data, not with the pasilla. This was a few months ago, so I can't exactly remember, but I believe I managed to create the ExonCountSet object after reading a bunch of forum posts about it. But then I got stuck because some parameter in the makeCompleteDEUAnalysis step was missing and I had no idea what it was supposed to be.

              I guess I'll go and install the pasilla dataset and try again.

              Comment


              • #37
                Hi again Alejandro,
                So the pasilla vignette does not indicate how to create an exonCoutSet object. That object is already in the pasilla package. To figure out how to get the data organized into the object, I found this thread much more useful than the documentation:
                http://grokbase.com/t/r/bioconductor...ountset-object

                Please take this as constructive criticism, because I've spent quite a lot of time struggling with your software, and maybe nobody has said this to you before. The documentation for DEXseq needs an overhaul. The manual is so long, but it starts with an over-complicated example that's way beyond the level of detail most people are even interested in. The section that is supposed to explain how to run an analysis in one step includes parameters that aren't explained, and that most users don't need to use. The section covering the very first steps of how to set up the data come later in the manual, which makes no sense. Even worse, there are no actual code examples for how to make the count files, annotations, design dataframe, and exonCountSet object.

                You're really missing a simple step by step workflow for getting the data from a bam file into R, and running the basic functionalities. I think that for non-R users (actually, even for R users) the manual is extremely cumbersome to use, totally in the wrong order, and full of both too much and too little detail in all the wrong places. And judging from the number of threads on technical problems in dexseq, I'm not the only one struggling. It's great that your team is so responsive to people's issues on seqanswers, but you would save yourself a lot of time answering questions by having a logical one-pager workflow.

                here are the headers that i made for myself with the code for each step that I ran/am trying to run:
                -reference preparation step
                -counting step call (on each library separately)
                -R data.frame preparation
                -R ExonCountSet object preparation
                -R analysis step

                right now I'm still stuck:
                >samples = data.frame(
                + condition = c("gr3","gr3","gr3","gr4","gr4","gr4"),
                + replicate = c(1:3,1:3),
                + row.names = c("A08586","A08606","A08616","A08473", "A08485", "A08500"),
                + stringsAsFactors = TRUE,
                + check.names = FALSE
                + )

                > annotationfile = '/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq/Homo_sapiens.GRCh37.69_dexseq.gtf'
                > fullFilenames<- list.files("/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq",full.names=TRUE,pattern="dexseq.count")

                exonCountSet <- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)
                Error: all(unlist(lapply(design, class)) == "factor") is not TRUE

                Comment


                • #38
                  Hi @Maayanster,

                  Thanks for your feedback. To be honest, I am surprised that you mention that there are no examples of how to generate the input files and objects since there is a whole section (3. Exon count files) in the pasilla vignette of how to generate these files. The section 4 of the pasilla vignette and the help of the function "read.HTSeqCounts" explain how to generate an ExonCountSet object from these. However, I am very glad that you mention this, and its clear evidence that we need to restructure our vignettes. So thanks again!

                  Regarding the error message, in your "samples" object, just remove the "replicate" row. I mean instead of what you have, just do:

                  Code:
                  samples = data.frame(
                  condition = c("gr3","gr3","gr3","gr4","gr4","gr4"),
                  row.names = c("A08586","A08606","A08616","A08473", "A08485", "A08500"),
                  stringsAsFactors = TRUE,
                  check.names = FALSE
                  )
                  
                  annotationfile = '/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq/Homo_sapiens.GRCh37.69_dexseq.gtf'
                  fullFilenames<- list.files("/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq",full.names=TRUE,pattern="dexseq.count")
                  
                  exonCountSet <- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)
                  Let me know if you have any additional questions!
                  Alejandro

                  Comment


                  • #39
                    Actually since you have only a single variable design, there is no need to create a data frame.

                    Just do:

                    Code:
                    annotationfile = '/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq/Homo_sapiens.GRCh37.69_dexseq.gtf'
                    fullFilenames<- list.files("/projects/mkreitzman_prj/expression_quantification_testing/testing/DEXseq",full.names=TRUE,pattern="dexseq.count")
                    
                    exonCountSet <- read.HTSeqCounts(countfiles = fullFilenames,design =  c("gr3","gr3","gr3","gr4","gr4","gr4"),flattenedfile = annotationfile)
                    That should work!

                    Comment


                    • #40
                      Thanks Alejandro, that worked.
                      I must be looking at the wrong manual or something... this is the one I have.
                      http://watson.nci.nih.gov/bioc_mirro...doc/DEXSeq.pdf

                      Comment


                      • #41
                        This is the good one :

                        http://bioconductor.org/packages/rel...ml/DEXSeq.html

                        Comment


                        • #42
                          In R/Bioconductor packages, each version is coupled to its documentation files. The bioconductor guys in Seattle are very strict about this because of very good reasons.

                          So in general I would not recommend using a version of a Bioconductor package with a manual find in google, but rather use the documentation file attach to this specific version. For example, in DEXSeq you could look for the vignette corresponding to the version that you are using like this:

                          library(DEXSeq)
                          browseVignettes("DEXSeq")

                          Comment


                          • #43
                            I've added MISO to the programs I'm testing.
                            summary to come.

                            Comment


                            • #44
                              Originally posted by Maayanster View Post
                              I've added MISO to the programs I'm testing.
                              summary to come.
                              Based on my experience it can take a long time to run. I think that you've mentioned that you are working with a small set of your data but it still may be worth running it in parallel.

                              Comment


                              • #45
                                Thanks for this great thread.

                                I'm just starting out in this area and I am looking at the list of programs you listed, but it is not clear to me if there is a particular pipeline to:

                                1. identify isoforms
                                2. quantify isoforms

                                from just RNA seq reads and an assembled de novo transcritpome, with no genome and no annotation information. Is this possible?
                                Thanks

                                Comment

                                Working...
                                X