Without starting some kind of pointless back and forth, I am wondering what are the advantages of using Cuffdiff or DESeq to identify differentially expressed genes from RNA-seq data. The statistical side is a little over my head at the moment but is also what spurs my interest in this question. I find Simon Anders posts very informative so I am inclined to use DESeq but everything R and Bioconductor is a little biologist unfriendly. The Cuffdiff workflow is easier to follow but am I sacrificing anything by using Cuffdiff?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
There's a discussion of this going on over at the bioconductor mailing list. I'm quoting some of the recent posts from the last two days:
Dear Bioconductor list,
Sometime ago Simon Anders explained the difference
between DeSeq and Cuffdiff as follows:
"If you have two samples, cuffdiff tests, for each transcript, whether
there is evidence that the concentration of this transcript is not the
same in the two samples.
If you have two different experimental conditions, with replicates for
each condition, DESeq tests, whether, for a given gene, the change in
expression strength between the two conditions is large as compared to
the variation within each replicate group."
Current language on the Cuffdiff site suggests that the current version
of that program tests for whether the change is significant compared to
changes in each condition.
Can someone please comment on the relative merits of Cuffdiff and
DeSeq. I ask here because our sequencing core delivers results
based on Cuffdiff and I want to know if I should redo it using
DeSeq,I would greatly appreciate any guidance in this matter.
Thanks and best wishes,
Rich
Not directly relevant to gene-level RNA-seq DE calls, but rather for
exon-level DE,
I found it useful to read this:
In particular, section 4.3 on page 11, and supplementary figures S7 and S8
on page 19.
I was informed by a coworker that since everyone uses
BowTie-TopHat-Cufflinks-Cuffdiff, it is the sensible thing to do.
Conversations with people who know what they are doing (Terry Speed &
BCGSC) suggest the matter is not yet settled.
So I retrieved ~1TB of BAMs, extracted the reads, and started looking into
how that compares to DEXSeq and/or subread.
It would be incredibly informative if the Cufflinks and DEXSeq authors had
time to weigh in on their strengths/weaknesses. DEXSeq & cummeRbund both
offer nice tools for exploring the results; I am curious which pipeline
fits best for my needs.
Thanks for bringing this up.
I'm also in the same boat as Rich. I run a new bioinformatics core
here and I'm building a pipeline for RNA-seq. Cufflinks for some time
has supported biological replicates, and I'm also curious about the
relative merits of using
bowtie/tophat-cufflinks-cuffmerge-cuffdiff-cummeRbund versus using
tophat-HTSeq?-customScriptForCreatingMatrix?-DESeq. Cufflinks also
gives me a host of other tests (differential splicing load,
differential TSS usage, differential coding output, etc), which also
seem useful for certain applications.
On a related note, does anyone have a workflow for taking multiple bam
files, running HTSeq-count (or another program), plus some other
program or custom script, to produce a matrix of counts as input to
DESeq?
Stephen
Dear Stephen,
To your related note, you could have a look at the easyRNASeq package (bioC 2.10) for R (2.15). It reads in your annotation, your bam files and generate a count table for DESeq, all in R. It can actually do the first step of DESeq (estimating size library and dispersion) and give you back a normalized countDataSet object plus some validation plots as described in the DESeq vignettes (the same is true for edgeR). I'm about to push some changes in SVN to correct an issue that prevented the package vignette to be build. The package should be available in a couple of days as binary or you could install it directly from SVN. See http://wiki.fhcrc.org/bioc/SvnHowTo. The package URL is: https://hedgehog.fhcrc.org/bioconduc...cks/easyRNASeq.
Getting the proper set of annotation is definitely the most important step in the whole process and the one that requires most attention, the rest is then pretty straightforward. What I mean by annotation is the description of your feature of interest, be it gene, transcript, exon, enhancers, etc... as genomic loci (chr, start, width, etc...). The main issue in defining the annotation is to avoid counting reads multiple times and the kind of annotation needed does of course depends on your project. If you are interested in looking at isoforms differential expression, you probably want to define synthetic exons (to avoid double counting) and process the obtained count table with DEXseq. If you're looking at gene expression, you would want to create gene models to avoid multiple counting and use these to create your count table. If you are interested in eRNAs, you can define enhancer loci as the count "feature". All this can be done relatively easily in R. Once you have the proper annotation that suits your need, running easyRNASeq is very straightforward. easyRNASeq accepts both RangedData and GRangesList as annotation input, amo ng other formats. easyRNASeq is in addition able to fetch annotations for you from different sources, but most of the time these would need to be post-processed. You can look at my post: "u! sing easyRNASeq examples" from 2 days ago for some examples and comments in addition to the vignette content.
Cheers,
Nico
I've fixed a bug in easyRNASeq and made a couple of modifications to deal with novelties introduced in the IRanges and in the DESeq packages. I've just committed that whole bunch, so it will take ~2 days before a binary package is available. You can always install from SVN in the meanwhile.
I've listed below the different examples you sent me in our email exchange last week and I've added some comments which I hope will prove useful to you. I've created a new "thread" to make it easier to read, and I'm confident it will prove useful to others. This leads me to thanking you for your patience, for providing the data and for giving me the opportunity to post such an example
I hope this helps, do not hesitate to contact me if you have further questions and to suggest improvements.
Cheers,
NicoCode:## load the library library(easyRNASeq) ## load the chromosome sizes ## Note that you can just provide a named list; using the BSgenome is not necessary (especially as the Hsapiens ones are huge >800Mb) library(BSgenome.Hsapiens.UCSC.hg19) chr.sizes=as.list(seqlengths(Hsapiens)) ## set the wd directory to wherever you have your bam files ## that's just for this example, you could give that directory ## directly to the filesDirectory argument of the function setwd("Desktop/Francesco") ## get the bam filenames bamfiles=dir(getwd(),pattern="*\\.bam$") ## run easyRNASeq to get an RNAseq object. ## we use biomaRt to fetch the annotation directly from ensembl ## rnaSeq <- easyRNASeq(filesDirectory=getwd(), organism="Hsapiens", chr.sizes=chr.sizes, readLength=100L, annotationMethod="biomaRt", format="bam", count="exons", filenames=bamfiles[1], outputFormat="RNAseq" ) ## Fetching the annotation that way is time consuming ## Using the RNAseq object you can extract the genic information ## and save them as an rda for a faster processing time ## later on gAnnot <- genomicAnnotation(rnaSeq) ## There are 244 "chromosomes" in that annotation, let's keep only what we need gAnnot <- gAnnot[space(gAnnot) %in% paste("chr",c(1:22,"X","Y","M"),sep=""),] save(gAnnot,file="gAnnot.rda") ## you could use it this way countTable <- easyRNASeq(filesDirectory=getwd(), organism="Hsapiens", chr.sizes=chr.sizes, readLength=100L, annotationMethod="rda", annotationFile="gAnnot.rda", format="bam", count="exons", filenames=bamfiles[1] ) ## using different annotation (makeTranscriptDB) library(GenomicFeatures) hg19.tx <- makeTranscriptDbFromUCSC( genome="hg19", tablename="refGene") ## easyRNASeq can deal with GRangesList object, so no need to modify it much, i.e. no need to convert it to a RangedData gAnnot <- exons(hg19.tx) ## change the metadata column name to suit easyRNASeq colnames(elementMetadata(gAnnot)) <- "exon" ## finally turn it into a GrangesList gAnnot <- split(gAnnot,seqnames(gAnnot)) ## you could save it as an rda object, you can as well use it directly ## by using the annotationMethod "env" and the annotationObject arguments. ## In addition we will be selecting for a single chromosome: chr1 ## as the corresponding name in your bam file is 1, that's what we'll use countTable <- easyRNASeq(filesDirectory=getwd(), organism="Hsapiens", chr.sizes=chr.sizes, readLength=100L, annotationMethod="env", annotationObject=gAnnot, format="bam", count="exons", filenames=bamfiles[1], chr.sel="1" ) ## applying DESeq ## this will not yield very sensitive results as we have no replicates (biological) ## These are important for DESeq to accurately model the technical and biological variance ## With no replicates for every condition, the dispersion will be based on a "pooled" estimate ## making the differential expression call lose sensitivity. In addition, DESeq is in such case ## using a conservative approach (which is good) so you'd get even less significant results. ## Defining the conditions conditions <- c("A","B") names(conditions) <- bamfiles ## running DESeq ## NOTE that the two last arguments are for the DESeq estimateDispersions method. They are just transferred through. ## There are necessary as we have only 1 replicate per condition; see the DESeq vignette for more details. countDataSet <- easyRNASeq(filesDirectory=getwd(), organism="Hsapiens", chr.sizes=chr.sizes, readLength=100L, annotationMethod="env", annotationObject=gAnnot, format="bam", count="exons", filenames=bamfiles, chr.sel="1", outputFormat="DESeq", normalize=TRUE, conditions=conditions, fitType="local", method="blind" ) ## finally my session info > sessionInfo() R Under development (unstable) (2012-02-07 r58290) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] GenomicFeatures_1.7.16 AnnotationDbi_1.17.15 easyRNASeq_1.1.6 [4] ShortRead_1.13.12 latticeExtra_0.6-19 RColorBrewer_1.0-5 [7] Rsamtools_1.7.29 DESeq_1.7.6 locfit_1.5-6 [10] lattice_0.20-0 akima_0.5-7 Biobase_2.15.3 [13] BSgenome_1.23.2 GenomicRanges_1.7.24 Biostrings_2.23.6 [16] IRanges_1.13.24 genomeIntervals_1.11.0 intervals_0.13.3 [19] edgeR_2.5.9 limma_3.11.11 biomaRt_2.11.1 [22] BiocGenerics_0.1.4 loaded via a namespace (and not attached): [1] annotate_1.33.2 bitops_1.0-4.1 DBI_0.2-5 genefilter_1.37.0 [5] geneplotter_1.33.1 grid_2.15.0 hwriter_1.3 RCurl_1.91-1 [9] RSQLite_0.11.1 rtracklayer_1.15.7 splines_2.15.0 survival_2.36-12 [13] tools_2.15.0 XML_3.9-4 xtable_1.6-0 zlibbioc_1.1.1
Comment
-
One of the issues I faced with DESeq was going from alignments (bam files) to the counts table that DESeq requires, e.g.:
Code:Sample1 Sample2 Sample3 Sample4 Sample5 Gene1 2614 7000 7119 6689 4893 Gene2 1162 2939 72 4988 390 Gene3 5799 8533 9263 8413 5635 Gene4 8129 4023 1524 2270 1816 Gene5 3536 4320 4927 2659 6795
htseq-count is dead simple to use. It takes an alignment and a GTF annotation file (e.g. htseq-count [options] <alignment> <gtf>) and produces a list of features and counts of reads that overlap those features (how overlapping reads are counted is specified by the options).
WASH5P 3
WDR18 201
ZNF555 44
ZNF57 82
ZNF77 55
psiTPTE22 0
tAKR 0
no_feature 6321
ambiguous 196
too_low_aQual 0
not_aligned 0
alignment_not_unique 15890
I hacked out a perl script that takes a list of bam files, runs samtools view on them, pipes that output into htseq-count, and drops an output file. It does this for each bam file, dropping a temporary file containing all the features and counts for that particular alignment. I then pipe together a series of join commands to glue together all these temporary count text files that HTSeq produces. This gets me my table that I can use with DESeq.
I've heard far too many examples of people using tophat-cufflinks-cuffmerge-cuffdiff for their RNA-seq workflow, and getting extremely different results from one version of cufflinks to the next. But when it takes some perl hacking around a python script to munge alignments into a format that R can read, it's currently easier IMHO to use the cufflinks solution. In addition, cufflinks gives you other useful information about splicing load, differential TSS usage, differential coding output, etc.
I would like to see this discussion continue, and if anyone has any better solution for going from bam alignments to a DESeq-ready counts table, I'm all ears.
Comment
-
Originally posted by turnersd View PostI hacked out a perl script that takes a list of bam files, runs samtools view on them, pipes that output into htseq-count, and drops an output file. It does this for each bam file, dropping a temporary file containing all the features and counts for that particular alignment.
...
I would like to see this discussion continue, and if anyone has any better solution for going from bam alignments to a DESeq-ready counts table, I'm all ears.
I've got a python script lying around that takes in multiple BAM files (or SAM files, if that's your liking) and spits out counts for each feature in a three-column format (feature, experiment, count) using HTSeq. I'm mapping to a transcriptome for the reads so haven't bothered with GTF files yet, but that's something that can be fairly easily plugged in using the HTSeq library. It would probably also be fairly easy to modify HTSeq-count to support no GFF files and multiple experiments.
Once you have the three column format, it's a somewhat easy process to convert into a count matrix using the xtabs function in R.
... but I'm sitting at the wrong computer at the moment. I'll attach / embed the useful scripts tomorrow after I get in to work.
Taking note of what else has been mentioned, easyRNASeq (which I hadn't heard about before) sounds like an even better tool for this given that it stays in R. As long as the BAM file reading/processing is quick enough (e.g. considering 5-50GB BAM files for each experiment), there's no reason not to do everything in R.
Comment
-
Originally posted by chadn737 View PostThe vignette for DESeq is exceptionally well written and makes it easy for a biologist with no R experience to follow it. I say this having at one time being that biologist with NO understanding of R.
Originally posted by gringer View PostTaking note of what else has been mentioned, easyRNASeq (which I hadn't heard about before) sounds like an even better tool for this given that it stays in R. As long as the BAM file reading/processing is quick enough (e.g. considering 5-50GB BAM files for each experiment), there's no reason not to do everything in R.
Anyway, enough of my ranting against Bioconductor. I'm still interested in opinions about what is happening under the hood of these methods.
Originally posted by turnersd View PostI've heard far too many examples of people using tophat-cufflinks-cuffmerge-cuffdiff for their RNA-seq workflow, and getting extremely different results from one version of cufflinks to the next.--------------
Ethan
Comment
-
I have HTseq-count in my pipeline. At the end I just import all the individual count files into R and create the table there.
Or for those who are very unfamiliar with perl, R, or python. Just open the count files that are output by HTseq-count in excel. Then copy and paste to make your table and save it as a csv file. I realize this is labor intensive, but I would expect basically every scientist to at least know excel.
Comment
-
Originally posted by ETHANolAnyway, enough of my ranting against Bioconductor. I'm still interested in opinions about what is happening under the hood of these methods.
For single isoform genes, Cuffdiff models the variance in fragment counts across replicates using the negative binomial distribution, similar to the method described by Anders and Huber (2010)
However since Cufflinks attempts to test differential expression for isoforms, their model gets a lot more complex and they attempt adjust expression values with reads that map to multiple isoforms/genes.
My understanding of statistics is far to rudimentary to actually speak on which performs better. For my own purposes, looking for expression differences between isoforms does not make my results anymore meaningful and also makes the types of analysis I am interested in far more difficult.
Also, in every paper I have seen that compares various methods of testing differential expression, DESeq and EdgeR give very similar results, and DESeq gives fairly similar results to BaySeq, whereas Cufflinks can differ considerably from all of these.
Cufflinks can also take a considerable time to run and even with extra steps, I can finish up using DESeq well before Cufflinks ever completes.
If its not obvious by now, I prefer DESeq and for that matter EdgeR, over Cufflinks. It fits my needs better.
Comment
-
Hi all,
Originally posted by chadn737 View PostFor my own purposes, looking for expression differences between isoforms does not make my results anymore meaningful and also makes the types of analysis I am interested in far more difficult.
The increased difficulty in interpretation of cuffdiff output is something that I have tried to address with the 'cummeRbund' R/Bioconductor package. The hope is that making the cuffdiff output manageable in a familiar environment, while at the same time maintaining the relationships between genes, isoforms, promoters, etc, will begin to make the interpretation of cuffdiff output more engaging.
Comment
-
Originally posted by gringer View Post... but I'm sitting at the wrong computer at the moment. I'll attach / embed the useful scripts tomorrow after I get in to work.
Code:$ ~/scripts/sam2rawCounts.py $(for x in $(ls -d Sample_*); do \ echo ${x}=$(ls ${x}/*contigs*.bam); done) > contig_reads.txt
Code:$ ~/scripts/sam2rawCounts.py $(for x in $(cat experiment_dirs.txt); do \ echo ${x}=$(ls ${x}/*contigs*.bam); done) > contig_reads.txt
Code:# get counts from transcriptome mapping reads.3col.df <- read.table("contig_reads.txt", header = TRUE, stringsAsFactors = FALSE); ## convert to raw count table formatted for DESeq experiment.counts <- xtabs(Reads ~ Contig + Experiment, data = reads.3col.df); ## assume no biological replicates experiment.conditions <- sub("^Sample_","",colnames(experiment.counts)); ## convert to DESeq CountDataSet object experiment.cds <- newCountDataSet(unclass(experiment.counts), experiment.conditions);
Attached Files
Comment
-
In regards to the question of how to get whole-number counts appropriate for using DESeq, I would recommend using RSEM for quantification: http://deweylab.biostat.wisc.edu/rsem/
RSEM only does quantification of known-transcripts but does so in a very stable manner (comapred to cufflinks) that gives gene-level counts that are the sum of the transcript-level counts for that gene (which cufflinks does not do). It also is more stable from version to version and has less issues in "failing" to produce a count for certain isoforms/genes.
RSEM can take BAM files as input, but expects them to be aligned to the transcriptome (i.e. won't take the output of TopHat that includes both genomic and txome alignments) so you may end up re-doing alignment to use it. But RSEM will happily call bowtie for you if you give it fastq as input.
It will do probablistic assignment of multi-reads giving you counts that can be fractions of whole numbers, but simply rounding these numbers before inputting them to DEseq should give you better results than simply doing a raw counting of reads in genes/exons.
Comment
-
DESeq/edgeR also allow you to use generalized linear modelling to use more complicated statistical models. For example to do paired same analysis (e.g. healthy tissue vs cancer tissue from 10 patients). In this sort of situation I find that cuffdiff finds no genes/isoforms are differentially expressed because the patient-to-patient variance is too high.
On the other hand, I'm still to find a solution to the read counting that truly satisfies me. DESeq and edgeR get away without normalising for length because the two entities being compared are the same length. However, if you sum over all exons in a gene then this isn't necessarily the case: different exons might be being used in different samples. You can test exon-by-exon, but then have to work out how you go from differentially expressed exons to differentially expressed genes. I admit I haven't looked at RSEM, i'll go and do that.
Comment
-
Using cuffdiff means you are coming from tophat-cufflinks-cuffmerge workflow. I have found that this workflow is bugged and mixes up gene annotations. It can not be said with 100% certainty if the reported FPMKs indeed belong to the corresponding gene/transcript.
Please view my thread if interested:
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
I am now using DESeq for my analysis.Last edited by epi; 03-09-2012, 09:23 AM.
Comment
-
Originally posted by epi View PostUsing cuffdiff means you are coming from tophat-cufflinks-cuffmerge workflow. I have found that this workflow is bugged and mixes up gene annotations. It can not be said with 100% certainty if the reported FPMKs indeed belong to the corresponding gene/transcript.
Please view my thread if interested:
Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
I am now using DESeq for my analysis.
Regarding use of Cufflinks or RSEM with count-based tools like DEseq or edgeR - the authors of DEseq have noted on a number of occasions that it is NOT a good idea to use RSEM or Cufflinks with their tool. The reason is that those tools produce (either explicitly or internally during estimation of gene and transcript FPKM) an *estimate* of the number of fragments that came from each transcript. However, that estimate comes with *uncertainty*, and those tools don't account for it when testing for DE. Think of it this way - when you have multiple isoforms, you have to decide to assign your fragments among them. In addition to the biological variability that's present between replicates, there's now also variability in this assignment process. Those tools don't account for it, so using them with the estimated counts from transcript-level tools will lead to false positives.
Comment
-
Originally posted by sudders View Post
On the other hand, I'm still to find a solution to the read counting that truly satisfies me. DESeq and edgeR get away without normalising for length because the two entities being compared are the same length. However, if you sum over all exons in a gene then this isn't necessarily the case: different exons might be being used in different samples. .
Comment
Latest Articles
Collapse
-
by seqadmin
The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...-
Channel: Articles
11-06-2024, 07:24 PM -
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 11:09 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
Today, 11:09 AM
|
||
Started by seqadmin, Today, 06:13 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
Today, 06:13 AM
|
||
Started by seqadmin, 11-01-2024, 06:09 AM
|
0 responses
30 views
0 likes
|
Last Post
by seqadmin
11-01-2024, 06:09 AM
|
||
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, 10-30-2024, 05:31 AM
|
0 responses
21 views
0 likes
|
Last Post
by seqadmin
10-30-2024, 05:31 AM
|
Comment