Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Benchmarking RNA-seq DE-tools

    I am interested in trying to benchmark some DE-tools, mainly for my own understanding, practice and as a bit of a "sanity-check" on some stuff I'm working on. Having a data set with "true" expressions would be preferable, and as far as I understand it that would be qPCR (for comparison with RNA-seq). I thus searched for any data set that contained both RNA-seq and qPCR data, and found this SEQC paper.

    They have performed several RNA-seq experiments at several different sequencing sites, as well as around 20,000 PrimePCR reactions (in addition to TaqMan assays for ~1000 genes from a previous study). Without knowing much about the details of the quality of this study/data set, I thought this seemed to be the perfect match for what I'm looking for. They have made their data available at the GEO: GSE47792. My first question is, then:

    1) Is this data set what I'm looking for? If not, do you know of any other good data set for benchmarking DE-software, preferably with "true" expression data?

    First, I tried to find the qPCR data, which I think I did at GSE56457. This data, however, seems to be raw qPCR data, and I have no idea how to analyze that. Just taking value(A) / value(B) (for sample A and B, which are some standard RNA samples) I find less than 20 DEGs for the entire data set, so I assume I have to do some sort of normalization for this to make any sense?

    2) Is there some simple way to analyze raw qPCR data for somebody with no experience whatsoever with it?

    Then I tried to find the RNA-seq data, which I think is at GSE49712. They provide both FPKM and counts from HTSeq, which is perfect. Just as a preliminary, exploratory analysis I checked for DEGs for the entire set with DESeq2 (i.e. fold changes) and compared that with fold changes based on FPKM (i.e. FPKM(A)/FPKM(B); I know this is wrong and bad, I just wanted to look at the data before I dug deeper into it). Strangely, I find that these two analyses correlated extremely well (Pearson r=0.991). Suprised, I find that (upon reading a bit more in details of data at GEO) the counts supplied are, in fact, normalized counts (using limma's voom function). Seeing as using FPKM in this way is not something one should do, I would not expect such a good correlation!

    3) Is the good correlation due to the counts being normalized, or is there some other problem?

    I have tried in vain to find raw counts. I then tried to find a SAM/BAM file, so that I could compute the counts myself, but it seems that any SRA file I can download (from here or using SRA tools) is un-aligned, at least if I understood it correctly. I downloaded a single SRA file from there (using SRA tools) and converted it to SAM, followed by trying to convert this to BAM, which didn't work. Googling lead me to believe that this was because of the un-aligned data, but my inexperience with GEO/SRA makes me unsure. I am now faced with the issue of having to align the FASTQ files myself. Each file pair is around 70 GB in size (due to the high depth of > 100 M reads per sample?), and I assume this would take ages to do.

    4) Is there some way for me to get the raw counts, or at least an aligned .BAM? Can I work backwards from the processed data to the raw data somehow?

    5) Am I going about this the wrong way, somehow? I know that you can use simulated data for benchmarking, but I'd prefer data with "true" values if possible. How do you generally go about benchmarking your DE-software(s) of choice?

  • #2
    Is this data set what I'm looking for?
    Absolutely. It is the gold standard. Except that they used a non-stranded library protocol, which was already obsolete when the paper was published.

    Is there some simple way to analyze raw qPCR data for somebody with no experience whatsoever with it?
    It's not raw, but I guess this is the best you're going to get, so the remaining steps are just arithmetic. Cq values are logarithmically related to template concentrations, and should be normalized against reference templates.

    Is the good correlation due to the counts being normalized, or is there some other problem?
    I don't understand exactly what two variables you're correlating, but it sounds like it's actually two representations of the same data? Then in that case, it's the badness of the correlation that would be due to different normalizations.

    Is there some way for me to get the raw counts, or at least an aligned .BAM? Can I work backwards from the processed data to the raw data somehow?
    It took me 732 CPU-hours to align the entire HiSeq data set for sample B with STAR. It's entirely realistic to just realign everything yourself.

    Comment


    • #3
      Thanks for the reply for my long post, much appreciated!

      Originally posted by jwfoley View Post
      Absolutely. It is the gold standard. Except that they used a non-stranded library protocol, which was already obsolete when the paper was published.
      Really? When I read the paper it gave me the impression that they were using paired-end reads. Looking at the supplementary (don't have access to the article from home) they did the read counting as with paired-end data. Are you sure, am I missing something?

      Originally posted by jwfoley View Post
      It's not raw, but I guess this is the best you're going to get, so the remaining steps are just arithmetic. Cq values are logarithmically related to template concentrations, and should be normalized against reference templates.
      Ok, so how do I go about doing this? Do I only need the Cq values that I already have, or some other values as well? I take it there is some software (bioconductor?) that can help me do this?

      Originally posted by jwfoley View Post
      I don't understand exactly what two variables you're correlating, but it sounds like it's actually two representations of the same data? Then in that case, it's the badness of the correlation that would be due to different normalizations.
      Basically, fold change [FPKM for sample A divided by FPKM for sample B, for all the genes in the set] vs. fold change [as calculated by DESeq2, the entire set, regardless of p-values]. My question is if the fold change values from DESeq2 are "wrong"/invalid/whatever, since the counts used to calculate fold change are already normalized (rather than being raw counts, which is what you should input into DESeq2 as far as I know).

      Originally posted by jwfoley View Post
      It took me 732 CPU-hours to align the entire HiSeq data set for sample B with STAR. It's entirely realistic to just realign everything yourself.
      Oh, I totally forgot there are other aligners than Tophat! I guess I'll have to learn how to operate that as well...

      Comment


      • #4
        Originally posted by ErikFas View Post
        Really? When I read the paper it gave me the impression that they were using paired-end reads. Looking at the supplementary (don't have access to the article from home) they did the read counting as with paired-end data. Are you sure, am I missing something?
        @jwfoley was referring to the data being "non-stranded" (i.e. libraries prepared without keeping strand information intact).

        Oh, I totally forgot there are other aligners than Tophat! I guess I'll have to learn how to operate that as well...
        Let me put a plug in for BBMap. Not much learning required to operate that

        Comment


        • #5
          Really? When I read the paper it gave me the impression that they were using paired-end reads. Looking at the supplementary (don't have access to the article from home) they did the read counting as with paired-end data. Are you sure, am I missing something?
          Yes, I'm afraid you are. As GenoMax said, stranded vs. unstranded is orthogonal to single-end vs. paired-end. A stranded paired-end library produced with most common kits would always have read 1 antisense- and read 2 sense-oriented relative to the transcript from which the fragment was derived. In these libraries, the orientation is random, so there is no way to know which strand the transcript was on, and when two transcripts overlap on opposite strands (not uncommon, especially not in plants), it's impossible or at least very difficult to know which transcript a given cDNA fragment came from.

          Stranded libraries are no more expensive and hardly any more difficult to make, so I assume SEQC either did this a long time ago or wanted to keep everything unstranded for a fair test with SOLiD and 454 if those didn't have stranded kits (though the fact that they even bothered with those platforms supports the "a long time ago" hypothesis).

          Ok, so how do I go about doing this? Do I only need the Cq values that I already have, or some other values as well? I take it there is some software (bioconductor?) that can help me do this?
          To get the fold change between two samples, it's roughly (2 ^ (Cq of your transcript in condition 1 - Cq of your transcript in condition 2)) / (2 ^ (Cq of reference transcript in condition 1 - Cq of reference transcript in condition 2)). If I remember right and am typing it right. Look up qPCR quantification in a source that can format mathematical formulas. This is pretty basic and you don't need Bioconductor, just a calculator.

          Basically, fold change [FPKM for sample A divided by FPKM for sample B, for all the genes in the set] vs. fold change [as calculated by DESeq2, the entire set, regardless of p-values]. My question is if the fold change values from DESeq2 are "wrong"/invalid/whatever, since the counts used to calculate fold change are already normalized (rather than being raw counts, which is what you should input into DESeq2 as far as I know).
          I wouldn't divide FPKM by FPKM: it's not a mathematically sound unit and can't be used upstream of any serious analysis, though it's tolerable as a figure scale for people who know what's a low or high value. If you want an FPKM-related fold change, cuffdiff can give you one, though I wouldn't use that for real science either.

          As for the counts, I don't know exactly what limma is doing to them but it seems like the logical thing is just to go inside limma and ask it for the fold change directly. Or give DESeq2 the unadulterated counts.

          And any software may have trouble with this data set because they all expect biological replicates and you only have technical replicates.

          Let me put a plug in for BBMap. Not much learning required to operate that
          I'm not familiar with BBMap but from a quick glance it doesn't look like it's splice-aware, so it's probably not suitable for eukaryotic RNA-seq.

          Comment


          • #6
            Originally posted by jwfoley View Post
            I'm not familiar with BBMap but from a quick glance it doesn't look like it's splice-aware, so it's probably not suitable for eukaryotic RNA-seq.
            BBMap is absolutely splice-aware. http://seqanswers.com/forums/showpos...25&postcount=6

            You may also want to check this thread out for other things BBMap can do (and do very efficiently, Brian's code is optimized): http://seqanswers.com/forums/showthread.php?t=58221.
            Last edited by GenoMax; 06-19-2015, 07:29 AM.

            Comment


            • #7
              Originally posted by jwfoley View Post
              I wouldn't divide FPKM by FPKM: it's not a mathematically sound unit and can't be used upstream of any serious analysis, though it's tolerable as a figure scale for people who know what's a low or high value. If you want an FPKM-related fold change, cuffdiff can give you one, though I wouldn't use that for real science either.
              Yeah, like I said in the first post, I'm not going to use this for anything, I'm just using it as a way to explore the data and get familiar with it.

              Originally posted by jwfoley View Post
              As for the counts, I don't know exactly what limma is doing to them but it seems like the logical thing is just to go inside limma and ask it for the fold change directly. Or give DESeq2 the unadulterated counts.
              I would love to give the raw counts, but the problem is that (as far as I can tell) the paper/GEO doesn't actually supply them, only normalized ones, hence all the fuss about doing another alignment myself (so I can get the raw counts).

              Originally posted by jwfoley View Post
              And any software may have trouble with this data set because they all expect biological replicates and you only have technical replicates.
              Oh, you can't use DESeq2/limma/edgeR/etc. for technical replicates at all? I seem to recall that they assume biological variation (and tries to see if the difference in expression is due to "real" differences between groups or just biological variation?), but does that mean it can't do technical replicates at all?

              Comment


              • #8
                Originally posted by GenoMax View Post
                BBMap is absolutely splice-aware. http://seqanswers.com/forums/showpos...25&postcount=6

                You may also want to check this thread out for other things BBMap can do (and do very efficiently, Brian's code is optimized): http://seqanswers.com/forums/showthread.php?t=58221.
                Okay, maybe splice-aware, but not transcriptome-aware, if you can't give it annotations...

                Ugh FINE I'll try BBMap, but I won't like it!

                Comment


                • #9
                  Originally posted by jwfoley View Post
                  Okay, maybe splice-aware, but not transcriptome-aware, if you can't give it annotations...
                  Does that matter? Once you have alignments you can use any annotation you like to get the counts.

                  Comment


                  • #10
                    Originally posted by ErikFas View Post
                    I would love to give the raw counts, but the problem is that (as far as I can tell) the paper/GEO doesn't actually supply them, only normalized ones, hence all the fuss about doing another alignment myself (so I can get the raw counts).
                    If this is a serious problem for you, PM me and maybe we can find a way for me to give you my STAR alignments.

                    Oh, you can't use DESeq2/limma/edgeR/etc. for technical replicates at all? I seem to recall that they assume biological variation (and tries to see if the difference in expression is due to "real" differences between groups or just biological variation?), but does that mean it can't do technical replicates at all?
                    I'm not saying you can't, just that I don't know the best way. Maybe it's completely sensible to e.g. tell DESeq2 that each HiSeq lane is a different sample, and then just use flow cell, sequencing site, and RNA sample as three different condition variables.

                    Comment


                    • #11
                      "My question is if the fold change values from DESeq2 are "wrong"/invalid/whatever, since the counts used to calculate fold change are already normalized (rather than being raw counts, which is what you should input into DESeq2 as far as I know)."

                      just to reiterate, the important thing is to not give counts to DESeq2 which are corrected for library size. It's not that the fold changes would be inaccurate (assuming the normalization is accurate), but that the point of performing statistical inference using the count model doesn't make sense if the input matrix isn't on the scale of counts or the sampling process that generated them.

                      Comment


                      • #12
                        Originally posted by jwfoley View Post
                        Okay, maybe splice-aware, but not transcriptome-aware, if you can't give it annotations...

                        Ugh FINE I'll try BBMap, but I won't like it!
                        What!!! Everyone who tries it likes it

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin


                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                          Today, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        37 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        41 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        35 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        54 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X