I am interested in trying to benchmark some DE-tools, mainly for my own understanding, practice and as a bit of a "sanity-check" on some stuff I'm working on. Having a data set with "true" expressions would be preferable, and as far as I understand it that would be qPCR (for comparison with RNA-seq). I thus searched for any data set that contained both RNA-seq and qPCR data, and found this SEQC paper.
They have performed several RNA-seq experiments at several different sequencing sites, as well as around 20,000 PrimePCR reactions (in addition to TaqMan assays for ~1000 genes from a previous study). Without knowing much about the details of the quality of this study/data set, I thought this seemed to be the perfect match for what I'm looking for. They have made their data available at the GEO: GSE47792. My first question is, then:
1) Is this data set what I'm looking for? If not, do you know of any other good data set for benchmarking DE-software, preferably with "true" expression data?
First, I tried to find the qPCR data, which I think I did at GSE56457. This data, however, seems to be raw qPCR data, and I have no idea how to analyze that. Just taking value(A) / value(B) (for sample A and B, which are some standard RNA samples) I find less than 20 DEGs for the entire data set, so I assume I have to do some sort of normalization for this to make any sense?
2) Is there some simple way to analyze raw qPCR data for somebody with no experience whatsoever with it?
Then I tried to find the RNA-seq data, which I think is at GSE49712. They provide both FPKM and counts from HTSeq, which is perfect. Just as a preliminary, exploratory analysis I checked for DEGs for the entire set with DESeq2 (i.e. fold changes) and compared that with fold changes based on FPKM (i.e. FPKM(A)/FPKM(B); I know this is wrong and bad, I just wanted to look at the data before I dug deeper into it). Strangely, I find that these two analyses correlated extremely well (Pearson r=0.991). Suprised, I find that (upon reading a bit more in details of data at GEO) the counts supplied are, in fact, normalized counts (using limma's voom function). Seeing as using FPKM in this way is not something one should do, I would not expect such a good correlation!
3) Is the good correlation due to the counts being normalized, or is there some other problem?
I have tried in vain to find raw counts. I then tried to find a SAM/BAM file, so that I could compute the counts myself, but it seems that any SRA file I can download (from here or using SRA tools) is un-aligned, at least if I understood it correctly. I downloaded a single SRA file from there (using SRA tools) and converted it to SAM, followed by trying to convert this to BAM, which didn't work. Googling lead me to believe that this was because of the un-aligned data, but my inexperience with GEO/SRA makes me unsure. I am now faced with the issue of having to align the FASTQ files myself. Each file pair is around 70 GB in size (due to the high depth of > 100 M reads per sample?), and I assume this would take ages to do.
4) Is there some way for me to get the raw counts, or at least an aligned .BAM? Can I work backwards from the processed data to the raw data somehow?
5) Am I going about this the wrong way, somehow? I know that you can use simulated data for benchmarking, but I'd prefer data with "true" values if possible. How do you generally go about benchmarking your DE-software(s) of choice?
They have performed several RNA-seq experiments at several different sequencing sites, as well as around 20,000 PrimePCR reactions (in addition to TaqMan assays for ~1000 genes from a previous study). Without knowing much about the details of the quality of this study/data set, I thought this seemed to be the perfect match for what I'm looking for. They have made their data available at the GEO: GSE47792. My first question is, then:
1) Is this data set what I'm looking for? If not, do you know of any other good data set for benchmarking DE-software, preferably with "true" expression data?
First, I tried to find the qPCR data, which I think I did at GSE56457. This data, however, seems to be raw qPCR data, and I have no idea how to analyze that. Just taking value(A) / value(B) (for sample A and B, which are some standard RNA samples) I find less than 20 DEGs for the entire data set, so I assume I have to do some sort of normalization for this to make any sense?
2) Is there some simple way to analyze raw qPCR data for somebody with no experience whatsoever with it?
Then I tried to find the RNA-seq data, which I think is at GSE49712. They provide both FPKM and counts from HTSeq, which is perfect. Just as a preliminary, exploratory analysis I checked for DEGs for the entire set with DESeq2 (i.e. fold changes) and compared that with fold changes based on FPKM (i.e. FPKM(A)/FPKM(B); I know this is wrong and bad, I just wanted to look at the data before I dug deeper into it). Strangely, I find that these two analyses correlated extremely well (Pearson r=0.991). Suprised, I find that (upon reading a bit more in details of data at GEO) the counts supplied are, in fact, normalized counts (using limma's voom function). Seeing as using FPKM in this way is not something one should do, I would not expect such a good correlation!
3) Is the good correlation due to the counts being normalized, or is there some other problem?
I have tried in vain to find raw counts. I then tried to find a SAM/BAM file, so that I could compute the counts myself, but it seems that any SRA file I can download (from here or using SRA tools) is un-aligned, at least if I understood it correctly. I downloaded a single SRA file from there (using SRA tools) and converted it to SAM, followed by trying to convert this to BAM, which didn't work. Googling lead me to believe that this was because of the un-aligned data, but my inexperience with GEO/SRA makes me unsure. I am now faced with the issue of having to align the FASTQ files myself. Each file pair is around 70 GB in size (due to the high depth of > 100 M reads per sample?), and I assume this would take ages to do.
4) Is there some way for me to get the raw counts, or at least an aligned .BAM? Can I work backwards from the processed data to the raw data somehow?
5) Am I going about this the wrong way, somehow? I know that you can use simulated data for benchmarking, but I'd prefer data with "true" values if possible. How do you generally go about benchmarking your DE-software(s) of choice?
Comment