Unconfigured Ad

**jwfoley** · 06-18-2015, 08:29 AM

Is this data set what I'm looking for?

Absolutely. It is the gold standard. Except that they used a non-stranded library protocol, which was already obsolete when the paper was published.

Is there some simple way to analyze raw qPCR data for somebody with no experience whatsoever with it?

It's not raw, but I guess this is the best you're going to get, so the remaining steps are just arithmetic. Cq values are logarithmically related to template concentrations, and should be normalized against reference templates.

Is the good correlation due to the counts being normalized, or is there some other problem?

I don't understand exactly what two variables you're correlating, but it sounds like it's actually two representations of the same data? Then in that case, it's the badness of the correlation that would be due to different normalizations.

Is there some way for me to get the raw counts, or at least an aligned .BAM? Can I work backwards from the processed data to the raw data somehow?

It took me 732 CPU-hours to align the entire HiSeq data set for sample B with STAR. It's entirely realistic to just realign everything yourself.

**ErikFas** · 06-19-2015, 06:03 AM

Thanks for the reply for my long post, much appreciated!

Originally posted by jwfoley View Post

Absolutely. It is the gold standard. Except that they used a non-stranded library protocol, which was already obsolete when the paper was published.

Really? When I read the paper it gave me the impression that they were using paired-end reads. Looking at the supplementary (don't have access to the article from home) they did the read counting as with paired-end data. Are you sure, am I missing something?

Originally posted by jwfoley View Post

It's not raw, but I guess this is the best you're going to get, so the remaining steps are just arithmetic. Cq values are logarithmically related to template concentrations, and should be normalized against reference templates.

Ok, so how do I go about doing this? Do I only need the Cq values that I already have, or some other values as well? I take it there is some software (bioconductor?) that can help me do this?

Originally posted by jwfoley View Post

I don't understand exactly what two variables you're correlating, but it sounds like it's actually two representations of the same data? Then in that case, it's the badness of the correlation that would be due to different normalizations.

Basically, fold change [FPKM for sample A divided by FPKM for sample B, for all the genes in the set] vs. fold change [as calculated by DESeq2, the entire set, regardless of p-values]. My question is if the fold change values from DESeq2 are "wrong"/invalid/whatever, since the counts used to calculate fold change are already normalized (rather than being raw counts, which is what you should input into DESeq2 as far as I know).

Originally posted by jwfoley View Post

It took me 732 CPU-hours to align the entire HiSeq data set for sample B with STAR. It's entirely realistic to just realign everything yourself.

Oh, I totally forgot there are other aligners than Tophat! I guess I'll have to learn how to operate that as well...

**GenoMax** · 06-19-2015, 06:21 AM

Originally posted by ErikFas View Post

Really? When I read the paper it gave me the impression that they were using paired-end reads. Looking at the supplementary (don't have access to the article from home) they did the read counting as with paired-end data. Are you sure, am I missing something?

@jwfoley was referring to the data being "non-stranded" (i.e. libraries prepared without keeping strand information intact).

Oh, I totally forgot there are other aligners than Tophat! I guess I'll have to learn how to operate that as well...

Let me put a plug in for BBMap. Not much learning required to operate that

**jwfoley** · 06-19-2015, 06:53 AM

Really? When I read the paper it gave me the impression that they were using paired-end reads. Looking at the supplementary (don't have access to the article from home) they did the read counting as with paired-end data. Are you sure, am I missing something?

Yes, I'm afraid you are. As GenoMax said, stranded vs. unstranded is orthogonal to single-end vs. paired-end. A stranded paired-end library produced with most common kits would always have read 1 antisense- and read 2 sense-oriented relative to the transcript from which the fragment was derived. In these libraries, the orientation is random, so there is no way to know which strand the transcript was on, and when two transcripts overlap on opposite strands (not uncommon, especially not in plants), it's impossible or at least very difficult to know which transcript a given cDNA fragment came from.

Stranded libraries are no more expensive and hardly any more difficult to make, so I assume SEQC either did this a long time ago or wanted to keep everything unstranded for a fair test with SOLiD and 454 if those didn't have stranded kits (though the fact that they even bothered with those platforms supports the "a long time ago" hypothesis).

Ok, so how do I go about doing this? Do I only need the Cq values that I already have, or some other values as well? I take it there is some software (bioconductor?) that can help me do this?

To get the fold change between two samples, it's roughly (2 ^ (Cq of your transcript in condition 1 - Cq of your transcript in condition 2)) / (2 ^ (Cq of reference transcript in condition 1 - Cq of reference transcript in condition 2)). If I remember right and am typing it right. Look up qPCR quantification in a source that can format mathematical formulas. This is pretty basic and you don't need Bioconductor, just a calculator.

Basically, fold change [FPKM for sample A divided by FPKM for sample B, for all the genes in the set] vs. fold change [as calculated by DESeq2, the entire set, regardless of p-values]. My question is if the fold change values from DESeq2 are "wrong"/invalid/whatever, since the counts used to calculate fold change are already normalized (rather than being raw counts, which is what you should input into DESeq2 as far as I know).

I wouldn't divide FPKM by FPKM: it's not a mathematically sound unit and can't be used upstream of any serious analysis, though it's tolerable as a figure scale for people who know what's a low or high value. If you want an FPKM-related fold change, cuffdiff can give you one, though I wouldn't use that for real science either.

As for the counts, I don't know exactly what limma is doing to them but it seems like the logical thing is just to go inside limma and ask it for the fold change directly. Or give DESeq2 the unadulterated counts.

And any software may have trouble with this data set because they all expect biological replicates and you only have technical replicates.

Let me put a plug in for BBMap. Not much learning required to operate that

I'm not familiar with BBMap but from a quick glance it doesn't look like it's splice-aware, so it's probably not suitable for eukaryotic RNA-seq.

**GenoMax** · 06-19-2015, 07:26 AM

Originally posted by jwfoley View Post

I'm not familiar with BBMap but from a quick glance it doesn't look like it's splice-aware, so it's probably not suitable for eukaryotic RNA-seq.

BBMap is absolutely splice-aware. http://seqanswers.com/forums/showpos...25&postcount=6

You may also want to check this thread out for other things BBMap can do (and do very efficiently, Brian's code is optimized): http://seqanswers.com/forums/showthread.php?t=58221.

**ErikFas** · 06-19-2015, 07:29 AM

Originally posted by jwfoley View Post

I wouldn't divide FPKM by FPKM: it's not a mathematically sound unit and can't be used upstream of any serious analysis, though it's tolerable as a figure scale for people who know what's a low or high value. If you want an FPKM-related fold change, cuffdiff can give you one, though I wouldn't use that for real science either.

Yeah, like I said in the first post, I'm not going to use this for anything, I'm just using it as a way to explore the data and get familiar with it.

Originally posted by jwfoley View Post

As for the counts, I don't know exactly what limma is doing to them but it seems like the logical thing is just to go inside limma and ask it for the fold change directly. Or give DESeq2 the unadulterated counts.

I would love to give the raw counts, but the problem is that (as far as I can tell) the paper/GEO doesn't actually supply them, only normalized ones, hence all the fuss about doing another alignment myself (so I can get the raw counts).

Originally posted by jwfoley View Post

And any software may have trouble with this data set because they all expect biological replicates and you only have technical replicates.

Oh, you can't use DESeq2/limma/edgeR/etc. for technical replicates at all? I seem to recall that they assume biological variation (and tries to see if the difference in expression is due to "real" differences between groups or just biological variation?), but does that mean it can't do technical replicates at all?

**jwfoley** · 06-19-2015, 07:31 AM

Originally posted by GenoMax View Post

BBMap is absolutely splice-aware. http://seqanswers.com/forums/showpos...25&postcount=6

You may also want to check this thread out for other things BBMap can do (and do very efficiently, Brian's code is optimized): http://seqanswers.com/forums/showthread.php?t=58221.

Okay, maybe splice-aware, but not transcriptome-aware, if you can't give it annotations...

Ugh FINE I'll try BBMap, but I won't like it!

**GenoMax** · 06-19-2015, 07:40 AM

Originally posted by jwfoley View Post

Okay, maybe splice-aware, but not transcriptome-aware, if you can't give it annotations...

Does that matter? Once you have alignments you can use any annotation you like to get the counts.

**jwfoley** · 06-19-2015, 07:44 AM

Originally posted by ErikFas View Post

I would love to give the raw counts, but the problem is that (as far as I can tell) the paper/GEO doesn't actually supply them, only normalized ones, hence all the fuss about doing another alignment myself (so I can get the raw counts).

If this is a serious problem for you, PM me and maybe we can find a way for me to give you my STAR alignments.

Oh, you can't use DESeq2/limma/edgeR/etc. for technical replicates at all? I seem to recall that they assume biological variation (and tries to see if the difference in expression is due to "real" differences between groups or just biological variation?), but does that mean it can't do technical replicates at all?

I'm not saying you can't, just that I don't know the best way. Maybe it's completely sensible to e.g. tell DESeq2 that each HiSeq lane is a different sample, and then just use flow cell, sequencing site, and RNA sample as three different condition variables.

**Michael Love** · 06-21-2015, 03:59 AM

"My question is if the fold change values from DESeq2 are "wrong"/invalid/whatever, since the counts used to calculate fold change are already normalized (rather than being raw counts, which is what you should input into DESeq2 as far as I know)."

just to reiterate, the important thing is to not give counts to DESeq2 which are corrected for library size. It's not that the fold changes would be inaccurate (assuming the normalization is accurate), but that the point of performing statistical inference using the count model doesn't make sense if the input matrix isn't on the scale of counts or the sampling process that generated them.

**Brian Bushnell** · 06-21-2015, 08:28 AM

Originally posted by jwfoley View Post

Okay, maybe splice-aware, but not transcriptome-aware, if you can't give it annotations...

Ugh FINE I'll try BBMap, but I won't like it!

What!!! Everyone who tries it likes it

Topics	Statistics	Last Post
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 42 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 48 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM

Unconfigured Ad

Benchmarking RNA-seq DE-tools

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News