Unconfigured Ad

**Jon_Keats** · 04-18-2011, 08:35 AM

What do you want to detect? What tissue type? The number of reads and read length really depend on the immediate goal associated with your hypothesis and tissue type. If you want to have your cake and eat it too (gene expression, transcript expression, mutation detection) you will need more reads than for just gene expression say 150 million versus 25 million. Also, replicates are nice to have for the expression estimates.

**gavin.oliver** · 04-18-2011, 11:45 AM

Hi, thanks for the reply.

I guess my ideal answer would tell me how the length/number of reads changes with application e.g. for gene expression vs transcript expression and discovery vs SNP detection.

As an example, looking at good vs poor prognosis colorectal tumor samples.

What number of replicates is desirable/computationally realistic?

**Simon Anders** · 04-18-2011, 10:19 PM

Usually, people do two to three replicates per condition. This is enough to estimate variance but not sufficient to overcome bad signal-to-noise (SNR) ratio. In your case, you expect the SNR to be very bad: The signal (differences in expression due to difference in prognosis) will probably be only very rarely be larger than the noise (differences due to the fact that each sample is form another patient with another genotype).

This is the reason that all these experiments attempting to link cancer prognosis to expression levels are done with tens of replicates (and the with microarrays because that is still cheaper) and why even so, they usually lead to nothing.

Are you sure you have the resources to do such a project? Your post does not sound as if you were aware that this is way more ambitious than your average RNA-Seq project.

**gavin.oliver** · 04-18-2011, 11:34 PM

Hi Simon & thanks again

In short, not certain at all

This is very much fact-finding.

Are there any published results of studies in this vein using (or attempting to use) NGS in place of microarray?

I guess I'm trying to get a concrete feel for how close NGS is to 'plugging into' areas traditionally using microarrays e.g. multivariate diagnostic/prognostic classification, SNP association studies etc.

**gavin.oliver** · 04-19-2011, 12:09 PM

No more luck here

If noone is aware of work in this area, let me ask in theory:

If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?

**pbluescript** · 04-19-2011, 12:36 PM

Originally posted by gavin.oliver View Post

No more luck here

If noone is aware of work in this area, let me ask in theory:

If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?

That would be a massive undertaking. Even at the lowest level of sequencing (36 bp, single end), you would generate 540 gigabases of data (50*2*150,000,000*36). However, 36 bp reads aren't that great for RNA seq, especially in a complex genome. If you go for 100 bp, paired end reads, you'd be talking about 3 terabases of data, which is approaching what was generated for the pilot paper for 1000 Genomes published in Nature last year (4.9 terabases). To generate and analyze that amount of data would probably require millions of dollars.

**Simon Anders** · 04-19-2011, 12:41 PM

If we are talking about analysing mRNA to see changes in gene expression, high-throughput sequencing will not provide much advantage over microarrays. With HTS, you might get better precision for your expression estimates (and even this only if you have enough reads), but the measurement precision is not the limiting factor here anyway, the patient-to-patient variation is.

You may hope to get better prognostic signatures by looking at features that are hard to see by microarrays, e.g., changes in splicing rather than expression, or appearance of fusion genes. This might be a long shot, though.

Finally, if you go for genomic sequencing and look for structural variants, you may hope to find something in the size range of variants to large for SNP chips and too small for array-CGH / tiling arrays. Again, whether cancer signatures are likely to be found there is up to anyone's guess.

**gavin.oliver** · 04-20-2011, 12:44 AM

These are great answers guy - thanks a lot

A few small things:

1) How to convert bases to bytes if talking about 3 terabases of data? I'm guessing it depends on the format it's supplied in?

2) As that data is mapped and converted to read counts, how much does it shrink?

3) How would that data likely be supplied by a sequencing provider? On disk?

Thanks for all your help.

**Jon_Keats** · 04-20-2011, 05:57 AM

Most suppliers will provide portable drives with the data as a fastq file, which contains the read sequence and quality values for each base (ei. 2 characters per base) these files also contain read identifiers, which can be of varying lengths so file size per base will vary by vendor and platform. But expect 3-4 bytes per base. Depending on the processing method the intermediate files can suck up a huge amount of space and to be conservative just expect the final output BAM file that you will likely want to keep to be roughly the same size. From that file you can use a variety of programs to generate the count data which ends up being very small 1-5 MB per sample. If you did one sample per lane on the Illumina HiSeq using 50x50 reads you end up with two 6-15GB fastq files, after alignment a single BAM file of 6-15GB, and then a couple of count files in the 1-5Mb range.

Do consider many of the comments from Simon, he is definitely one of the more knowledgeable contributors to this forum. In general I agree with him that these types of questions are still best answered using microarrays over RNAseq. Largely because there are many more nuances you need to consider when designing an RNAseq and to do it properly the cost will scare you. If you cut corners because of cost in the end you will likely end up regretting you decision later when you realize you can't do XYZ or the results or inaccurate. That being said, I have more hope that expression profiling does provide a good means to identify prognostic groups. In fact I often say the only thing it is good for is subset identification and I have a number of similar studies running currently. The one issue you always run into, regardless of arrays or sequencing, is the old "garbage in garbage out phenomena". If you can not purify the tumor cells to high purity (ie. 85% or greater minimum) you are likely to end up with feelings similar to Simon's comment "they usually lead to nothing". Even with that, I can tell you in our field were we can robustly purify tumor cells by magnetic sorting to an average of 95% purity the really good risk models only fell out in cohorts of 250-350 patients.

**gavin.oliver** · 04-20-2011, 06:06 AM

Jon - thanks for the comprehensive response!

Rest assured I take Simon's (and your) responses very seriously and disagree with none of them!

I am just keen to build a strong concept of the why-nots in an area where I have limited practical experience.

So it's fair to say that prognostic classification should remain a microarray-based pursuit.

Do you feel that will change in the near future?

**Jon_Keats** · 04-20-2011, 06:58 AM

I don't think it should remain a microarray-based pursuit, the sequencing based read outs have so many advantages it is unquestionable they will ultimately be the future. The question is what to do today. On my end we are pushing forward with sequencing based measurements for a couple of reasons. First, our institute is heavily invested in NGS technology and have sold off our rooms full of Affymetrix equipment to make room for all the sequencers. So internally our hands are a bit tied, though we still have our Agilent platforms that I actually prefer anyways. Second, we and others have had success in our field using affy arrays so we assume similar models should fall out of a sequencing based study. But I have to say my major imputus was/is that those arrays may not exist in a couple of years so we better start moving our models to sequencing based read outs so we can stay ahead of the curve. We also want to start integrating exome/genome sequencing with expression estimates and the sequencing based approach allows for things like allele specific expression analysis that can not be done on a conventional microarray platform.

The short answer is if you have the money, and feel comfortable with NGS data or have collaborators who are, then go ahead. But like any experiment, maybe more so given the cost/risk, make sure the sample selection and analytical goal are specifically layed out in advance so library production and sequencing are performed correctly. Once that decision is made I would always suggest a 3 sample test batch to see if you can handle the data and to see if the outlined plan is generating the data you need for your analytical goals then ramp up to mass production of the 100 sample batch.

**gavin.oliver** · 04-20-2011, 07:08 AM

Apologies - I should have said "remain a microarray-based pursuit for now"

Our group have been firmly Affymetrix microarray-based for many years now and have been involved in some large scale prognostic classifier type work.

The thing is that I am trying convince a move (tentative/partial at least) toward NGS and I want to have a strong idea of what we can already do with it and what remains in the future i.e. to what degree and in what applications it can already replace microarray cost-effectively...

**Jon_Keats** · 04-20-2011, 07:40 AM

RNAseq is best used currently for small scale test vs control comparisons or time series. But that is largely assuming you want to look at gene expression and transcript expression comparisons were you need significant read depth for the later. In your situation I think limiting the analysis to "gene" expression you could generally replace affy arrays getting rid of their many inaccuracies for around double the cost per sample likely less depending on vendor. Depending on the tissue, the cost could drop even more if you can multiplex, but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.

**gavin.oliver** · 04-20-2011, 10:15 AM

Originally posted by Jon_Keats View Post

but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.

Is this where protocols like DSN normalisation can be of use? Or am I off the mark?

Topics	Statistics	Last Post
Study Captures the First Moments of DNA Replication by SEQadmin2 Started by SEQadmin2, 07-24-2026, 12:17 PM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 07-24-2026, 12:17 PM
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, 07-23-2026, 11:41 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-23-2026, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 213 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 79 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM

Unconfigured Ad

Planning an RNA-Seq Experiment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News