Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • gavin.oliver
    Senior Member
    • Jan 2010
    • 110

    Planning an RNA-Seq Experiment

    Hi all,

    If planning an experiment to for example compare two types (stages) of human tumour, what sort of number of reads and number of bioiogical replicates is currently considered acceptable?
  • Jon_Keats
    Senior Member
    • Mar 2010
    • 279

    #2
    What do you want to detect? What tissue type? The number of reads and read length really depend on the immediate goal associated with your hypothesis and tissue type. If you want to have your cake and eat it too (gene expression, transcript expression, mutation detection) you will need more reads than for just gene expression say 150 million versus 25 million. Also, replicates are nice to have for the expression estimates.

    Comment

    • gavin.oliver
      Senior Member
      • Jan 2010
      • 110

      #3
      Hi, thanks for the reply.

      I guess my ideal answer would tell me how the length/number of reads changes with application e.g. for gene expression vs transcript expression and discovery vs SNP detection.

      As an example, looking at good vs poor prognosis colorectal tumor samples.

      What number of replicates is desirable/computationally realistic?

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Usually, people do two to three replicates per condition. This is enough to estimate variance but not sufficient to overcome bad signal-to-noise (SNR) ratio. In your case, you expect the SNR to be very bad: The signal (differences in expression due to difference in prognosis) will probably be only very rarely be larger than the noise (differences due to the fact that each sample is form another patient with another genotype).

        This is the reason that all these experiments attempting to link cancer prognosis to expression levels are done with tens of replicates (and the with microarrays because that is still cheaper) and why even so, they usually lead to nothing.

        Are you sure you have the resources to do such a project? Your post does not sound as if you were aware that this is way more ambitious than your average RNA-Seq project.

        Comment

        • gavin.oliver
          Senior Member
          • Jan 2010
          • 110

          #5
          Hi Simon & thanks again

          In short, not certain at all

          This is very much fact-finding.

          Are there any published results of studies in this vein using (or attempting to use) NGS in place of microarray?

          I guess I'm trying to get a concrete feel for how close NGS is to 'plugging into' areas traditionally using microarrays e.g. multivariate diagnostic/prognostic classification, SNP association studies etc.

          Comment

          • gavin.oliver
            Senior Member
            • Jan 2010
            • 110

            #6
            No more luck here

            If noone is aware of work in this area, let me ask in theory:

            If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?

            Comment

            • pbluescript
              Senior Member
              • Nov 2009
              • 224

              #7
              Originally posted by gavin.oliver View Post
              No more luck here

              If noone is aware of work in this area, let me ask in theory:

              If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?
              That would be a massive undertaking. Even at the lowest level of sequencing (36 bp, single end), you would generate 540 gigabases of data (50*2*150,000,000*36). However, 36 bp reads aren't that great for RNA seq, especially in a complex genome. If you go for 100 bp, paired end reads, you'd be talking about 3 terabases of data, which is approaching what was generated for the pilot paper for 1000 Genomes published in Nature last year (4.9 terabases). To generate and analyze that amount of data would probably require millions of dollars.

              Comment

              • Simon Anders
                Senior Member
                • Feb 2010
                • 995

                #8
                If we are talking about analysing mRNA to see changes in gene expression, high-throughput sequencing will not provide much advantage over microarrays. With HTS, you might get better precision for your expression estimates (and even this only if you have enough reads), but the measurement precision is not the limiting factor here anyway, the patient-to-patient variation is.

                You may hope to get better prognostic signatures by looking at features that are hard to see by microarrays, e.g., changes in splicing rather than expression, or appearance of fusion genes. This might be a long shot, though.

                Finally, if you go for genomic sequencing and look for structural variants, you may hope to find something in the size range of variants to large for SNP chips and too small for array-CGH / tiling arrays. Again, whether cancer signatures are likely to be found there is up to anyone's guess.

                Comment

                • gavin.oliver
                  Senior Member
                  • Jan 2010
                  • 110

                  #9
                  These are great answers guy - thanks a lot

                  A few small things:

                  1) How to convert bases to bytes if talking about 3 terabases of data? I'm guessing it depends on the format it's supplied in?

                  2) As that data is mapped and converted to read counts, how much does it shrink?

                  3) How would that data likely be supplied by a sequencing provider? On disk?

                  Thanks for all your help.

                  Comment

                  • Jon_Keats
                    Senior Member
                    • Mar 2010
                    • 279

                    #10
                    Most suppliers will provide portable drives with the data as a fastq file, which contains the read sequence and quality values for each base (ei. 2 characters per base) these files also contain read identifiers, which can be of varying lengths so file size per base will vary by vendor and platform. But expect 3-4 bytes per base. Depending on the processing method the intermediate files can suck up a huge amount of space and to be conservative just expect the final output BAM file that you will likely want to keep to be roughly the same size. From that file you can use a variety of programs to generate the count data which ends up being very small 1-5 MB per sample. If you did one sample per lane on the Illumina HiSeq using 50x50 reads you end up with two 6-15GB fastq files, after alignment a single BAM file of 6-15GB, and then a couple of count files in the 1-5Mb range.

                    Do consider many of the comments from Simon, he is definitely one of the more knowledgeable contributors to this forum. In general I agree with him that these types of questions are still best answered using microarrays over RNAseq. Largely because there are many more nuances you need to consider when designing an RNAseq and to do it properly the cost will scare you. If you cut corners because of cost in the end you will likely end up regretting you decision later when you realize you can't do XYZ or the results or inaccurate. That being said, I have more hope that expression profiling does provide a good means to identify prognostic groups. In fact I often say the only thing it is good for is subset identification and I have a number of similar studies running currently. The one issue you always run into, regardless of arrays or sequencing, is the old "garbage in garbage out phenomena". If you can not purify the tumor cells to high purity (ie. 85% or greater minimum) you are likely to end up with feelings similar to Simon's comment "they usually lead to nothing". Even with that, I can tell you in our field were we can robustly purify tumor cells by magnetic sorting to an average of 95% purity the really good risk models only fell out in cohorts of 250-350 patients.

                    Comment

                    • gavin.oliver
                      Senior Member
                      • Jan 2010
                      • 110

                      #11
                      Jon - thanks for the comprehensive response!

                      Rest assured I take Simon's (and your) responses very seriously and disagree with none of them!

                      I am just keen to build a strong concept of the why-nots in an area where I have limited practical experience.

                      So it's fair to say that prognostic classification should remain a microarray-based pursuit.

                      Do you feel that will change in the near future?
                      Last edited by gavin.oliver; 04-20-2011, 06:07 AM. Reason: typo

                      Comment

                      • Jon_Keats
                        Senior Member
                        • Mar 2010
                        • 279

                        #12
                        I don't think it should remain a microarray-based pursuit, the sequencing based read outs have so many advantages it is unquestionable they will ultimately be the future. The question is what to do today. On my end we are pushing forward with sequencing based measurements for a couple of reasons. First, our institute is heavily invested in NGS technology and have sold off our rooms full of Affymetrix equipment to make room for all the sequencers. So internally our hands are a bit tied, though we still have our Agilent platforms that I actually prefer anyways. Second, we and others have had success in our field using affy arrays so we assume similar models should fall out of a sequencing based study. But I have to say my major imputus was/is that those arrays may not exist in a couple of years so we better start moving our models to sequencing based read outs so we can stay ahead of the curve. We also want to start integrating exome/genome sequencing with expression estimates and the sequencing based approach allows for things like allele specific expression analysis that can not be done on a conventional microarray platform.

                        The short answer is if you have the money, and feel comfortable with NGS data or have collaborators who are, then go ahead. But like any experiment, maybe more so given the cost/risk, make sure the sample selection and analytical goal are specifically layed out in advance so library production and sequencing are performed correctly. Once that decision is made I would always suggest a 3 sample test batch to see if you can handle the data and to see if the outlined plan is generating the data you need for your analytical goals then ramp up to mass production of the 100 sample batch.

                        Comment

                        • gavin.oliver
                          Senior Member
                          • Jan 2010
                          • 110

                          #13
                          Apologies - I should have said "remain a microarray-based pursuit for now"

                          Our group have been firmly Affymetrix microarray-based for many years now and have been involved in some large scale prognostic classifier type work.

                          The thing is that I am trying convince a move (tentative/partial at least) toward NGS and I want to have a strong idea of what we can already do with it and what remains in the future i.e. to what degree and in what applications it can already replace microarray cost-effectively...

                          Comment

                          • Jon_Keats
                            Senior Member
                            • Mar 2010
                            • 279

                            #14
                            RNAseq is best used currently for small scale test vs control comparisons or time series. But that is largely assuming you want to look at gene expression and transcript expression comparisons were you need significant read depth for the later. In your situation I think limiting the analysis to "gene" expression you could generally replace affy arrays getting rid of their many inaccuracies for around double the cost per sample likely less depending on vendor. Depending on the tissue, the cost could drop even more if you can multiplex, but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.

                            Comment

                            • gavin.oliver
                              Senior Member
                              • Jan 2010
                              • 110

                              #15
                              Originally posted by Jon_Keats View Post
                              but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.
                              Is this where protocols like DSN normalisation can be of use? Or am I off the mark?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM
                              • seqadmin
                                Investigating the Gut Microbiome Through Diet and Spatial Biology
                                by seqadmin




                                The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                02-24-2025, 06:31 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              26 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              20 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              188 views
                              0 reactions
                              Last Post seqadmin  
                              Working...