Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Planning an RNA-Seq Experiment

    Hi all,

    If planning an experiment to for example compare two types (stages) of human tumour, what sort of number of reads and number of bioiogical replicates is currently considered acceptable?

  • #2
    What do you want to detect? What tissue type? The number of reads and read length really depend on the immediate goal associated with your hypothesis and tissue type. If you want to have your cake and eat it too (gene expression, transcript expression, mutation detection) you will need more reads than for just gene expression say 150 million versus 25 million. Also, replicates are nice to have for the expression estimates.

    Comment


    • #3
      Hi, thanks for the reply.

      I guess my ideal answer would tell me how the length/number of reads changes with application e.g. for gene expression vs transcript expression and discovery vs SNP detection.

      As an example, looking at good vs poor prognosis colorectal tumor samples.

      What number of replicates is desirable/computationally realistic?

      Comment


      • #4
        Usually, people do two to three replicates per condition. This is enough to estimate variance but not sufficient to overcome bad signal-to-noise (SNR) ratio. In your case, you expect the SNR to be very bad: The signal (differences in expression due to difference in prognosis) will probably be only very rarely be larger than the noise (differences due to the fact that each sample is form another patient with another genotype).

        This is the reason that all these experiments attempting to link cancer prognosis to expression levels are done with tens of replicates (and the with microarrays because that is still cheaper) and why even so, they usually lead to nothing.

        Are you sure you have the resources to do such a project? Your post does not sound as if you were aware that this is way more ambitious than your average RNA-Seq project.

        Comment


        • #5
          Hi Simon & thanks again

          In short, not certain at all

          This is very much fact-finding.

          Are there any published results of studies in this vein using (or attempting to use) NGS in place of microarray?

          I guess I'm trying to get a concrete feel for how close NGS is to 'plugging into' areas traditionally using microarrays e.g. multivariate diagnostic/prognostic classification, SNP association studies etc.

          Comment


          • #6
            No more luck here

            If noone is aware of work in this area, let me ask in theory:

            If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?

            Comment


            • #7
              Originally posted by gavin.oliver View Post
              No more luck here

              If noone is aware of work in this area, let me ask in theory:

              If one had the will/finances to profile 50 good prognosis and 50 poor prognosis tumors with 150 million reads per patient, what size of an undertaking would it be computationally? Is it huge? Or by generating read counts one patient at a time, does it become nothing more than a collection of 100 data matrices?
              That would be a massive undertaking. Even at the lowest level of sequencing (36 bp, single end), you would generate 540 gigabases of data (50*2*150,000,000*36). However, 36 bp reads aren't that great for RNA seq, especially in a complex genome. If you go for 100 bp, paired end reads, you'd be talking about 3 terabases of data, which is approaching what was generated for the pilot paper for 1000 Genomes published in Nature last year (4.9 terabases). To generate and analyze that amount of data would probably require millions of dollars.

              Comment


              • #8
                If we are talking about analysing mRNA to see changes in gene expression, high-throughput sequencing will not provide much advantage over microarrays. With HTS, you might get better precision for your expression estimates (and even this only if you have enough reads), but the measurement precision is not the limiting factor here anyway, the patient-to-patient variation is.

                You may hope to get better prognostic signatures by looking at features that are hard to see by microarrays, e.g., changes in splicing rather than expression, or appearance of fusion genes. This might be a long shot, though.

                Finally, if you go for genomic sequencing and look for structural variants, you may hope to find something in the size range of variants to large for SNP chips and too small for array-CGH / tiling arrays. Again, whether cancer signatures are likely to be found there is up to anyone's guess.

                Comment


                • #9
                  These are great answers guy - thanks a lot

                  A few small things:

                  1) How to convert bases to bytes if talking about 3 terabases of data? I'm guessing it depends on the format it's supplied in?

                  2) As that data is mapped and converted to read counts, how much does it shrink?

                  3) How would that data likely be supplied by a sequencing provider? On disk?

                  Thanks for all your help.

                  Comment


                  • #10
                    Most suppliers will provide portable drives with the data as a fastq file, which contains the read sequence and quality values for each base (ei. 2 characters per base) these files also contain read identifiers, which can be of varying lengths so file size per base will vary by vendor and platform. But expect 3-4 bytes per base. Depending on the processing method the intermediate files can suck up a huge amount of space and to be conservative just expect the final output BAM file that you will likely want to keep to be roughly the same size. From that file you can use a variety of programs to generate the count data which ends up being very small 1-5 MB per sample. If you did one sample per lane on the Illumina HiSeq using 50x50 reads you end up with two 6-15GB fastq files, after alignment a single BAM file of 6-15GB, and then a couple of count files in the 1-5Mb range.

                    Do consider many of the comments from Simon, he is definitely one of the more knowledgeable contributors to this forum. In general I agree with him that these types of questions are still best answered using microarrays over RNAseq. Largely because there are many more nuances you need to consider when designing an RNAseq and to do it properly the cost will scare you. If you cut corners because of cost in the end you will likely end up regretting you decision later when you realize you can't do XYZ or the results or inaccurate. That being said, I have more hope that expression profiling does provide a good means to identify prognostic groups. In fact I often say the only thing it is good for is subset identification and I have a number of similar studies running currently. The one issue you always run into, regardless of arrays or sequencing, is the old "garbage in garbage out phenomena". If you can not purify the tumor cells to high purity (ie. 85% or greater minimum) you are likely to end up with feelings similar to Simon's comment "they usually lead to nothing". Even with that, I can tell you in our field were we can robustly purify tumor cells by magnetic sorting to an average of 95% purity the really good risk models only fell out in cohorts of 250-350 patients.

                    Comment


                    • #11
                      Jon - thanks for the comprehensive response!

                      Rest assured I take Simon's (and your) responses very seriously and disagree with none of them!

                      I am just keen to build a strong concept of the why-nots in an area where I have limited practical experience.

                      So it's fair to say that prognostic classification should remain a microarray-based pursuit.

                      Do you feel that will change in the near future?
                      Last edited by gavin.oliver; 04-20-2011, 06:07 AM. Reason: typo

                      Comment


                      • #12
                        I don't think it should remain a microarray-based pursuit, the sequencing based read outs have so many advantages it is unquestionable they will ultimately be the future. The question is what to do today. On my end we are pushing forward with sequencing based measurements for a couple of reasons. First, our institute is heavily invested in NGS technology and have sold off our rooms full of Affymetrix equipment to make room for all the sequencers. So internally our hands are a bit tied, though we still have our Agilent platforms that I actually prefer anyways. Second, we and others have had success in our field using affy arrays so we assume similar models should fall out of a sequencing based study. But I have to say my major imputus was/is that those arrays may not exist in a couple of years so we better start moving our models to sequencing based read outs so we can stay ahead of the curve. We also want to start integrating exome/genome sequencing with expression estimates and the sequencing based approach allows for things like allele specific expression analysis that can not be done on a conventional microarray platform.

                        The short answer is if you have the money, and feel comfortable with NGS data or have collaborators who are, then go ahead. But like any experiment, maybe more so given the cost/risk, make sure the sample selection and analytical goal are specifically layed out in advance so library production and sequencing are performed correctly. Once that decision is made I would always suggest a 3 sample test batch to see if you can handle the data and to see if the outlined plan is generating the data you need for your analytical goals then ramp up to mass production of the 100 sample batch.

                        Comment


                        • #13
                          Apologies - I should have said "remain a microarray-based pursuit for now"

                          Our group have been firmly Affymetrix microarray-based for many years now and have been involved in some large scale prognostic classifier type work.

                          The thing is that I am trying convince a move (tentative/partial at least) toward NGS and I want to have a strong idea of what we can already do with it and what remains in the future i.e. to what degree and in what applications it can already replace microarray cost-effectively...

                          Comment


                          • #14
                            RNAseq is best used currently for small scale test vs control comparisons or time series. But that is largely assuming you want to look at gene expression and transcript expression comparisons were you need significant read depth for the later. In your situation I think limiting the analysis to "gene" expression you could generally replace affy arrays getting rid of their many inaccuracies for around double the cost per sample likely less depending on vendor. Depending on the tissue, the cost could drop even more if you can multiplex, but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.

                            Comment


                            • #15
                              Originally posted by Jon_Keats View Post
                              but the limitation will be relative gene expression. Take my situation of working on multiple myeloma which is a plasma cell disease were the cell really is a factory producing immunoglobulin we need to double the read count to get equal counts on the non-immunoglobulin genes compared to breast cancer because ~50% of all the transcripts in the cell are immunoglobulin.
                              Is this where protocols like DSN normalisation can be of use? Or am I off the mark?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X