Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • shocker8786
    Member
    • Jan 2013
    • 28

    Pooling Samples for Sequencing

    I am working on an experiment in which we are going to be doing reduced representation bisulfite sequencing and RNA-seq. Our plan is to pool the samples for each treatment group and sequence each treatment group as a single sample. So for example if we have 4 samples for treatment A we will combine them and sequence the group as a single treatment A sample.

    My question is when using this method are there any issues with comparing pools of different sample size? I have two treatments, one has 4 samples and the other has 5. Can I use all the samples from each treatment, or do I have to remove one from the second group, so I have pools of 4 samples for each group?

    In other words, is there any issues associated with comparing pools with unequal sample size?
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    Sigh. Well, at least you ask before doing the experiment and ruining your project. No, the unequal sample sizes are not your problem.

    But how would you ever know whether an observed difference is statistically significant, i.e., large compared to what you observe between samples treated the same way, if you don't know how strong the differences between samples in the same treatment group are?

    Maybe I'm in a bad mood because it's early in the morning, but as you are the n-th person to ask this question here: I still don't get it. Why would anyone even think about pooling samples without multiplexing? I met people who claimed that they knew that the differences between equally treated samples are so small that they don't need to check, but curiously, these are only those people who have never done such an experiment.

    Comment

    • shocker8786
      Member
      • Jan 2013
      • 28

      #3
      Thank you for your reply. I'm new to NGS analysis, so I may have this wrong, but my understanding was that when comparing differentially methylated sites between groups your statistics are based on comparing the number of methylated/unmethylated reads for each group.

      For example, you have a region where 50 reads are aligned in both pools. You would then determine statistical significance by comparing the methylated and unmethylated read counts of the two pools at that region.

      I was under this assumption based on the paper below.


      The sentence below was taken from the supplemental methods, where they explain how statistical significance was determined between the two cell lines.

      "For each methylation region, statistical significance of differential methylation was calculated using a Fisher’s exact test on a 2 × 2 contingency table of methylated and nonmethylated counts in the two cell lines. "

      The way I interpret that is the reads are what give you statistical significance. If I'm mistaken would you be able to explain what I am missing? Thank you very much for your help, I really appreciate it!

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        Short answer: Using Fisher's exact test for this purpose is wrong. I don't have much time at the moment to look at it in detail, but the paper's analysis is most likely seriously flawed.

        Imagine you have 2 treated and 2 control samples:

        Control 1: 10 of 50 reads methylated
        Control 2: 30 of 50 reads methylated
        Treatment 1: 20 of 50 reads methylated
        Treatment 2: 40 of 50 reads methylated

        So, the methylation goes up by 10 reads, but between two samples within the same treatment group, the difference is 20 reads. Would you believe that this increase in methylation by 10 reads is due to the treatment? I'd rather say it is due to the same random variation that you see within group. Next time you do the same experiment, you might get the opposite result if things vary so much.

        Now, imagine you pooled the samples, so you see only the averages:

        Control: 20 of 50 reads methylated
        Treatment: 30 of 50 reads methylated

        Now you don't know any more that there was a change of 20 between replicates, and might think that an increase by 10 is a lot. FIsher's exact test cannot know this either, which is why it is wrong to use this test.

        The advantage of pooling is, of course, precisely that you do not see that your results are unlikely to be reproducible, and hence are not discouraged from writing a paper anyway. The fact that referees still fail to spot this elementary mistake seems to help.

        Comment

        • microgirl123
          Senior Member
          • Jun 2012
          • 199

          #5
          I think what Simon is trying to say doesn't relate to NGS sequencing specifically. It relates to any set of samples you are trying to perform statistics on and get meaningful results. Basically, you cannot statistically compare two things unless you have replicates (n must be greater than 1 in your statistics formulas!). If you pool all your samples together into two groups, then you can't perform statistics because you only have one of each of two things (n=1).

          You should index each of your 4 samples for Treatment A and each of your 5 samples for Treatment B before pooling. Then you can perform your NGS analysis on the pooled sample and see how the differences between samples in Treatment A compare to the differences between samples in Treatment B.

          Comment

          • shocker8786
            Member
            • Jan 2013
            • 28

            #6
            Thank you very much for taking the time to explain, I understand what you are saying now. I cannot remember why the decision to pool was originally made, but your argument against it makes perfect sense. I'm definitely going to talk with my group about reconsidering our experimental design.

            Thanks again!

            Comment

            • Simon Anders
              Senior Member
              • Feb 2010
              • 995

              #7
              Originally posted by microgirl123 View Post
              I think what Simon is trying to say doesn't relate to NGS sequencing specifically.
              Of course. But NGS is one of the few fields where people don't know this and nevertheless routinely get papers in high-ranking journals, which than causes new-comers to think that this is how it should be done.

              Comment

              • Rick_R
                Junior Member
                • Sep 2013
                • 2

                #8
                I know this is many months after the original post, but I would like to pose a similar question.

                I work with cell lines, and can therefore produce many biological replicates. However, the cost of sequencing them all separately would be too high. One could sequence, say, 6 samples:
                1. Control A
                2. Control B
                3. Control C
                4. Treatment A
                5. Treatment B
                6. Treatment C

                Might it be better to sequence this instead:
                1. Control A + Control B
                2. Control C + Control D
                3. Control E + Control F
                4. Treatment A + Treatment B
                5. Treatment C + Treatment D
                6. Treatment E + Treatment F

                Is this a reasonable way to reduce the "noise" from biological variability/random variation while maintaining the number of samples sequenced?

                Comment

                • Simon Anders
                  Senior Member
                  • Feb 2010
                  • 995

                  #9
                  Yes, it is.

                  It's still worth double-checking whether multiplexing really is that expensive: Even if you want to use only one lane for two samples, you can still gain information by marking the fragments from each sample with a barcode. You don't pay more for the sequencing, but you do pay extra for the steps up to the barcode ligation because they cannot be performed in a pooled fashion.

                  Comment

                  • aliceb
                    Member
                    • Jan 2010
                    • 18

                    #10
                    Hi all,

                    To dredge up an old question again, I was wondering if I could get an opinion on a pooling / not pooling design.

                    First, I understand that I want biological replicates! But is it better to work with replicates of pools or replicates of individuals? I'm leaning towards individuals because we can better call alleles, I think. But my main goal is to identify differentially expressed genes.

                    An example. We have 3 treatments to compare:

                    Option A: 5 individuals per treatment, giving me 15 libraries.
                    Option B: 5 pools (of 10 individuals?), again giving me 15 libraries, but summarizing 150 individuals.

                    Any thoughts on this option would be appreciated.

                    Thanks!

                    Comment

                    • Simon Anders
                      Senior Member
                      • Feb 2010
                      • 995

                      #11
                      Of course, B is the better option if you have so many samples anyway. (What are we talking about? Flies?) Unless you want to look at allele-specific expression, as you already noted. The trade-off here depends on how much signal you gain with B vs A and how much potentially interesting biology you lose by not being able to look at alleles.

                      The option I argued against is

                      Option C: Pool all the samples from each treatment, giving you 3 libraries in total.

                      It seems to be non-obvious to distressingly many practitioners why that one is not acceptable.


                      If it does not cost anything extra, you should consider

                      Option D: Label the cDNA from each individual with a barcode, the pool them all in one big library, spread over 15 sequencing lanes.

                      This offers you most information, but requires you to do all the sample-prep steps up to the barcoding 150 times in parallel, which is practicable only if these are only few steps before the pooling and/or you have suitable robotics or lots of patience.

                      Comment

                      • aliceb
                        Member
                        • Jan 2010
                        • 18

                        #12
                        Thanks for the reply! We're working with wasps that can be grown up, but high numbers will be a bit of a struggle. And as they're variable, sexual populations there will certainly be information that is lost by pooling.

                        Option D sounds fantastic. But as I actually have 12 experimental lines to sequence (well, 3 blocks of 4 parallel lines), with at least 5 biological replicates each, I think it's outside of my budget and pipetting capacity

                        Also, when it comes to pooling, do you have an opinion on how many individuals to use? It seems like pools of only 5 individuals might have problems with one weirdo dominating the response. But how high would one have to go to avoid that? This where my number limitations come in. I would like 10 per pool, but might be limited to fewer.
                        Last edited by aliceb; 01-09-2014, 04:29 AM.

                        Comment

                        • revAMI
                          Junior Member
                          • Jan 2014
                          • 1

                          #13
                          Library prep can be more expensive than the sequencing, so option D would have a significant added cost.

                          I have money of 18 preps, and one run. I have three treatment groups, and hundreds of samples. Is it better to pick six from each group at random, or do six pools (of how many?) for each group?

                          Pooling would reduce chance bias from biological variability, and give a stronger signal for the most changed genes. It would also be more emotionally satisfying to use more of my samples. On the other hand, it would make allele-specific expression and alternative splicing much harder to do.

                          This is in humans, so I'm not concerned about creating a denovo trnscriptome.

                          Which would look better to apply for a follow grant to do more samples?

                          Comment

                          Latest Articles

                          Collapse

                          • GATTACAT
                            Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by GATTACAT
                            Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                            07-01-2026, 11:43 AM
                          • SEQadmin2
                            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                            by SEQadmin2


                            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                            Here are nine questions we think about, in roughly the order they matter, before...
                            06-18-2026, 07:11 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 11:08 AM
                          0 responses
                          6 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-30-2026, 05:37 AM
                          0 responses
                          11 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-26-2026, 11:10 AM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-17-2026, 06:09 AM
                          0 responses
                          53 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...