Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-Seq, lower coverage shows more differential expression

    Hi All,

    I've recently being assessing some RNA-Seq data that I have from a pilot project. When we begin our actual study though we'll have much more samples and hence we want to use less coverage to save money where possible. Basically we're looking at differential expression between 2 conditions, however when I randomly extracted 1/3 of the reads for each file and mapped them (using the exact same pipeline as the whole files) and then looked at differential expression, I found about 4 times more differentially expressed genes than with the all of the data.

    Any ideas why? I've been doing this using DESeq

    I also performed some clustering and found that the samples from the pilot study tend to fall into 2 fairly distinct groups, but look at differential expression within those groups isn't viable because of the sample size (only 3 samples in each group) and so DESeq doesn't detect anything as being significantly differentially expressed.

  • #2
    Dear bob loblaw

    thank you. The behaviour you report is not reasonable, and somewhere in your workflow or tool chain there must be a mistake. Can you report the sequence of steps (script) that you perform, to reproduce your observation?

    Best wishes
    Wolfgang
    Wolfgang Huber
    EMBL

    Comment


    • #3
      Hi Wolfgang,

      I've located at one problem in my workflow that resulted in only fraction of my reads mapping. I'm hoping that this problem can be attributed to this. I'm currently re-processing my samples now. Thanks though!

      Comment


      • #4
        Normalizing libraries sequenced a different deep

        I have a question somehow related to this discussion.
        If I have libraries to compare that have been sequenced with different deep:
        example:
        Lib1; 12,000,000 reads
        Lib2; 17,000,000 reads
        Lib3; 9,000,000 reads

        It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?

        Comment


        • #5
          Originally posted by colaneri View Post
          I have a question somehow related to this discussion.
          If I have libraries to compare that have been sequenced with different deep:
          example:
          Lib1; 12,000,000 reads
          Lib2; 17,000,000 reads
          Lib3; 9,000,000 reads

          It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?
          I think so, assuming that aside from varying coverages they are all prepared in exactly the same way.

          Comment


          • #6
            Originally posted by colaneri View Post
            I have a question somehow related to this discussion.
            If I have libraries to compare that have been sequenced with different deep:
            example:
            Lib1; 12,000,000 reads
            Lib2; 17,000,000 reads
            Lib3; 9,000,000 reads

            It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?
            But why do this? All well designed software for differential expression analysis will account for varying library depth.

            Comment


            • #7
              Originally posted by kmcarr View Post
              But why do this? All well designed software for differential expression analysis will account for varying library depth.
              Thats a good point, although the poster didn't say how they were comparing the libraries.

              Comment


              • #8
                i good rule of thumb when doing random subsetting of data is to do it more than once so you can observe that the results are stable and not jumping around due to the random sampling. So...sure go ahead and subset down to 9,000,000 reads but run a few iterations of the entire pipeline. if the results are stable you're good. if not then subsetting may not be appropriate, for whatever reason.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Originally posted by Wolfgang Huber View Post
                  Dear bob loblaw

                  thank you. The behaviour you report is not reasonable, and somewhere in your workflow or tool chain there must be a mistake. Can you report the sequence of steps (script) that you perform, to reproduce your observation?

                  Best wishes
                  Wolfgang
                  Well my workflow is very simple. I'm using bowtie2 to map my reads to a reference database, and then I'm using samtools idxstats function to create a count table. Then I merge rows with duplicate IDs, then I put it into R and run DESeq on it. It's a very simple workflow.

                  I've also noticed that for the random subset of reads and the whole file the exact same proportion of reads is mapped (i.e. if 67.4% is mapped for the whole file, then in the subset file 67.4% is also mapped) which at least indicates that this difference in the number of differentially expressed genes between the whole file and subset isn't down to some non random effect in the subsetting process.

                  Any ideas?
                  Thanks

                  Comment


                  • #10
                    So, to rephrase, you're mapping to a transcriptome reference and not a genome, correct? When you do this its very important that the rows you merge be all features that share exonic sequence - as in all alternative isoforms of a gene and even multi copy genes that reside in separate loci. Otherwise you're going to run into some confusion for sure. Bowtie is not designed nor capable of making alignment decisions between features with shared sequence beyond total random selection. It does make the same decision each time you run it but that's by design...they do a random number seeding trick to ensure this happens. So even if you're using an Ensemble annotation for mouse or human simply merging counts based on any of the provided ids isn't enough to remove the ambiguity problem.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #11
                      Originally posted by sdriscoll View Post
                      So, to rephrase, you're mapping to a transcriptome reference and not a genome, correct? When you do this its very important that the rows you merge be all features that share exonic sequence - as in all alternative isoforms of a gene and even multi copy genes that reside in separate loci. Otherwise you're going to run into some confusion for sure. Bowtie is not designed nor capable of making alignment decisions between features with shared sequence beyond total random selection. It does make the same decision each time you run it but that's by design...they do a random number seeding trick to ensure this happens. So even if you're using an Ensemble annotation for mouse or human simply merging counts based on any of the provided ids isn't enough to remove the ambiguity problem.
                      I'm mapping against the DNA sequences of predicted proteins from sequenced genomes. The only rows that I merge are the ones which have identical IDs I hadn't even considered merging isoforms...

                      We originally did it to cut down on computational time more than anything (although it actually doesn't make that big of a difference in terms of the size of the count table)
                      Last edited by bob-loblaw; 05-29-2013, 09:12 AM.

                      Comment


                      • #12
                        This is all sounds as if some mistake happens during counting. Your approach of using samtools idxstats is rathe runorthodox, and I wonder if it is correct. It might be much safer to use some well-tested tool to obtain a count table instead of using some home-brewn solution.

                        Comment


                        • #13
                          Originally posted by Simon Anders View Post
                          This is all sounds as if some mistake happens during counting. Your approach of using samtools idxstats is rathe runorthodox, and I wonder if it is correct. It might be much safer to use some well-tested tool to obtain a count table instead of using some home-brewn solution.
                          From what I can see idxstats works perfectly, it needs to a bit of tweeking in R so that DESeq can read it, but nothing major.

                          I am open to suggestions though, can you give some examples of those well tested tools? Thanks
                          Last edited by bob-loblaw; 05-29-2013, 08:36 AM.

                          Comment


                          • #14
                            Actually, I don't get it. How do you get per-gene count with idxstat? I thought it only tells you the number of reads mapped to each reference sequence, i.e., to each chromosome. How do you get individual genes?

                            Comment


                            • #15
                              Originally posted by Simon Anders View Post
                              Actually, I don't get it. How do you get per-gene count with idxstat? I thought it only tells you the number of reads mapped to each reference sequence, i.e., to each chromosome. How do you get individual genes?

                              Maybe that depends on the reference that you've mapped to?

                              For me it throws out a table which has every gene ID from the genome I'm mapping to, along with it's length, and how many reads are mapped to that reference. For some of them it just gives me and organism and coordinates for the genome, but I can get more information about that from a GFF file I have of the annotations.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X