Announcement

Collapse
No announcement yet.

Redundant reads are removed from ChIP-seq, what about RNA-seq?

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Redundant reads are removed from ChIP-seq, what about RNA-seq?

    I have dealt with both ChIP-seq and RNA-seq analysis. In ChIP-seq, it's almost a standard procedure to remove those redundant reads that map to the same location with the same orientation. It's reasonable because by chance it's very unlikely for the sonication to break the genomic sequence at the same location for more than twice during sample preparation. So, if we see the redundant reads, they are most likely PCR amplifications.

    However, it seems NOT to be a standard to remove those redundant reads for RNA-seq. My understanding is that the total coding sequence length is much shorter than the genomic sequence length, which significantly increase the chance for the same location to be selected for sequencing. However, how do you distinguish the redundancy of amplification from random selection?

    I have had this concern because I have seen certain genes containing much higher read count from one biological replicate than the other replicates. Probably more than 100 folds! It's very unlikely to happen b/c of biological variation. They are more likely to be related with PCR bias.

    Any thoughts?

    - L

  • #2
    Originally posted by asiangg View Post
    I have had this concern because I have seen certain genes containing much higher read count from one biological replicate than the other replicates. Probably more than 100 folds! It's very unlikely to happen b/c of biological variation. They are more likely to be related with PCR bias.

    Any thoughts?

    - L
    Is this change in fold depth a function of total sequencing depth?
    Rather than looking at one gene, why don't you instead look at the entire genome? If this particular gene is an outlier, it may possibly be due to biological variation.

    Comment


    • #3
      The dynamic range between the lowest and highest expressed mRNAs in a typical cell has been estimated at 10^5 to 10^7. If you remove redundant reads you can lose the ability to accurately measure this dynamic range. Duplicates might result from PCR amplification but as library depth increases you expect duplicates to occur even if your library has no PCR introduced amplification bias. In particular for short mRNAs that are highly expressed you will see a lot of duplicates, especially if your reads are not paired or you are not evaluating duplicates at the level of read pairs. As 'RockChalkJayhawk' indicates, if your libraries are of different depths, this can result in a large apparent difference in read counts for a particular gene between your replicates. Read counts and the occurrence of duplicates are pretty much meaningless when not considered in the context of library depth AND quality (high error rates can cause you to underestimate the presence of PCR introduced amplification bias...).

      When we compare tag redundancy levels across RNA-Seq libraries we examine the mapping position of read pairs (outer genome coordinates of the sequenced cDNA fragments) on a per N reads mapped basis.

      Comment


      • #4
        Both "malachig" and "RockChalkJayhawk" have made a few good points. Yes, we should keep the redundant reads in the library.

        However, my response is: isn't PCR amplification or sequencing biased towards certain genes that are highly GC-rich or having certain sequence features? Do we really expect the redundancy to be the same for all genes?

        In my case, the replicates contain similar read counts, so sequencing depth should not cause so much differences.

        Comment


        • #5
          If you're keeping your PCR cycles reasonable (less than 20 cycles, ideally less than 15) bottlenecking just doesn't tend to be a problem, and if it is, you can just spot it by eye. Mapping artifacts are a problem, but they can be solved with a confidence metric like posterior probability instead of just using all unique best hits. If you think you have bias, check for it. Don't just throw away good data and make your results less quantitative to get rid of artifacts you might not have anyway.

          Comment


          • #6
            Well, it sounds so easy!!

            However, would you let me know how you "spot it by eye"? In terms of "a confidence metric like posterior probability", any details regarding the calculation of it? Any references? Thx!

            - L

            Originally posted by jwfoley View Post
            If you're keeping your PCR cycles reasonable (less than 20 cycles, ideally less than 15) bottlenecking just doesn't tend to be a problem, and if it is, you can just spot it by eye. Mapping artifacts are a problem, but they can be solved with a confidence metric like posterior probability instead of just using all unique best hits. If you think you have bias, check for it. Don't just throw away good data and make your results less quantitative to get rid of artifacts you might not have anyway.

            Comment


            • #7
              Originally posted by asiangg View Post
              isn't PCR amplification or sequencing biased towards certain genes that are highly GC-rich or having certain sequence features? Do we really expect the redundancy to be the same for all genes?

              In my case, the replicates contain similar read counts, so sequencing depth should not cause so much differences.
              I guess my response to this question would be, yes maybe. Certainly GC is a factor. We have found in our own libraries that both highly GC and AT rich regions can be under-represented. Other factors that could influence amplification or sequencing bias of one gene more than another could be secondary structure (duplex or hairpin formation), repeat content, and probably many others.

              However, all of these things, including GC content are not changing between your replicates. How would GC content of a particular gene result in bias in one replicate but not others? Do you have reason to suspect that something went wrong (or at least differently) in construction or sequencing of one replicate versus the other?

              It would be interesting to know more about the layout of your experiment. How many replicates are we talking about here? Is one library an outlier while all other replicates are highly similar by comparison? Is a standard number of PCR cycles used for amplification or is it varied to compensate for varying input amounts?

              I agree with 'jwfoley' regarding throwing out data. If you can identify and characterize some bias and it is systematic in a way that can be corrected, then you can deal with it. If you can identify a particular biological replicate as being a serious failure (i.e. something went wrong during sample prep., library construction or sequencing) then you might consider discarding the whole replicate (as long as you can justify this). Failing these options you may have to live with the variability. It could well be biological variability... in which case you are stuck with it.

              Comment


              • #8
                Originally posted by asiangg View Post
                Well, it sounds so easy!!

                However, would you let me know how you "spot it by eye"? In terms of "a confidence metric like posterior probability", any details regarding the calculation of it? Any references? Thx!

                - L
                This came up in another thread and I explained it at the bottom. I'm not aware that anyone has published this, but I've heard a rumor that some similar approach will be used in new versions of popular short-read aligners.

                Comment


                • #9
                  Hi "jwfoley":

                  Thank you for suggesting the use of MAPQ score. This seems to be very useful for RNA-seq.

                  Can you tell me the definition of "PCR bottlenecked"? How do you judge whether a sample is "PCR bottlenecked"?

                  Although I agree the use of posterior probability of mapping for RNA-seq, I still think we should remove those redundant reads for ChIP-seq. For RNA-seq, we cannot distinguish the PCR amplification from independent fragments so let's keep those duplicates if they appear and hope the PCR amplification applies uniformly to all fragments.

                  But for ChIP-seq, it's extremely unlikely for the sonicator to break the same genomic location for excessive times. So removing redundant reads basically eliminates PCR and mapping artifacts all together. Only in rare events it will throw away the useful information!

                  - L


                  Originally posted by jwfoley View Post
                  This came up in another thread and I explained it at the bottom. I'm not aware that anyone has published this, but I've heard a rumor that some similar approach will be used in new versions of popular short-read aligners.

                  Comment


                  • #10
                    Originally posted by asiangg View Post
                    Hi "jwfoley":

                    Can you tell me the definition of "PCR bottlenecked"? How do you judge whether a sample is "PCR bottlenecked"?
                    The telltale sign is "stacks" of reads that all start at the same position. Of course, those could also be mapping artifacts, and when I filter by posterior probability I basically stop seeing those in my data. But if your sample still has an abundance of stacks after filtering, it might be bottlenecked. Ideally you should just redo the experiment with fewer PCR cycles and more input.

                    If you have more than one sample and a stack is in the same place in all of them, that's probably a mapping artifact. If the stacks seem to occur in random places (and they're in exons where you expect them), it's more suggestive of bottlenecking.


                    Originally posted by asiangg View Post
                    Although I agree the use of posterior probability of mapping for RNA-seq, I still think we should remove those redundant reads for ChIP-seq. For RNA-seq, we cannot distinguish the PCR amplification from independent fragments so let's keep those duplicates if they appear and hope the PCR amplification applies uniformly to all fragments.
                    I still disagree. ChIP-seq is as quantitative as RNA-seq and you lose sensitivity by discarding data. A peak-caller won't work well if you've flattened all the peaks. As the throughput of our machines goes up, you'll be throwing away more and more perfectly good data; I suspect they're already well beyond the point where you toss out more signal than noise by doing this.

                    Really, the domain of sequences you'll pull out in transcription-factor ChIP-seq may be smaller than in RNA-seq: regardless of what proportion of the genome you think is transcribed, surely even less of it is bound by any individual protein. So the odds of duplicate reads containing signal instead of technical noise would actually by higher for TF ChIP-seq.

                    Originally posted by asiangg View Post
                    But for ChIP-seq, it's extremely unlikely for the sonicator to break the same genomic location for excessive times.
                    No, DNA is not equally strong in all places, and there are biases in where it likes to shear. Of course that's also true for RNA, and ligases have nucleotide preferences too. No amount of technical perfection will get around the heterogeneity of molecular biology.

                    Comment


                    • #11
                      Even if the genome is heterogeneous, it's still very unlikely that the same location will be broken twice or more. When the library goes up, we can compensate this by allowing 2 duplicates or 3 duplicates.

                      If the pull down of the antibody is very small, the materials for sequencing are very little as well. Then the machine will end up sequencing PCR products all the time, which results in a lot of duplicated reads. I don't think filtering by mapping quality can solve this problem.

                      - L


                      Originally posted by jwfoley View Post
                      The telltale sign is "stacks" of reads that all start at the same position. Of course, those could also be mapping artifacts, and when I filter by posterior probability I basically stop seeing those in my data. But if your sample still has an abundance of stacks after filtering, it might be bottlenecked. Ideally you should just redo the experiment with fewer PCR cycles and more input.

                      If you have more than one sample and a stack is in the same place in all of them, that's probably a mapping artifact. If the stacks seem to occur in random places (and they're in exons where you expect them), it's more suggestive of bottlenecking.




                      I still disagree. ChIP-seq is as quantitative as RNA-seq and you lose sensitivity by discarding data. A peak-caller won't work well if you've flattened all the peaks. As the throughput of our machines goes up, you'll be throwing away more and more perfectly good data; I suspect they're already well beyond the point where you toss out more signal than noise by doing this.

                      Really, the domain of sequences you'll pull out in transcription-factor ChIP-seq may be smaller than in RNA-seq: regardless of what proportion of the genome you think is transcribed, surely even less of it is bound by any individual protein. So the odds of duplicate reads containing signal instead of technical noise would actually by higher for TF ChIP-seq.



                      No, DNA is not equally strong in all places, and there are biases in where it likes to shear. Of course that's also true for RNA, and ligases have nucleotide preferences too. No amount of technical perfection will get around the heterogeneity of molecular biology.

                      Comment


                      • #12
                        Originally posted by asiangg View Post
                        Even if the genome is heterogeneous, it's still very unlikely that the same location will be broken twice or more. When the library goes up, we can compensate this by allowing 2 duplicates or 3 duplicates.

                        If the pull down of the antibody is very small, the materials for sequencing are very little as well. Then the machine will end up sequencing PCR products all the time, which results in a lot of duplicated reads. I don't think filtering by mapping quality can solve this problem.

                        - L
                        If you are determined to throw away some of your data, at least inspect it on a browser and make sure what you're throwing looks like an artifact ("stacks" as described above) and not real signal. As I said, when I filter by probability I generally don't see those artifacts anymore, but your data may be different.

                        Comment

                        Working...
                        X