Announcement

Collapse
No announcement yet.

SAMtools pileup of millions of reads from a single amplicon

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAMtools pileup of millions of reads from a single amplicon

    Hi all,


    We would like to pileup millions of reads from a single amplicon for ultra-sensitive mutation detection.

    Considering that SAMtools pileup is limited to several thousand reads at a given position I am wondering if you could suggest us any alternative approach or workaround.


    Any feedback is highly appreciated!

  • #2
    Is that limit documented somewhere or based on personal experience?

    Heng Li has referred to pileup being able to use 200GB BAM's before (albeit not for one amplicon) http://seqanswers.com/forums/showthread.php?t=6680

    Comment


    • #3
      I use
      samtools mpileup -BQ60 -d500000 -D -f

      for our low-variant detection. The "-d" is "-d INT At a position, read maximally INT reads per input BAM. [250]" which limits the depth of the pileup. I turn off the BAQ calculation as I find it depresses scores of any variant, and while we only allow quality scores of 60 that is because our method greatly improves the quality scores so if you are looking at normal reads you might skip that or set -Q to 30.
      Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

      Comment


      • #4
        Given the error-prone nature of Illumina sequencing, there is a limit to how ultra sensitive you can be. I am skeptical that millions of reads will give you more true positives than a hundred thousand.

        Comment


        • #5
          Originally posted by swbarnes2 View Post
          Given the error-prone nature of Illumina sequencing, there is a limit to how ultra sensitive you can be. I am skeptical that millions of reads will give you more true positives than a hundred thousand.
          Agreed. The race to the bottom for ultra-sensitive variant detection seems to be conveniently ignoring the false positive rate right now and it's quite disconcerting. Combined with your PCR induced errors, you're asking for trouble.
          Last edited by Bukowski; 02-19-2014, 04:19 PM.

          Comment


          • #6
            Originally posted by Bukowski View Post
            Agreed. The race to the bottom for ultra-sensitive variant detection seems to be conveniently ignoring the false positive rate right now and it's quite disconcerting. Combined with your PCR induced errors, you're asking for trouble.
            Of course, you're right! We are also thinking about these problems and try to face them using corresponding control samples.
            But this is another question, I just wanted to know if it would be possible to map millions of reads to one and the same location, process them with (m)pileup and call variants on it.

            Comment


            • #7
              Originally posted by svos View Post
              Of course, you're right! We are also thinking about these problems and try to face them using corresponding control samples.
              But this is another question, I just wanted to know if it would be possible to map millions of reads to one and the same location, process them with (m)pileup and call variants on it.
              It's hard to say without knowing exactly how low you are trying to go, but I would NOT believe mpileup on anything less than a few % unless I had very solid spike-in data proving that the false positive and false negative rates were acceptable.

              Comment


              • #8
                Originally posted by swbarnes2 View Post
                It's hard to say without knowing exactly how low you are trying to go, but I would NOT believe mpileup on anything less than a few % unless I had very solid spike-in data proving that the false positive and false negative rates were acceptable.
                Again, you're right, but thats another problem... Hopefully we will have control settings allowing us to perform such an analysis.

                The simple question is, is this kind of variant detection possible in respect to its technical / bioinformatic setting using e.g. (m)pileup or an alternative? Or will we face the problems already here (without thinking about the biological and sequencing background)?

                Comment


                • #9
                  Perhaps one solution is to compute it in sections (say 1000 reads at a time), computing a vector of ACGT- at each point along with confidences, and then combining those vectors together in a second round of mpileup.

                  It's not possible with the current code, but in principle the "reduced-reads" style notation (done formally) could yield a way to compute extreme depth pileups in a memory-tractable manner.

                  Comment

                  Working...
                  X