Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Collapse PCR duplicates to a consensus read

  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapse PCR duplicates to a consensus read

    We are working on methods to detect very low frequencies of mutations in cancer patients. By drawing a blood sample and isolating the plasma, we are able to detect very small amounts of circulating tumor-DNA (ctDNA). We want to be able to detect mutations in 1/100 or maybe 1/1000 fractions of the germline DNA, so we are using a targeted approach and sequencing to a depth of more than 10.000. We want to be able to tell mutations from sequencing errors, so we are adding a 6bp long identifyer in the pcr reaction during library preparation. We want to find the pcr duplicates for each unique identifyer (UID) and collapse these to a consensus sequence.

    UID1          ------G--------------------
    UID1          ---------------------------
    UID1          ---------------------------
    UID2           -------G----------C---------
    UID2           -------G--------------------
    UID2	       -------G------A--------T----
    UID2                 ---------------------------
    UID2                 ---------------------------
    UID3     	               ---------------------------
    UID3 	                   ---------------------T-----
    UID3	                   ---------------------------
    As seen her, we have a REF->G mutation in UID2 but we want to weed out the other mutations found in the same UID. By taking the consensus sequence of all UIDs we will get this picture:

    UID1          ---------------------------
    UID2           -------G--------------------
    UID2                 ---------------------------
    UID3     	               ---------------------------
    UID3 	                   ---------------------N-----
    We would also like to have the number og reads from which each consensus sequence is made stored either an a tag in the bamfile or reflected in the quality score of the consensus sequence.

    Could you guide me to how this could be accomplished, or do you have any suggestions to the best tools, to programme this?
    Last edited by vang42; 09-30-2015, 09:21 AM. Reason: The drawing was malformed

  • #2
    Collapsing PCR dupes to consensus seq to find low frequency mutations

    I am interested in finding a solution to the exact same question. Does anyone have ideas?


    • #3
      4^6=4096 UIDs

      Either separate the FASTQ file into 4096 FASTQ files, by UID, before aligning, or align together, and then separate the BAM file into 4096 temporary BAM files.

      Run samtools mpileup separately on each individual BAM file.
      Determine the consensus sequence from the pileup results.

      Combine together the 4096 pileup results, being careful to preserve the information about the individual UIDs, while also computing the sums for all the UIDs.

      If your starting point is the BAM file, you could write a C++ program, using the OpenMP library for parallelism, and the samtools API for the pileup.

      It would then be a one command line. It's probably simpler to just use samtools mpileup as a binary tool, and write some scripts to parse then output, but it sounds more impressive to write a program.


      • #4
        Thank you blancha.

        As I see it, you method would collapse the reads in the last three UID3 reads, which is not the goal. I need to group together the same reads, that would be defined as pcr-duplicates. That would be all reads with the same 5' position (or pairs of reads for paired end sequencing) within each UID group.


        • #5
          As I see it, you method would collapse the reads in the last three UID3 reads, which is not the goal. I need to group together the same reads, that would be defined as pcr-duplicates. That would be all reads with the same 5' position (or pairs of reads for paired end sequencing) within each UID group.
          Yes, you are right.
          Well, let's just make a small adjustment to the algorithm, then.

          After having generated the 4096 BAM files for each UID, and sorted them by coordinates, if necessary, let's cycle through each BAM file, and then treat each group of reads with the same start coordinate individually. First, keep a record of the number of reads with the same start and end position, for your custom tag. Second, count the number of each nucleotide at each position, and pick the most common nucleotide at each position. Use this information to generate a new entry in a new BAM file, with the collapsed reads, with the custom tag for the number of reads. Let's call this optional tag XU.

          You haven't mentioned the size of the library or the number of reads. If it's significant, I would write this program in C++, possibly using OpenMP for parallelism. Otherwise, I would write the program in Python.

          I'm assuming you are a programmer yourself, or at least, have access to qualified programmers. Otherwise, this is the kind of program that I could write myself, if there really is a demand for this program, and I was provided a sample BAM file. It would also be important to know the number and length of the reads, and also if a Unix cluster is available to process the reads. It's a simple and amusing task for a programmer.

          There are a few additional details that are not clear, that a programmer would have to know, and a couple of options in the processing of the data. Is the UID in the sequence read, or was it sequenced separately? Instead of separating the BAM files, the BAM file could also be treated as a whole, and just be sorted first by coordinates, and then by UID, for processing. The UID with the same start and end coordinates could then be processed together. I'm not sure yet which approach is more efficient. One must also verify if any sequencing errors in the UID are possible, which again depends on where the UID was inserted.
          Last edited by blancha; 10-31-2015, 03:58 AM. Reason: Added questions about details.


          • #6
            I'll make a small adjustment to my algorithm. I'll consider reads as duplicates if they have the same start position, without looking at the end position. I was thinking initially of taking into account different reads aligned at the same position, that were hard-trimmed or soft-trimmed differently. Upon reflection, it isn't really useful to take into account this possibility. If reads have the same UID,and start position, my algorithm would consider them to be duplicates.


            • #7
              I am sure that many would actually use such a program if you wrote and released it to the community.

              So far I have used a mixture of my own scripts with the script from this package

              If you wrote something fast and flexible it would be valuable to many labs, since the very deep sequencing in cancer genomics is becoming more needed. It might be difficult to make a one-solution-for-all, since there are so many experimental procedures that would change the input fastq files.

              I would be happy to collaborate and provide examples and bam files for you. The library size is typically up to 10^9 reads


              • #8
                Just send me a private message through the forum, so we can exchange email addresses.
                I'll make the program, and the code publicly available on GitHub when finished, if there is a demand for it.

                For the sample files, there are a few options.
                If you can give me the location where the files are, I can download them myself.
                I also have my own FTP server at my research institute, very occasionally unreliable.
                There is also a service called Globus Connect, in Canada, which allows one to access directly and write to files directly on one of Compute Canada's servers. You'd have too open an account with Globus Connect first though and use their software.
                I don't currently have a lot of capacity available on my Dropbox account.


                • #9
                  Sent you a private message, with my contact information.
                  Not sure if it worked though, since the message doesn't appear in the sent messages list.


                  • #10
                    I do not believe you can confidently identify 1/1000 mutations with Illumina reads, even with 10000x coverage. The error rate is too high and too nonrandom. Obviously it would be possible with sufficient coverage and a perfectly random error mode, but Illumina's errors are clearly nonrandom, as you can see from base frequency histograms. This is unfortunate as the most glaring biases would be trivial to correct using the realtime mapping data from spiked-in PhiX, but Illumina makes no effort to do that.


                    • #11
                      Follow up for pcr dupes without uid

                      Thanks vang and blancha for the insightful dialogue. I too would be interested in a streamlined program, if made available.

                      I also have the problem of working with aligned data that contains many pcr duplicates (as defined by position) but lacks uids. Do either of you have ideas of how to tackle that problem? I am a confident command line user but still developing in terms of programming skills. I realize trying to collapse large library sizes will take some parallelization that I am less familiar with.


                      • #12
                        If you don't have UIDs, and just want to mark or remove the PCR duplicates, there are many tools available. No programming skills are required.
                        Picard tools MarkDuplicates is my favorite.

                        The case described by vang42 is a novel experimental approach to identify PCR artefacts in ultra-deep sequencing. @Brian Bushnell raises an interesting question about the validity of ultra-deep sequencing, but I won't get into that debate. Given the novelty of the approach, no existing software programs exist.

                        Removing or marking duplicates without ultra-deep sequencing or UIDs is just a run-of-the-mill operation, that can be performed by multiple existing programs, notably Picard tools MarkDuplicates.


                        • #13
                          Thanks blancha. I am familiar with MarkDuplicates, but in my understanding that program assess the duplicate reads and retains the one with the highest quality score. I am looking for a way of generating a consensus from pcr duplicate reads by collapsing the reads. Such that:




                          based on a majority vote that is set programmatically (ie the bp that is present in >50% of reads becomes part of the consensus)

                          Anyone have any ideas??
                          Last edited by C9r1y; 11-06-2015, 12:44 PM.