We are working on methods to detect very low frequencies of mutations in cancer patients. By drawing a blood sample and isolating the plasma, we are able to detect very small amounts of circulating tumor-DNA (ctDNA). We want to be able to detect mutations in 1/100 or maybe 1/1000 fractions of the germline DNA, so we are using a targeted approach and sequencing to a depth of more than 10.000. We want to be able to tell mutations from sequencing errors, so we are adding a 6bp long identifyer in the pcr reaction during library preparation. We want to find the pcr duplicates for each unique identifyer (UID) and collapse these to a consensus sequence.
As seen her, we have a REF->G mutation in UID2 but we want to weed out the other mutations found in the same UID. By taking the consensus sequence of all UIDs we will get this picture:
We would also like to have the number og reads from which each consensus sequence is made stored either an a tag in the bamfile or reflected in the quality score of the consensus sequence.
Could you guide me to how this could be accomplished, or do you have any suggestions to the best tools, to programme this?
Code:
UID1 ------G-------------------- UID1 --------------------------- UID1 --------------------------- UID2 -------G----------C--------- UID2 -------G-------------------- UID2 -------G------A--------T---- UID2 --------------------------- UID2 --------------------------- UID3 --------------------------- UID3 ---------------------T----- UID3 ---------------------------
Code:
UID1 --------------------------- UID2 -------G-------------------- UID2 --------------------------- UID3 --------------------------- UID3 ---------------------N-----
Could you guide me to how this could be accomplished, or do you have any suggestions to the best tools, to programme this?
Comment