We have high rates (50%) of duplication flags identified by picard using proton's rnaseq protocol. So I conducted an in-silico experiment below.
The data comes from 2 almost-technical replicates of human breast cancer. They differ in starting RNA input, 50 vs 25 ng of RNA.
0% optical reads were detected, as I wanted, but:
Relevant flags:
1024 1040
25ng 8615686 6934223
50ng 11481086 8776399
These reads represent roughly 50% of the total, and that has been consistent regardless of the starting input material or specimen.
Since the mean read length is short and these are single end reads, I thought the duplication rate might be caused by short, highly expressed transcripts.
Therefor, I merged the bam files from the replicates, set RG the same with AddOrReplaceReadGroups, then markedDuplicates on the merged bam
Results:
1024 1040
5025 observed 20508673 16097011
and expected results:
1024 1040
5025expectedminimum 20096772 15710622
5025 observed 20508673 16097011
where I expected the minimum number of duplicates to be the sum of the duplicates from the individual runs.
Since the observed duplication rate in the merged sample is only slightly higher, I conclude that the majority of original reads marked as duplicate really are pcr duplicates. And that the 'false pcr duplicates' rate is only about 3%, given this library preperation.
Is this interpretation correct?
The data comes from 2 almost-technical replicates of human breast cancer. They differ in starting RNA input, 50 vs 25 ng of RNA.
0% optical reads were detected, as I wanted, but:
Relevant flags:
1024 1040
25ng 8615686 6934223
50ng 11481086 8776399
These reads represent roughly 50% of the total, and that has been consistent regardless of the starting input material or specimen.
Since the mean read length is short and these are single end reads, I thought the duplication rate might be caused by short, highly expressed transcripts.
Therefor, I merged the bam files from the replicates, set RG the same with AddOrReplaceReadGroups, then markedDuplicates on the merged bam
Results:
1024 1040
5025 observed 20508673 16097011
and expected results:
1024 1040
5025expectedminimum 20096772 15710622
5025 observed 20508673 16097011
where I expected the minimum number of duplicates to be the sum of the duplicates from the individual runs.
Since the observed duplication rate in the merged sample is only slightly higher, I conclude that the majority of original reads marked as duplicate really are pcr duplicates. And that the 'false pcr duplicates' rate is only about 3%, given this library preperation.
Is this interpretation correct?
Comment