Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK: read counts varying between programs

    Hi all,

    I am doing an exome analysis with BWA 0.6.1-r104 and GATK v2.2-8-gec077cd.
    I have paired end reads, my protocol until now is (in brief, omitting options etc.)

    bwa aln R1.fastq
    bwa aln R2.fastq
    bwa sampe R1.sai R2.sai
    picard/CleanSam.jar
    picard/SortSam.jar
    picard/MarkDuplicates.jar
    picard/AddOrReplaceReadGroups.jar
    picard/BuildBamIndex.jar
    GATK -T RealignerTargetCreator -known dbsnp.vcf
    GATK -T IndelRealigner -known dbsnp.vcf
    GATK -T BaseRecalibrator -knownSites dbsnp.vcf
    GATK -T PrintReads


    A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

    I have 85767226 paired end = 171534452 sequences in fastQ file

    BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.



    MarkDuplicates reports:

    Read 165619516 records. 2 pairs never matched.
    Marking 20272927 records as duplicates.
    Found 2919670 optical duplicate clusters.

    so nearly 6 million reads seem to miss.



    CreateTargets MicroScheduler reports

    35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
    -> 428072 reads (0.26% of total) failing BadMateFilter
    -> 16077607 reads (9.65% of total) failing DuplicateReadFilter
    -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

    so nearly 5 million reads seem to miss



    The Realigner MicroScheduler reports

    0 reads were filtered out during traversal out of 171551640 total (0.00%)

    which appears a miracle to me since
    1) there are even more reads now than input sequences,
    2) all those crappy reads reported by CreateTargets do not appear.



    From Base recalibration MicroScheduler, I get

    41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
    -> 16010068 reads (9.32% of total) failing DuplicateReadFilter
    -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

    ..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.


    I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?


    Thanks for any comments!

  • #2
    Originally posted by stephlo View Post
    Hi all,

    I am doing an exome analysis with BWA 0.6.1-r104 and GATK v2.2-8-gec077cd.
    I have paired end reads, my protocol until now is (in brief, omitting options etc.)

    bwa aln R1.fastq
    bwa aln R2.fastq
    bwa sampe R1.sai R2.sai
    picard/CleanSam.jar
    picard/SortSam.jar
    picard/MarkDuplicates.jar
    picard/AddOrReplaceReadGroups.jar
    picard/BuildBamIndex.jar
    GATK -T RealignerTargetCreator -known dbsnp.vcf
    GATK -T IndelRealigner -known dbsnp.vcf
    GATK -T BaseRecalibrator -knownSites dbsnp.vcf
    GATK -T PrintReads


    A closer look on the output of the above toolchain revealed changes in read counts I did not quite understand.

    I have 85767226 paired end = 171534452 sequences in fastQ file

    BWA reports this number, the cleaned SAM file has 171534452 alignments as expected.



    MarkDuplicates reports:

    Read 165619516 records. 2 pairs never matched.
    Marking 20272927 records as duplicates.
    Found 2919670 optical duplicate clusters.

    so nearly 6 million reads seem to miss.



    CreateTargets MicroScheduler reports

    35915555 reads were filtered out during traversal out of 166579875 total (21.56%)
    -> 428072 reads (0.26% of total) failing BadMateFilter
    -> 16077607 reads (9.65% of total) failing DuplicateReadFilter
    -> 19409876 reads (11.65% of total) failing MappingQualityZeroFilter

    so nearly 5 million reads seem to miss



    The Realigner MicroScheduler reports

    0 reads were filtered out during traversal out of 171551640 total (0.00%)

    which appears a miracle to me since
    1) there are even more reads now than input sequences,
    2) all those crappy reads reported by CreateTargets do not appear.



    From Base recalibration MicroScheduler, I get

    41397379 reads were filtered out during traversal out of 171703265 total (24.11%)
    -> 16010068 reads (9.32% of total) failing DuplicateReadFilter
    -> 25387311 reads (14.79% of total) failing MappingQualityZeroFilter

    ..... so my reads got even more offspring, but, e.g., the duplicate reads reappear with "roughly" the same number.


    I found these varying counts a little irritating -- can someone please give me a hint on the logics of these numbers? And, does the protocol look meaningful?


    Thanks for any comments!
    Hi Stephio,

    Just a comment. I use a similar approach - no CleanSam step used - for exome capture analysis and got comparable results. IMHO, the protocol We are currently following is standard and sound. At the same time, I found no logical explanation to the variation in the no. of read counts. I am also concerned about the high no. of failing MappingQualityZeroFilter reads. Any logical/clear explanation for that?


    Cheers,

    Fernando

    Comment


    • #3
      Hi Fernando,

      I posted the same question on the GATK forum and was told not to worry as long as those counts are "roughly silimar" :




      Best,

      Stephan

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-25-2024, 11:49 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-24-2024, 08:47 AM
      0 responses
      20 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      62 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Working...
      X