Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Ohad
    Member
    • Jul 2013
    • 28

    Cleaning MNase BAM file

    Hi,
    I have this dataset of MNase.
    the raw reads were aligned with bwa, the BAM file is about 60GB, and I wonder if I should filter it.

    The header of the file looks like this:

    @RG ID:T0-001 DT:2009-12-18 PG:BWA PI:150 PL:ILLUMINA
    @RG ID:T0-002 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
    @RG ID:T0-003 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
    @RG ID:T0-004 DT:2010-09-10 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
    @RG ID:T0-005 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
    @RG ID:T0-006 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
    @RG ID:T0-007 DT:2010-09-17 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
    @RG ID:T0-008 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE3
    @RG ID:T0-009 DT:2011-03-18 PG:BWA PI:150 PL:ILLUMINA PU:LANE4
    @RG ID:T0-010 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE1
    @RG ID:T0-011 DT:2011-11-22 PG:BWA PI:150 PL:ILLUMINA PU:LANE2
    @SQ SN:chr1 LN:249250621
    @SQ SN:chr2 LN:243199373
    @SQ SN:chr3 LN:198022430
    @SQ SN:chr4 LN:191154276
    @SQ SN:chr5 LN:180915260
    @SQ SN:chr6 LN:171115067
    @SQ SN:chr7 LN:159138663
    @SQ SN:chr8 LN:146364022
    @SQ SN:chr9 LN:141213431
    @SQ SN:chr10 LN:135534747
    @SQ SN:chr11 LN:135006516
    @SQ SN:chr12 LN:133851895
    @SQ SN:chr13 LN:115169878
    @SQ SN:chr14 LN:107349540
    @SQ SN:chr15 LN:102531392
    @SQ SN:chr16 LN:90354753
    @SQ SN:chr17 LN:81195210
    @SQ SN:chr18 LN:78077248
    @SQ SN:chr19 LN:59128983
    @SQ SN:chr20 LN:63025520
    @SQ SN:chr21 LN:48129895
    @SQ SN:chr22 LN:51304566
    @SQ SN:chrX LN:155270560
    @SQ SN:chrM LN:16571
    @SQ SN:MMTV LN:10937

    Indicating that it is sorted, and it is a pileup(?) of many sequencing done over the years, (done on the same cell line)
    Last two groups are paired-end, titled @RG ID:T0-010 and @RG ID:T0-011, the remaining are single-end.

    Running samtools flagstat gives this output:

    974253427 + 425863565 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    974252673 + 233075580 mapped (100.00%:54.73%)
    603612708 + 143779894 paired in sequencing
    301806108 + 71890193 read1
    301806600 + 71889701 read2
    603406978 + 82441932 properly paired (99.97%:57.34%)
    603612645 + 91537096 with itself and mate mapped
    0 + 7424943 singletons (0.00%:5.16%)
    61464 + 8185912 with mate mapped to a different chr
    61464 + 2965666 with mate mapped to a different chr (mapQ>=5)


    First thing pops clearly is that there are quite a lot of reads who did not passed QC. However, 54.73% of them are aligned.

    I saw that duplicates were not flagged but instead are tagged.

    Considering all that, I would like to share some questions:
    1) Should I filter out QC failed reads ?
    I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?

    2) How should I relate to duplicates ?
    I run a "samtools view ...etc | grep -c ...etc " command to count the reads holding tags which indicate that they are duplicates. The output was 50169529 reads that are tagged as "duplicate", which id 3.6% of all the reads. I tend to relate to these reads as valid reads, considering that the cells were treated with MNase.

    3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?

    I would appreciate some thoughts of this. Thanks
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Originally posted by Ohad View Post
    Considering all that, I would like to share some questions:
    1) Should I filter out QC failed reads ?
    Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

    I looked inside the file, how can a 51bp read with a running Phred score of 2 (ASCII chars consist only of "#" so I caculated 35 -33 = 2 , right?) get a CIGAR 51M ? and can I assume that it is a good alignment after all ?
    Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.

    3) If I decide to filter out some reads. What are the consequences of removing a read that is a part of a pair ? I suspect that read2 will be left with a wrong flag, indicating it is paired while read1 is no longer present. Can this flag thing create some problems downstream along the pipeline ?
    I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.

    Comment

    • Ohad
      Member
      • Jul 2013
      • 28

      #3
      Originally posted by Brian Bushnell View Post
      Yes, unless that somehow biases the data or drops your coverage too low. Otherwise you'll be adding a lot of noise, and the noise may be biased in some way. Do you know why the %failed is so high?

      Look at the mapping score, not the cigar string. Old-style cigar strings are worthless without context: M stands for Match or Mismatch. New-style cigar strings use '=' for match and 'X' for mismatch, and therefore actually give you useful information. Also, 2 is a "special" quality score for Illumina. 5 to 41 are normal Phred values, but 2 indicates something else (like, that two clusters are too close together to distinguish) and is usually much more accurate than a true Q2 base, up to a measured Q11~Q15 in my tests on HighSeq data. Bases with Q2 should not be used for error-sensitive things like assembly and variation calling, but they can increase alignment accuracy if the aligner is error-tolerant, because longer reads have less ambiguity. BWA is not very error-tolerant, though.
      I just finished doing fastqc.
      I see most reads are 50bp, but there are 40 and 36 as well.
      Quality control is GREEN and the box-plot way of distribution shows a large amount of reads with the "special value" of 2.
      I counted 58,180,456 of them using grep, not close enough to explain ~425M bad reads.

      Looking for some cases I saw these three:


      HWI-ST227_83:3:23:15749:55015 528 chr1 9995 25 50M * 0 165
      TTACGATATCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
      ##################################################
      RG:Z:T0-008

      HWI-ST227:145:C06RAACXX:1:1104:11867:197280 629 chr1 9996 0 * = 9996 129
      TCTGATCTTAGGGTTTGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGT
      ###########################################?2+A1:1
      RG:Z:T0-010

      HWI-ST227_83:4:65:8785:4808 512 chr1 10000 25 50M * 0 165
      ATTACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTATCCGT
      ##################################################
      RG:Z:T0-009

      Rolling down some few more lines on my screen I see flags 512 / 528 are quite common.

      Perhaps those "2" quality lines are fine or explainable.

      But another thought in mind is that it seems, by the read lengths and sequencing date, that this is an old data and perhaps the technology was still new and quality calling was less advanced.

      I also see RED cross on 73.3% Duplication Level which is to be bothered by it, and some Kmer popping on some unique positions in the read. Though for Kmer I got the ORANGE escalation mark.

      I do not recommend removing singletons from a paired dataset. Use a tool that can process paired reads in their original fastq, and if either fails your filtering criteria are, then remove both. I like to create 2 outputs, one for good pairs and one for orphaned singletons, so as to conserve all of the good data (though in practice the orphaned singletons file usually gets ignored unless there is a severe lack of coverage otherwise). Then map the clean files. This way saves a lot of processing and results in smaller intermediate files. And it's much easier, since pairs in the original fastq pairs will be together, and thus can be processed as a stream, whereas in a BAM they could be anywhere.
      I don't have the original raw files, and it would be a problem for my small server to hold those files if I extract them back from BAM.

      My current direction of thought is to break this BAM into Read Groups and start evaluating each group's sequence quality singly.
      Last edited by Ohad; 03-02-2014, 05:13 PM.

      Comment

      Latest Articles

      Collapse

      • SEQadmin2
        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by SEQadmin2


        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


        Here are nine questions we think about, in roughly the order they matter, before...
        06-18-2026, 07:11 AM
      • SEQadmin2
        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
        by SEQadmin2


        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
        ...
        06-02-2026, 10:05 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, 06-17-2026, 06:09 AM
      0 responses
      25 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-09-2026, 11:58 AM
      0 responses
      42 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-05-2026, 10:09 AM
      0 responses
      48 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-04-2026, 08:59 AM
      0 responses
      49 views
      0 reactions
      Last Post SEQadmin2  
      Working...