No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools rmdup output / number of reads?

    Dear all,

    I am quite new to this community but have already found lots of helpful answers scanning through the threads here. But now I have come to a topic which I didn't found covered yet.

    I used samtools rmdup to remove duplicate reads from my bwa alignments. I also analysed the data including duplicates, i.e. with all reads, but want to see the difference upon using duplicate-free .bam files...

    Using rmdup, it gives me the following output:

    samtools rmdup -s S1_sorted.bam S1_unique.bam
    [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '

    Does this mean that there were initially 18,248,499 reads, and that 6,479,447 were suspected to be duplicate and thus removed?

    That would mean, the number of "uniquely" mapped reads should be 18,248,499 - (minus) 6,479,447 = 11,769,052 reads.

    Since I am not sure how to interpret the rmdupse shell comment, I thought of another possibility to get the number of uniquely mapped reads:

    1) samtools index S1_unique.bam
    2) samtools view S1_unique.bam | wc -l 11897943

    I.E. just counting the lines of the resulting "unique.bam". However, there is a slight difference between the two numbers (11,897,943 vs 11,769,05).

    Anyone able to explain it?

    Ideas are very appreciated!

    Thanks in advance.

  • #2
    Are there unmapped reads in your file? Are you counting the header lines?


    • #3
      Hi nilshomer,

      I guess there are unmapped reads in my file, since no alignment is perfect?! I checked for unmapped reads:

      samtools view -f4 -c S1_unique.bam

      But still, I do not understand why unmapped reads should explain the different in read numbers by a) wc -l and b) output from rmdupse?

      And another question: Is there any need to remove unmapped reads before further processing, e.g. mpileup?

      Concerning header lines in my .bam files, I don't think there are any:

      samtools view S1_unique.bam | head

      gives the following output:

      61WG6AAXX_1:63:18124:15753 16 chr1 10023 0 76M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAeedbaeeecedbcc`b[abaLdacdd
      eeeeeeeecefffffeeeeedeeeedddddeeeeefffffffffffffcf XT:A:R NM:i:0 X0:i:368 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:54:8505:15971 16 chr1 10024 0 76M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACdbdbddb`\^b_b^`\`bb\b^ccc^
      b_ceddddddddddcffdffeeeeedfefefefffefdffeffffeffff XT:A:R NM:i:0 X0:i:372 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:3:3180:2050 0 chr1 10034 0 76M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACffffffffffefffffeeffffcfff
      fffffedfefddddedd_ded`d^c`cccdddddddbdddddYdaaaXaB XT:A:U NM:i:0 X0:i:1 X1:i:373 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:85:5647:17815 16 chr1 10044 18 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCY\b`b`a`\bb\b`dde`dbeeeceb
      eee_dcdffcedeeeaeceeececeefcfddffdfdffffeeeeeeecee XT:A:U NM:i:1 X0:i:1 X1:i:3 XM:i:1 XO:i:0 XG:i:0 MD:Z:73C2 XA:Z:chr4,+191044156,76M,2;chr15,+102521285,76M
      61WG6AAXX_1:9:11361:10428 16 chr1 10164 0 76M * 0 0 AACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTa`aaaS`^```\b`dde_bded_abb
      dd\dbddd\ddddd^eceeeaeceeedeceffdfceeedeceeeeddddd XT:A:R NM:i:2 X0:i:3 X1:i:2 XM:i:2 XO:i:0 XG:i:0 MD:Z:16T53C5
      61WG6AAXX_1:102:19247:8028 0 chr1 10364 23 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACcffeffffeffffcffffceffffef
      cfffddfdfcfdafcddadddfddddddfddddddddddcbc`YY_b`b\ XT:A:U NM:i:1 X0:i:1 X1:i:1 XM:i:1 XO:i:0 XG:i:0 MD:Z:74A1 XA:Z:chr20,-62918671,76M,2;
      61WG6AAXX_1:87:9895:13370 0 chr1 14063 0 76M * 0 0 GATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTAffdcdfffefffffffffffffffff
      ffffefffffefffedfffffeffffefdffdfefffdfffddffdeddf XT:A:R NM:i:2 X0:i:6 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:74C0A0
      61WG6AAXX_1:94:13388:6107 0 chr1 15960 0 76M * 0 0 GGTGGGGAACAGGGCAAGGAGGAAAGGCTGCTCAGGCAGGGCTGGGGAAGCTTeededffffffffffffffedcffdd
      affffffadc`edeffddd`edTeddfccffdfdd`YdbdbY\bedadd^ XT:A:R NM:i:0 X0:i:7 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:16:13382:17255 16 chr1 19197 15 76M * 0 0 TAGCTCCCCTACCTCCAAGAGCCCAGCCCTTGCCCACAGGGCCACACTCCACGY`aaTaaaaY\^b`cbcedYdcff^f
      dafacefffdfcffafdefffefffffdfceffffffffffefffffdff XT:A:U NM:i:0 X0:i:1 X1:i:6 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:104:17203:5703 0 chr1 19556 0 76M * 0 0 CCCCAGAGAGCCAATTTCACAATCCAGAAGTCCCCGTGCCCTAAAGGGTCTGCffffffefffffffffffffefffff
      fffffffffffffffffffdeddffefffffdedffddfdddddeddddd XT:A:R NM:i:0 X0:i:3 X1:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      Last edited by Marie_Noir; 07-22-2011, 12:36 AM.


      • #4
        Hi Marie,
        Things like that can be quite confusing at times, just check this, it might solve it.

        Read info on my original dataset:
        16562593 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        14493755 + 0 mapped (87.51%:nan%)

        After removing unmapped reads (samtools view -bq 1 file.bam)
        11965422 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11965401 + 0 mapped (100.00%:nan%)

        Using rmdup on this unique data: (3457 / 11965401 = 0.0003)
        11961965 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11961944 + 0 mapped (100.00%:nan%)

        So, 11965401-3457=11961944, which the original number of completely aligned reads (100%) from the previous file, but when you do 'wc -l', it tells you the uniquely+non-uniquely aligned reads, which equals to 11961965. There is a difference of 21 reads in there. My data set is small and little non-redundant, when the depth is increased, this difference becomes big, thats why you might have this diff. or for a dataset which is multiplexed and the starting material is small, there are more number of duplicates introduced during the amplification step.

        I will suggest to use 'samtools flagstat file.bam' to get info about the file.

        I hope that helps a bit.



        • #5
          Originally posted by Marie_Noir View Post
          Dear all,

          samtools rmdup -s S1_sorted.bam S1_unique.bam
          [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '
          6479447 represents the number of filtered reads.
          18248499 stands for the mapped reads.


          Latest Articles


          • seqadmin
            Advanced Methods for the Detection of Infectious Disease
            by seqadmin

            The recent pandemic caused worldwide health, economic, and social disruptions with its reverberations still felt today. A key takeaway from this event is the need for accurate and accessible tools for detecting and tracking infectious diseases. Timely identification is essential for early intervention, managing outbreaks, and preventing their spread. This article reviews several valuable tools employed in the detection and surveillance of infectious diseases.
            11-27-2023, 01:15 PM
          • seqadmin
            Strategies for Investigating the Microbiome
            by seqadmin

            Microbiome research has led to the discovery of important connections to human and environmental health. Sequencing has become a core investigational tool in microbiome research, a subject that we covered during a recent webinar. Our expert speakers shared a number of advancements including improved experimental workflows, research involving transmission dynamics, and invaluable analysis resources. This article recaps their informative presentations, offering insights...
            11-09-2023, 07:02 AM





          Topics Statistics Last Post
          Started by seqadmin, 12-01-2023, 09:55 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 11-30-2023, 10:48 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 11-29-2023, 08:26 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 11-29-2023, 08:12 AM
          0 responses
          Last Post seqadmin