Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools rmdup output / number of reads?

    Dear all,

    I am quite new to this community but have already found lots of helpful answers scanning through the threads here. But now I have come to a topic which I didn't found covered yet.

    I used samtools rmdup to remove duplicate reads from my bwa alignments. I also analysed the data including duplicates, i.e. with all reads, but want to see the difference upon using duplicate-free .bam files...

    Using rmdup, it gives me the following output:

    samtools rmdup -s S1_sorted.bam S1_unique.bam
    [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '

    Does this mean that there were initially 18,248,499 reads, and that 6,479,447 were suspected to be duplicate and thus removed?

    That would mean, the number of "uniquely" mapped reads should be 18,248,499 - (minus) 6,479,447 = 11,769,052 reads.

    Since I am not sure how to interpret the rmdupse shell comment, I thought of another possibility to get the number of uniquely mapped reads:

    1) samtools index S1_unique.bam
    2) samtools view S1_unique.bam | wc -l 11897943

    I.E. just counting the lines of the resulting "unique.bam". However, there is a slight difference between the two numbers (11,897,943 vs 11,769,05).

    Anyone able to explain it?

    Ideas are very appreciated!

    Thanks in advance.

  • #2
    Are there unmapped reads in your file? Are you counting the header lines?

    Comment


    • #3
      Hi nilshomer,

      I guess there are unmapped reads in my file, since no alignment is perfect?! I checked for unmapped reads:

      samtools view -f4 -c S1_unique.bam
      128891

      But still, I do not understand why unmapped reads should explain the different in read numbers by a) wc -l and b) output from rmdupse?

      And another question: Is there any need to remove unmapped reads before further processing, e.g. mpileup?

      Concerning header lines in my .bam files, I don't think there are any:

      samtools view S1_unique.bam | head

      gives the following output:


      61WG6AAXX_1:63:18124:15753 16 chr1 10023 0 76M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAeedbaeeecedbcc`b[abaLdacdd
      eeeeeeeecefffffeeeeedeeeedddddeeeeefffffffffffffcf XT:A:R NM:i:0 X0:i:368 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:54:8505:15971 16 chr1 10024 0 76M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACdbdbddb`\^b_b^`\`bb\b^ccc^
      b_ceddddddddddcffdffeeeeedfefefefffefdffeffffeffff XT:A:R NM:i:0 X0:i:372 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:3:3180:2050 0 chr1 10034 0 76M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACffffffffffefffffeeffffcfff
      fffffedfefddddedd_ded`d^c`cccdddddddbdddddYdaaaXaB XT:A:U NM:i:0 X0:i:1 X1:i:373 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:85:5647:17815 16 chr1 10044 18 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCY\b`b`a`\bb\b`dde`dbeeeceb
      eee_dcdffcedeeeaeceeececeefcfddffdfdffffeeeeeeecee XT:A:U NM:i:1 X0:i:1 X1:i:3 XM:i:1 XO:i:0 XG:i:0 MD:Z:73C2 XA:Z:chr4,+191044156,76M,2;chr15,+102521285,76M
      ,2;chr4,+191043973,76M,2;
      61WG6AAXX_1:9:11361:10428 16 chr1 10164 0 76M * 0 0 AACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTa`aaaS`^```\b`dde_bded_abb
      dd\dbddd\ddddd^eceeeaeceeedeceffdfceeedeceeeeddddd XT:A:R NM:i:2 X0:i:3 X1:i:2 XM:i:2 XO:i:0 XG:i:0 MD:Z:16T53C5
      61WG6AAXX_1:102:19247:8028 0 chr1 10364 23 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACcffeffffeffffcffffceffffef
      cfffddfdfcfdafcddadddfddddddfddddddddddcbc`YY_b`b\ XT:A:U NM:i:1 X0:i:1 X1:i:1 XM:i:1 XO:i:0 XG:i:0 MD:Z:74A1 XA:Z:chr20,-62918671,76M,2;
      61WG6AAXX_1:87:9895:13370 0 chr1 14063 0 76M * 0 0 GATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTAffdcdfffefffffffffffffffff
      ffffefffffefffedfffffeffffefdffdfefffdfffddffdeddf XT:A:R NM:i:2 X0:i:6 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:74C0A0
      61WG6AAXX_1:94:13388:6107 0 chr1 15960 0 76M * 0 0 GGTGGGGAACAGGGCAAGGAGGAAAGGCTGCTCAGGCAGGGCTGGGGAAGCTTeededffffffffffffffedcffdd
      affffffadc`edeffddd`edTeddfccffdfdd`YdbdbY\bedadd^ XT:A:R NM:i:0 X0:i:7 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:16:13382:17255 16 chr1 19197 15 76M * 0 0 TAGCTCCCCTACCTCCAAGAGCCCAGCCCTTGCCCACAGGGCCACACTCCACGY`aaTaaaaY\^b`cbcedYdcff^f
      dafacefffdfcffafdefffefffffdfceffffffffffefffffdff XT:A:U NM:i:0 X0:i:1 X1:i:6 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:104:17203:5703 0 chr1 19556 0 76M * 0 0 CCCCAGAGAGCCAATTTCACAATCCAGAAGTCCCCGTGCCCTAAAGGGTCTGCffffffefffffffffffffefffff
      fffffffffffffffffffdeddffefffffdedffddfdddddeddddd XT:A:R NM:i:0 X0:i:3 X1:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      Last edited by Marie_Noir; 07-22-2011, 12:36 AM.

      Comment


      • #4
        Hi Marie,
        Things like that can be quite confusing at times, just check this, it might solve it.

        Read info on my original dataset:
        16562593 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        14493755 + 0 mapped (87.51%:nan%)

        After removing unmapped reads (samtools view -bq 1 file.bam)
        11965422 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11965401 + 0 mapped (100.00%:nan%)

        Using rmdup on this unique data: (3457 / 11965401 = 0.0003)
        11961965 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11961944 + 0 mapped (100.00%:nan%)


        So, 11965401-3457=11961944, which the original number of completely aligned reads (100%) from the previous file, but when you do 'wc -l', it tells you the uniquely+non-uniquely aligned reads, which equals to 11961965. There is a difference of 21 reads in there. My data set is small and little non-redundant, when the depth is increased, this difference becomes big, thats why you might have this diff. or for a dataset which is multiplexed and the starting material is small, there are more number of duplicates introduced during the amplification step.

        I will suggest to use 'samtools flagstat file.bam' to get info about the file.

        I hope that helps a bit.

        Sukhi

        Comment


        • #5
          Originally posted by Marie_Noir View Post
          Dear all,

          samtools rmdup -s S1_sorted.bam S1_unique.bam
          [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '
          6479447 represents the number of filtered reads.
          18248499 stands for the mapped reads.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Genetic Variation in Immunogenetics and Antibody Diversity
            by seqadmin



            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
            11-06-2024, 07:24 PM
          • seqadmin
            Choosing Between NGS and qPCR
            by seqadmin



            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
            10-18-2024, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:09 AM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Today, 06:13 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 11-01-2024, 06:09 AM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-30-2024, 05:31 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Working...
          X