Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools rmdup output / number of reads?

    Dear all,

    I am quite new to this community but have already found lots of helpful answers scanning through the threads here. But now I have come to a topic which I didn't found covered yet.

    I used samtools rmdup to remove duplicate reads from my bwa alignments. I also analysed the data including duplicates, i.e. with all reads, but want to see the difference upon using duplicate-free .bam files...

    Using rmdup, it gives me the following output:

    samtools rmdup -s S1_sorted.bam S1_unique.bam
    [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '

    Does this mean that there were initially 18,248,499 reads, and that 6,479,447 were suspected to be duplicate and thus removed?

    That would mean, the number of "uniquely" mapped reads should be 18,248,499 - (minus) 6,479,447 = 11,769,052 reads.

    Since I am not sure how to interpret the rmdupse shell comment, I thought of another possibility to get the number of uniquely mapped reads:

    1) samtools index S1_unique.bam
    2) samtools view S1_unique.bam | wc -l 11897943

    I.E. just counting the lines of the resulting "unique.bam". However, there is a slight difference between the two numbers (11,897,943 vs 11,769,05).

    Anyone able to explain it?

    Ideas are very appreciated!

    Thanks in advance.

  • #2
    Are there unmapped reads in your file? Are you counting the header lines?

    Comment


    • #3
      Hi nilshomer,

      I guess there are unmapped reads in my file, since no alignment is perfect?! I checked for unmapped reads:

      samtools view -f4 -c S1_unique.bam
      128891

      But still, I do not understand why unmapped reads should explain the different in read numbers by a) wc -l and b) output from rmdupse?

      And another question: Is there any need to remove unmapped reads before further processing, e.g. mpileup?

      Concerning header lines in my .bam files, I don't think there are any:

      samtools view S1_unique.bam | head

      gives the following output:


      61WG6AAXX_1:63:18124:15753 16 chr1 10023 0 76M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAeedbaeeecedbcc`b[abaLdacdd
      eeeeeeeecefffffeeeeedeeeedddddeeeeefffffffffffffcf XT:A:R NM:i:0 X0:i:368 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:54:8505:15971 16 chr1 10024 0 76M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACdbdbddb`\^b_b^`\`bb\b^ccc^
      b_ceddddddddddcffdffeeeeedfefefefffefdffeffffeffff XT:A:R NM:i:0 X0:i:372 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:3:3180:2050 0 chr1 10034 0 76M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACffffffffffefffffeeffffcfff
      fffffedfefddddedd_ded`d^c`cccdddddddbdddddYdaaaXaB XT:A:U NM:i:0 X0:i:1 X1:i:373 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:85:5647:17815 16 chr1 10044 18 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCY\b`b`a`\bb\b`dde`dbeeeceb
      eee_dcdffcedeeeaeceeececeefcfddffdfdffffeeeeeeecee XT:A:U NM:i:1 X0:i:1 X1:i:3 XM:i:1 XO:i:0 XG:i:0 MD:Z:73C2 XA:Z:chr4,+191044156,76M,2;chr15,+102521285,76M
      ,2;chr4,+191043973,76M,2;
      61WG6AAXX_1:9:11361:10428 16 chr1 10164 0 76M * 0 0 AACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTa`aaaS`^```\b`dde_bded_abb
      dd\dbddd\ddddd^eceeeaeceeedeceffdfceeedeceeeeddddd XT:A:R NM:i:2 X0:i:3 X1:i:2 XM:i:2 XO:i:0 XG:i:0 MD:Z:16T53C5
      61WG6AAXX_1:102:19247:8028 0 chr1 10364 23 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACcffeffffeffffcffffceffffef
      cfffddfdfcfdafcddadddfddddddfddddddddddcbc`YY_b`b\ XT:A:U NM:i:1 X0:i:1 X1:i:1 XM:i:1 XO:i:0 XG:i:0 MD:Z:74A1 XA:Z:chr20,-62918671,76M,2;
      61WG6AAXX_1:87:9895:13370 0 chr1 14063 0 76M * 0 0 GATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTAffdcdfffefffffffffffffffff
      ffffefffffefffedfffffeffffefdffdfefffdfffddffdeddf XT:A:R NM:i:2 X0:i:6 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:74C0A0
      61WG6AAXX_1:94:13388:6107 0 chr1 15960 0 76M * 0 0 GGTGGGGAACAGGGCAAGGAGGAAAGGCTGCTCAGGCAGGGCTGGGGAAGCTTeededffffffffffffffedcffdd
      affffffadc`edeffddd`edTeddfccffdfdd`YdbdbY\bedadd^ XT:A:R NM:i:0 X0:i:7 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:16:13382:17255 16 chr1 19197 15 76M * 0 0 TAGCTCCCCTACCTCCAAGAGCCCAGCCCTTGCCCACAGGGCCACACTCCACGY`aaTaaaaY\^b`cbcedYdcff^f
      dafacefffdfcffafdefffefffffdfceffffffffffefffffdff XT:A:U NM:i:0 X0:i:1 X1:i:6 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      61WG6AAXX_1:104:17203:5703 0 chr1 19556 0 76M * 0 0 CCCCAGAGAGCCAATTTCACAATCCAGAAGTCCCCGTGCCCTAAAGGGTCTGCffffffefffffffffffffefffff
      fffffffffffffffffffdeddffefffffdedffddfdddddeddddd XT:A:R NM:i:0 X0:i:3 X1:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
      Last edited by Marie_Noir; 07-22-2011, 12:36 AM.

      Comment


      • #4
        Hi Marie,
        Things like that can be quite confusing at times, just check this, it might solve it.

        Read info on my original dataset:
        16562593 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        14493755 + 0 mapped (87.51%:nan%)

        After removing unmapped reads (samtools view -bq 1 file.bam)
        11965422 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11965401 + 0 mapped (100.00%:nan%)

        Using rmdup on this unique data: (3457 / 11965401 = 0.0003)
        11961965 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        11961944 + 0 mapped (100.00%:nan%)


        So, 11965401-3457=11961944, which the original number of completely aligned reads (100%) from the previous file, but when you do 'wc -l', it tells you the uniquely+non-uniquely aligned reads, which equals to 11961965. There is a difference of 21 reads in there. My data set is small and little non-redundant, when the depth is increased, this difference becomes big, thats why you might have this diff. or for a dataset which is multiplexed and the starting material is small, there are more number of duplicates introduced during the amplification step.

        I will suggest to use 'samtools flagstat file.bam' to get info about the file.

        I hope that helps a bit.

        Sukhi

        Comment


        • #5
          Originally posted by Marie_Noir View Post
          Dear all,

          samtools rmdup -s S1_sorted.bam S1_unique.bam
          [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '
          6479447 represents the number of filtered reads.
          18248499 stands for the mapped reads.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Non-Coding RNA Research and Technologies
            by seqadmin


            Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

            [Article Coming Soon!]...
            Yesterday, 08:07 AM
          • seqadmin
            Recent Developments in Metagenomics
            by seqadmin





            Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
            09-23-2024, 06:35 AM
          • seqadmin
            Understanding Genetic Influence on Infectious Disease
            by seqadmin




            During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

            Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
            09-09-2024, 10:59 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 10-02-2024, 04:51 AM
          0 responses
          14 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-01-2024, 07:10 AM
          0 responses
          24 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-30-2024, 08:33 AM
          1 response
          31 views
          0 likes
          Last Post EmiTom
          by EmiTom
           
          Started by seqadmin, 09-26-2024, 12:57 PM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Working...
          X