Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • ct586
    Originally posted by Marie_Noir View Post
    Dear all,

    samtools rmdup -s S1_sorted.bam S1_unique.bam
    [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '
    6479447 represents the number of filtered reads.
    18248499 stands for the mapped reads.

    Leave a comment:

  • vanbug
    Hi Marie,
    Things like that can be quite confusing at times, just check this, it might solve it.

    Read info on my original dataset:
    16562593 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    14493755 + 0 mapped (87.51%:nan%)

    After removing unmapped reads (samtools view -bq 1 file.bam)
    11965422 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    11965401 + 0 mapped (100.00%:nan%)

    Using rmdup on this unique data: (3457 / 11965401 = 0.0003)
    11961965 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    11961944 + 0 mapped (100.00%:nan%)

    So, 11965401-3457=11961944, which the original number of completely aligned reads (100%) from the previous file, but when you do 'wc -l', it tells you the uniquely+non-uniquely aligned reads, which equals to 11961965. There is a difference of 21 reads in there. My data set is small and little non-redundant, when the depth is increased, this difference becomes big, thats why you might have this diff. or for a dataset which is multiplexed and the starting material is small, there are more number of duplicates introduced during the amplification step.

    I will suggest to use 'samtools flagstat file.bam' to get info about the file.

    I hope that helps a bit.


    Leave a comment:

  • Marie_Noir
    Hi nilshomer,

    I guess there are unmapped reads in my file, since no alignment is perfect?! I checked for unmapped reads:

    samtools view -f4 -c S1_unique.bam

    But still, I do not understand why unmapped reads should explain the different in read numbers by a) wc -l and b) output from rmdupse?

    And another question: Is there any need to remove unmapped reads before further processing, e.g. mpileup?

    Concerning header lines in my .bam files, I don't think there are any:

    samtools view S1_unique.bam | head

    gives the following output:

    61WG6AAXX_1:63:18124:15753 16 chr1 10023 0 76M * 0 0 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAeedbaeeecedbcc`b[abaLdacdd
    eeeeeeeecefffffeeeeedeeeedddddeeeeefffffffffffffcf XT:A:R NM:i:0 X0:i:368 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
    61WG6AAXX_1:54:8505:15971 16 chr1 10024 0 76M * 0 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACdbdbddb`\^b_b^`\`bb\b^ccc^
    b_ceddddddddddcffdffeeeeedfefefefffefdffeffffeffff XT:A:R NM:i:0 X0:i:372 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
    61WG6AAXX_1:3:3180:2050 0 chr1 10034 0 76M * 0 0 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACffffffffffefffffeeffffcfff
    fffffedfefddddedd_ded`d^c`cccdddddddbdddddYdaaaXaB XT:A:U NM:i:0 X0:i:1 X1:i:373 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
    61WG6AAXX_1:85:5647:17815 16 chr1 10044 18 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCY\b`b`a`\bb\b`dde`dbeeeceb
    eee_dcdffcedeeeaeceeececeefcfddffdfdffffeeeeeeecee XT:A:U NM:i:1 X0:i:1 X1:i:3 XM:i:1 XO:i:0 XG:i:0 MD:Z:73C2 XA:Z:chr4,+191044156,76M,2;chr15,+102521285,76M
    61WG6AAXX_1:9:11361:10428 16 chr1 10164 0 76M * 0 0 AACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTa`aaaS`^```\b`dde_bded_abb
    dd\dbddd\ddddd^eceeeaeceeedeceffdfceeedeceeeeddddd XT:A:R NM:i:2 X0:i:3 X1:i:2 XM:i:2 XO:i:0 XG:i:0 MD:Z:16T53C5
    61WG6AAXX_1:102:19247:8028 0 chr1 10364 23 76M * 0 0 AACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACcffeffffeffffcffffceffffef
    cfffddfdfcfdafcddadddfddddddfddddddddddcbc`YY_b`b\ XT:A:U NM:i:1 X0:i:1 X1:i:1 XM:i:1 XO:i:0 XG:i:0 MD:Z:74A1 XA:Z:chr20,-62918671,76M,2;
    61WG6AAXX_1:87:9895:13370 0 chr1 14063 0 76M * 0 0 GATGGCACCTCCCTCCCTCTCAACCACTTGAGCAAACTCCAAGACATCTTCTAffdcdfffefffffffffffffffff
    ffffefffffefffedfffffeffffefdffdfefffdfffddffdeddf XT:A:R NM:i:2 X0:i:6 X1:i:1 XM:i:2 XO:i:0 XG:i:0 MD:Z:74C0A0
    61WG6AAXX_1:94:13388:6107 0 chr1 15960 0 76M * 0 0 GGTGGGGAACAGGGCAAGGAGGAAAGGCTGCTCAGGCAGGGCTGGGGAAGCTTeededffffffffffffffedcffdd
    affffffadc`edeffddd`edTeddfccffdfdd`YdbdbY\bedadd^ XT:A:R NM:i:0 X0:i:7 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
    61WG6AAXX_1:16:13382:17255 16 chr1 19197 15 76M * 0 0 TAGCTCCCCTACCTCCAAGAGCCCAGCCCTTGCCCACAGGGCCACACTCCACGY`aaTaaaaY\^b`cbcedYdcff^f
    dafacefffdfcffafdefffefffffdfceffffffffffefffffdff XT:A:U NM:i:0 X0:i:1 X1:i:6 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
    61WG6AAXX_1:104:17203:5703 0 chr1 19556 0 76M * 0 0 CCCCAGAGAGCCAATTTCACAATCCAGAAGTCCCCGTGCCCTAAAGGGTCTGCffffffefffffffffffffefffff
    fffffffffffffffffffdeddffefffffdedffddfdddddeddddd XT:A:R NM:i:0 X0:i:3 X1:i:4 XM:i:0 XO:i:0 XG:i:0 MD:Z:76
    Last edited by Marie_Noir; 07-22-2011, 12:36 AM.

    Leave a comment:

  • nilshomer
    Are there unmapped reads in your file? Are you counting the header lines?

    Leave a comment:

  • Marie_Noir
    started a topic Samtools rmdup output / number of reads?

    Samtools rmdup output / number of reads?

    Dear all,

    I am quite new to this community but have already found lots of helpful answers scanning through the threads here. But now I have come to a topic which I didn't found covered yet.

    I used samtools rmdup to remove duplicate reads from my bwa alignments. I also analysed the data including duplicates, i.e. with all reads, but want to see the difference upon using duplicate-free .bam files...

    Using rmdup, it gives me the following output:

    samtools rmdup -s S1_sorted.bam S1_unique.bam
    [bam_rmdupse_core] 6479447 / 18248499 = 0.3551 in library ' '

    Does this mean that there were initially 18,248,499 reads, and that 6,479,447 were suspected to be duplicate and thus removed?

    That would mean, the number of "uniquely" mapped reads should be 18,248,499 - (minus) 6,479,447 = 11,769,052 reads.

    Since I am not sure how to interpret the rmdupse shell comment, I thought of another possibility to get the number of uniquely mapped reads:

    1) samtools index S1_unique.bam
    2) samtools view S1_unique.bam | wc -l 11897943

    I.E. just counting the lines of the resulting "unique.bam". However, there is a slight difference between the two numbers (11,897,943 vs 11,769,05).

    Anyone able to explain it?

    Ideas are very appreciated!

    Thanks in advance.

Latest Articles


  • seqadmin
    Multiomics Techniques Advancing Disease Research
    by seqadmin

    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

    A major leap in the field has
    02-08-2024, 06:33 AM





Topics Statistics Last Post
Started by seqadmin, 02-23-2024, 04:11 PM
1 response
Last Post kim897
by kim897
Started by seqadmin, 02-21-2024, 08:52 AM
0 responses
Last Post seqadmin  
Started by seqadmin, 02-20-2024, 08:57 AM
0 responses
Last Post seqadmin  
Started by seqadmin, 02-14-2024, 09:19 AM
0 responses
Last Post seqadmin