Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • chrishah
    Member
    • Jul 2011
    • 18

    stats from sam/bam

    Hi there,

    I am looking for a tool to extract some statistics from sam/bam files. What I want is something like a tab delimited output for every sequence telling me:
    -# of reads that mapped
    -min, max, avg, median coverage
    -how many percent of the original reads have been covered i.e. are there regions with no reads mapping

    Any suggestions?? I would really appreciate your input!!!

    The idxstats option from samtools is partly doing what I want. It gives me the number of reads mapping per sequence. I am not sure what # of unmapped reads means, though:

    idxstats samtools idxstats <aln.bam>

    Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.
    much obliged,
    Chris
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    With paired end data, one read may map to a reference, and the partner can be unmapped. In this case the partner is assigned to that reference (with the same POS), but the FLAG says it is unmapped. This ensures when sorting the unmapped partner will be next to the mapped read, which is useful.

    What this means is that for each reference you can count the mapped reads, and also the placed but not mapped reads.

    The final line for 'samtools idxstats' is the unmapped reads which didn't get placed to a reference in this way.

    Comment

    • chrishah
      Member
      • Jul 2011
      • 18

      #3
      Hi maubp,

      Thanks for your explanation!

      what about cases where # unmapped reads is higher than #mapped reads? I am having those in my samtools idxstats output..

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Could you post some example output?

        Comment

        • chrishah
          Member
          • Jul 2011
          • 18

          #5
          Sure!

          6584_length_242_COV_34.193_gc_40.49_norest 242 0 3
          6657_length_206_COV_19.714_gc_33.81_norest 206 34 11
          6758_length_218_COV_30.721_gc_39.64_norest 218 0 0
          7140_length_205_COV_22.099_gc_38.28_norest 205 2 0
          7148_length_573_COV_32.388_gc_42.02_norest 573 10 2
          7239_length_281_COV_22.531_gc_31.82_norest 281 1 1
          10317_length_223_COV_25.493_gc_43.17_norest 223 15 17
          10334_length_247_COV_20.730_gc_51.59_norest 247 3 130

          Comment

          • maubp
            Peter (Biopython etc)
            • Jul 2009
            • 1544

            #6
            That is curious. Which mapper did you use to make the SAM/BAM file? Have you examined the reads placed against those particular references to see what might be going on?

            Comment

            • swbarnes2
              Senior Member
              • May 2008
              • 910

              #7
              The other time a read can have a 4 flag set is if it hangs off the edge of a reference. It will have the mapping coordinate of the reference it starts on, but bwa will also give it the 4 flag. That might be where those extra unmapped but mapped reads are coming from.

              Comment

              • chrishah
                Member
                • Jul 2011
                • 18

                #8
                Hi,
                Thanks for your answers!
                I used BWA for mapping.
                I extracted all lines that refer to the contig 6584_length_242_COV_34.193_gc_40.49_norest from the sam file. Any idea how these hits can produce the idxstats output #mapped reads=0 and #unmapped reads=3 (6584_length_242_COV_34.193_gc_40.49_norest 242 0 3)? Where would the 4 flag be?

                @SQ SN:6584_length_242_COV_34.193_gc_40.49_norest LN:242
                HWI-ST558:73:B01BVACXX:6:1202:3864:159130 103 6584_length_242_COV_34.193_gc_40.49_norest 241 60 101M 6657_length_206_COV_19.714_gc_33.81_norest 98 0 TTCAAATCTTCTCCACTCCTGCAGGAAGAGTAGTATTTTCTTACATGTTTTCTCCAATTAACAACATTTCTATCTGATATTTCATCTTTGGACAGCACTGC CCCFFFFFHHHHHJJJJJJJJJHHJHIJJJCGHFHIIIJGIIJJJEIHHIIIIJJJIJJJJJJIJJJJJJJJJEHHIJJJJJHGHHHHHFEEFFDDEEEDD XT:A:U NM:i:1 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:1G99
                HWI-ST558:73:B01BVACXX:6:1202:3864:159130 147 6657_length_206_COV_19.714_gc_33.81_norest 98 60 101M 6584_length_242_COV_34.193_gc_40.49_norest 241 0 GCAAATTCTTCAAGTTTCAACATTCTCATCTCATGTACAGATAAACCGGTTTGGAAATTGTAGTTAAGGATCGCTGTCATATTTAGTTAGATTCTGGTTTT CDDDDDCDDDDDDDDDDEDEEEEEFFFFFDEHHHGGHIJIIHIHFIHHFB?ICGIJIIGIGJJIJJIGIGJJJHEJIJJIJJJJJJJJHHHHHFFFFFCCC XT:A:U NM:i:0 SM:i:37 AM:i:37 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:101
                HWI-ST558:73:B01BVACXX:6:1303:2033:109353 117 6584_length_242_COV_34.193_gc_40.49_norest 241 0 * = 241 0 TGTATACGTATAACAACCATTATGAATTTGAAAAGTCATAGTTTGAAATAGAATATGAAGGTCATTCTCATTTTAAAATCTGAATAATTTCGAAATTGTGT CDCCCDEFFECDFHHEHDD@CGGFACGIIIHEGGFEGHGBBBBHHGGGBGGIBGEJJJIGJIJIGJJJJJHIGJHGHGHIJJIJJIIIHFFFFFDFDD@C@
                HWI-ST558:73:B01BVACXX:6:1303:2033:109353 157 6584_length_242_COV_34.193_gc_40.49_norest 241 37 101M = 241 0 TTCAAATCTTCTCCACTCCTGCAGGAAGAGTAGTATTTTCTTACATGTTTTCTCCAATTAACAACATTTCTATCTGATATTTCATCTTTGGACAGCACTGC AECCDCC?>>A>A?7EEHAEHJHIJJIGGEIJIIJIHGGJIGFEHGHEGHCIGIIIIHGIGHG@HGGHEIHGJIIGGEGIIIIHGGDHFA+HBDDDDD@@@ XT:A:U NM:i:1 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:1 XO:i:0 XG:i:0 MD:Z:1G99

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, Yesterday, 11:10 AM
                0 responses
                8 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                44 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                104 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                125 views
                0 reactions
                Last Post SEQadmin2  
                Working...