Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [BWA] Calculate % of reads not aligned to reference

    Hi,

    I am using BWA to align single reads to reference genome using BWA. Because the sequenced genome is quite diverged from the reference (genetic distance is about 40 to 50%, but still belong to the same species), I am curious how many reads are not aligned by BWA, how to find out this information?

    Thanks

  • #2
    samtools idxstats?

    Comment


    • #3
      Hi,

      When I use this command, I get this:

      ref 12398 1478064 0
      * 0 0 8139515

      with the reference sequence name, sequence length, # mapped reads and # unmapped reads. What is the meaning of the "0" and the "*"? (if they have any).

      best

      Comment


      • #4
        idxstats samtools idxstats <aln.bam>
        Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.
        The final line is the unmapped reads (* instead of a reference sequence name, zero length).

        Comment


        • #5
          Thanks, now I get it. But one more question; what if I want to know the total number of reads mapped in a multi-sequence (several contigs) reference?
          i.e.:

          samtools idxstats <aln.bam>

          contig00001 22002 147 0
          contig00002 19783 23 0
          contig00003 19528 25 0
          contig00004 17742 192 0
          contig00005 16684 35 0
          .
          .
          .
          contig61681 100 0 0
          contig61682 100 0 0
          contig61684 100 0 0
          contig61685 100 0 0
          contig61686 100 0 0
          * 0 0 2005333

          Is there a way to see the sum of all the mapped reads.

          thanks in advance.

          Comment


          • #6
            Add up all the values in column 3 (it doesn't hurt to include the final row as it has zero there) for the number of mapped reads.

            That's trivial in Perl / Python / etc. You can do it with a Unix one liner too, this is one way using the cut command to select just column 3, and awk for counting:

            Code:
            samtools idxstats example.bam | cut -f3 | awk 'BEGIN {total=0} {total += $1} END {print total}'
            There may well be a neater Unix solution...

            Comment


            • #7
              Also you can use 'samtools flagstat' as pointed out by av_d on this thread:
              Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

              Comment


              • #8
                With the human genome as reference, when I run samtools idxstat I get a count of unmapped reads for each chromosome. But if the read was unmapped, how was it assigned to a particular chromosome?

                Comment


                • #9
                  Originally posted by ruchira View Post
                  With the human genome as reference, when I run samtools idxstat I get a count of unmapped reads for each chromosome. But if the read was unmapped, how was it assigned to a particular chromosome?
                  Most likely, its mate mapped to that chromosome. Sam specs call for unmapped reads to be given the mapping coordaintes of their mates, when their mates mapped. So the read has both the unampped flag set, and a mapping position.

                  Comment


                  • #10
                    Thanks very much for your answer, swbarnes2. I had used bwa align and then bwa sampe to generate the sam files. So I guess perhaps bwa sampe included the chromosome information but still listed the reads as unmapped.

                    I expected some reads to be completely unmapped (because they were not human DNA). Where would these appear in the idxstats output?

                    Comment


                    • #11
                      I wanted the same information recently and thought I'd post my steps. My output from BWA is in .sam format, so first it needs to be converted to bam, then sorted, then indexed. Not everyone will need to sort.

                      Code:
                      samtools view -Sb filename.sam > filename.bam
                      samtools sort filename.bam filename_sorted
                      samtools index filename_sorted.bam
                      Then I run this awk command. The "cut" of the previous is unnecessary since you can just read the appropriate field via awk.

                      Code:
                      samtools idxstats filename_sorted.bam | awk 'BEGIN {a=0;b=0} {a += $3; b+=$4 } END{print a " mapped " b " unmapped "}'

                      Comment


                      • #12
                        Is there a way to utilize samtools idxstats so i can read the amount of mapped/unmapped reads to particular genes rather than the entire chromosome itself? I get an output like this:
                        chr1 197195432 2022423 0
                        chr2 181748087 1915486 0
                        chr3 159599783 1418344 0
                        chr4 155630120 1352178 0
                        chr5 152537259 1526530 0
                        .....
                        thank you in advance

                        Comment


                        • #13
                          Originally posted by wmyashar View Post
                          Is there a way to utilize samtools idxstats so i can read the amount of mapped/unmapped reads to particular genes rather than the entire chromosome itself? I get an output like this:
                          chr1 197195432 2022423 0
                          chr2 181748087 1915486 0
                          chr3 159599783 1418344 0
                          chr4 155630120 1352178 0
                          chr5 152537259 1526530 0
                          .....
                          thank you in advance
                          You want to filter your .bam based on those coordiantes, then count. I think samtools view can take a .bed file and do that, so can BEDTools.

                          Comment


                          • #14
                            &quot;Mapping&quot; unmapped reads

                            Originally posted by swbarnes2 View Post
                            Most likely, its mate mapped to that chromosome. Sam specs call for unmapped reads to be given the mapping coordaintes of their mates, when their mates mapped. So the read has both the unampped flag set, and a mapping position.
                            By "its mate" are you referring to paired-end protocols?

                            Comment


                            • #15
                              Hi all,

                              after sorting, indexing and counting reads that align to chromosomes in a BAM file i get this (example for one library only):

                              1 | 30,427,671 | 5,913,901 | 0
                              2 | 19,698,289 | 3,386,635 | 0
                              3 | 23,459,830 | 4,784,837 | 0
                              4 | 18,585,056 | 3,292,873 | 0
                              5 | 26,975,502 | 5,032,188 | 0
                              Mt | 366,924 | 37,747 | 0
                              Pt | 154,478 | 9,107 | 0
                              * 0 0 0


                              First column Arabidopsis chromosomes, 2nd chromosomes length (in bp), 3rd mapped reads and 4th unmapped. Correct?

                              Now, is it possible, for example, that in chr 1 i only mapped 5 million reads and I have 0 unmpped for all? Looks odd to me.
                              Any comments.
                              Thanks.
                              Last edited by Gonza; 10-29-2014, 05:39 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Best Practices for Single-Cell Sequencing Analysis
                                by seqadmin



                                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                06-06-2024, 07:15 AM
                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin



                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 07:49 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 07:23 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-17-2024, 06:54 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-14-2024, 07:24 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X