Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sorting a BAM produces a smaller file than the original

    I recently sorted a bunch of BAMs that are about 400 MB in size each. After sorting with samtools the output sorted BAMs are about 3-4 MB smaller than the unsorted originals. Does anyone know why this is? My understanding of a sort is that it simply rearranges the available reads inside a BAM, and I don't see why this would create a smaller file.

  • #2
    Is there any chance that the unsorted BAM included unmapped reads and the sorted one doesn't? What did you use for alignment?

    Comment


    • #3
      ignore this
      Last edited by gprakhar; 08-24-2011, 05:19 AM. Reason: Incorrect information.

      Comment


      • #4
        Sorting does not remove duplicate reads.

        I seem to recall this same question asked previously on Seqanswers but my search skills are lacking this morning so I can't find the thread. If you were to compare an unsorted vs. sorted SAM file the sizes should be identical because, as oiiio said, the information content is the same, it has simply been rearranged. SAM files are plain text files so provided the content is the same the file size will be the same regardless of order. BAM files are compressed; the order of the information in the file can have an effect on the compression of that data. This was the explanation I recall from that earlier thread. (Once the caffeine fully kicks in I'll see if I can find that thread.)

        Comment


        • #5
          BAM is compressed. Sorting helps to give a better compression ratio because similar sequences are grouped together.

          Comment


          • #6
            Thanks everyone, and I also now believe that the order is improving compression.

            Comment


            • #7
              we found sorting by samtools removs unmapped reads.
              Marco

              Comment


              • #8
                Originally posted by marcowanger View Post
                we found sorting by samtools removs unmapped reads.
                No, it does not. It puts reads without coordinates at the end of the file.

                Comment


                • #9
                  Originally posted by lh3 View Post
                  No, it does not. It puts reads without coordinates at the end of the file.
                  oh? My colleague told me he found my samtools sorted BAM removed the unmapped reads. OK, then I got it wrong.

                  Thanks for informing me the actual situation.

                  Marco
                  Marco

                  Comment


                  • #10
                    Yes, the unmapped reads are placed at the end of the file,

                    you can check this out with

                    Code:
                     samtools view input.bam | tail

                    Comment


                    • #11
                      hello guys
                      this is true.... sorting does not remove any data from the BAM file. its puts those reads at the end of the file. I checked it.....

                      Comment


                      • #12
                        I know this thread has been inactive for a while, but my question is basically the same.

                        I have a 120GB bam file from RNA-Seq data and used the samtools sort function with compression level 1 (so meaning the least compression).

                        I ended up with a file of only ~30GB size.

                        Now this is the first time I am working with this and I have no idea if this is a normal range for reduction in size, but my feeling is something went wrong.

                        Could anyone with experience tell me if the result is expected or not?


                        thanks,
                        Florian

                        Comment


                        • #13
                          RNA-Seq contains a greater fraction of duplicate reads due to highly expressed genes, so sorting reduces the size more than DNA-Seq. If you want to check that nothing was lost, use 'samtools view' to compare the number of reads before/after sorting.

                          Comment


                          • #14
                            What HESmith said, but I'll add that if you really want to be 110% sure that nothing was lost/changed, you can use bamHash on both files. If the checksums are the same then they contain the same reads (just in a different order).

                            Comment


                            • #15
                              Originally posted by HESmith View Post
                              If you want to check that nothing was lost, use 'samtools view' to compare the number of reads before/after sorting.
                              Using 'samtools index' and then 'samtools idxstats' on a sorted file will give you total counts for mapped and unmapped reads, which is slightly easier to check compared to looking at a few million lines to find what's missing.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-19-2024, 07:20 AM
                              0 responses
                              142 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              116 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              109 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              43 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X