Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • wdemos
    Member
    • Jun 2012
    • 31

    Threaded Sam to Bam conversion with Samtools View

    I am attempting to speed up the sam to bam conversion of a whole genome (paired end) alignment with Samtools0-1.19. Alignment was done with bwa-0.7.7 mem algorithm
    Commands:
    samtools view -bS -o .bam pe.sam (default)
    samtools view -bS -@ 10 -m 2G -o .bam pe.sam (threaded)

    Comparing the output .bam files there is a 0.4G difference in file size.
    I ran samtools flagstat on both bam files.
    Differences:
    6,026,490 QC passed reads
    6,026,490 paired in sequencing
    779,134 read 1
    5,247,356 read 2
    all other metrics are identical

    Can anyone explain why threading would give less reads on the same input .sam file? I assume it may have to do with the merging of thread data?

    Is there a way to correct this issue without losing the speed increase provided by threading? (Currently without threading conversion takes 6 hours. With threading Conversion takes 1 hour 15 minutes)
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    That really shouldn't happen. It makes sense that you could get different sized files, but they should contain identical number of reads. Does the same thing happen if you use the new version?

    Comment

    • Richard Finney
      Senior Member
      • Feb 2009
      • 701

      #3
      MAYBE ...

      The gzip compression used by samtools binary BAM version will be using different input streams depending on threading or non-threading.

      The non-threaded version will discover a different optimization that the threaded version.

      Can both BAMS versions be re-transformed back to identical SAM files?

      dpryan is right about you should have the same read counts.
      Last edited by Richard Finney; 08-21-2014, 02:06 PM.

      Comment

      • wdemos
        Member
        • Jun 2012
        • 31

        #4
        Thanks. I am currently testing a comparison run with the latest version of samtools. Hope to have an update soon.

        Comment

        • wdemos
          Member
          • Jun 2012
          • 31

          #5
          I completed the same test with the latest version of samtools. Again there is a 0.4G difference in file size:
          10 Threads 2G:
          1367501828 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (100.00%:nan%)
          1367501828 + 0 paired in sequencing
          685896751 + 0 read1
          681605077 + 0 read2
          1343716094 + 0 properly paired (98.26%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Single Thread, default settings:
          1373528318 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (99.56%:nan%)
          1373528318 + 0 paired in sequencing
          686675885 + 0 read1
          686852433 + 0 read2
          1343716094 + 0 properly paired (97.83%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Differences:
          6,026,490 QC passed reads
          6,026,490 paired in sequencing
          779,134 read 1
          5,247,356 read 2

          Any ideas? Is it just not possible to use multiple threads for this without losing data?

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            This is troubling. Can you post the exact commands you used?

            Would it be possible for you to gzip the original SAM file and post is somewhere (google drive, dropbox, etc.) so we can track down exactly why this is happening?

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              I should note that it's interesting that the differences are due to unmapped reads.

              Comment

              • wdemos
                Member
                • Jun 2012
                • 31

                #8
                Below are the commands that I used to run the sort. I am unable to provide the sam file as it is human data that is not involved in a public study.

                threaded
                Run on our cluster, all threads on one server
                samtools view -bS -@ 10 -m 2G -o {sample}.bam {sample}_pe.sam

                Run on our cluster, same server as above
                samtools view -bS -o {sample}.bam {sample}_pe.sam

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  Fair enough (can't argue with HIPAA), I'll have to see if I can reproduce this with a dataset locally (since I only keep BAM files around, I'll have to realign something). There's certainly no reason those commands should produce this behavior.

                  Comment

                  • wdemos
                    Member
                    • Jun 2012
                    • 31

                    #10
                    thank you for looking into this. Greatly appreciated.

                    Comment

                    • wdemos
                      Member
                      • Jun 2012
                      • 31

                      #11
                      If it is of any use the sam file was generated as follows:
                      bwa mem -M -t 8 -R ''$RG'' $bwaREF $read1 $read2 > ${sample}_pe.sam

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        You read my mind, thanks

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          I should have looked more carefully at the command you were using. You're mixing up the samtools view and samtools sort options. It looks like you're trying to allow each compression thread to have up to 2 gigs of memory (-m 2G), but that's not what the -m option does in samtools view. What that does is in samtools view is filter by query read length according to the CIGAR string. Since unmapped reads have no CIGAR string (they have an * there), you end up filtering them out. This is the cause of the difference you're seeing.

                          BTW, there's no way to tell the compressor threads how much memory to use when doing the conversion.

                          Comment

                          • wdemos
                            Member
                            • Jun 2012
                            • 31

                            #14
                            Ah, thank you!

                            Comment

                            Latest Articles

                            Collapse

                            • SEQadmin2
                              From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                              by SEQadmin2


                              Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                              The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                              ...
                              06-02-2026, 10:05 AM
                            • SEQadmin2
                              Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                              by SEQadmin2


                              With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                              Introduction

                              Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                              05-22-2026, 06:42 AM
                            • SEQadmin2
                              Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                              by SEQadmin2

                              Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                              Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                              05-06-2026, 09:04 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Yesterday, 08:59 AM
                            0 responses
                            13 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            21 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 11:40 AM
                            0 responses
                            19 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 05-28-2026, 11:40 AM
                            0 responses
                            31 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...