Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • wdemos
    Member
    • Jun 2012
    • 31

    Threaded Sam to Bam conversion with Samtools View

    I am attempting to speed up the sam to bam conversion of a whole genome (paired end) alignment with Samtools0-1.19. Alignment was done with bwa-0.7.7 mem algorithm
    Commands:
    samtools view -bS -o .bam pe.sam (default)
    samtools view -bS -@ 10 -m 2G -o .bam pe.sam (threaded)

    Comparing the output .bam files there is a 0.4G difference in file size.
    I ran samtools flagstat on both bam files.
    Differences:
    6,026,490 QC passed reads
    6,026,490 paired in sequencing
    779,134 read 1
    5,247,356 read 2
    all other metrics are identical

    Can anyone explain why threading would give less reads on the same input .sam file? I assume it may have to do with the merging of thread data?

    Is there a way to correct this issue without losing the speed increase provided by threading? (Currently without threading conversion takes 6 hours. With threading Conversion takes 1 hour 15 minutes)
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    That really shouldn't happen. It makes sense that you could get different sized files, but they should contain identical number of reads. Does the same thing happen if you use the new version?

    Comment

    • Richard Finney
      Senior Member
      • Feb 2009
      • 701

      #3
      MAYBE ...

      The gzip compression used by samtools binary BAM version will be using different input streams depending on threading or non-threading.

      The non-threaded version will discover a different optimization that the threaded version.

      Can both BAMS versions be re-transformed back to identical SAM files?

      dpryan is right about you should have the same read counts.
      Last edited by Richard Finney; 08-21-2014, 02:06 PM.

      Comment

      • wdemos
        Member
        • Jun 2012
        • 31

        #4
        Thanks. I am currently testing a comparison run with the latest version of samtools. Hope to have an update soon.

        Comment

        • wdemos
          Member
          • Jun 2012
          • 31

          #5
          I completed the same test with the latest version of samtools. Again there is a 0.4G difference in file size:
          10 Threads 2G:
          1367501828 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (100.00%:nan%)
          1367501828 + 0 paired in sequencing
          685896751 + 0 read1
          681605077 + 0 read2
          1343716094 + 0 properly paired (98.26%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Single Thread, default settings:
          1373528318 + 0 in total (QC-passed reads + QC-failed reads)
          0 + 0 duplicates
          1367501828 + 0 mapped (99.56%:nan%)
          1373528318 + 0 paired in sequencing
          686675885 + 0 read1
          686852433 + 0 read2
          1343716094 + 0 properly paired (97.83%:nan%)
          1362706162 + 0 with itself and mate mapped
          4795666 + 0 singletons (0.35%:nan%)
          11194510 + 0 with mate mapped to a different chr
          5019248 + 0 with mate mapped to a different chr (mapQ>=5)

          Differences:
          6,026,490 QC passed reads
          6,026,490 paired in sequencing
          779,134 read 1
          5,247,356 read 2

          Any ideas? Is it just not possible to use multiple threads for this without losing data?

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            This is troubling. Can you post the exact commands you used?

            Would it be possible for you to gzip the original SAM file and post is somewhere (google drive, dropbox, etc.) so we can track down exactly why this is happening?

            Comment

            • dpryan
              Devon Ryan
              • Jul 2011
              • 3478

              #7
              I should note that it's interesting that the differences are due to unmapped reads.

              Comment

              • wdemos
                Member
                • Jun 2012
                • 31

                #8
                Below are the commands that I used to run the sort. I am unable to provide the sam file as it is human data that is not involved in a public study.

                threaded
                Run on our cluster, all threads on one server
                samtools view -bS -@ 10 -m 2G -o {sample}.bam {sample}_pe.sam

                Run on our cluster, same server as above
                samtools view -bS -o {sample}.bam {sample}_pe.sam

                Comment

                • dpryan
                  Devon Ryan
                  • Jul 2011
                  • 3478

                  #9
                  Fair enough (can't argue with HIPAA), I'll have to see if I can reproduce this with a dataset locally (since I only keep BAM files around, I'll have to realign something). There's certainly no reason those commands should produce this behavior.

                  Comment

                  • wdemos
                    Member
                    • Jun 2012
                    • 31

                    #10
                    thank you for looking into this. Greatly appreciated.

                    Comment

                    • wdemos
                      Member
                      • Jun 2012
                      • 31

                      #11
                      If it is of any use the sam file was generated as follows:
                      bwa mem -M -t 8 -R ''$RG'' $bwaREF $read1 $read2 > ${sample}_pe.sam

                      Comment

                      • dpryan
                        Devon Ryan
                        • Jul 2011
                        • 3478

                        #12
                        You read my mind, thanks

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          I should have looked more carefully at the command you were using. You're mixing up the samtools view and samtools sort options. It looks like you're trying to allow each compression thread to have up to 2 gigs of memory (-m 2G), but that's not what the -m option does in samtools view. What that does is in samtools view is filter by query read length according to the CIGAR string. Since unmapped reads have no CIGAR string (they have an * there), you end up filtering them out. This is the cause of the difference you're seeing.

                          BTW, there's no way to tell the compressor threads how much memory to use when doing the conversion.

                          Comment

                          • wdemos
                            Member
                            • Jun 2012
                            • 31

                            #14
                            Ah, thank you!

                            Comment

                            Latest Articles

                            Collapse

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by SEQadmin2, Yesterday, 11:58 AM
                            0 responses
                            10 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-05-2026, 10:09 AM
                            0 responses
                            25 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-04-2026, 08:59 AM
                            0 responses
                            35 views
                            0 reactions
                            Last Post SEQadmin2  
                            Started by SEQadmin2, 06-02-2026, 12:03 PM
                            0 responses
                            58 views
                            0 reactions
                            Last Post SEQadmin2  
                            Working...