Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem merging BAM files

    Hi everyone. We are having difficulty merging BAM files. Our pipeline so far is this:-

    50bp single-end rna-seq reads - map to hg19 with Bowtie, writing unaligned reads to a new file and writing mapped reads to a SAM file. Then map unaligned Bowtie reads with TopHat and write to a BAM file.

    We now want to merge the two output files, but there is a problem with the way the reads are sorted. We have tried (1) samtools sort on both files, then samtools reheader and samtools merge, (2) sorting files as per command on cufflinks manual page (sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted) and then merging, (3) picard sort and merge commands, and yet none of these will actually merge the files.

    An example of the picard input / output is here:

    PHP Code:
    java -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/SortSam.jar INPUT=bowtie.sam OUTPUT=bowtie_picardsort.bam SORT_ORDER=coordinate

    [Fri Oct 07 15:48:34 BST 2011net.sf.picard.sam.SortSam INPUT=bowtie.sam OUTPUT=bowtie_picardsort.bam SORT_ORDER=coordinate    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Fri Oct 07 15:48:34 BST 2011Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64Java HotSpot(TM64-Bit Server VM 1.6.0_26-b03-384-10M3425
    Snappy is disabled via system property
    .
    INFO    2011-10-07 15:49:56    SortSam    Read 10000000 records.
    INFO    2011-10-07 15:51:13    SortSam    Read 20000000 records.
    INFO    2011-10-07 15:52:29    SortSam    Read 30000000 records.
    INFO    2011-10-07 15:53:44    SortSam    Read 40000000 records.
    INFO    2011-10-07 15:54:59    SortSam    Read 50000000 records.
    INFO    2011-10-07 15:55:38    SortSam    Finished reading inputsmerging and writing to output now.
    [
    Fri Oct 07 16:04:53 BST 2011net.sf.picard.sam.SortSam doneElapsed time16.31 minutes.
    Runtime.totalMemory()=830885888

    java 
    -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/SortSam.jar INPUT=tophat.bam OUTPUT=tophat_picardsort.bam SORT_ORDER=coordinate

    [Fri Oct 07 16:07:29 BST 2011net.sf.picard.sam.SortSam INPUT=tophat.bam OUTPUT=tophat_picardsort.bam SORT_ORDER=coordinate    VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Fri Oct 07 16:07:29 BST 2011Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64Java HotSpot(TM64-Bit Server VM 1.6.0_26-b03-384-10M3425
    Snappy is disabled via system property
    .
    INFO    2011-10-07 16:07:40    SortSam    Finished reading inputsmerging and writing to output now.
    [
    Fri Oct 07 16:07:54 BST 2011net.sf.picard.sam.SortSam doneElapsed time0.40 minutes.
    Runtime.totalMemory()=527294464

    java 
    -Xmx2g -Dsnappy.disable=true -jar /path/to/picard-tools-1.52/MergeSamFiles.jar INPUT=bowtie_picardsort.bam INPUT=tophat_picardsort.bam OUTPUT=picardall.bam SORT_ORDER=coordinate

    [Fri Oct 07 16:10:02 BST 2011net.sf.picard.sam.MergeSamFiles INPUT=[bowtie_picardsort.bamtophat_picardsort.bamOUTPUT=picardall.bam SORT_ORDER=coordinate    ASSUME_SORTED=false MERGE_SEQUENCE_DICTIONARIES=false USE_THREADING=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
    [Fri Oct 07 16:10:02 BST 2011Executing as swebioinf@SWEBIOINFs-Mac-Pro.local on Mac OS X 10.6.8 x86_64Java HotSpot(TM64-Bit Server VM 1.6.0_26-b03-384-10M3425
    INFO    2011
    -10-07 16:10:02    MergeSamFiles    Input files are in same order as output so sorting to temp directory is not needed.
    [
    Fri Oct 07 16:10:02 BST 2011net.sf.picard.sam.MergeSamFiles doneElapsed time0.00 minutes.
    Runtime.totalMemory()=85000192
    Exception in thread 
    "main" net.sf.samtools.util.SequenceUtil$SequenceListsDifferExceptionSequences at index 1 don't match: 1/243199373/chr2 1/135534747/chr10
        at net.sf.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:109)
        at net.sf.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:138)
        at net.sf.picard.sam.SamFileHeaderMerger.getSequenceDictionary(SamFileHeaderMerger.java:482)
        at net.sf.picard.sam.SamFileHeaderMerger.<init>(SamFileHeaderMerger.java:133)
        at net.sf.picard.sam.MergeSamFiles.doWork(MergeSamFiles.java:137)
        at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:175)
        at net.sf.picard.sam.MergeSamFiles.main(MergeSamFiles.java:84) 
    The error seems to be this:

    Exception in thread "main" net.sf.samtools.util.SequenceUtil$SequenceListsDifferException: Sequences at index 1 dont match: 1/243199373/chr2 1/135534747/chr10

    I think the problem lies somewhere with the way bowtie and tophat sort their output by default. Bowtie sorts numerically, chr1, chr2, chr3... etc whereas TopHat sorts chr1, chr10, chr11......chr19, chr2, chr20, chr21 etc. But we have sorted both files again using picard SortSam so we cannot understand why this discrepancy is not corrected. (Same happens with samtools sort).

    Can anyone help us please?

    Many thanks

    Helen

  • #2
    Can you double check the embedded SAM text header (@SQ lines) and the BAM binary header (sequence names and lengths) are consistent? I suspect "samtools reheader" doesn't check this.

    [Just a guess - there may well be other reasons why the merge fails]
    Last edited by maubp; 10-10-2011, 01:51 AM.

    Comment


    • #3
      Here are the headers:

      Tophat output before Picard sorting:
      PHP Code:
      @HD    VN:1.0    SO:coordinate
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chrM    LN:16571
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566 
      Tophat output after picard sorting:-
      PHP Code:
      @HD    VN:1.0    SO:coordinate
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chrM    LN:16571
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566 
      Bowtie output before picard sorting:
      PHP Code:
      @HD    VN:1.0    SO:unsorted
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566
      @SQ    SN:chrM    LN:16571 
      Bowtie output after picard sorting:
      PHP Code:
      @HD    VN:1.0    SO:coordinate
      @SQ    SN:chr1    LN:249250621
      @SQ    SN:chr2    LN:243199373
      @SQ    SN:chr3    LN:198022430
      @SQ    SN:chr4    LN:191154276
      @SQ    SN:chr5    LN:180915260
      @SQ    SN:chr6    LN:171115067
      @SQ    SN:chr7    LN:159138663
      @SQ    SN:chr8    LN:146364022
      @SQ    SN:chr9    LN:141213431
      @SQ    SN:chr10    LN:135534747
      @SQ    SN:chr11    LN:135006516
      @SQ    SN:chr12    LN:133851895
      @SQ    SN:chr13    LN:115169878
      @SQ    SN:chr14    LN:107349540
      @SQ    SN:chr15    LN:102531392
      @SQ    SN:chr16    LN:90354753
      @SQ    SN:chr17    LN:81195210
      @SQ    SN:chr18    LN:78077248
      @SQ    SN:chr19    LN:59128983
      @SQ    SN:chr20    LN:63025520
      @SQ    SN:chr21    LN:48129895
      @SQ    SN:chr22    LN:51304566
      @SQ    SN:chrX    LN:155270560
      @SQ    SN:chrY    LN:59373566
      @SQ    SN:chrM    LN:16571 
      You can see that they are sorted differently and that picard sorting has not corrected this. Why? How can I fix this?

      Comment


      • #4
        Use Picard ReorderSam with the same ref.fa file for both bam files. SortSam / samtools sort just sorts the alignments in the same order as the SQ lines in the header.
        Last edited by Chipper; 10-10-2011, 11:57 AM.

        Comment


        • #5
          Why not just align all reads with tophat? I'd guess the tophat alignment is not working well as all the exon restricted reads are lost in the bowtie alignment. Basically tophat is doing your pipeline as it does what you are doing and leverages the exon reads you are removing by doing bowtie first.

          Comment


          • #6
            Chipper - thank you. I will try your suggestion today.

            Jon - We run Bowtie using best and strata settings, and trim the 3' end of the read, which as far as I am aware you cannot do using TopHat. We have spent considerable time optimising our mapping and find TopHat alone is not enough for our particular project.
            Last edited by hlwright; 10-11-2011, 01:16 AM.

            Comment


            • #7
              Thanks Chipper, your advice has solved the problem.

              For anyone else who may be having the same problem, here is the solution:-

              Create a sequence dictionary for your reference.fa in the same folder:-
              PHP Code:
              java -Xmx2g -jar CreateSequenceDictionary.jar REFERENCE=hg19.fa OUTPUT=hg19.dict 
              Then reorder each of your bam files:-
              PHP Code:
              java -Xmx2g -jar ReorderSam.jar INPUT=bowtie.bam OUTPUT=bowtie_reorder.bam REFERENCE=/path/to/hg19.fa

              java 
              -Xmx2g -jar ReorderSam.jar INPUT=tophat.bam OUTPUT=tophat_reorder.bam REFERENCE=/path/to/hg19.fa 
              Then merge your bam files:
              PHP Code:
              java -Xmx2g -jar MergeSamFiles.jar INPUT=bowtie_reorder.bam INPUT=tophat_reorder.bam OUTPUT=reorder_merge.bam SORT_ORDER=coordinate 

              Once again thank you to everyone who took the time to read and reply to this post. Your help is very much appreciated

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X