Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Jon_Keats
    Senior Member
    • Mar 2010
    • 279

    TopHat v1.2.0 sort header

    I never used Tophat v1.1.1 which listed a fix for the sam sort header (see below) but in newest version, TopHat v1.2.0 all my sam files have sort headers of "sorted" not "coordinate". Oddly, parsing the file through Samtools sort does not fix the problem but parsing it through Picard does. Also the sam headers are not listed in numeric order:

    Such as:
    chr1
    chr11
    chr12
    ...
    chr2
    chr20

    Not:
    chr1
    chr2
    chr3
    chr4


    Anyone else seeing these minor issues?


    ----Previous release notes-----

    TopHat 1.1.1 release 10/11/2010

    This release of TopHat includes some fixes related to Colorspace read mapping.

    * Negative quality values are now handled correctly.
    * Comments at the beginning of csfasta files no longer trigger an error.
    * --integer-quals no longer conflicts with -i
    * The header in TopHat BAM files now correctly lists the sort order as coordinate, with group order reference
  • colindaven
    Senior Member
    • Oct 2008
    • 417

    #2
    Picard seems to do a better job than Samtools of putting whether the BAM has been sorted in the SAM header. I've been working with a group of computer scientists who picked up on this one, so I have changed from Samtools to Picard for SAM/BAM conversion and sorting.

    Comment

    • cjp
      Member
      • Jun 2011
      • 58

      #3
      Hi Jon,

      You can use picard ReorderSam:





      First, TopHat gives the wrong sort order in the header, so you'll have to change that else picard will complain.

      e.g.,

      samtools view -H acc.bam | sed 's/sorted/unsorted/' > acc.header.sam
      samtools reheader acc.header.sam acc.bam > acc_head.bam

      This is the picard command that works for me (it uses the order of sequences in the reference file in the output BAM file):

      java -jar /path/to/picard/jars/ReorderSam.jar I=acc_head.bam O=acc_order.bam R=/path/to/ref/human_g1k_v37.fasta

      Then you may have to re-sort the BAM file. Although, if you trust TopHat's sorting, I guess you can change the sed line above to: sed 's/sorted/coordinate/'. For me, I like to add read group info anyway as a lot of software like GATK needs them to run, so I use picard AddOrReplaceReadGroups, which also allows you to sort with the SO option:

      e.g.,

      name="MY_SAMPLE"

      java -jar /path/to/picard/jars/AddOrReplaceReadGroups.jar I=acc_order.bam O=acc_rg.bam RGID=$name RGLB=$name RGPL=ILLUMINA RGPU=$name RGSM=$name SO=coordinate

      There are other options I use with picard:

      TMP_DIR=/path/to/tmp VALIDATION_STRINGENCY=SILENT VERBOSITY=ERROR QUIET=true CREATE_INDEX=true

      Chris

      Comment

      • cedance
        Senior Member
        • Feb 2011
        • 108

        #4
        The post is quite old and newer versions of tophat (since 1.3.0 I guess), with collaboration from picard developers, have overcome these issues and also SAM format TLEN parameter etc...
        Its better to use 1.3.1 (1.3.2 is out but still in beta) in my opinion.

        Comment

        • cjp
          Member
          • Jun 2011
          • 58

          #5
          Originally posted by cedance View Post
          The post is quite old and newer versions of tophat (since 1.3.0 I guess), with collaboration from picard developers, have overcome these issues and also SAM format TLEN parameter etc...
          Its better to use 1.3.1 (1.3.2 is out but still in beta) in my opinion.
          I still use old versions of TopHat for short reads because in the TopHat page it says this about version 1.3:

          "For short reads (usually <45-bp), it is recommended that users decrease segment length (--segment-length) to about half the read length and segment mismatches (--segment-mismatches) to 0 or 1"

          When I ran it on 36bp data, it was necessary to play with these settings and I got different results than I did with TopHat 1.2 - reads didn't align across splice sites and aligned in different places or across different splice sites. In the help pages, I couldn't find an explanation of why the new version needed these new parameter changes but they weren't needed in older versions of TopHat.

          On long read data, I think TopHat 1.3 seems to work well.

          Comment

          • cedance
            Senior Member
            • Feb 2011
            • 108

            #6
            Din't know that. Thanks for letting me know.Yes, that makes total sense. Fortunately, I work on 80bp paired end reads. The problem I faced with Tophat 1.2.0 is that the column 9 of SAM format = TLEN was 0 always. I would like to know the entire fragment length that's mapped.

            Best,
            Arun.

            Comment

            • cjp
              Member
              • Jun 2011
              • 58

              #7
              That's true - you can use picard FixMateInformation, but I don't know how well it works with reads that align overs introns:



              Chris

              Comment

              • cjp
                Member
                • Jun 2011
                • 58

                #8
                Are you sure it is fixed, I just found one of my TopHat 1.3 files (the @PG line says so anyway) and it looks like this in the header (chromosomes are still in the order 1,10,11):

                @HD VN:1.0 SO:coordinate
                @SQ SN:1 LN:249250621
                @SQ SN:10 LN:135534747
                @SQ SN:11 LN:135006516
                ...
                @SQ SN:GL000247.1 LN:36422
                @SQ SN:GL000248.1 LN:39786
                @SQ SN:GL000249.1 LN:38502
                @SQ SN:MT LN:16569
                @SQ SN:X LN:155270560
                @SQ SN:Y LN:59373566
                @PG ID:TopHat VN:1.3.1 CL:/home/cjp64/src/tophat-1.3.1/src/tophat -p 12 --segment-length 15 --segment-mismatches 0 -o A37_2_west -G /home/easih/gtf/hg19_ccds_08022011.gtf /home/easih/refs/human_1kg/bowtie/human_g1k_v37 /scratch/svvd2/A37/A3700002.1.f

                Comment

                • cedance
                  Senior Member
                  • Feb 2011
                  • 108

                  #9
                  cjp, its fixed in Tophat 1.3.1
                  Originally posted by tophat
                  TLEN field in SAM format is correctly output
                  I was talking about this (bold area):

                  5_Solexa_0503:5:75:14816:7572#0 99 SL2.40ch01 17202 255 75M = 17281 159 CGGCCGCACAGTTATTCGTGATGTCGCCATCGGATGTGGCCATAGTAATCACGGTATGTTTATTGGGGCTGCCGG CCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCACCBCCCC@C@C@DCDCABBC?BA=@CC<@C=BBBDB@:@8@ NH:i:1 NM:i:2
                  5_Solexa_0503:5:75:14816:7572#0 147 SL2.40ch01 17281 255 80M = 17202 -159 TTGGGTCTTGGAGGAGGCTCTATGTCACTTGTTGGACAACTCGGTGGACAAACAGGTGGAGCCTTTAGTTACTGTTTGGA >DCCB@@?@DCDDCDBCDCACDCDAC?A=CCCCCCACCDCCACCCCCDCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCC NH:i:1 NM:i:2

                  Comment

                  • cjp
                    Member
                    • Jun 2011
                    • 58

                    #10
                    The original post says this:

                    Originally posted by Jon_Keats View Post
                    I never used Tophat v1.1.1 which listed a fix for the sam sort header (see below) but in newest version, TopHat v1.2.0 all my sam files have sort headers of "sorted" not "coordinate". Oddly, parsing the file through Samtools sort does not fix the problem but parsing it through Picard does. Also the sam headers are not listed in numeric order:

                    Such as:
                    chr1
                    chr11
                    chr12
                    ...
                    chr2
                    chr20

                    Not:
                    chr1
                    chr2
                    chr3
                    chr4


                    Anyone else seeing these minor issues?

                    ...
                    So this was why I mentioned ReorderSam. The TLEN thing is a separate issue, but I thought you said TopHat 1.3 fixed the issue of the header having the wrong chromosome order, but it doesn't seem to in my file.

                    Comment

                    • cjp
                      Member
                      • Jun 2011
                      • 58

                      #11
                      Here is an example of short reads being aligned differently by TopHat 1.1, 1.2 and 1.3 (even though I set the segment values the same as mentioned in the TopHat home page):



                      Chris

                      Comment

                      Latest Articles

                      Collapse

                      • SEQadmin2
                        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                        by SEQadmin2


                        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                        ...
                        06-02-2026, 10:05 AM
                      • SEQadmin2
                        Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                        by SEQadmin2


                        With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                        Introduction

                        Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                        05-22-2026, 06:42 AM
                      • SEQadmin2
                        Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                        by SEQadmin2

                        Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                        Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                        05-06-2026, 09:04 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, Yesterday, 08:59 AM
                      0 responses
                      14 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 12:03 PM
                      0 responses
                      22 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 11:40 AM
                      0 responses
                      19 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 05-28-2026, 11:40 AM
                      0 responses
                      32 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...