Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • nir
    Junior Member
    • Jun 2010
    • 3

    Sort problem (Tophat --> Cufflinks)

    Hi all,

    I am trying to follow the simple recipe for differential analysis in RNA-seq data without gene and transcript discovery.

    I ran tophat (v1.0.12) using the default parameters and then cuffdiff (v1.3.0) with the standard mm9 files and NCBIM37 (with added "chr" prefix to contig names; This work is done in mouse).

    I am getting an error from cuffdiff saying that the sam files are not sorted. However, looking at the files I see that they are lexicographically sorted (as expected since they are produced by cuffdiff: < chr1, chr10, chr11,...,chr2, chr3, ... , chrM/X/Y> ).

    I tried to add a header to the sam files , but it did not help.

    Here are the technical details:

    Tophat Command:
    tophat -r 150 -G $REF_FILE -o $OUT mm9 $IN_1 $IN_2

    Cuffdiff command:
    cuffdiff -o $OUT Mus_musculus.NCBIM37.61.chr.gtf $SAMPLE_1/accepted_hits.sam $SAMPLE_2/accepted_hits.sam



    Error from cuffdiff:
    ========
    You are using Cufflinks v1.3.0, which is the most recent release.
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 3//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 5//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [16:44:37] Loading reference annotation.
    [16:44:44] Inspecting maps and determining fragment length distributions.

    Error: this SAM file doesn't appear to be correctly sorted!
    current hit is at chrX:3030137, last one was at chrM:16220
    Cufflinks requires that if your file has SQ records in
    the SAM header that they appear in the same order as the chromosomes names
    in the alignments.
    If there are no SQ records in the header, or if the header is missing,
    the alignments must be sorted lexicographically by chromsome
    name and by position.
    ===============





    Edited sam files and added header:
    @HD VN:1.0 SO:coordinate
    @SQ SN:chr1 LN:197195432
    @SQ SN:chr10 LN:129993255
    @SQ SN:chr11 LN:121843856
    @SQ SN:chr12 LN:121257530
    @SQ SN:chr13 LN:120284312
    @SQ SN:chr14 LN:125194864
    @SQ SN:chr15 LN:103494974
    @SQ SN:chr16 LN:98319150
    @SQ SN:chr17 LN:95272651
    @SQ SN:chr18 LN:90772031
    @SQ SN:chr19 LN:61342430
    @SQ SN:chr2 LN:181748087
    @SQ SN:chr3 LN:159599783
    @SQ SN:chr4 LN:155630120
    @SQ SN:chr5 LN:152537259
    @SQ SN:chr6 LN:149517037
    @SQ SN:chr7 LN:152524553
    @SQ SN:chr8 LN:131738871
    @SQ SN:chr9 LN:124076172
    @SQ SN:chrM LN:16299
    @SQ SN:chrX LN:166650296
    @SQ SN:chrY LN:15902555
    @PG TopHat VN:1.0.130


    Output with headers:
    =============
    You are using Cufflinks v1.3.0, which is the most recent release.
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 3//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    File 5//accepted_hits.sam doesn't appear to be a valid BAM file, trying SAM...
    [20:30:45] Loading reference annotation.
    [20:30:54] Inspecting maps and determining fragment length distributions.

    Error: this SAM file doesn't appear to be correctly sorted!
    current hit is at chr10:3001890, last one was at chr1:197184857
    Cufflinks requires that if your file has SQ records in
    the SAM header that they appear in the same order as the chromosomes names
    in the alignments.
    If there are no SQ records in the header, or if the header is missing,
    the alignments must be sorted lexicographically by chromsome
    name and by position.
    ==============


    Would greatly appreciate your help!
  • ETHANol
    Senior Member
    • Feb 2010
    • 308

    #2
    I would try to convert your sam files to bam files, sort and convert back to sam files which should add a header (all with samtools) and see if you still have a problem.

    Not saying it would work, but probably worth a try since it would be very easy.
    --------------
    Ethan

    Comment

    • nir
      Junior Member
      • Jun 2010
      • 3

      #3
      Thank you so much for the quick reply.

      I'm wondering -- The sam files are already lexicographically sorted. Why should re-sorting make a difference? (is there some other technical issue I'm overlooking?).
      I'm not sure this is a header issue since I tried running with/ without a header (maybe there's something wrong with my header format?).

      Thanks again!

      Comment

      • ETHANol
        Senior Member
        • Feb 2010
        • 308

        #4
        1. Because do you really know your file is correctly sorted for sure? That's a lot of lines to read through.
        2. Because if samtools has a problem reading your sam file it might give you a better idea of what is wrong with it.
        3. Because Tophat told you your file wasn't sorted, maybe it's telling the truth.
        4. Because there is a small chance it might solve your problem.
        5. When stuff isn't working, I just throw everything at it I can think of.
        --------------
        Ethan

        Comment

        • nir
          Junior Member
          • Jun 2010
          • 3

          #5
          Thanks again for the quick reply. I really appreciate it.

          I went through the file (with a simple script) and verified the chromosome order (it's properly sorted).

          From the error msg one can see that cufflinks complains since he sees "chrX" after "chrM" (which is a bit weird). What I then found is that things work just fine if I remove the chrM lines.

          I'm not crazy about this workaround though. Any inputs?

          Comment

          • Jon_Keats
            Senior Member
            • Mar 2010
            • 279

            #6
            The source of the issue is that you added "chr" to the NCBI reference file. The chr addition is the expect format for UCSC files, which use chrMT not chrM. Ultimately, there is no reason to add chr to the reference file in the majority of situations.

            Comment

            • ETHANol
              Senior Member
              • Feb 2010
              • 308

              #7
              I filter out the chrM reads as well because HTseq-count doesn't like it that the GTF file I use (iGenomes) doesn't contain any chrM genes. It seems that chrM messes up more then one RNA-seq workflow. I don't see it as an issue. It's just one line of code that runs pretty quickly to get rid of the chrM reads. Does your GTF file contain any chrM genes. If it does't then it is no loss.

              Glad to be of absolutely no help and suggest you spend some of your time on something that didn't work. Ha ha.
              --------------
              Ethan

              Comment

              • ETHANol
                Senior Member
                • Feb 2010
                • 308

                #8
                Originally posted by Jon_Keats View Post
                The source of the issue is that you added "chr" to the NCBI reference file. The chr addition is the expect format for UCSC files, which use chrMT not chrM. Ultimately, there is no reason to add chr to the reference file in the majority of situations.
                Wow, that is just confusing. How hard would it be for NCBI, Ensembl and UCSC to just agree on chromosome names.
                --------------
                Ethan

                Comment

                • Jon_Keats
                  Senior Member
                  • Mar 2010
                  • 279

                  #9
                  NCBI and ENSEMBL do, while UCSC is fixated on messing people up with "chr", "MT versus M", and the whole 1-based and 0-based annotation issues (at least in my opinion). One good rule of thumb, is to absolutely stick with one source of annotation and what ever you do, don't mix and match UCSC stuff with other sources.... Sorry just my pet peeve after finding Illumina mis-mapped SNPs back in the day with iGenome because someone used UCSC dpSNP entries which shift the mapping by 1bp and seeing many of these issues.

                  Ethan,

                  on a separate note do you track PF rates on your CHIP-seq runs? We have some odd correlations with IP method.

                  Comment

                  • ETHANol
                    Senior Member
                    • Feb 2010
                    • 308

                    #10
                    I can say I take a glance but that is about it. Usually, I just look and say that seems reasonable and move on. There's always something more to look at, I suppose.
                    --------------
                    Ethan

                    Comment

                    • vyellapa
                      Member
                      • Oct 2011
                      • 59

                      #11
                      If you believe sort order is the real issue, running Picards ReorderSam.jar after a SamSort.jar would work.

                      Comment

                      • kesner
                        Member
                        • Apr 2012
                        • 10

                        #12
                        cuffdif problems with chrM and chrX not in sorted order

                        I made a 3 record sam file with 1 chr1, chrX, and chrM record. When chrM comes prior to chrX I get the same sort order problem. Changing chrM to chrMT does not fix the problem.
                        barry

                        Comment

                        • kesner
                          Member
                          • Apr 2012
                          • 10

                          #13
                          Also,
                          If I change chrM to a chrZ and place it after chrX there is no problem.
                          barry

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 08:59 AM
                          0 responses
                          13 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          22 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 11:40 AM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          31 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...