Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq question: high number of missing mates/unmatched reads

    Hi,

    I'm trying to create read count tables from my sam files with HTSeq (v0.6.1) and run into a problem I don't get.
    When I run HTSeq-count (htseq-count -r name -s yes -t "Coding gene" -i "name" -f sam foo.sam bar.gff) I encounter two things: (i) a very low percentage of reads matching to features and (ii) half of the reads are without mate (Warning: 13666808 reads with missing mate encountered)

    I wrote a simple script checking how many of my read names occur twice in the *.sam file (i.e., checking for mate names) and less than 5% are single mate reads. Additionally, I checked how many read pairs fall inside a feature and found that >40% do.

    As many people used HTSeq-count without problems I'm pretty sure I'm doing something wrong and not HTSeq. But I'd like to understand what ...

    This is a line from my sam file:
    HWI-ST218:183:H88NHADXX:1:1101:1250:2193 83 chr1 3950365 60 99M1S chr1 3950190 -274 CAACTTCAACGGCCAGAGATAAATACCGCCCTTCCCTTAATATGATTGAAATCAACCAGACCGGAACCCACAACACCACCCCATAAAGCAATACAAACTN DDDDDDDDDDDDDDEDDDEEEEDDBDDDDDDEDEEFFFFFFFHHHHHHJJJJJJJJJJHHJJJJJJJJJIJIIHGHGCJJJJJJJJJHGHHHFFFDD=1# AS:i:99
    ]
    And this a line from my gff file (gff3 produced with bp_genbank2gff script):
    chr1 RefSeq Coding gene 286 609 . - . name=chr1_10001;product="transposase"

    I changed some of the settings for testing purposes,e.g., used -r pos instead of name and ran out of memory. This was kind of expected because HTSeq doesn't find mates and therefore keeps each read in memory. I also tried strand=no without any change at all.

    Does anyone have a clue what I'm doing wrong?

    Regards,
    TimK

  • #2
    hi TimK,
    Most probably your SAM file is not Name sorted. HTSeq requires input SAM to be read name sorted. Generally if a SAM/ BAM is sorted, its coordinate-based.

    Check the header of SAM to see what sorting is present.
    To name sort, use samtools sort command and use the -n option instead of the -o option.

    HTSeq expects that the read pairs would be consecutive in SAM file. For PE data, esp. in RNA-seq, coordinate sorting screws this expectation.

    Edit -
    -o in samtools sort is for output. Sorry about confusion. Coordinate sorting is default. For name sorting use -n
    Last edited by amitm; 08-17-2014, 04:21 PM. Reason: Correction

    Comment


    • #3
      HI amitm,

      thanks for the suggestion but the bam files I used first are sorted (tried name and position).
      I then used only sam files (for human readability) and checked them manually. They seem
      to be sorted by name (at least the few % I checked were). So I am quite sure the sorting is
      not the problem.

      Comment


      • #4
        hi TimK,
        That missing mate warning is characteristic when HTSeq is supplied with non- Name-sorted SAM.
        Name sorting is not default and its different from the (coordinate) sorted BAM which (say) TopHat returns.
        Manual checking is not reasonable as there are millions. After using samtools sort with -n parameter, do you still get the same error?

        Comment


        • #5
          Just to exclude possible errors I re-did the following:
          1. I sorted a *.bam using samtools sort -n
          2. I sorted a *.sam using sort -s -k 1,1

          The result for both is the same:
          Warning: 18631016 reads with missing mate encountered.
          29112998 SAM alignment pairs processed.


          Additionally, reagarding my own scripts:
          1. >95% of the read names appear twice in the *.sam file
          2. all reads have their mate in the next line


          And I agree that it sounds very much like the sam/bam files are not correctly sorted but I can't find the mistake. Is there any naming convention and some characters I use in the names shouldn't be used?
          Also I realized that there are several versions of the sam format. I'm using samtools v0.1.19-44428cd. Could it be related to that?

          TimK
          Last edited by TimK; 08-17-2014, 07:16 PM.

          Comment


          • #6
            hi TimK,
            Samtools version is fine and it shouldn't create such an error. Nonetheless even after Name sorting if you are getting error then probably you should do these -
            1) Use Picard to calculate alignmnet statistics of your data. Is proper alignment good?

            2) Try another count generation tool like featurecounts

            This tool doesn't require Name sorting. Typical coordinate sorted BAM file can be directly taken as input. And is faster too.

            See if that helps.

            Comment


            • #7
              Hi Amitm,

              thanks a lot for the help. I used smalt for the mapping to the reference but will have a look at Picard and definitely at featurecounts.

              Hope it will work with one of those.
              Cheers,
              Tim

              Comment


              • #8
                Follow-up

                Hello, I ran into a similar problem using BAM files and then sorted using:
                Code:
                samtools sort -n -O BAM -o <output> <input>
                While this indeed fix the error message about 'missing mates' and this other warning:
                Code:
                arning: Read ACB052:253:C76YKACXX:8:2311:13183:88600 claims to have an aligned mate which could not be found in an adjacent line.
                I went ahead and compared the read counts of the sorted and unsorted after running through HTSeq-count and found that the sorted had significantly less counts even though the __no_feature, __ambiguous, and __alignment_not_unique parameters were less in the sorted file than the unsorted.

                Any thoughts on this?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Best Practices for Single-Cell Sequencing Analysis
                  by seqadmin



                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                  06-06-2024, 07:15 AM
                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 07:23 AM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-17-2024, 06:54 AM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-14-2024, 07:24 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-13-2024, 08:58 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Working...
                X