Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq question: high number of missing mates/unmatched reads

    Hi,

    I'm trying to create read count tables from my sam files with HTSeq (v0.6.1) and run into a problem I don't get.
    When I run HTSeq-count (htseq-count -r name -s yes -t "Coding gene" -i "name" -f sam foo.sam bar.gff) I encounter two things: (i) a very low percentage of reads matching to features and (ii) half of the reads are without mate (Warning: 13666808 reads with missing mate encountered)

    I wrote a simple script checking how many of my read names occur twice in the *.sam file (i.e., checking for mate names) and less than 5% are single mate reads. Additionally, I checked how many read pairs fall inside a feature and found that >40% do.

    As many people used HTSeq-count without problems I'm pretty sure I'm doing something wrong and not HTSeq. But I'd like to understand what ...

    This is a line from my sam file:
    HWI-ST218:183:H88NHADXX:1:1101:1250:2193 83 chr1 3950365 60 99M1S chr1 3950190 -274 CAACTTCAACGGCCAGAGATAAATACCGCCCTTCCCTTAATATGATTGAAATCAACCAGACCGGAACCCACAACACCACCCCATAAAGCAATACAAACTN DDDDDDDDDDDDDDEDDDEEEEDDBDDDDDDEDEEFFFFFFFHHHHHHJJJJJJJJJJHHJJJJJJJJJIJIIHGHGCJJJJJJJJJHGHHHFFFDD=1# AS:i:99
    ]
    And this a line from my gff file (gff3 produced with bp_genbank2gff script):
    chr1 RefSeq Coding gene 286 609 . - . name=chr1_10001;product="transposase"

    I changed some of the settings for testing purposes,e.g., used -r pos instead of name and ran out of memory. This was kind of expected because HTSeq doesn't find mates and therefore keeps each read in memory. I also tried strand=no without any change at all.

    Does anyone have a clue what I'm doing wrong?

    Regards,
    TimK

  • #2
    hi TimK,
    Most probably your SAM file is not Name sorted. HTSeq requires input SAM to be read name sorted. Generally if a SAM/ BAM is sorted, its coordinate-based.

    Check the header of SAM to see what sorting is present.
    To name sort, use samtools sort command and use the -n option instead of the -o option.

    HTSeq expects that the read pairs would be consecutive in SAM file. For PE data, esp. in RNA-seq, coordinate sorting screws this expectation.

    Edit -
    -o in samtools sort is for output. Sorry about confusion. Coordinate sorting is default. For name sorting use -n
    Last edited by amitm; 08-17-2014, 04:21 PM. Reason: Correction

    Comment


    • #3
      HI amitm,

      thanks for the suggestion but the bam files I used first are sorted (tried name and position).
      I then used only sam files (for human readability) and checked them manually. They seem
      to be sorted by name (at least the few % I checked were). So I am quite sure the sorting is
      not the problem.

      Comment


      • #4
        hi TimK,
        That missing mate warning is characteristic when HTSeq is supplied with non- Name-sorted SAM.
        Name sorting is not default and its different from the (coordinate) sorted BAM which (say) TopHat returns.
        Manual checking is not reasonable as there are millions. After using samtools sort with -n parameter, do you still get the same error?

        Comment


        • #5
          Just to exclude possible errors I re-did the following:
          1. I sorted a *.bam using samtools sort -n
          2. I sorted a *.sam using sort -s -k 1,1

          The result for both is the same:
          Warning: 18631016 reads with missing mate encountered.
          29112998 SAM alignment pairs processed.


          Additionally, reagarding my own scripts:
          1. >95% of the read names appear twice in the *.sam file
          2. all reads have their mate in the next line


          And I agree that it sounds very much like the sam/bam files are not correctly sorted but I can't find the mistake. Is there any naming convention and some characters I use in the names shouldn't be used?
          Also I realized that there are several versions of the sam format. I'm using samtools v0.1.19-44428cd. Could it be related to that?

          TimK
          Last edited by TimK; 08-17-2014, 07:16 PM.

          Comment


          • #6
            hi TimK,
            Samtools version is fine and it shouldn't create such an error. Nonetheless even after Name sorting if you are getting error then probably you should do these -
            1) Use Picard to calculate alignmnet statistics of your data. Is proper alignment good?

            2) Try another count generation tool like featurecounts

            This tool doesn't require Name sorting. Typical coordinate sorted BAM file can be directly taken as input. And is faster too.

            See if that helps.

            Comment


            • #7
              Hi Amitm,

              thanks a lot for the help. I used smalt for the mapping to the reference but will have a look at Picard and definitely at featurecounts.

              Hope it will work with one of those.
              Cheers,
              Tim

              Comment


              • #8
                Follow-up

                Hello, I ran into a similar problem using BAM files and then sorted using:
                Code:
                samtools sort -n -O BAM -o <output> <input>
                While this indeed fix the error message about 'missing mates' and this other warning:
                Code:
                arning: Read ACB052:253:C76YKACXX:8:2311:13183:88600 claims to have an aligned mate which could not be found in an adjacent line.
                I went ahead and compared the read counts of the sorted and unsorted after running through HTSeq-count and found that the sorted had significantly less counts even though the __no_feature, __ambiguous, and __alignment_not_unique parameters were less in the sorted file than the unsorted.

                Any thoughts on this?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Addressing Off-Target Effects in CRISPR Technologies
                  by seqadmin






                  The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                  08-27-2024, 04:44 AM
                • seqadmin
                  Selecting and Optimizing mRNA Library Preparations
                  by seqadmin



                  Sequencing mRNA provides a snapshot of cellular activity, allowing researchers to study the dynamics of cellular processes, compare gene expression across different tissue types, and gain insights into the mechanisms of complex diseases. “mRNA’s central role in the dogma of molecular biology makes it a logical and relevant focus for transcriptomic studies,” stated Sebastian Aguilar Pierlé, Ph.D., Application Development Lead at Inorevia. “One of the major hurdles for...
                  08-07-2024, 12:11 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 08-27-2024, 04:40 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 08-22-2024, 05:00 AM
                0 responses
                293 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 08-21-2024, 10:49 AM
                0 responses
                135 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 08-19-2024, 05:12 AM
                0 responses
                124 views
                0 likes
                Last Post seqadmin  
                Working...
                X