Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • all_your_base
    Member
    • Mar 2012
    • 40

    Using HTSeq with Bowtie2 .sam files

    Does anyone here use HTSeq to aggregate read counts per gene for their Bowtie2 generated .SAM files? If not, what other method do you use?

    I have found that PE Illumina reads put through Bowtie2 create a .SAM that breaks HTSeq. It seems that Bowtie2 tries multiple alignments, and only reports the best possible alignment. When HTSeq sees a read's flag variable that indicates it had multiple alignments, it throws a warning and drops the read. Therefore, I believe HTSeq is only reporting counts for uniquely mapping read pairs in Bowtie2 .SAM output.

    Does anyone have any thoughts on this?
  • chadn737
    Senior Member
    • Jan 2009
    • 392

    #2
    Sounds like HTSeq is doing what it was designed to do.

    Comment

    • all_your_base
      Member
      • Mar 2012
      • 40

      #3
      That is not accurate, HTSeq is supposed to aggregate read counts based on .SAM file entries. If Bowtie2 decides a particular read is the best alignment, and therefore discards multiple alignments, that single reported alignment is supposed to be counted for expression profiling.

      If you only aggregated reads that map exactly one place to the genome, you would lose >60% of all of your counts. When Bowtie2 reports the mapping efficiency, let's say at 80%, that includes reads that have multiple alignments but only one alignment reported. Purely uniquely mapping reads really account for <40% of a typical sequencing run.

      Comment

      • all_your_base
        Member
        • Mar 2012
        • 40

        #4
        [SOLVED]

        Found the problem... It's not with HTSeq, but Bowtie2 itself.

        Bowtie2 is supposed to report the SEQ and QUAL strings for secondary alignments, which HTSeq needs to aggregate counts. There is an option that allows Bowtie2 to suppress this information as asterisks that is accidentally stuck on by default. See my other thread:

        Here.

        Comment

        • chadn737
          Senior Member
          • Jan 2009
          • 392

          #5
          <40% unique? That certainly has not been my experience. Even when working with transcriptomes of known polyploids.

          Comment

          • all_your_base
            Member
            • Mar 2012
            • 40

            #6
            Well, you are probably right, ~40% is probably most accurate for polyploid plants, but it really is near 40% in such cases.

            Try running Bowtie2 with -k 4, sorting the .SAM, and separating only those valid alignments with a single mapping positions for the forward and reverse reads. In maize, soybean, Arabidopsis, wheat, etc., I've found the uniquely mapping reads to be the minority.

            What do you typically see for mammals?

            Comment

            • chadn737
              Senior Member
              • Jan 2009
              • 392

              #7
              I work in plants. In Arabidopsis I typically see on the order of 87% of all reads (including unmapped ones) map uniquely. I have seen this same trend across multiple samples.

              For a while I was mapping B. Oleracea transcriptomes to B. Rapa before I got the Oleracea genome. Even then the uniquely mapped reads never dropped below 58%. Now I would not find <40% surprising for wheat, but for Arabidopsis that is far too low.

              But why are you using -k 4 rather than the default where Bowtie 2 finds the best alignment? By using this setting, Bowtie 2 will still report an alignment of lower quality than the best match as long as it is a valid alignment. As a result, you will increase the number of multi-mapping reads even if they are not real.

              Also...I'm assuming you are doing transcriptomes (else why use HTSeq count) in which case, why use Bowtie 2 rather than Tophat 2?

              Comment

              • all_your_base
                Member
                • Mar 2012
                • 40

                #8
                I wonder why you see so many more uniquely mapping reads than I do.

                I use -k 4 only when I want to separate uniquely mapping reads, duplicated reads, and multiple mapping reads. With -k 4, Bowtie2 will look for up to 4 alignments. Therefore, I can tell by the .SAM file is something only has a uniquely mapping location, only 2 mapping locations, or more than two. I find this is helpful for detecting duplicated genes by following the duplicated reads (those that map to exactly 2 locations).

                For situations in which I just want to do expression profiling, I don't specify a -k value.

                When analyzing transcriptomes I do use Tophat2, unless I'm specifically looking for gene duplication events, and then I use Bowtie2 and -k 4.

                Got a question for you... you ask why I would use HTSeq when mapping to genomes, but why wouldn't I? I often map illumina reads to a completed genome that is annotated with .gff file. I then use HTSeq while specifying the gene features and aggregate counts per gene.

                Is there a better way to do this?

                Comment

                • chadn737
                  Senior Member
                  • Jan 2009
                  • 392

                  #9
                  Originally posted by all_your_base View Post
                  I wonder why you see so many more uniquely mapping reads than I do.

                  I use -k 4 only when I want to separate uniquely mapping reads, duplicated reads, and multiple mapping reads. With -k 4, Bowtie2 will look for up to 4 alignments. Therefore, I can tell by the .SAM file is something only has a uniquely mapping location, only 2 mapping locations, or more than two. I find this is helpful for detecting duplicated genes by following the duplicated reads (those that map to exactly 2 locations).

                  For situations in which I just want to do expression profiling, I don't specify a -k value.

                  When analyzing transcriptomes I do use Tophat2, unless I'm specifically looking for gene duplication events, and then I use Bowtie2 and -k 4.

                  Got a question for you... you ask why I would use HTSeq when mapping to genomes, but why wouldn't I? I often map illumina reads to a completed genome that is annotated with .gff file. I then use HTSeq while specifying the gene features and aggregate counts per gene.

                  Is there a better way to do this?
                  I did not mean to imply that HTseq-count should not be used for counting the counts for a gene feature when mapping BACK to a genome. Rather I was referring to the data, whether it be RNA-seq or something like WGS. I was assuming that you were working with RNA-seq data and so wondering why you would be using Bowtie rather than Tophat to do this.

                  It does make sense to me now why you are using Bowtie 2 with -k 4.

                  But when you do this, do you account for the alignment scores? Because using -k 4 means that Bowtie 2 will report the best match and also a lower quality match as long as it still fits with the other settings.

                  I find it highly unlikely that in the vast majority of cases, even those with say 2 alignments, that both those alignments will have equal alignment scores. I would expect that in the vast majority of cases Bowtie 2 will report an alignment of lower quality than the primary alignment that it would report without the -k 4 setting. This would lead to an overestimation of the real number of reads mapping to multiple locations.

                  Comment

                  Latest Articles

                  Collapse

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, 06-05-2026, 10:09 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-04-2026, 08:59 AM
                  0 responses
                  24 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  29 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  23 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...