Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cpleis
    Junior Member
    • Oct 2012
    • 8

    Issue with htseq-count

    Hello,
    I am working on an RNA-Seq project in soybean. I used Illumina HiSeq for single end reads on my soybean samples and am working on my alignment. I have run into a problem when trying to get my read counts from HTSeq. I used Tophat2 to do my alignment. When I use htseq-count to get read counts from my accepted_hits file against the soybean GFF3 file I get the error:

    Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:2413:101528', because chromosome 'Glyma07g15800.1|pacid:16267415', to which it has been aligned, did not appear in the GFF file.

    This occurs for each read and the job stops running.

    Here is my command line for htseq-count:
    module load python/2.7.3
    htseq-count -m union -s no -t gene -i ID /home/a-m/leisner1/accepted_hits.bam/accepted_hits_Blk1Con.sam /home/a-m/leisner1/accepted_hits.bam/Gmax_109_gene.gff3 > Blk1Con_counts_out

    I realized that the accepted_hits.bam file (that I sorted by read name and converted to a sam file) has a different gene name than the GFF3 file I am using. Since I have multiple isoforms of my transcripts my gene names end in '.1', '.2' while the gene names in my GFF file do not. I am not sure how to solve this problem. When the bioinformatics core originally ran my alignment for me they said that to avoid this problem they extracted sequences for CDS, 5’ UTR and 3’ UTRs for each gene based on the coordinates in the GFF3 file provided by phytozome (version 8) and then concatenated them to form a single reference transcript sequence for each gene (.fna file). If I use this new file to do my alignment in TopHat I could get around the naming issue. However, they said that I may have to modify the GFF3 file with gene names and coordinates matching the Gmax_genes.fna file.

    My overall question is 1) is there an easier/simpler way to avoid the different gene names so I can get htseq-count to work, and 2) if not, how to I modify the GFF3 file in order to get my read counts?

    I realize I could simply 'trim' the names of the files (with some simple code I'm sure) but then I am not sure if that would mess up my count data.

    Thanks!
  • Simon Anders
    Senior Member
    • Feb 2010
    • 995

    #2
    There are good reasons to align against the genome, not the transcriptome (see various previous threads here). Please don't deviate from the accepted best practise unless you know what you are doing. Especially, using TopHat and htseq-count when aligning against a transcriptome does not make any sense. (Read up on what these tools do precisely to se why.)

    Comment

    • cpleis
      Junior Member
      • Oct 2012
      • 8

      #3
      Simon,
      Thank you for your response. I re-did the alignment using the complete nucleotide sequence (Gmax_109.fa) instead of the transcriptome using TopHat. I then sorted the accepted_hits.bam file by read name (-n) using samtools and then converted this sorted file into a SAM file. When I loaded this file into htseq-count I am getting this error repeatedly:
      Warning: Skipping read 'DBRHHJN1:263:C0REVACXX:5:1101:1886:109580', because chromosome 'scaffold_785', to which it has been aligned, did not appear in the GFF file.

      I am still using the same htseq-count code as before the same GFF3 file. I am not really sure how to tackle this error.

      Thanks!

      Comment

      • Simon Anders
        Senior Member
        • Feb 2010
        • 995

        #4
        So, does the scaffold appear in your GFF file?

        Have you checked whether the chromosome names in the genome fasta file and in the gff file match?

        Comment

        • cpleis
          Junior Member
          • Oct 2012
          • 8

          #5
          Simon,
          It looks like my gff file does not contain information about scaffold_785 and that is why reads mapping to scaffold_785 are skipped for counting. I'm assuming then that there is most likely there is no annotation information available for scaffold_785 and so is not present in gff file. There are only a handful of scaffolds that are being skipped. My chromosome names are the same in both files.

          I re-ran the analysis and got an output file with:
          no_feature 2421689
          ambiguous 345554
          too_low_aQual 0
          not_aligned 0
          alignment_not_unique 19096489

          I started with 44,613,416 reads so I'm assuming this output is "normal" (?).

          Thanks for the help!

          Comment

          • Simon Anders
            Senior Member
            • Feb 2010
            • 995

            #6
            Should be okay. Only the fraction of reads with ambiguous alignment is quite high. Transcribed sequence is usually not that repetitive that you would have so many reads mapping to more than one location. Maybe somethig went wrong in your assembly, and some contigs appear in several scaffolds.

            Comment

            • cpleis
              Junior Member
              • Oct 2012
              • 8

              #7
              Would a large number of isoforms not account for the high number of ambiguous alignments? I was also going to try running my htseq-count alignment with a different mode. Currently I am using union, but I was also going to try running it with intersection_nonempty to see what kind of output I get.

              If it is a problem with my alignment how would I troubleshoot that? Thanks!

              Comment

              • Simon Anders
                Senior Member
                • Feb 2010
                • 995

                #8
                Did you align against the genome or the transcriptome? If the latter, using htseq-count is pointless, of course.

                Comment

                • cpleis
                  Junior Member
                  • Oct 2012
                  • 8

                  #9
                  I re-did the alignment against the genome.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    Yesterday, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 12:03 PM
                  0 responses
                  19 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, Yesterday, 11:40 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  29 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-26-2026, 10:12 AM
                  0 responses
                  31 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...