Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count: many reads are "no_feature".

    Hi,
    I just started to use htseq-count for my paired-end RNA-Seq data. I've name-sorted uniquely mapped reads outputted by tophat, converted the sorted bam file to sam, and used the sam file with an ensembl gtf file (version 74 for human) as the input to htseq-count. Here's the command and output.
    Command:
    python -m HTSeq.scripts.count -o out.sam -i gene_name sorted.sam Ensembl_GRCh37.74.gtf > counts.txt

    Output:
    ...
    14098588 sam line pairs processed.
    no_feature 11732479
    ambiguous 52152
    too_low_aQual 0
    not_aligned 0
    alignment_not_unique 0

    There are 21150698 reads (some paired, some not) in my sorted sam file. I understand that paired reads are counted as one, and that's why there are only 14098588 processed by htseq-count. Here are my questions:
    1) Why are there so many "no_feature" reads? Is this normal for a typical RNA-Seq experiment? Is 11732479 the actual number of reads (paired or not) or the number of single reads + pairs counted as one?
    2) I also checked the output.sam but there are only 11585093 reads there. It looks like htseq-count didn't record all the reads that it processed. So how does htseq-count choose which reads to record?

    Thank you,
    Sylvia

  • #2
    Originally posted by zzhao2 View Post
    1) Why are there so many "no_feature" reads? Is this normal for a typical RNA-Seq experiment? Is 11732479 the actual number of reads (paired or not) or the number of single reads + pairs counted as one?
    Hi- Do the chromosome names in the GTF file match those in the sam file? Ensembl files have chromosome names like 1, 2, 3 etc. while UCSC uses chr1, chr2, chr3 etc. Also, maybe check you are using the same version for both the reference fasta and the gtf (e.g. *both* are hg19).

    I guess 11732479 refers to number of fragments so a single end read counts as 1 and the two ends of a paired-end read count 1 as well.

    Good luck
    Dario

    Comment


    • #3
      Originally posted by zzhao2 View Post
      Command:
      Code:
      python -m HTSeq.scripts.count -o out.sam -i gene_name sorted.sam Ensembl_GRCh37.74.gtf > counts.txt
      You did not specify a --stranded option for htseq-count which means it is using the default setting, --stranded=yes. There are three possible settings for --stranded (yes, no, reverse). This has to be set properly according to the protocol used to generate your RNA-Seq library. In fact --stranded=yes is probably the least likely. What kit/protocol was used to construct the library in this case?

      Comment


      • #4
        Hi Dario,
        thanks for your reply. Yes I've noticed the chromosome name issue and changed them to chr1, chr2, etc. Without changing them I could only get a result with all genes' read counts being zero, and all reads as "no_feature".

        I didn't use a reference fasta file. Should I use it, and where?

        If I have 11732479 "no_feature" fragments, then it's even worse, because the total number of my fragments is 14098588, so only 17% of them are mapped to exons.

        Originally posted by dariober View Post
        Hi- Do the chromosome names in the GTF file match those in the sam file? Ensembl files have chromosome names like 1, 2, 3 etc. while UCSC uses chr1, chr2, chr3 etc. Also, maybe check you are using the same version for both the reference fasta and the gtf (e.g. *both* are hg19).

        I guess 11732479 refers to number of fragments so a single end read counts as 1 and the two ends of a paired-end read count 1 as well.

        Good luck
        Dario

        Comment


        • #5
          Hi kmcarr,
          Thanks for pointing this out. I didn't specify the stranded option because my reads are stranded. I've actually tried all three settings: yes, no, and reverse, but none of them gave a number of "no_feature" reads that's significantly less than others.

          I don't actually know the kit used, because what I have is only the data from our sequencing facility, but their report says that my reads are stranded.

          Originally posted by kmcarr View Post
          You did not specify a --stranded option for htseq-count which means it is using the default setting, --stranded=yes. There are three possible settings for --stranded (yes, no, reverse). This has to be set properly according to the protocol used to generate your RNA-Seq library. In fact --stranded=yes is probably the least likely. What kit/protocol was used to construct the library in this case?

          Comment


          • #6
            I wonder though in everyone's experience, what's an "acceptable" percent of no features? Surely no organism's annotation is perfect? Plus how often is it that the RNA that you get is 100% mRNA

            In this nature paper they found about 86% mapped to known exons http://www.nature.com/nature/journal...ture08872.html

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Exploring the Dynamics of the Tumor Microenvironment
              by seqadmin




              The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
              07-08-2024, 03:19 PM
            • seqadmin
              Exploring Human Diversity Through Large-Scale Omics
              by seqadmin


              In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
              06-25-2024, 06:43 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 07-19-2024, 07:20 AM
            0 responses
            37 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 07-16-2024, 05:49 AM
            0 responses
            47 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 07-15-2024, 06:53 AM
            0 responses
            57 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 07-10-2024, 07:30 AM
            0 responses
            43 views
            0 likes
            Last Post seqadmin  
            Working...
            X