Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Zero counts in htseq-count output

    I am pretty desperate, so if anyone has any advice, I will be forever grateful.

    Basically, I am using the following command:

    htseq-count -s no -a 0 FourA.sam hg19.gtf > FourA.count

    and getting zero counts in the output. I have tried -s as yes/no (I am almost sure it should be "yes" based off the site I got my files), and -a between 0 and 10. No matter what parameters, I get zero counts.

    This is for a quick project, so I am not going for accuracy, I just want any results (of any quality)!

    Here is the background of my data
    1) Downloaded .sra file.
    2) Used fastq-dump to convert .sra file to .fastq file
    3) Only kept 1/10 sequences from .fastq file (it was too large).
    4) On Galaxy: Ran "Fastq Groomer" tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest)
    5) On Galaxy: Ran "Manipulation Reads" to convert to base-space "0123. -> ATCGN" and remove the adaptor itself. This generated fastqcssanger file.
    6) On Galaxy: Ran ""Map with BWA for SOLiD" tool on the fastqcssanger file and mrna.fa file to get the SAM file

    I downloaded the mrna.fa file from: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)

    Any advice is greatly appreciated. I don't think I have time to rerun/remap. If there is any faster solution for where I went wrong in this pipeline, I would need to do that. I just need something to present, regardless of the quality, and will admit that to my class!

    (Also, I am getting an warning "Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set", although I read in other forums that is usually ignorable).
    Last edited by lanner; 05-04-2014, 07:16 PM. Reason: To add detail of the warning

  • #2
    This is head output of some of my files:

    hg19.gtf:

    chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 66999825 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67101627 67101698 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";


    mrna.fasta:

    >AF001540 1
    ggcacgaggcaggtctgtctgttctgttggcaagtaaatgcagtactgtt
    ctgatcccgctgctattagaatgcattgtgaaacgactggagtatgatta
    aaagttgtgttccccaatgcttggagtagtgattgttgaaggaaaaaatc
    cagctgagtgataaggctgagtgttgaggaaatttctgcagttttaagca
    gtcgtatttgtgattgaagctgagtacatttgctggtgtatttttaggta
    aaatgcttttttgttcatttctgggtggtgggaggggactgaagccttta
    gtcttttccagatgcaaccttaaaatcagtgacaagaaacattccaaaca
    agcaacagtcttcaagaaattaaactggcaagtggaaatgtttaaacagt
    tcagtgatctttagtgcattgtttatgtgtgggtttctctctcccctccc


    FourA.sam:

    @SQ SN:AF001540 LN:1781
    @SQ SN:AF001541 LN:1138
    @SQ SN:AF001542 LN:2992
    @SQ SN:AF001543 LN:903
    @SQ SN:AF001544 LN:434
    @SQ SN:AF001545 LN:370
    @SQ SN:AF001546 LN:1142
    @SQ SN:AF001547 LN:1092
    @SQ SN:AF034176 LN:7232
    @SQ SN:AF038950 LN:2384

    FourA.count (head and tail)

    head:
    NM_000014 0
    NM_000015 0
    NM_000016 0
    NM_000017 0
    NM_000018 0
    NM_000019 0
    NM_000020 0
    NM_000021 0
    NM_000022 0
    NM_000023 0

    tail:
    NR_046431 0
    NR_046432 0
    NR_046433 0
    NR_046435 0
    NR_046436 0
    no_feature 257
    ambiguous 0
    too_low_aQual 0
    not_aligned 6366159
    alignment_not_unique 0

    Comment


    • #3
      Since you mapped against the transcriptome, htseq-count (or any tool for that matter) won't be able to give you counts. The reason for this is that these tools look at the coordinates in the SAM file (e.g., "AF001540:100-300") and then the GTF file and try to see if they overlap (and if the read overlaps multiple things). Since AF001540 (or any other transcript ID) doesn't exist as a chromosome or contig in your annotation file, this will never work.

      You have two possibilities for moving forward. (1) Map to the genome with STAR or tophat2 or a similar tool. The resulting alignments will work fine with htseq-counts (or featureCounts, which is faster). (2) Realize that the counts are equivalent to the number of reads mapping to each contig/transcript and that you can get this information with "samtools idxstats sorted.and.indexed.alignments.bam". Option (2) is the less ideal of the two, since you can't filter by MAPQ (well you can, but you need to just create a new BAM file to do so) and even if you could it would be misleading since you'd have reads that look like they map to multiple places but actually don't since it's likely you have many transcripts that overlap. I would strongly encourage you to just remap things.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Latest Developments in Precision Medicine
        by seqadmin



        Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

        Somatic Genomics
        “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
        05-24-2024, 01:16 PM
      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        05-06-2024, 07:48 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 05-24-2024, 07:15 AM
      0 responses
      123 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-23-2024, 10:28 AM
      0 responses
      136 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-23-2024, 07:35 AM
      0 responses
      135 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-22-2024, 02:06 PM
      0 responses
      11 views
      0 likes
      Last Post seqadmin  
      Working...
      X