Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Zero counts in htseq-count output

    I am pretty desperate, so if anyone has any advice, I will be forever grateful.

    Basically, I am using the following command:

    htseq-count -s no -a 0 FourA.sam hg19.gtf > FourA.count

    and getting zero counts in the output. I have tried -s as yes/no (I am almost sure it should be "yes" based off the site I got my files), and -a between 0 and 10. No matter what parameters, I get zero counts.

    This is for a quick project, so I am not going for accuracy, I just want any results (of any quality)!

    Here is the background of my data
    1) Downloaded .sra file.
    2) Used fastq-dump to convert .sra file to .fastq file
    3) Only kept 1/10 sequences from .fastq file (it was too large).
    4) On Galaxy: Ran "Fastq Groomer" tool to remove the leading quality score for the adaptor base (input 'Color Space Sanger', default for the rest)
    5) On Galaxy: Ran "Manipulation Reads" to convert to base-space "0123. -> ATCGN" and remove the adaptor itself. This generated fastqcssanger file.
    6) On Galaxy: Ran ""Map with BWA for SOLiD" tool on the fastqcssanger file and mrna.fa file to get the SAM file

    I downloaded the mrna.fa file from: (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/)

    Any advice is greatly appreciated. I don't think I have time to rerun/remap. If there is any faster solution for where I went wrong in this pipeline, I would need to do that. I just need something to present, regardless of the quality, and will admit that to my class!

    (Also, I am getting an warning "Warning: Malformed SAM line: RNAME != '*' although flag bit &0x0004 set", although I read in other forums that is usually ignorable).
    Last edited by lanner; 05-04-2014, 07:16 PM. Reason: To add detail of the warning

  • #2
    This is head output of some of my files:

    hg19.gtf:

    chr1 hg19_refGene start_codon 67000042 67000044 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67000042 67000051 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 66999825 67000051 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67091530 67091593 0.000000 + 2 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67091530 67091593 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67098753 67098777 0.000000 + 1 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67098753 67098777 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67101627 67101698 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene exon 67101627 67101698 0.000000 + . gene_id "NM_032291"; transcript_id "NM_032291";
    chr1 hg19_refGene CDS 67105460 67105516 0.000000 + 0 gene_id "NM_032291"; transcript_id "NM_032291";


    mrna.fasta:

    >AF001540 1
    ggcacgaggcaggtctgtctgttctgttggcaagtaaatgcagtactgtt
    ctgatcccgctgctattagaatgcattgtgaaacgactggagtatgatta
    aaagttgtgttccccaatgcttggagtagtgattgttgaaggaaaaaatc
    cagctgagtgataaggctgagtgttgaggaaatttctgcagttttaagca
    gtcgtatttgtgattgaagctgagtacatttgctggtgtatttttaggta
    aaatgcttttttgttcatttctgggtggtgggaggggactgaagccttta
    gtcttttccagatgcaaccttaaaatcagtgacaagaaacattccaaaca
    agcaacagtcttcaagaaattaaactggcaagtggaaatgtttaaacagt
    tcagtgatctttagtgcattgtttatgtgtgggtttctctctcccctccc


    FourA.sam:

    @SQ SN:AF001540 LN:1781
    @SQ SN:AF001541 LN:1138
    @SQ SN:AF001542 LN:2992
    @SQ SN:AF001543 LN:903
    @SQ SN:AF001544 LN:434
    @SQ SN:AF001545 LN:370
    @SQ SN:AF001546 LN:1142
    @SQ SN:AF001547 LN:1092
    @SQ SN:AF034176 LN:7232
    @SQ SN:AF038950 LN:2384

    FourA.count (head and tail)

    head:
    NM_000014 0
    NM_000015 0
    NM_000016 0
    NM_000017 0
    NM_000018 0
    NM_000019 0
    NM_000020 0
    NM_000021 0
    NM_000022 0
    NM_000023 0

    tail:
    NR_046431 0
    NR_046432 0
    NR_046433 0
    NR_046435 0
    NR_046436 0
    no_feature 257
    ambiguous 0
    too_low_aQual 0
    not_aligned 6366159
    alignment_not_unique 0

    Comment


    • #3
      Since you mapped against the transcriptome, htseq-count (or any tool for that matter) won't be able to give you counts. The reason for this is that these tools look at the coordinates in the SAM file (e.g., "AF001540:100-300") and then the GTF file and try to see if they overlap (and if the read overlaps multiple things). Since AF001540 (or any other transcript ID) doesn't exist as a chromosome or contig in your annotation file, this will never work.

      You have two possibilities for moving forward. (1) Map to the genome with STAR or tophat2 or a similar tool. The resulting alignments will work fine with htseq-counts (or featureCounts, which is faster). (2) Realize that the counts are equivalent to the number of reads mapping to each contig/transcript and that you can get this information with "samtools idxstats sorted.and.indexed.alignments.bam". Option (2) is the less ideal of the two, since you can't filter by MAPQ (well you can, but you need to just create a new BAM file to do so) and even if you could it would be misleading since you'd have reads that look like they map to multiple places but actually don't since it's likely you have many transcripts that overlap. I would strongly encourage you to just remap things.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Addressing Off-Target Effects in CRISPR Technologies
        by seqadmin






        The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
        08-27-2024, 04:44 AM
      • seqadmin
        Selecting and Optimizing mRNA Library Preparations
        by seqadmin



        Sequencing mRNA provides a snapshot of cellular activity, allowing researchers to study the dynamics of cellular processes, compare gene expression across different tissue types, and gain insights into the mechanisms of complex diseases. “mRNA’s central role in the dogma of molecular biology makes it a logical and relevant focus for transcriptomic studies,” stated Sebastian Aguilar Pierlé, Ph.D., Application Development Lead at Inorevia. “One of the major hurdles for...
        08-07-2024, 12:11 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 08-27-2024, 04:40 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 08-22-2024, 05:00 AM
      0 responses
      293 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 08-21-2024, 10:49 AM
      0 responses
      135 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 08-19-2024, 05:12 AM
      0 responses
      124 views
      0 likes
      Last Post seqadmin  
      Working...
      X