Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq count without gene structure

    I have a file with genes but all I know are their locations and I have no information on exons and introns or any of those sorts. I have tried to generate a gff file from the gene name, chr, begin, end and strand, but htseq-count cannot count this genes without proper structure.... Is this right? is there anyway around this that I am not aware of?

    How else can I just get the number of reads per gene, without structure information?

    Thank you very much,

    Ines

  • #2
    What organism are you using? I would hope that there's a better annotation for it!

    Comment


    • #3
      It is human, but I am working with non-coding genes that have very poor structure annotation. So because of that I am trying to only use these genes in the annotation file, and disregard the rest of the genome.

      Is there a way to count just the reads per location?

      Comment


      • #4
        Ah, in that case I would expect your artificial GTF or GFF file to work with htseq-count (at least, there's no obvious reason why it wouldn't).

        Depending on the nature of your experiment and how the data looks, I would be somewhat hesitant to leave out the counts for the other genes. If there are a good number of non-coding genes (be they pseudo or ncRNAs) and you have decent coverage, then things will probably work. I would want to ensure, though, that DESeq/edgeR/whatever has enough information to properly calculate the library sizes and dispersions.

        Comment


        • #5
          Then there must be something else wrong because I keep getting all the reads in the not uniquely aligned region.

          This is what my "mock gff" looks like:
          chr2 GeneCard exon 236414395 236416210 . + . ID=exon:AGAP1-IT1.1;ID=gene:AGAP1-IT1;
          chr1 GeneCard exon 49839873 49937757 . - . ID=exon:AGBL4-IT1.1;ID=gene:AGBL4-IT1;
          chr2 GeneCard exon 27272551 27273132 . - . ID=exon:AGBL5-AS1.1;ID=gene:AGBL5-AS1;
          chr2 GeneCard exon 27283906 27284683 . + . ID=exon:AGBL5-IT1.1;ID=gene:AGBL5-IT1;
          chr6 GeneCard exon 161581146 161583014 . - . ID=exon:AGPAT4-IT1.1;ID=gene:AGPAT4-IT1;
          chr15 GeneCard exon 82764078 82798323 . - . ID=exon:AGSK1.1;ID=gene:AGSK1;


          and I called htseq-count as:

          htseq-count -i ID sample1.out.Aligned.sam myAnnotation.gff >> count_table.txt

          Do you know what could be wrong? Also it does not work when I sort the file either... it tells me the sorting is not correct...

          Thanks so much for the help.

          ps: It is an extremely good remark about the library sizes... I had no exactly considered that might be a problem. Thanks, I'll definitely look into that once this part is sorted...

          Comment


          • #6
            When you sort the input file, you need to do so by name rather than coordinate (so "samtools sort -n ..."). Presumably that led to the error you noticed.

            Regarding having most/all of your counts as non-uniquely aligned, I'm not particularly surprised to see that. If a read maps elsewhere in the genome (regardless of whether it's to a gene) then it won't get counted, since it has unclear origin. There's simply no way to get around that (well, you can, but whether the results are actually meaningful remains to be seen). You might be able to get something with the cufflinks pipeline, though you'd need to look back at the original data to see if the results seem reasonable.

            BTW, your gff file seems odd, since each line has multiple ID attributes. I assume the first line should end in 'exon "AGAP1-IT1.1"; gene "AGAP1-IT1";', though that might not lead to the error in and of itself (though it wouldn't help), given just the couple lines you posted..

            Comment


            • #7
              I am sorry, I didn't quite follow what you said was wrong with the gtf file I showed... Is it because they are all treated as exons?

              It gives me the error that the read wasn't aligned because the chr it aligned was not present. I just assumed it is the correct error if the gene that read aligns to is not there...

              So I have 51 222 712 input reads, from which 42 731 666 are uniquely mapped and the htseq-count gives me 8 689 079 alignment not unique reads.... does this means that I have the remainder as uniquely mapped to the file I gave? or what does it mean?

              Comment


              • #8
                I am sorry, I didn't quite follow what you said was wrong with the gtf file I showed... Is it because they are all treated as exons?
                Treating everything as an exon is OK (since you're just looking at a subset of unspliced things anyway), it's the equals signs and the multiple uses of "ID" attributes that could muck things up.

                It gives me the error that the read wasn't aligned because the chr it aligned was not present. I just assumed it is the correct error if the gene that read aligns to is not there...
                You can get that errors like that if your genome has chromosomes on which your annotation lacks any gene mappings (some genomes will produce a lot of these).

                So I have 51 222 712 input reads, from which 42 731 666 are uniquely mapped and the htseq-count gives me 8 689 079 alignment not unique reads.... does this means that I have the remainder as uniquely mapped to the file I gave? or what does it mean?
                Not seeing the actual output of things, I would presume that either the remainder are uniquely mapped and included in the counts or uniquely mapped but added to the "empty" column, since they don't map to a feature that you're counting.

                Comment


                • #9
                  So now to understand if it was the fact that I was using only a subset of the genes I tried using the whole genome, without structure as well. I keep getting the same error...

                  Warning: Skipping read 'HWI-ST827:920AF6ACXX:4:1101:1561:135138', because chromosome 'chr11', to which it has been aligned, did not appear in the GFF file.

                  This is how my annotation file looks like now:

                  chr6 GeneCard exon 161581146 161583014 . - . ID=exon:AGPAT4-IT1;
                  chr15 GeneCard exon 82764078 82798323 . - . ID=exon:AGSK1;
                  chr6 GeneCard exon 88410020 88410933 . - . ID=exon:AKIRIN2-AS1;

                  and this is my output Sam file:

                  HWI-ST827:920AF6ACXX:4:1101:1444:173048 99 chr17 78120611 255 2S74M = 78120738 182 CTCAGCAGGTCCTCCCGCAGNCCCATGGNGTCGAACNNGGGGGTNNCATCCACCTNCTCGCTGGTCTCGAATTCCA <<<??<:??(=;<9>@?35>#21(1=??################################################ NH:i:1 HI:i:1 AS:i:114 nM:i:3
                  HWI-ST827:920AF6ACXX:4:1101:1444:173048 147 chr17 78120738 255 55M = 78120611 -182 GCCATCGTGGCCCTGGTTCCCATGATTCAGAGTCCGCGGAAGAGCACAGCGCGCG 5,(2BA?5((/-(.-(.)(G@44GFBB<==A70@??C9E<EA<+C8@1ADD?@@B NH:i:1 HI:i:1 AS:i:114 nM:i:3

                  Also, about the reads, when I sort the htseq count output file on the reads nothing has aligned... even when I use the whole genome... So it is clearly not counting the reads correctly...

                  What am I missing?

                  Thank you so much for all the help so far!!

                  Comment


                  • #10
                    You'll get that warning if you have alignments to a chromosome or contig that's not in your annotation GTF/GFF file. You can safely ignore those.

                    You're other problem with htseq-count could be due to the odd format of your annotation file (I imagine that the "ID=" part could cause a parsing error). Download the GTF annotation from UCSC (or whatever the accompanying annotation for the genome you're using is) and use it (unchanged) with htseq-count. If that works correctly, then the problem is in your annotation file.

                    Comment


                    • #11
                      Yes I ended up doing that, and it was in fact my annotation file. I created it using awk and I think there must have been some invisible character. But it is all solved now!

                      Thank you very much for all the help!

                      If I can ask just one thing more, if not I can start a new thread, it is not exactly related.

                      Is there anyway for me to output the normalized reads the CountDataSet object uses to test for the differential expression? So after I calculate the sizeFactors and the dispersions, can I output the "read" values at that point into a matrix or something?

                      Thanks again for the help!

                      Comment


                      • #12
                        It'd be best to create a new thread, since that's not particularly related.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        29 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        31 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X