Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using HTSeq-count when no attributes are available?

    I have been using htseq-count with great success, except for one particular type of genome feature, the transposable elements (te).
    There is no attribute defined in the gff file, and this appears to always be the case for this type of genome feature.

    I am not involved in upstream sequencing activities, and do not understand why attributes are not always available. I do not entirely understand why htseq-count needs to know the attribute of interest, unless it must choose between multiple attributes.

    Here are the first few lines of my te.gff file:

    Code:
    C21242	TRF	Tandem_Repeat	38	100	72	+	.	.
    C21306	TRF	Tandem_Repeat	35	143	112	+	.	.
    C21306	TRF	Tandem_Repeat	574	947	208	+	.	.
    Below, the command I hopelessly attempted to execute:

    /usr/bin/python -m HTSeq.scripts.count --format=bam --stranded=no --type=Tandem_Repeat --idattr=. --mode=union ./BAM/ovo1.bam ./features/te.gff > ovo1te.counts
    Error occured when processing GFF file (line 1 of file ./features/te.gff):
    Failure parsing GFF attribute line
    [Exception type: ValueError, raised in __init__.py:164]

    Can someone please explain why htseq-count needs to know the attribute? Is there a way to run it with no attribute defined?

    Kind Regards, Elizabeth

  • #2
    Hi Elizabeth

    The attribute line is necessary because a GTF can have several attributes, for example, when you are counting the number of reads per gene, your GTF file will have gene entries, exon entires, intron entries,etc but you only want to count the reads inside the exons (if you want to measure gene expression) .

    The solution is simple, you have to use awk or excel to add an "attribute" and "id" column (id column is necessary because a feature can have several attributes, for instance, a gene can have several exons, hence in the final file the counts have to be grouped by ID). If you don't how to do this, let me know and I will post the awk sentence. Now I have to go :/

    I hope it helps !

    Comment


    • #3
      Hello Diego,

      Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.

      In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?

      I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?

      Sorry, but I'm still unsure of things.

      Comment


      • #4
        Hi,
        I work with TE annotations and I use a lot gff with attributes. For instance, using RepeatMarker it pulls out a gff file with "Target" as attribute (the sequence used as library input). Of course, everyone would have their own pipe; I like to edit the attributes so I have one "ID" with my specific transposon and genome location; another attribute with the transposon ("Target") and finally another attribute showing TE family where it belongs (let's say "Name"). Then, I can play with all these and HtSeq together. Just try to find your way on the analysis, with attributes there is a bigger potential.

        Comment


        • #5
          Originally posted by elizabeth000 View Post
          Hello Diego,

          Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.

          In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?

          I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?

          Sorry, but I'm still unsure of things.
          "In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?"

          You have only one attribute, tandem_repeat, in the third column. But anyway you have to tell htseq-count the name of your attribute. Ok it is silly if you don't have any other attr, but consider that sometimes it is difficult for the programmer to think in all the exceptions

          According what I see in your file, the last gtf column is missed. This column contains the name, ID or some other feature identification. I think the problem is that htseq-count is counting the features (in your case TANDEM_REPEAT) but it doesn't know where to assign it.

          You only have to add this missed column. Don't worry about grouping. You can create an ID for each tandem repeat, for instance, TR_1, TR_2 and so on.

          C21242 TRF Tandem_Repeat 38 100 72 + . . ID=TR_1;Name=TR_1
          C21306 TRF Tandem_Repeat 35 143 112 + . . ID=TR_2;Name=TR_2
          C21306 TRF Tandem_Repeat 574 947 208 + . . ID=TR_3;Name=TR_3

          Then in htseq-count

          htseq-count --format=bam --stranded=no --type=Tandem_Repeat --i ID yourbam.bam yourgtf.gft

          With this, htseq-count will consider only Tandem_Repeats entries from the gtf file and will report the number of reads in each one them using the ID field from the last column.


          Other thing, your are using the "HTSeq.scripts.count" script. I think this is not the regular script to count the reads. Maybe there are some differences
          Last edited by diego diaz; 04-17-2015, 10:43 AM.

          Comment


          • #6
            If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:

            "The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."

            Thank you very much for the detailed explanations, I understand the use of attributes much better now. In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.

            Comment


            • #7
              Originally posted by elizabeth000 View Post
              In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.
              Since you had indicated that you are not a biologist in a different post, I thought I should point this out. If you are using a specific GFF file you will want to make sure that you are using the corresponding genome build/reference for alignments. If you used an incorrect combination you could be looking at subtle (or large errors) when you try to translate your alignments using the GTF file. The errors may not be easily apparent while parsing.

              Comment


              • #8
                Originally posted by elizabeth000 View Post
                If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:

                "The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."
                Ok, I didn't know, thanks for the information

                Genomax is right, you have to be sure that your reference and your gtf have the same version. For example, is you map your reads to the genome assembly of mouse mm10 , the gene annotations have to be for mm10 too.

                Comment


                • #9
                  Yikes, it is a good thing you mentioned this, because I was about to make a terrible mistake. I have taken over a project left behind by a student, and I just discovered it was a much older version used for the mapping. Thank you for calling my attention to this!

                  I have not found any comprehensive tutorial to guide me in this new project, just documentation on the various tools available... do you know of a good tutorial that can help me avoid these stupid beginner's mistakes?

                  Comment


                  • #10
                    This is a good tutorial to perform differential expression analysis



                    Here, they use htseq-count to count the number of reads per gene, and they give a lot of tips.

                    If you aren't doing something like that, please tell us what kind of analysis you need to perform (Chip-seq, RNA-seq, smallRNA-seq, etc), to point you to a more suitable tutorial

                    Comment


                    • #11
                      Thank you for the link to the EMBL tutorial. Looks good! Yes, htseq-count is useful in my case. I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.

                      I just added attributes in the Transposable Elements GFF file - that was the original reason for this post - using a script written in Tcl. I thought I would post it just in case somebody out there might want to use the same approach. I'm a big fan of Tcl.

                      Code:
                      #!/usr/bin/tclsh
                      
                      # This script adds arbitrary numbered attributes to the ninth column in a GFF file, so it will be compatible with HTSeq tools.
                      # The user must input the filepath, number of header lines, and the attribute prefix, and use redirect to print the results to a file.
                      
                      proc Attr {path header prefix} {
                          
                         set inId [open $path r]
                         set iline 0
                         
                         while {[gets $inId fileLine] > 0} {
                            incr iline
                      # Skip header lines if they exist. 
                            if {$iline > $header} {
                               set columnList [split $fileLine]
                               set attr "$prefix$iline"
                               set newFileLine "[lindex $columnList 0]\t[lindex $columnList 1]\t[lindex $columnList 2]\t[lindex $columnList 3]\t[lindex $columnList 4]\t[lindex $columnList 5]\t[lindex $columnList 6]\t[lindex $columnList 7]\t$attr"
                               puts $newFileLine
                            }
                         }
                         close $inId
                      }
                      
                      ################################## MAIN ###############################
                      
                      if {$argc != 0} {
                         set argList [split $argv]
                         set path [lindex $argList 0]
                         set header [lindex $argList 1]
                         set prefix [lindex $argList 2]
                      }
                      
                      # Extract the lines with interval length above threshold.
                      Attr $path $header $prefix

                      Comment


                      • #12
                        Originally posted by elizabeth000 View Post
                        I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.
                        Deseq2 also has good functions to normalize the count data, rlog transformation and varianceStabilizingTransformation. But be careful, they said in their tutorial

                        "The variance stabilizing and rlog transformations are provided for applications other than diferential testing, for example clustering of samples or other machine learning applications. For diferential testing we recommend the DESeq function applied to raw counts"

                        I am not sure but I think that in edgeR happens something similar.

                        whatever you choose, I recommend you to follow the manual reference.



                        The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.

                        Comment


                        • #13
                          I have already downloaded all the manuals I could find on edgeR, do not worry! I want my results to mean something at the end

                          It is amazing, in my opinion, that samples can be compared in these long sequencing experiments, since there are dozens of steps before the analysis stage. Every step is a source of technical error or random variability that may affect the total number of reads. Differences in sequencing quality can lead to bias, and even the efficiency of the PCR amplification and the ligation reactions could have an effect. Also, far far upstream of the analysis there was the sample harvesting... I am not even sure the same biomass was harvested from one sample to the next. Then there are the internal problems of competition (highly expressed transcripts drowning out all the others)... It seems practically inextricable to me.
                          I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.

                          Comment


                          • #14
                            Originally posted by elizabeth000 View Post
                            I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.
                            On well designed (and executed) experiments replicates should be fairly close (within limits of biological variability).

                            Interesting situations arise when people expect you to fix experimental "problems" with informatics

                            Comment


                            • #15
                              Yes, one has to be very careful to analyze this sort of experiments. As you said, from the extraction until the differential expression analysis, there are a lot of sources of bias, PCR amplification, contamination, lane specific error. human manipulation, read mapping, etc. Most softwares take into account the huge amount of variation and use complex statistics models. In the case of edgeR or deseq2, both use a negative binomial distribution, a statistical distribution suitable to model over-dispersed count data. This programs test a null hypothesis, gene aren't not differentially expressed, and assign p-values, then this p-values are adjusted for multiple testing, false discovery rate. Both support complex analysis, like GLM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-19-2024, 07:20 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              46 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X