Using HTSeq-count when no attributes are available?

diego diaz replied

04-23-2015, 06:49 AM
Yes, one has to be very careful to analyze this sort of experiments. As you said, from the extraction until the differential expression analysis, there are a lot of sources of bias, PCR amplification, contamination, lane specific error. human manipulation, read mapping, etc. Most softwares take into account the huge amount of variation and use complex statistics models. In the case of edgeR or deseq2, both use a negative binomial distribution, a statistical distribution suitable to model over-dispersed count data. This programs test a null hypothesis, gene aren't not differentially expressed, and assign p-values, then this p-values are adjusted for multiple testing, false discovery rate. Both support complex analysis, like GLM.
Leave a comment:
GenoMax replied

04-23-2015, 06:23 AM
Originally posted by elizabeth000 View Post

I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.

On well designed (and executed) experiments replicates should be fairly close (within limits of biological variability).

Interesting situations arise when people expect you to fix experimental "problems" with informatics
Leave a comment:
elizabeth000 replied

04-23-2015, 05:56 AM
I have already downloaded all the manuals I could find on edgeR, do not worry! I want my results to mean something at the end

It is amazing, in my opinion, that samples can be compared in these long sequencing experiments, since there are dozens of steps before the analysis stage. Every step is a source of technical error or random variability that may affect the total number of reads. Differences in sequencing quality can lead to bias, and even the efficiency of the PCR amplification and the ligation reactions could have an effect. Also, far far upstream of the analysis there was the sample harvesting... I am not even sure the same biomass was harvested from one sample to the next. Then there are the internal problems of competition (highly expressed transcripts drowning out all the others)... It seems practically inextricable to me.
I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.
Leave a comment:
diego diaz replied

04-23-2015, 05:42 AM
Originally posted by elizabeth000 View Post

I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.

Deseq2 also has good functions to normalize the count data, rlog transformation and varianceStabilizingTransformation. But be careful, they said in their tutorial

"The variance stabilizing and rlog transformations are provided for applications other than diferential testing, for example clustering of samples or other machine learning applications. For diferential testing we recommend the DESeq function applied to raw counts"

I am not sure but I think that in edgeR happens something similar.

whatever you choose, I recommend you to follow the manual reference.

http://www.bioconductor.org/packages/release/bioc/manuals/edgeR/man/edgeR.pdf

Bioconductor - Help

http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf

The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.
Leave a comment:

elizabeth000 replied

04-23-2015, 05:28 AM

Thank you for the link to the EMBL tutorial. Looks good! Yes, htseq-count is useful in my case. I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.

I just added attributes in the Transposable Elements GFF file - that was the original reason for this post - using a script written in Tcl. I thought I would post it just in case somebody out there might want to use the same approach. I'm a big fan of Tcl.

Code:

#!/usr/bin/tclsh

# This script adds arbitrary numbered attributes to the ninth column in a GFF file, so it will be compatible with HTSeq tools.
# The user must input the filepath, number of header lines, and the attribute prefix, and use redirect to print the results to a file.

proc Attr {path header prefix} {
    
   set inId [open $path r]
   set iline 0
   
   while {[gets $inId fileLine] > 0} {
      incr iline
# Skip header lines if they exist. 
      if {$iline > $header} {
         set columnList [split $fileLine]
         set attr "$prefix$iline"
         set newFileLine "[lindex $columnList 0]\t[lindex $columnList 1]\t[lindex $columnList 2]\t[lindex $columnList 3]\t[lindex $columnList 4]\t[lindex $columnList 5]\t[lindex $columnList 6]\t[lindex $columnList 7]\t$attr"
         puts $newFileLine
      }
   }
   close $inId
}

################################## MAIN ###############################

if {$argc != 0} {
   set argList [split $argv]
   set path [lindex $argList 0]
   set header [lindex $argList 1]
   set prefix [lindex $argList 2]
}

# Extract the lines with interval length above threshold.
Attr $path $header $prefix

Leave a comment:

diego diaz replied

04-23-2015, 04:59 AM
This is a good tutorial to perform differential expression analysis

http://www-huber.embl.de/pub/pdf/nprot.2013.099.pdf

Here, they use htseq-count to count the number of reads per gene, and they give a lot of tips.

If you aren't doing something like that, please tell us what kind of analysis you need to perform (Chip-seq, RNA-seq, smallRNA-seq, etc), to point you to a more suitable tutorial
Leave a comment:
elizabeth000 replied

04-23-2015, 01:34 AM
Yikes, it is a good thing you mentioned this, because I was about to make a terrible mistake. I have taken over a project left behind by a student, and I just discovered it was a much older version used for the mapping. Thank you for calling my attention to this!

I have not found any comprehensive tutorial to guide me in this new project, just documentation on the various tools available... do you know of a good tutorial that can help me avoid these stupid beginner's mistakes?
Leave a comment:
diego diaz replied

04-22-2015, 05:07 AM
Originally posted by elizabeth000 View Post

If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:

"The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."

Ok, I didn't know, thanks for the information

Genomax is right, you have to be sure that your reference and your gtf have the same version. For example, is you map your reads to the genome assembly of mouse mm10 , the gene annotations have to be for mm10 too.
Leave a comment:
GenoMax replied

04-22-2015, 01:11 AM
Originally posted by elizabeth000 View Post

In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.

Since you had indicated that you are not a biologist in a different post, I thought I should point this out. If you are using a specific GFF file you will want to make sure that you are using the corresponding genome build/reference for alignments. If you used an incorrect combination you could be looking at subtle (or large errors) when you try to translate your alignments using the GTF file. The errors may not be easily apparent while parsing.
Leave a comment:
elizabeth000 replied

04-22-2015, 12:46 AM
If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:

"The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."

Thank you very much for the detailed explanations, I understand the use of attributes much better now. In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.
Leave a comment:
diego diaz replied

04-17-2015, 10:34 AM
Originally posted by elizabeth000 View Post

Hello Diego,

Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.

In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?

I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?

Sorry, but I'm still unsure of things.

"In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?"

You have only one attribute, tandem_repeat, in the third column. But anyway you have to tell htseq-count the name of your attribute. Ok it is silly if you don't have any other attr, but consider that sometimes it is difficult for the programmer to think in all the exceptions

According what I see in your file, the last gtf column is missed. This column contains the name, ID or some other feature identification. I think the problem is that htseq-count is counting the features (in your case TANDEM_REPEAT) but it doesn't know where to assign it.

You only have to add this missed column. Don't worry about grouping. You can create an ID for each tandem repeat, for instance, TR_1, TR_2 and so on.

C21242 TRF Tandem_Repeat 38 100 72 + . . ID=TR_1;Name=TR_1
C21306 TRF Tandem_Repeat 35 143 112 + . . ID=TR_2;Name=TR_2
C21306 TRF Tandem_Repeat 574 947 208 + . . ID=TR_3;Name=TR_3

Then in htseq-count

htseq-count --format=bam --stranded=no --type=Tandem_Repeat --i ID yourbam.bam yourgtf.gft

With this, htseq-count will consider only Tandem_Repeats entries from the gtf file and will report the number of reads in each one them using the ID field from the last column.

Other thing, your are using the "HTSeq.scripts.count" script. I think this is not the regular script to count the reads. Maybe there are some differences

Last edited by diego diaz; 04-17-2015, 10:43 AM.
Leave a comment:
cascoamarillo replied

04-17-2015, 07:54 AM
Hi,
I work with TE annotations and I use a lot gff with attributes. For instance, using RepeatMarker it pulls out a gff file with "Target" as attribute (the sequence used as library input). Of course, everyone would have their own pipe; I like to edit the attributes so I have one "ID" with my specific transposon and genome location; another attribute with the transposon ("Target") and finally another attribute showing TE family where it belongs (let's say "Name"). Then, I can play with all these and HtSeq together. Just try to find your way on the analysis, with attributes there is a bigger potential.
Leave a comment:
elizabeth000 replied

04-17-2015, 07:40 AM
Hello Diego,

Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.

In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?

I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?

Sorry, but I'm still unsure of things.
Leave a comment:
diego diaz replied

04-17-2015, 07:22 AM
Hi Elizabeth

The attribute line is necessary because a GTF can have several attributes, for example, when you are counting the number of reads per gene, your GTF file will have gene entries, exon entires, intron entries,etc but you only want to count the reads inside the exons (if you want to measure gene expression) .

The solution is simple, you have to use awk or excel to add an "attribute" and "id" column (id column is necessary because a feature can have several attributes, for instance, a gene can have several exons, hence in the final file the counts have to be grouped by ID). If you don't how to do this, let me know and I will post the awk sentence. Now I have to go :/

I hope it helps !
Leave a comment:
elizabeth000 started a topic Using HTSeq-count when no attributes are available?

04-17-2015, 06:23 AM
Using HTSeq-count when no attributes are available?
I have been using htseq-count with great success, except for one particular type of genome feature, the transposable elements (te).
There is no attribute defined in the gff file, and this appears to always be the case for this type of genome feature.

I am not involved in upstream sequencing activities, and do not understand why attributes are not always available. I do not entirely understand why htseq-count needs to know the attribute of interest, unless it must choose between multiple attributes.

Here are the first few lines of my te.gff file:

Code:

C21242 TRF Tandem_Repeat 38 100 72 + . . C21306 TRF Tandem_Repeat 35 143 112 + . . C21306 TRF Tandem_Repeat 574 947 208 + . .

Below, the command I hopelessly attempted to execute:

/usr/bin/python -m HTSeq.scripts.count --format=bam --stranded=no --type=Tandem_Repeat --idattr=. --mode=union ./BAM/ovo1.bam ./features/te.gff > ovo1te.counts
Error occured when processing GFF file (line 1 of file ./features/te.gff):
Failure parsing GFF attribute line
[Exception type: ValueError, raised in __init__.py:164]

Can someone please explain why htseq-count needs to know the attribute? Is there a way to run it with no attribute defined?

Kind Regards, Elizabeth
Tags: gff, htseq, htseq count, transposable element

Previous template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: