Header Leaderboard Ad
Collapse
Using HTSeq-count when no attributes are available?
Collapse
Announcement
Collapse
No announcement yet.
X
-
Yes, one has to be very careful to analyze this sort of experiments. As you said, from the extraction until the differential expression analysis, there are a lot of sources of bias, PCR amplification, contamination, lane specific error. human manipulation, read mapping, etc. Most softwares take into account the huge amount of variation and use complex statistics models. In the case of edgeR or deseq2, both use a negative binomial distribution, a statistical distribution suitable to model over-dispersed count data. This programs test a null hypothesis, gene aren't not differentially expressed, and assign p-values, then this p-values are adjusted for multiple testing, false discovery rate. Both support complex analysis, like GLM.
-
Originally posted by elizabeth000 View PostI suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.
Interesting situations arise when people expect you to fix experimental "problems" with informatics
Leave a comment:
-
I have already downloaded all the manuals I could find on edgeR, do not worry! I want my results to mean something at the end
It is amazing, in my opinion, that samples can be compared in these long sequencing experiments, since there are dozens of steps before the analysis stage. Every step is a source of technical error or random variability that may affect the total number of reads. Differences in sequencing quality can lead to bias, and even the efficiency of the PCR amplification and the ligation reactions could have an effect. Also, far far upstream of the analysis there was the sample harvesting... I am not even sure the same biomass was harvested from one sample to the next. Then there are the internal problems of competition (highly expressed transcripts drowning out all the others)... It seems practically inextricable to me.
I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.
Leave a comment:
-
Originally posted by elizabeth000 View PostI want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.
"The variance stabilizing and rlog transformations are provided for applications other than diferential testing, for example clustering of samples or other machine learning applications. For diferential testing we recommend the DESeq function applied to raw counts"
I am not sure but I think that in edgeR happens something similar.
whatever you choose, I recommend you to follow the manual reference.
Leave a comment:
-
Thank you for the link to the EMBL tutorial. Looks good! Yes, htseq-count is useful in my case. I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.
I just added attributes in the Transposable Elements GFF file - that was the original reason for this post - using a script written in Tcl. I thought I would post it just in case somebody out there might want to use the same approach. I'm a big fan of Tcl.
Code:#!/usr/bin/tclsh # This script adds arbitrary numbered attributes to the ninth column in a GFF file, so it will be compatible with HTSeq tools. # The user must input the filepath, number of header lines, and the attribute prefix, and use redirect to print the results to a file. proc Attr {path header prefix} { set inId [open $path r] set iline 0 while {[gets $inId fileLine] > 0} { incr iline # Skip header lines if they exist. if {$iline > $header} { set columnList [split $fileLine] set attr "$prefix$iline" set newFileLine "[lindex $columnList 0]\t[lindex $columnList 1]\t[lindex $columnList 2]\t[lindex $columnList 3]\t[lindex $columnList 4]\t[lindex $columnList 5]\t[lindex $columnList 6]\t[lindex $columnList 7]\t$attr" puts $newFileLine } } close $inId } ################################## MAIN ############################### if {$argc != 0} { set argList [split $argv] set path [lindex $argList 0] set header [lindex $argList 1] set prefix [lindex $argList 2] } # Extract the lines with interval length above threshold. Attr $path $header $prefix
Leave a comment:
-
This is a good tutorial to perform differential expression analysis
Here, they use htseq-count to count the number of reads per gene, and they give a lot of tips.
If you aren't doing something like that, please tell us what kind of analysis you need to perform (Chip-seq, RNA-seq, smallRNA-seq, etc), to point you to a more suitable tutorial
Leave a comment:
-
Yikes, it is a good thing you mentioned this, because I was about to make a terrible mistake. I have taken over a project left behind by a student, and I just discovered it was a much older version used for the mapping. Thank you for calling my attention to this!
I have not found any comprehensive tutorial to guide me in this new project, just documentation on the various tools available... do you know of a good tutorial that can help me avoid these stupid beginner's mistakes?
Leave a comment:
-
Originally posted by elizabeth000 View PostIf I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:
"The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."
Genomax is right, you have to be sure that your reference and your gtf have the same version. For example, is you map your reads to the genome assembly of mouse mm10 , the gene annotations have to be for mm10 too.
Leave a comment:
-
Originally posted by elizabeth000 View PostIn the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.
Leave a comment:
-
If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:
"The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."
Thank you very much for the detailed explanations, I understand the use of attributes much better now. In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.
Leave a comment:
-
Originally posted by elizabeth000 View PostHello Diego,
Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.
In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?
I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?
Sorry, but I'm still unsure of things.
You have only one attribute, tandem_repeat, in the third column. But anyway you have to tell htseq-count the name of your attribute. Ok it is silly if you don't have any other attr, but consider that sometimes it is difficult for the programmer to think in all the exceptions
According what I see in your file, the last gtf column is missed. This column contains the name, ID or some other feature identification. I think the problem is that htseq-count is counting the features (in your case TANDEM_REPEAT) but it doesn't know where to assign it.
You only have to add this missed column. Don't worry about grouping. You can create an ID for each tandem repeat, for instance, TR_1, TR_2 and so on.
C21242 TRF Tandem_Repeat 38 100 72 + . . ID=TR_1;Name=TR_1
C21306 TRF Tandem_Repeat 35 143 112 + . . ID=TR_2;Name=TR_2
C21306 TRF Tandem_Repeat 574 947 208 + . . ID=TR_3;Name=TR_3
Then in htseq-count
htseq-count --format=bam --stranded=no --type=Tandem_Repeat --i ID yourbam.bam yourgtf.gft
With this, htseq-count will consider only Tandem_Repeats entries from the gtf file and will report the number of reads in each one them using the ID field from the last column.
Other thing, your are using the "HTSeq.scripts.count" script. I think this is not the regular script to count the reads. Maybe there are some differencesLast edited by diego diaz; 04-17-2015, 10:43 AM.
Leave a comment:
-
Hi,
I work with TE annotations and I use a lot gff with attributes. For instance, using RepeatMarker it pulls out a gff file with "Target" as attribute (the sequence used as library input). Of course, everyone would have their own pipe; I like to edit the attributes so I have one "ID" with my specific transposon and genome location; another attribute with the transposon ("Target") and finally another attribute showing TE family where it belongs (let's say "Name"). Then, I can play with all these and HtSeq together. Just try to find your way on the analysis, with attributes there is a bigger potential.
Leave a comment:
-
Hello Diego,
Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.
In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?
I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?
Sorry, but I'm still unsure of things.
Leave a comment:
-
Hi Elizabeth
The attribute line is necessary because a GTF can have several attributes, for example, when you are counting the number of reads per gene, your GTF file will have gene entries, exon entires, intron entries,etc but you only want to count the reads inside the exons (if you want to measure gene expression) .
The solution is simple, you have to use awk or excel to add an "attribute" and "id" column (id column is necessary because a feature can have several attributes, for instance, a gene can have several exons, hence in the final file the counts have to be grouped by ID). If you don't how to do this, let me know and I will post the awk sentence. Now I have to go :/
I hope it helps !
Leave a comment:
-
Using HTSeq-count when no attributes are available?
I have been using htseq-count with great success, except for one particular type of genome feature, the transposable elements (te).
There is no attribute defined in the gff file, and this appears to always be the case for this type of genome feature.
I am not involved in upstream sequencing activities, and do not understand why attributes are not always available. I do not entirely understand why htseq-count needs to know the attribute of interest, unless it must choose between multiple attributes.
Here are the first few lines of my te.gff file:
Code:C21242 TRF Tandem_Repeat 38 100 72 + . . C21306 TRF Tandem_Repeat 35 143 112 + . . C21306 TRF Tandem_Repeat 574 947 208 + . .
/usr/bin/python -m HTSeq.scripts.count --format=bam --stranded=no --type=Tandem_Repeat --idattr=. --mode=union ./BAM/ovo1.bam ./features/te.gff > ovo1te.counts
Error occured when processing GFF file (line 1 of file ./features/te.gff):
Failure parsing GFF attribute line
[Exception type: ValueError, raised in __init__.py:164]
Can someone please explain why htseq-count needs to know the attribute? Is there a way to run it with no attribute defined?
Kind Regards, Elizabeth
Latest Articles
Collapse
-
Differential Expression and Data Visualization: Recommended Tools for Next-Level Sequencing Analysisby seqadmin
After covering QC and alignment tools in the first segment and variant analysis and genome assembly in the second segment, we’re wrapping up with a discussion about tools for differential gene expression analysis and data visualization. In this article, we include recommendations from the following experts: Dr. Mark Ziemann, Senior Lecturer in Biotechnology and Bioinformatics, Deakin University; Dr. Medhat Mahmoud Postdoctoral Research Fellow at Baylor College of Medicine;...-
Channel: Articles
05-23-2023, 12:26 PM -
-
by seqadmin
Continuing from our previous article, we share variant analysis and genome assembly tools recommended by our experts Dr. Medhat Mahmoud, Postdoctoral Research Fellow at Baylor College of Medicine, and Dr. Ming "Tommy" Tang, Director of Computational Biology at Immunitas and author of From Cell Line to Command Line.
Variant detection and analysis tools
Mahmoud classifies variant detection work into two main groups: short variants (<50...-
Channel: Articles
05-19-2023, 10:03 AM -
-
by seqadmin
With new tools and computational resources being released regularly, it can be hard to determine which are best suited for the analysis process and which older tools continue to be maintained. In an effort to assist the sequencing community, we interviewed three highly skilled bioinformaticians about their recommended tools for several important analysis applications.
Quality control and preprocessing tools
“Garbage in, garbage out” is a popular...-
Channel: Articles
05-16-2023, 10:11 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Exploring French-Canadian Ancestry: Insights into Migration, Settlement Patterns, and Genetic Structure
by seqadmin
Started by seqadmin, 05-26-2023, 09:22 AM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
05-26-2023, 09:22 AM
|
||
Started by seqadmin, 05-24-2023, 09:49 AM
|
0 responses
9 views
0 likes
|
Last Post
by seqadmin
05-24-2023, 09:49 AM
|
||
Introducing ProtVar: A Web Tool for Contextualizing and Interpreting Human Missense Variation in Proteins
by seqadmin
Started by seqadmin, 05-23-2023, 07:14 AM
|
0 responses
29 views
0 likes
|
Last Post
by seqadmin
05-23-2023, 07:14 AM
|
||
Started by seqadmin, 05-18-2023, 11:36 AM
|
0 responses
113 views
0 likes
|
Last Post
by seqadmin
05-18-2023, 11:36 AM
|
Leave a comment: