Unconfigured Ad
Collapse
X
-
Yes, one has to be very careful to analyze this sort of experiments. As you said, from the extraction until the differential expression analysis, there are a lot of sources of bias, PCR amplification, contamination, lane specific error. human manipulation, read mapping, etc. Most softwares take into account the huge amount of variation and use complex statistics models. In the case of edgeR or deseq2, both use a negative binomial distribution, a statistical distribution suitable to model over-dispersed count data. This programs test a null hypothesis, gene aren't not differentially expressed, and assign p-values, then this p-values are adjusted for multiple testing, false discovery rate. Both support complex analysis, like GLM.
-
-
On well designed (and executed) experiments replicates should be fairly close (within limits of biological variability).Originally posted by elizabeth000 View PostI suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.
Interesting situations arise when people expect you to fix experimental "problems" with informatics
Leave a comment:
-
-
I have already downloaded all the manuals I could find on edgeR, do not worry! I want my results to mean something at the end
It is amazing, in my opinion, that samples can be compared in these long sequencing experiments, since there are dozens of steps before the analysis stage. Every step is a source of technical error or random variability that may affect the total number of reads. Differences in sequencing quality can lead to bias, and even the efficiency of the PCR amplification and the ligation reactions could have an effect. Also, far far upstream of the analysis there was the sample harvesting... I am not even sure the same biomass was harvested from one sample to the next. Then there are the internal problems of competition (highly expressed transcripts drowning out all the others)... It seems practically inextricable to me.
I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.
Leave a comment:
-
-
Deseq2 also has good functions to normalize the count data, rlog transformation and varianceStabilizingTransformation. But be careful, they said in their tutorialOriginally posted by elizabeth000 View PostI want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.
"The variance stabilizing and rlog transformations are provided for applications other than diferential testing, for example clustering of samples or other machine learning applications. For diferential testing we recommend the DESeq function applied to raw counts"
I am not sure but I think that in edgeR happens something similar.
whatever you choose, I recommend you to follow the manual reference.
Leave a comment:
-
-
Thank you for the link to the EMBL tutorial. Looks good! Yes, htseq-count is useful in my case. I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.
I just added attributes in the Transposable Elements GFF file - that was the original reason for this post - using a script written in Tcl. I thought I would post it just in case somebody out there might want to use the same approach. I'm a big fan of Tcl.
Code:#!/usr/bin/tclsh # This script adds arbitrary numbered attributes to the ninth column in a GFF file, so it will be compatible with HTSeq tools. # The user must input the filepath, number of header lines, and the attribute prefix, and use redirect to print the results to a file. proc Attr {path header prefix} { set inId [open $path r] set iline 0 while {[gets $inId fileLine] > 0} { incr iline # Skip header lines if they exist. if {$iline > $header} { set columnList [split $fileLine] set attr "$prefix$iline" set newFileLine "[lindex $columnList 0]\t[lindex $columnList 1]\t[lindex $columnList 2]\t[lindex $columnList 3]\t[lindex $columnList 4]\t[lindex $columnList 5]\t[lindex $columnList 6]\t[lindex $columnList 7]\t$attr" puts $newFileLine } } close $inId } ################################## MAIN ############################### if {$argc != 0} { set argList [split $argv] set path [lindex $argList 0] set header [lindex $argList 1] set prefix [lindex $argList 2] } # Extract the lines with interval length above threshold. Attr $path $header $prefix
Leave a comment:
-
-
This is a good tutorial to perform differential expression analysis
Here, they use htseq-count to count the number of reads per gene, and they give a lot of tips.
If you aren't doing something like that, please tell us what kind of analysis you need to perform (Chip-seq, RNA-seq, smallRNA-seq, etc), to point you to a more suitable tutorial
Leave a comment:
-
-
Yikes, it is a good thing you mentioned this, because I was about to make a terrible mistake. I have taken over a project left behind by a student, and I just discovered it was a much older version used for the mapping. Thank you for calling my attention to this!
I have not found any comprehensive tutorial to guide me in this new project, just documentation on the various tools available... do you know of a good tutorial that can help me avoid these stupid beginner's mistakes?
Leave a comment:
-
-
Ok, I didn't know, thanks for the informationOriginally posted by elizabeth000 View PostIf I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:
"The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."
Genomax is right, you have to be sure that your reference and your gtf have the same version. For example, is you map your reads to the genome assembly of mouse mm10 , the gene annotations have to be for mm10 too.
Leave a comment:
-
-
Since you had indicated that you are not a biologist in a different post, I thought I should point this out. If you are using a specific GFF file you will want to make sure that you are using the corresponding genome build/reference for alignments. If you used an incorrect combination you could be looking at subtle (or large errors) when you try to translate your alignments using the GTF file. The errors may not be easily apparent while parsing.Originally posted by elizabeth000 View PostIn the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.
Leave a comment:
-
-
If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:
"The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."
Thank you very much for the detailed explanations, I understand the use of attributes much better now. In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.
Leave a comment:
-
-
"In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?"Originally posted by elizabeth000 View PostHello Diego,
Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.
In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?
I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?
Sorry, but I'm still unsure of things.
You have only one attribute, tandem_repeat, in the third column. But anyway you have to tell htseq-count the name of your attribute. Ok it is silly if you don't have any other attr, but consider that sometimes it is difficult for the programmer to think in all the exceptions
According what I see in your file, the last gtf column is missed. This column contains the name, ID or some other feature identification. I think the problem is that htseq-count is counting the features (in your case TANDEM_REPEAT) but it doesn't know where to assign it.
You only have to add this missed column. Don't worry about grouping. You can create an ID for each tandem repeat, for instance, TR_1, TR_2 and so on.
C21242 TRF Tandem_Repeat 38 100 72 + . . ID=TR_1;Name=TR_1
C21306 TRF Tandem_Repeat 35 143 112 + . . ID=TR_2;Name=TR_2
C21306 TRF Tandem_Repeat 574 947 208 + . . ID=TR_3;Name=TR_3
Then in htseq-count
htseq-count --format=bam --stranded=no --type=Tandem_Repeat --i ID yourbam.bam yourgtf.gft
With this, htseq-count will consider only Tandem_Repeats entries from the gtf file and will report the number of reads in each one them using the ID field from the last column.
Other thing, your are using the "HTSeq.scripts.count" script. I think this is not the regular script to count the reads. Maybe there are some differencesLast edited by diego diaz; 04-17-2015, 10:43 AM.
Leave a comment:
-
-
Hi,
I work with TE annotations and I use a lot gff with attributes. For instance, using RepeatMarker it pulls out a gff file with "Target" as attribute (the sequence used as library input). Of course, everyone would have their own pipe; I like to edit the attributes so I have one "ID" with my specific transposon and genome location; another attribute with the transposon ("Target") and finally another attribute showing TE family where it belongs (let's say "Name"). Then, I can play with all these and HtSeq together. Just try to find your way on the analysis, with attributes there is a bigger potential.
Leave a comment:
-
-
Hello Diego,
Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.
In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?
I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?
Sorry, but I'm still unsure of things.
Leave a comment:
-
-
Hi Elizabeth
The attribute line is necessary because a GTF can have several attributes, for example, when you are counting the number of reads per gene, your GTF file will have gene entries, exon entires, intron entries,etc but you only want to count the reads inside the exons (if you want to measure gene expression) .
The solution is simple, you have to use awk or excel to add an "attribute" and "id" column (id column is necessary because a feature can have several attributes, for instance, a gene can have several exons, hence in the final file the counts have to be grouped by ID). If you don't how to do this, let me know and I will post the awk sentence. Now I have to go :/
I hope it helps !
Leave a comment:
-
-
Using HTSeq-count when no attributes are available?
I have been using htseq-count with great success, except for one particular type of genome feature, the transposable elements (te).
There is no attribute defined in the gff file, and this appears to always be the case for this type of genome feature.
I am not involved in upstream sequencing activities, and do not understand why attributes are not always available. I do not entirely understand why htseq-count needs to know the attribute of interest, unless it must choose between multiple attributes.
Here are the first few lines of my te.gff file:
Below, the command I hopelessly attempted to execute:Code:C21242 TRF Tandem_Repeat 38 100 72 + . . C21306 TRF Tandem_Repeat 35 143 112 + . . C21306 TRF Tandem_Repeat 574 947 208 + . .
/usr/bin/python -m HTSeq.scripts.count --format=bam --stranded=no --type=Tandem_Repeat --idattr=. --mode=union ./BAM/ovo1.bam ./features/te.gff > ovo1te.counts
Error occured when processing GFF file (line 1 of file ./features/te.gff):
Failure parsing GFF attribute line
[Exception type: ValueError, raised in __init__.py:164]
Can someone please explain why htseq-count needs to know the attribute? Is there a way to run it with no attribute defined?
Kind Regards, Elizabeth
Latest Articles
Collapse
-
by SEQadmin2
I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.
Here are nine questions we think about, in roughly the order they matter, before...-
Channel: Articles
06-18-2026, 07:11 AM -
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, 06-26-2026, 11:10 AM
|
0 responses
15 views
0 reactions
|
Last Post
by SEQadmin2
06-26-2026, 11:10 AM
|
||
|
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population
by SEQadmin2
Started by SEQadmin2, 06-17-2026, 06:09 AM
|
0 responses
49 views
0 reactions
|
Last Post
by SEQadmin2
06-17-2026, 06:09 AM
|
||
|
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism
by SEQadmin2
Started by SEQadmin2, 06-09-2026, 11:58 AM
|
0 responses
107 views
0 reactions
|
Last Post
by SEQadmin2
06-09-2026, 11:58 AM
|
||
|
Started by SEQadmin2, 06-05-2026, 10:09 AM
|
0 responses
125 views
0 reactions
|
Last Post
by SEQadmin2
06-05-2026, 10:09 AM
|
Leave a comment: