Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • diego diaz
    replied
    Yes, one has to be very careful to analyze this sort of experiments. As you said, from the extraction until the differential expression analysis, there are a lot of sources of bias, PCR amplification, contamination, lane specific error. human manipulation, read mapping, etc. Most softwares take into account the huge amount of variation and use complex statistics models. In the case of edgeR or deseq2, both use a negative binomial distribution, a statistical distribution suitable to model over-dispersed count data. This programs test a null hypothesis, gene aren't not differentially expressed, and assign p-values, then this p-values are adjusted for multiple testing, false discovery rate. Both support complex analysis, like GLM.

    Leave a comment:


  • GenoMax
    replied
    Originally posted by elizabeth000 View Post
    I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.
    On well designed (and executed) experiments replicates should be fairly close (within limits of biological variability).

    Interesting situations arise when people expect you to fix experimental "problems" with informatics

    Leave a comment:


  • elizabeth000
    replied
    I have already downloaded all the manuals I could find on edgeR, do not worry! I want my results to mean something at the end

    It is amazing, in my opinion, that samples can be compared in these long sequencing experiments, since there are dozens of steps before the analysis stage. Every step is a source of technical error or random variability that may affect the total number of reads. Differences in sequencing quality can lead to bias, and even the efficiency of the PCR amplification and the ligation reactions could have an effect. Also, far far upstream of the analysis there was the sample harvesting... I am not even sure the same biomass was harvested from one sample to the next. Then there are the internal problems of competition (highly expressed transcripts drowning out all the others)... It seems practically inextricable to me.
    I suppose I'll see how comparable the replicates are, and hopefully find justification for a more optimistic outlook.

    Leave a comment:


  • diego diaz
    replied
    Originally posted by elizabeth000 View Post
    I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.
    Deseq2 also has good functions to normalize the count data, rlog transformation and varianceStabilizingTransformation. But be careful, they said in their tutorial

    "The variance stabilizing and rlog transformations are provided for applications other than diferential testing, for example clustering of samples or other machine learning applications. For diferential testing we recommend the DESeq function applied to raw counts"

    I am not sure but I think that in edgeR happens something similar.

    whatever you choose, I recommend you to follow the manual reference.



    The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.

    Leave a comment:


  • elizabeth000
    replied
    Thank you for the link to the EMBL tutorial. Looks good! Yes, htseq-count is useful in my case. I want to compare counts between experimental conditions. My next task is to normalize the count data... I am going to use edgeR.

    I just added attributes in the Transposable Elements GFF file - that was the original reason for this post - using a script written in Tcl. I thought I would post it just in case somebody out there might want to use the same approach. I'm a big fan of Tcl.

    Code:
    #!/usr/bin/tclsh
    
    # This script adds arbitrary numbered attributes to the ninth column in a GFF file, so it will be compatible with HTSeq tools.
    # The user must input the filepath, number of header lines, and the attribute prefix, and use redirect to print the results to a file.
    
    proc Attr {path header prefix} {
        
       set inId [open $path r]
       set iline 0
       
       while {[gets $inId fileLine] > 0} {
          incr iline
    # Skip header lines if they exist. 
          if {$iline > $header} {
             set columnList [split $fileLine]
             set attr "$prefix$iline"
             set newFileLine "[lindex $columnList 0]\t[lindex $columnList 1]\t[lindex $columnList 2]\t[lindex $columnList 3]\t[lindex $columnList 4]\t[lindex $columnList 5]\t[lindex $columnList 6]\t[lindex $columnList 7]\t$attr"
             puts $newFileLine
          }
       }
       close $inId
    }
    
    ################################## MAIN ###############################
    
    if {$argc != 0} {
       set argList [split $argv]
       set path [lindex $argList 0]
       set header [lindex $argList 1]
       set prefix [lindex $argList 2]
    }
    
    # Extract the lines with interval length above threshold.
    Attr $path $header $prefix

    Leave a comment:


  • diego diaz
    replied
    This is a good tutorial to perform differential expression analysis



    Here, they use htseq-count to count the number of reads per gene, and they give a lot of tips.

    If you aren't doing something like that, please tell us what kind of analysis you need to perform (Chip-seq, RNA-seq, smallRNA-seq, etc), to point you to a more suitable tutorial

    Leave a comment:


  • elizabeth000
    replied
    Yikes, it is a good thing you mentioned this, because I was about to make a terrible mistake. I have taken over a project left behind by a student, and I just discovered it was a much older version used for the mapping. Thank you for calling my attention to this!

    I have not found any comprehensive tutorial to guide me in this new project, just documentation on the various tools available... do you know of a good tutorial that can help me avoid these stupid beginner's mistakes?

    Leave a comment:


  • diego diaz
    replied
    Originally posted by elizabeth000 View Post
    If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:

    "The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."
    Ok, I didn't know, thanks for the information

    Genomax is right, you have to be sure that your reference and your gtf have the same version. For example, is you map your reads to the genome assembly of mouse mm10 , the gene annotations have to be for mm10 too.

    Leave a comment:


  • GenoMax
    replied
    Originally posted by elizabeth000 View Post
    In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.
    Since you had indicated that you are not a biologist in a different post, I thought I should point this out. If you are using a specific GFF file you will want to make sure that you are using the corresponding genome build/reference for alignments. If you used an incorrect combination you could be looking at subtle (or large errors) when you try to translate your alignments using the GTF file. The errors may not be easily apparent while parsing.

    Leave a comment:


  • elizabeth000
    replied
    If I correctly understand the instructions from Simon Anders, HTSeq.scripts.count and htseq-count commands call the same script:

    "The file “htseq-count” has to be in the system’s search path. By default, Python places it in its script directory, which you have to add to your search path. A maybe easier alternative is to write python -m HTSeq.scripts.count instead of htseq-count, followed by the options and arguments, which will launch the htseq-count script as well."

    Thank you very much for the detailed explanations, I understand the use of attributes much better now. In the newer versions of the GFF files I am not encountering any lack of attributes so the problem is resolved.

    Leave a comment:


  • diego diaz
    replied
    Originally posted by elizabeth000 View Post
    Hello Diego,

    Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.

    In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?

    I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?

    Sorry, but I'm still unsure of things.
    "In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?"

    You have only one attribute, tandem_repeat, in the third column. But anyway you have to tell htseq-count the name of your attribute. Ok it is silly if you don't have any other attr, but consider that sometimes it is difficult for the programmer to think in all the exceptions

    According what I see in your file, the last gtf column is missed. This column contains the name, ID or some other feature identification. I think the problem is that htseq-count is counting the features (in your case TANDEM_REPEAT) but it doesn't know where to assign it.

    You only have to add this missed column. Don't worry about grouping. You can create an ID for each tandem repeat, for instance, TR_1, TR_2 and so on.

    C21242 TRF Tandem_Repeat 38 100 72 + . . ID=TR_1;Name=TR_1
    C21306 TRF Tandem_Repeat 35 143 112 + . . ID=TR_2;Name=TR_2
    C21306 TRF Tandem_Repeat 574 947 208 + . . ID=TR_3;Name=TR_3

    Then in htseq-count

    htseq-count --format=bam --stranded=no --type=Tandem_Repeat --i ID yourbam.bam yourgtf.gft

    With this, htseq-count will consider only Tandem_Repeats entries from the gtf file and will report the number of reads in each one them using the ID field from the last column.


    Other thing, your are using the "HTSeq.scripts.count" script. I think this is not the regular script to count the reads. Maybe there are some differences
    Last edited by diego diaz; 04-17-2015, 10:43 AM.

    Leave a comment:


  • cascoamarillo
    replied
    Hi,
    I work with TE annotations and I use a lot gff with attributes. For instance, using RepeatMarker it pulls out a gff file with "Target" as attribute (the sequence used as library input). Of course, everyone would have their own pipe; I like to edit the attributes so I have one "ID" with my specific transposon and genome location; another attribute with the transposon ("Target") and finally another attribute showing TE family where it belongs (let's say "Name"). Then, I can play with all these and HtSeq together. Just try to find your way on the analysis, with attributes there is a bigger potential.

    Leave a comment:


  • elizabeth000
    replied
    Hello Diego,

    Thank you for your fast reply confirming my understanding about the attributes. Defining the attribute is only necessary when there are multiple ones to choose from.

    In my case, there are no attributes to choose from at all. So doesn't it seem kind of silly to add one?

    I know how to add this with Tcl, but how can I know which transposable elements should be grouped under the same ID? Any clue as to why my gff file doesn't already have attributes defined, although all the other features do?

    Sorry, but I'm still unsure of things.

    Leave a comment:


  • diego diaz
    replied
    Hi Elizabeth

    The attribute line is necessary because a GTF can have several attributes, for example, when you are counting the number of reads per gene, your GTF file will have gene entries, exon entires, intron entries,etc but you only want to count the reads inside the exons (if you want to measure gene expression) .

    The solution is simple, you have to use awk or excel to add an "attribute" and "id" column (id column is necessary because a feature can have several attributes, for instance, a gene can have several exons, hence in the final file the counts have to be grouped by ID). If you don't how to do this, let me know and I will post the awk sentence. Now I have to go :/

    I hope it helps !

    Leave a comment:


  • Using HTSeq-count when no attributes are available?

    I have been using htseq-count with great success, except for one particular type of genome feature, the transposable elements (te).
    There is no attribute defined in the gff file, and this appears to always be the case for this type of genome feature.

    I am not involved in upstream sequencing activities, and do not understand why attributes are not always available. I do not entirely understand why htseq-count needs to know the attribute of interest, unless it must choose between multiple attributes.

    Here are the first few lines of my te.gff file:

    Code:
    C21242	TRF	Tandem_Repeat	38	100	72	+	.	.
    C21306	TRF	Tandem_Repeat	35	143	112	+	.	.
    C21306	TRF	Tandem_Repeat	574	947	208	+	.	.
    Below, the command I hopelessly attempted to execute:

    /usr/bin/python -m HTSeq.scripts.count --format=bam --stranded=no --type=Tandem_Repeat --idattr=. --mode=union ./BAM/ovo1.bam ./features/te.gff > ovo1te.counts
    Error occured when processing GFF file (line 1 of file ./features/te.gff):
    Failure parsing GFF attribute line
    [Exception type: ValueError, raised in __init__.py:164]

    Can someone please explain why htseq-count needs to know the attribute? Is there a way to run it with no attribute defined?

    Kind Regards, Elizabeth

Latest Articles

Collapse

  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM
  • seqadmin
    Recent Advances in Sequencing Analysis Tools
    by seqadmin


    The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
    05-06-2024, 07:48 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 05-24-2024, 07:15 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-23-2024, 10:28 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-23-2024, 07:35 AM
0 responses
22 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-22-2024, 02:06 PM
0 responses
11 views
0 likes
Last Post seqadmin  
Working...
X