Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DEXSEQ dexseq_count.py counts

    Hi,
    I am writing because am trying to figure out the concept behind the python script that counts reads per exon for DEXSEQ analysis.
    Seems like the count output does not explain what I see in IGV and so the statistics can be off in DEXSEQ and render my analysis pointless.

    This is what I have done:
    1. created flattened gff from ensembl gtf
    python ~/bin/DEXSeq/python_scripts/dexseq_prepare_annotation.py ensembl.63.genes.gtf ensembl.63.genes.gff

    2. counted reads in query sorted bam file:
    samtools view sorted.bam | python ~/bin/DEXSeq/python_scripts/dexseq_count.py --paired=yes -s no -a 10 ensembl.63.genes.gff - dexseq.counts.txt

    3. created samples object for DEXSEQ in R.
    I am using DEXSeq_1.4.0. in R Studio 0.97.248

    samples = data.frame(
    condition = factor(c("unaffected","affected")),
    replicate = factor(c(1:1,1:1)),
    row.names = factor(c("A","B")),
    stringsAsFactors = TRUE,
    check.names = FALSE
    )

    annotationfile = "ensembl.63.genes.gff"
    fullFilenames<- list.files("/Users/sszelinger/Documents/PROJECTS/DEXSEQ",full.names=TRUE,pattern="counts.txt")

    ecs<- read.HTSeqCounts(countfiles = fullFilenames,design = samples,flattenedfile = annotationfile)

    Looking up counts for gene of interest
    countTableForGene(ecs, "ENSG00000XXXXXX", normalized=FALSE, withDispersion=FALSE)

    Exon count for sample A
    E001 141
    E002 66
    E003 45
    E004 5
    E005 48
    E006 37
    E007 31
    E008 36
    E009 64
    E010 35

    Exon Count for sample B
    E001 187
    E002 95
    E003 66
    E004 2
    E005 38
    E006 29
    E007 7
    E008 32
    E009 55
    E010 27

    You can see in "A" for exon 4 "E004" there is count=5 and for "B" for exon 4 count = 2. A is unaffected and B is affected.
    When I pull up IGV, it seems like that "A" has a lot more reads than "B". A has an average coverage of 20 and B has about 3. In my view B is a great case for exon skipping between case and control, and the integrated DNA-RNA analysis suggest this but the counts do not pick it up. So I have to wonder what are the rules for counting the reads in dexseq? Is it based on HTSeq? If so is dexseq_count.py uses union, intersection_strict, or intersection_non_empty. From the IGV trace it looks like to me A should have at least 10 times the coverage as B at same position.

    Can someone explain this or give me hints.
    If the python script does not count split reads or those that do not map more than X bases within the exon than I understand, but should be documented with the python script so we can make more educated decisions.
    I feel that there a real problem here in those situations where we want to detect exon skipping and the low count numbers based on the python script might be misleading. Especially if the counts biased in both case and control subjects into a direction where no significant difference might be conculded between them.
    I appreciate any comment. Hope the IGV trace loads up fine..
    Thanks very much,
    Szabi
    Attached Files

  • #2
    Hi @sszelinger,

    Thanks for your interest in DEXSeq!

    It sounds strange. First of all, I would double check in the gtf that the E004 corresponds to the genomic region that you are observing in the browser. Note that the python script assigns new exon ID's to each unique non-overlapping exonic region based on all its annotated transcripts, so it might not correspond to the exon number 4 of the gene model from the browser.

    My second guess would be regarding the quality of the alignments, the python script to count reads discards those reads with quality below 10 (this includes multiple hits). Maybe some of the alignments seen in the browser were filtered by the script.

    Alejandro

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Recent Advances in Sequencing Analysis Tools
      by seqadmin


      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
      05-06-2024, 07:48 AM
    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 05-07-2024, 06:57 AM
    0 responses
    12 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 05-06-2024, 07:17 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 05-02-2024, 08:06 AM
    0 responses
    22 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-30-2024, 12:17 PM
    0 responses
    24 views
    0 likes
    Last Post seqadmin  
    Working...
    X