Unconfigured Ad

**areyes** · 03-27-2012, 12:05 AM

Hi senpeng!

1.

What are the exact lines are you using for the dexseq scripts? You have a lot of reads! and it sounds strange that a lot of them are empty, how many reads do you have in your initial fastq files?

2.

DEXSeq can do this! Have a look at the section 4: "Additional technical or experimental variables".

Let me know if you have additional questions,

Alejandro

**Simon Anders** · 03-27-2012, 01:17 AM

Are you sure the coordinates in your GTF file are based on the same genome build as the Fasta files you have aligned your read against?

**senpeng** · 03-27-2012, 01:38 PM

Originally posted by areyes View Post

Hi senpeng!

1.

What are the exact lines are you using for the dexseq scripts? You have a lot of reads! and it sounds strange that a lot of them are empty, how many reads do you have in your initial fastq files?

2.

DEXSeq can do this! Have a look at the section 4: "Additional technical or experimental variables".

Let me know if you have additional questions,

Alejandro

Dear Alejandro,
Thanks for your reply.

1.We have around 220 million reads in fastq files(paired end, so 110 million for one end).

I just used python dexseq_prepare_annotation.py ensembl ensembl.63.genes.gtf <output.gtf>
to generate the flattened_gtf_file.

Then python dexseq_count.py <output.gtf> <sam_file> <output_file> for the DEXSeq counts.

**senpeng** · 03-27-2012, 01:57 PM

Originally posted by Simon Anders View Post

Are you sure the coordinates in your GTF file are based on the same genome build as the Fasta files you have aligned your read against?

Dear Simon,

Thanks so much for your reply.

For the alignment we used
TopHat VN:1.3.2 -r 40 -p 8 -G annotation/ensembl.63.genes.gtf /refgenome/GRCh37/Homo_sapiens.GRCh37.62

And I used the exactly the same ensembl gtf for dexseq_prepare_annotation.py and dexseq_count.py .

I checked our alignment files in IGV too, there's some corrected alignment in exon area, but also a lot of alignment reads fall into intron areas. Is it the possible cause of the _empty, not perfect alignment?

Also,two more questions:

1.Since we've already have around 150-200 million reads, but it seems that we still could not get a good count on isoform-level exons (gene-level, it seems OK). How many reads do you recommend for study of isoform-level changes?

2. We also aligned the data to NCBI gtf (Homo_sapiens.NCBI36.54.noMT.gtf). Could we try using NCBI gtf in DEXSeq? since you recommend ensembl in your manual and the file size are different (ensembl ~450Mb vs NCBI ~120Mb).

Thanks again for your help and look forward to your reply.

**Simon Anders** · 03-27-2012, 10:11 PM

Originally posted by senpeng View Post

I checked our alignment files in IGV too, there's some corrected alignment in exon area, but also a lot of alignment reads fall into intron areas. Is it the possible cause of the _empty, not perfect alignment?

Yes. A read that wholly falls into an intron, without even touching the exon, will be counted as "empty".

You may want to figure out why you have so many read aligning to introns. Are the introns evenly filled with reads or are there just small islands of reads within the introns. The former would mean intron retention, the latter the presence of exons missing in your annotation.

In case that these are biological signals of interest, you should use some tool like cufflinks or RSEM to find boundaries for these additional features and let DEXSeq test for them, too. We have not yet tried such a toolchain, though.

1.Since we've already have around 150-200 million reads, but it seems that we still could not get a good count on isoform-level exons (gene-level, it seems OK). How many reads do you recommend for study of isoform-level changes?

As long as most of them align to exons, you should get quite far with 150M reads. Look at the MA plot to see whether you are Poisson-limited.

We also aligned the data to NCBI gtf (Homo_sapiens.NCBI36.54.noMT.gtf). Could we try using NCBI gtf in DEXSeq? since you recommend ensembl in your manual and the file size are different (ensembl ~450Mb vs NCBI ~120Mb).

The Python script needs to know for each exon to which gene it belongs. The UCSC files do not contain this information; their "gene_id" field is just a copy of the "transcript_id" field. You need to manually rectify this.

Simon

**wfan** · 03-26-2013, 03:32 PM

indeed. Our core facility folk who did tophat for us used a different version of gtf. In my case there is ZERO counts. Every row of the output file has zero. So ...

I will re-run everything and will update when I know more.
Thanks a lot
Wenhong

**h_manoj** · 06-25-2013, 01:37 PM

Hello,
I am trying to run DEXSeq on a set of several samples. As the first step, I generated the annotation file in the required format using "dexseq_prepare_annotation.py". Then I used the "dexseq_count.py" to generate counts.

I am worried since the "empty" cases are too many:
_ambiguous 0
_empty 66686221
_lowaqual 0
_notaligned 0

Is this expected?

Total mapped reads in this dataset is 351,853,340. My annotation is based on gencode v14 in gtf format (I also tried using gff3).

python dexseq_prepare_annotation.py gencv14_levels_12.gtf htseqfrmtdgencv14L12
python dexseq_count.py -p yes -s yes htseqfrmtdgencv14L12 SrtdBAM_FT_1.sam dexcount_FT1_Strndd

**areyes** · 06-25-2013, 11:15 PM

I guess your protocol has a step enriching for polyadenylated RNAs?

You have 20% mapping outside of exons (could also mean unspliced introns, not annotated transcripts and lincRNAs, etc). Would not think is particularly bad.

**h_manoj** · 06-26-2013, 03:03 PM

Originally posted by areyes View Post

I guess your protocol has a step enriching for polyadenylated RNAs?

You have 20% mapping outside of exons (could also mean unspliced introns, not annotated transcripts and lincRNAs, etc). Would not think is particularly bad.

Thanks for your insight, areyes. Yes, the protocol does enrich for polyA RNAs.

Topics	Statistics	Last Post
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 32 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 39 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 62 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM

Unconfigured Ad

DEXSeq (1.too many empty 2. paried sample comparison?)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News