Unconfigured Ad

**areyes** · 06-25-2013, 11:06 PM

Hi @alittleboy,

It should not take to long to sort them, however it will be even faster if you use samtools or http://picard.sourceforge.net/command-line-oIt should not take to long to sort them, however it will be even faster if you use samtools or picard tools. You could remove the initial sam files, sort them by read name and then you could sort them back with respect with position in the genome and you should get back your initial file.

Alejandro

**krespim** · 06-25-2013, 11:17 PM

Hi alittleboy,

may I ask why you are sorting the SAM files that way? What I normally do is to keep everything in BAM file - as much as possible. So for DEXSeq read count I would index the BAM files with the command you have in (1) and then in (2) pass it directly to deseq_count.py with:

samtools view sample.bam | python dexseq_count.py <options> geneModel - sample.output.dexseq_count.txt

You will notice the "-" after geneModel which indicates that the argument (the reads file) is passed to dexseq_count.py as a stdin (standard input). This bypasses the creation of a intermediate huge SAM file.

Regarding the sorting of the reads, if I am not mistaken, tophat produces already sorted bam files. Just in case, you can always sort your BAM before indexing it:
samtools sort sample.bam sample.sorted # creates a sorted file
samtools index sample.sorted.bam # index sorted bam
rm sample.bam # removes the old bam

You will end up with a sorted and indexed bam that can be used for pretty much any downstream analysis.

I hope it helps.

**areyes** · 06-25-2013, 11:32 PM

Hi @krespin,

may I ask why you are sorting the SAM files that way? What I normally do is to keep everything in BAM file - as much as possible. So for DEXSeq read count I would index the BAM files with the command you have in (1) and then in (2) pass it directly to deseq_count.py with:

I am afraid its my fold in this case, the pasilla vignette indicates that you could do this in this way and has not been updated. But you are certainly right, its better to do it as you are indicating!

Alejandro

**krespim** · 06-25-2013, 11:55 PM

Originally posted by areyes View Post

Hi @krespin,

I am afraid its my fold in this case, the pasilla vignette indicates that you could do this in this way and has not been updated. But you are certainly right, its better to do it as you are indicating!

Alejandro

Still a damn good bioconductor package

To be fair, I also had some difficulty in the beginning because my annotations/mappings were done with UCSC gene symbols and had to circumvent some issues. Otherwise all good - I am doing some DEXseq analysis as we speak.

**alittleboy** · 06-26-2013, 04:49 AM

Hi @krespim and @areyes:

Thank you so much for your inputs and suggestions! Yes, as you see, I strictly follow the instructions of the pasilla vignette, and I think we can avoid generating the huge SAM files during the steps. To @areyes, would you please, if possible, send me an updated vignette on the most efficient way of generating the exon count files in command line? Currently I have TopHat accepted_hits.bam files at hand, and I use:

samtools view C26.bam | sort -k1,1 -k2,2n | python dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - C26.txt

Is this correct? Is there a dash between the gene model and the output .TXT file?

BTW, the first C26.bam file is obtained from running:

samtools index C26.bam (<-- this C26.bam is the "accepted_hits.bam" from TopHat).

Thank you so much!!

**dpryan** · 06-26-2013, 06:48 AM

The dash is correct (you just use it in place of the file name). You don't need to index the BAM file, you can skip that.

FYI, I think you can avoid the initial sorting step by giving tophat the "--no-sort-bam" flag. You'll have to "samtools sort" it later if you want to look at it in IGV (or whatever), but overall this might be a bit faster since you can just script the whole process.

**alittleboy** · 06-26-2013, 11:04 AM

Originally posted by dpryan View Post

The dash is correct (you just use it in place of the file name). You don't need to index the BAM file, you can skip that.

FYI, I think you can avoid the initial sorting step by giving tophat the "--no-sort-bam" flag. You'll have to "samtools sort" it later if you want to look at it in IGV (or whatever), but overall this might be a bit faster since you can just script the whole process.

Hi @dpryan, @areyes and @krespim:

Thanks a lot for all your suggestions! I think I can get the exon count files by "piping" in Linux commands ;-)

Some question about the .TXT output file: I got the following (first gene with 15 exons) output:

ENSG00000000003:001 10
ENSG00000000003:002 0
...
ENSG00000000003:014 0
ENSG00000000003:015 0

Q1: I notice that I have items such as shown below: there are two Ensembl gene names, having all exon counts equal 0 (is that weird?). Can I know what does this mean?

ENSG00000206495+ENSG00000242726:001 0
ENSG00000206495+ENSG00000242726:002 0
...
ENSG00000206495+ENSG00000242726:057 0
ENSG00000206495+ENSG00000242726:058 0

There are also some items with more than 2 genes (non-zero exon counts):

...
ENSG00000197013+ENSG00000269237+ENSG00000196268+ENSG00000268658:064 6

Q2: At the end of the file, I notice some summaries:

_ambiguous 0
_empty 24914546
_lowaqual 0
_notaligned 0

What are each of them represent? I can guess its meaning, but just wanna confirm... especially the "_empty" item... As each of the .TXT files are to be passed to the read.HTSeqCounts() function, will these extra information automatically filtered out?

Thank you so much!!

**dpryan** · 06-26-2013, 12:24 PM

Originally posted by alittleboy View Post

Q1: I notice that I have items such as shown below: there are two Ensembl gene names, having all exon counts equal 0 (is that weird?). Can I know what does this mean?

ENSG00000206495+ENSG00000242726:001 0
ENSG00000206495+ENSG00000242726:002 0
...
ENSG00000206495+ENSG00000242726:057 0
ENSG00000206495+ENSG00000242726:058 0

Have a look at those genes in a genome browser. You'll notice that they overlap in at least one transcript.

Originally posted by alittleboy View Post

Q2: At the end of the file, I notice some summaries:

_ambiguous 0
_empty 24914546
_lowaqual 0
_notaligned 0

The whole python program is only ~180 lines, so it's easy to go through and just see how those numbers are generated.

"_notaligned" is self explanatory (you're using tophat, so those will be in a different file). "_lowaqual" will be incremented when the mapping quality of a read is below the value set by -a/--minaqual. "_ambiguous" will be set when a read can map to more than one "gene". I would assume that this might occur in unstranded libraries with reads aligning to genes that partly overlap on opposite strands, but perhaps areyes could weigh in here. "_empty" is when a read aligns outside of a gene/exon.

**krespim** · 06-28-2013, 10:57 AM

Originally posted by dpryan View Post

"_empty" is when a read aligns outside of a gene/exon.

Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.

**alittleboy** · 06-28-2013, 11:04 AM

Originally posted by krespim View Post

Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.

Hi @krespim:

Thanks for pointing this out!! Could you explain in some more details about "wrong annotation" and "wrong library type"? I think I used the Homo_sapiens.GRCh37.72_norm.gtf file from Ensembl for the annotation, and library type is fr-unstranded for Standard Illumina (paired-end?). Is there anything wrong with this set-up?

BTW, how can I tell if the _empty entry of 24914546 is unreasonably large or not?

**dpryan** · 06-28-2013, 11:08 AM

Originally posted by krespim View Post

Not having any read mapped suggests, to me, some issues with the annotation and/or with the parameters set for DEXSeq. I had that problem before when I was mapping using either the wrong annotation or the wrong library type.

One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).

**alittleboy** · 06-28-2013, 11:40 AM

Originally posted by dpryan View Post

One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).

Hi @dpryan and @areyes

Oops... I guess the number of empty I got (24914546) might be too large. Is there a way to check which step went wrong? Here I detailed the steps I used:

(1) I got BAM files for each of the 2 lanes of the samples, and I used samtools to merge the BAM files into a single one:

samtools merge combined.bam accepted_hits_from_L1.bam accepted_hits_from_L2.bam

(2) I used the Python script to generate GFF file:

python dexseq_prepare_annotation.py Homo_sapiens.GRCh37.72_norm.gtf HS.GRCh37.72_norm.DEXSeq.gff

(3) I counted the reads of exons using another Python script to generate TXT files:

samtools view combined.bam | sort -k1,1 -k2,2n | dexseq_count.py -p yes HS.GRCh37.72_norm.DEXSeq.gff - combined.txt

(4) I switched to R to use the read.HTSeqCounts() function to include those TXT files, and then do the analyses.

Interestingly, I did get results for testing differential exon usage (~400 genes are involved by FDR<0.1), and the results "seem" to be reasonable... Just wanna confirm if this is not "garbage in, garbage out"...

Thank you so much!!

**dpryan** · 06-28-2013, 11:50 AM

How many reads mapped uniquely? You can also open your accepted_hits.bam (or whatever it's called) file in IGV or another browser and just spot-check a bit of the data to see if it matches up. It's always good to visually look at things to see if they make sense.

**krespim** · 06-28-2013, 12:13 PM

Originally posted by dpryan View Post

One expects a certain amount of that. Certainly, if you get 80% of your mapped reads in the empty column then you probably did something wrong (or something very interesting is going on!).

I misread the outpout and thought that none of this reads were mapped - my bad. Nevertheless, 24M empty seems quite a bit to me. That said I also follow routinely your suggestion of looking at IGV shots to check if everything is Kosher. Another thing I do as a sanity check is to grep a few genes that I know are expressed in my samples just to see if they have a reasonable number of counts*.

@ alittleboy: I really don't what is going wrong there, your parameters seem correct. By wrong library I mean using "stranded" instead of default for instance

*funnily enough this happened the other day. I ran dexseq count on a number of files and my goto confirmation genes had the right counts in some files but not in others. No pattern. To cut a long story short, it turns out that during the transfer of the bam files from the cluster were the mapping was done, some files got corrupted (the file size remained unchanged). And yes, it was with scp.

Topics	Statistics	Last Post
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM
MetaBeeAI Helps Scientists Process Research Literature Faster by SEQadmin2 Started by SEQadmin2, 05-28-2026, 11:40 AM	0 responses 29 views 0 reactions	Last Post by SEQadmin2 05-28-2026, 11:40 AM
Scientists Solve a 25-Year Mystery in RNA Interference by SEQadmin2 Started by SEQadmin2, 05-26-2026, 10:12 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 05-26-2026, 10:12 AM

Unconfigured Ad

a general question on DEXSeq data preparations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News