Seqanswers Leaderboard Ad

**Wolfgang Huber** · 05-20-2013, 01:26 AM

Dear bob loblaw

thank you. The behaviour you report is not reasonable, and somewhere in your workflow or tool chain there must be a mistake. Can you report the sequence of steps (script) that you perform, to reproduce your observation?

Best wishes
Wolfgang

**bob-loblaw** · 05-20-2013, 06:32 AM

Hi Wolfgang,

I've located at one problem in my workflow that resulted in only fraction of my reads mapping. I'm hoping that this problem can be attributed to this. I'm currently re-processing my samples now. Thanks though!

**colaneri** · 05-20-2013, 10:48 AM

Normalizing libraries sequenced a different deep

I have a question somehow related to this discussion.
If I have libraries to compare that have been sequenced with different deep:
example:
Lib1; 12,000,000 reads
Lib2; 17,000,000 reads
Lib3; 9,000,000 reads

It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?

**bob-loblaw** · 05-21-2013, 02:05 AM

Originally posted by colaneri View Post

I have a question somehow related to this discussion.
If I have libraries to compare that have been sequenced with different deep:
example:
Lib1; 12,000,000 reads
Lib2; 17,000,000 reads
Lib3; 9,000,000 reads

It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?

I think so, assuming that aside from varying coverages they are all prepared in exactly the same way.

**kmcarr** · 05-21-2013, 02:20 AM

Originally posted by colaneri View Post

I have a question somehow related to this discussion.
If I have libraries to compare that have been sequenced with different deep:
example:
Lib1; 12,000,000 reads
Lib2; 17,000,000 reads
Lib3; 9,000,000 reads

It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?

But why do this? All well designed software for differential expression analysis will account for varying library depth.

**bob-loblaw** · 05-21-2013, 02:34 AM

Originally posted by kmcarr View Post

But why do this? All well designed software for differential expression analysis will account for varying library depth.

Thats a good point, although the poster didn't say how they were comparing the libraries.

**sdriscoll** · 05-21-2013, 01:23 PM

i good rule of thumb when doing random subsetting of data is to do it more than once so you can observe that the results are stable and not jumping around due to the random sampling. So...sure go ahead and subset down to 9,000,000 reads but run a few iterations of the entire pipeline. if the results are stable you're good. if not then subsetting may not be appropriate, for whatever reason.

**bob-loblaw** · 05-29-2013, 05:44 AM

Originally posted by Wolfgang Huber View Post

Dear bob loblaw

thank you. The behaviour you report is not reasonable, and somewhere in your workflow or tool chain there must be a mistake. Can you report the sequence of steps (script) that you perform, to reproduce your observation?

Best wishes
Wolfgang

Well my workflow is very simple. I'm using bowtie2 to map my reads to a reference database, and then I'm using samtools idxstats function to create a count table. Then I merge rows with duplicate IDs, then I put it into R and run DESeq on it. It's a very simple workflow.

I've also noticed that for the random subset of reads and the whole file the exact same proportion of reads is mapped (i.e. if 67.4% is mapped for the whole file, then in the subset file 67.4% is also mapped) which at least indicates that this difference in the number of differentially expressed genes between the whole file and subset isn't down to some non random effect in the subsetting process.

Any ideas?
Thanks

**sdriscoll** · 05-29-2013, 07:19 AM

So, to rephrase, you're mapping to a transcriptome reference and not a genome, correct? When you do this its very important that the rows you merge be all features that share exonic sequence - as in all alternative isoforms of a gene and even multi copy genes that reside in separate loci. Otherwise you're going to run into some confusion for sure. Bowtie is not designed nor capable of making alignment decisions between features with shared sequence beyond total random selection. It does make the same decision each time you run it but that's by design...they do a random number seeding trick to ensure this happens. So even if you're using an Ensemble annotation for mouse or human simply merging counts based on any of the provided ids isn't enough to remove the ambiguity problem.

**bob-loblaw** · 05-29-2013, 07:22 AM

Originally posted by sdriscoll View Post

So, to rephrase, you're mapping to a transcriptome reference and not a genome, correct? When you do this its very important that the rows you merge be all features that share exonic sequence - as in all alternative isoforms of a gene and even multi copy genes that reside in separate loci. Otherwise you're going to run into some confusion for sure. Bowtie is not designed nor capable of making alignment decisions between features with shared sequence beyond total random selection. It does make the same decision each time you run it but that's by design...they do a random number seeding trick to ensure this happens. So even if you're using an Ensemble annotation for mouse or human simply merging counts based on any of the provided ids isn't enough to remove the ambiguity problem.

I'm mapping against the DNA sequences of predicted proteins from sequenced genomes. The only rows that I merge are the ones which have identical IDs I hadn't even considered merging isoforms...

We originally did it to cut down on computational time more than anything (although it actually doesn't make that big of a difference in terms of the size of the count table)

**Simon Anders** · 05-29-2013, 08:30 AM

This is all sounds as if some mistake happens during counting. Your approach of using samtools idxstats is rathe runorthodox, and I wonder if it is correct. It might be much safer to use some well-tested tool to obtain a count table instead of using some home-brewn solution.

**bob-loblaw** · 05-29-2013, 08:34 AM

Originally posted by Simon Anders View Post

This is all sounds as if some mistake happens during counting. Your approach of using samtools idxstats is rathe runorthodox, and I wonder if it is correct. It might be much safer to use some well-tested tool to obtain a count table instead of using some home-brewn solution.

From what I can see idxstats works perfectly, it needs to a bit of tweeking in R so that DESeq can read it, but nothing major.

I am open to suggestions though, can you give some examples of those well tested tools? Thanks

**Simon Anders** · 05-29-2013, 08:37 AM

Actually, I don't get it. How do you get per-gene count with idxstat? I thought it only tells you the number of reads mapped to each reference sequence, i.e., to each chromosome. How do you get individual genes?

**bob-loblaw** · 05-29-2013, 08:45 AM

Originally posted by Simon Anders View Post

Actually, I don't get it. How do you get per-gene count with idxstat? I thought it only tells you the number of reads mapped to each reference sequence, i.e., to each chromosome. How do you get individual genes?

Maybe that depends on the reference that you've mapped to?

For me it throws out a table which has every gene ID from the genome I'm mapping to, along with it's length, and how many reads are mapped to that reference. For some of them it just gives me and organism and coordinates for the genome, but I can get more information about that from a GFF file I have of the annotations.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

RNA-Seq, lower coverage shows more differential expression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News