Unconfigured Ad

bswhite · 01-28-2014, 09:03 AM

Originally posted by ThePresident View Post

Hi whiteman, I'm still trying to figure this out.

Even when I try to extract only uniquely mapped reads from BAM files with samtools view -q 255 option, I still see a 2x number of reads for one of my samples. Crazy... I must have confused files or something since this makes no sense whatsoever. Let me figure this out

TP

Did you mean samtools view -F 256? i.e., exclude any read that
has the 'not primary alignment' flag set?

http://picard.sourceforge.net/explain-flags.html

I'm also trying to simplify by excluding both mates if either is multi mapped
(i.e., has flag 256 set or has NH field set with something other than NH:i:1).
This should be extremely conservative in giving uniquely mapped reads.

I plan to run these through cufflinks, htseq-count, and featureCounts
to compare results.

ThePresident · 01-27-2014, 07:24 AM

Hi whiteman, I'm still trying to figure this out.

Even when I try to extract only uniquely mapped reads from BAM files with samtools view -q 255 option, I still see a 2x number of reads for one of my samples. Crazy... I must have confused files or something since this makes no sense whatsoever. Let me figure this out

TP

bswhite · 01-24-2014, 01:38 PM

TP,

I'm also trying to calculate RPKM manually (though for paired end
reads) and to compare against cufflinks FPKM.

Are you seeing a bimodal distribution in your cufflinks RPKMs? Are
you recapitulating that manually? Was this dependent on your only using
uniquely mapped reads?

I can't reproduce the bimodal distribution I see with cufflinks. In particular,
the mode I see at higher cufflinks FPKMs is not present with my manual
htseq-count-based approach. I understand there are plenty of caveats (e.g.,
with respect to FPKMs vs RPKMs, transcripts vs gene, etc).

I'm also using htseq-count. As the length of the gene, I'm taking the number
of exonic bases in any transcript of associated with a gene. These lengths
are universally not longer than those cufflinks is reporting in the locus field.

I have tried passing --no-length-correction to cufflinks(2.0.2) as suggested by
sdriscoll here:

cufflinks reports extremely high FPKMs for short transcripts - SEQanswers

http://seqanswers.com/forums/showthread.php?t=14386&highlight=cufflinks

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

thinking the second mode might be an artifact of cufflinks. But I retain it.

Any ideas or particulars I'm missing? It sounds as if you are much closer
than I am.

Hope this newbie isn't hijacking your thread ...

dpryan · 01-24-2014, 12:25 PM

Originally posted by AdrianP View Post

You mentioned taking into account partial reads, which seems okay. But you also mentioned differences in gene length, so I am wondering why would cufflinks assume a different gene length than the actual gene length given.

From the manual calculations, it seems that one number is always above the other. Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.

Part of the point of cufflinks is to reconstruct the transcripts that the organism is actually expressing, rather than just directly using those in an annotation file. If the observed transcript size after assembly is different from the annotation file then the denominators will differ as well. This is more frequently the case in eukaryotes, where there are multiple transcripts and cufflinks will calculate the gene's length as their weighted average (or something like that, I'd have to reread the paper to refresh my memory). Perhaps cufflinks is observing a shorter or longer transcript in these cases, in which case the RPKMs could differ by 2x.

ThePresident · 01-24-2014, 12:21 PM

Originally posted by AdrianP View Post

Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.

Yup, it's cufflinks RPKM value that is always higher. But I can't guarantee it would've been for all the values.

dpryan · 01-24-2014, 12:16 PM

I vaguely remember bowtie1 using a MAPQ of 255 for unique mappers (similar to older versions of tophat, since the equivalent is now 50 there). If you're using bowtie2, then just use -q 2 (0 and 1 are for multimappers in bowtie2). If you did in fact use bowtie1, then what you wrote should work. BTW, it's faster to just:

Code:

samtools view -q 255 input.bam | htseq-count ...

Just use a dash ("-") as the file name. If what you wrote above is giving you the apparent 2x increase in read counts then something went seriously wrong at some point

AdrianP · 01-24-2014, 12:14 PM

Originally posted by dpryan View Post

As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).

You mentioned taking into account partial reads, which seems okay. But you also mentioned differences in gene length, so I am wondering why would cufflinks assume a different gene length than the actual gene length given.

From the manual calculations, it seems that one number is always above the other. Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.

ThePresident · 01-24-2014, 12:03 PM

Originally posted by dpryan View Post

As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).

BTW, the ability of htseq-count to discriminate between unique and multimappers is aligner dependent. You mention using bowtie (I'll take that to mean tophat, unless your organism doesn't do splicing), so it should then have the NH auxiliary tag properly set. When in doubt, use the "-o" option of htseq-count to follow how reads are being treated. Then you can determine if there's anything really amiss.

I'm dealing with bacterial transcriptome, so no splicing here. That's why I'm simply using bowtie instead of tophat. And, somebody told me that bowtie does not include the NH auxiliary tag (which is used to discriminate for multimappers / unique reads). What I have been told is that by default, bowtie should assign a mapping quality of 255 to reads that map once. So I could extract them with samtools:

Code:

samtools view -hb -q 255 input.bam > output.bam

That way, I have BAM files with uniquely-mapped reads. Then perform HTSeq-count and manual calculation of RPKM. What's your opinion on that?

dpryan · 01-24-2014, 11:48 AM

Originally posted by ThePresident View Post

It's somewhat near but not exactly the same thing. I did it with just a couple of values but still: 8vs9 , 146vs300, 13vs18.

As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).

BTW, the ability of htseq-count to discriminate between unique and multimappers is aligner dependent. You mention using bowtie (I'll take that to mean tophat, unless your organism doesn't do splicing), so it should then have the NH auxiliary tag properly set. When in doubt, use the "-o" option of htseq-count to follow how reads are being treated. Then you can determine if there's anything really amiss.

ThePresident · 01-24-2014, 11:35 AM

Originally posted by AdrianP View Post

If you use 61,402,323 as total reads, what value of RPKM do you get for any given gene compared to FPKM by cufflinks?

It's somewhat near but not exactly the same thing. I did it with just a couple of values but still: 8vs9 , 146vs300, 13vs18.

AdrianP · 01-24-2014, 11:21 AM

Originally posted by ThePresident View Post

It might seems that I'm flooding this thread, but I just realized that I already had the same problem. I think that the right way of dealing with all this is to extract only uniquely-mapped reads from bowtie-generated BAM files and use them in HTSeq-count, DESeq and RPKM calculations.

My only concern is to lose too much reads du to the multimapping...

If you use 61,402,323 as total reads, what value of RPKM do you get for any given gene compared to FPKM by cufflinks?

ThePresident · 01-24-2014, 10:45 AM

It might seems that I'm flooding this thread, but I just realized that I already had the same problem. I think that the right way of dealing with all this is to extract only uniquely-mapped reads from bowtie-generated BAM files and use them in HTSeq-count, DESeq and RPKM calculations.

My only concern is to lose too much reads du to the multimapping...

Simon Anders · 01-24-2014, 10:31 AM

[deleted] .

ThePresident · 01-24-2014, 08:40 AM

But on the other side, the total number of mapped reads in a the RPKM formula is quite arbitrary. It will be the same for every gene inside one single library, so as long as keep that in mind, the formula still makes sense.

On the other side, I wonder if I could compare RPKM values calculated by this manner in between libraries containing replicates? Ex. If I have three libraries of the same condition (biological replicates) but with different sequencing depth, could I calculate RPKM as explained above and then just average them...?

ThePresident · 01-24-2014, 08:23 AM

@AdrianP : I'm at work now and I don't have access to my linux station now. But, from bowtie alignment I know how much reads I have in my libraries (fastq). For example, in one of my libraries I have 61,402,323 reads, and when counted with HTSeq-count (sum of all reads across all genes) I get 116,898,233 which is almost double.

@dpryan: My understanding of HTSeq-count is that it does not count multimappers. I used intersection-nonempty mode. Here is the final output:

no_feature 9034352
ambiguous 958299
too_low_aQual 0
not_aligned 2097930
alignment_not_unique 0

Fastq file contained 61,402,323 reads and 96.58% of that number have been aligned with bowtie. So, the SAM file has to contain around 59,302,363. So, if there are multimapped reads in mu HTSeq-count, how to get rid of them?

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News