Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bswhite
    replied
    Originally posted by ThePresident View Post
    Hi whiteman, I'm still trying to figure this out.

    Even when I try to extract only uniquely mapped reads from BAM files with samtools view -q 255 option, I still see a 2x number of reads for one of my samples. Crazy... I must have confused files or something since this makes no sense whatsoever. Let me figure this out

    TP
    Did you mean samtools view -F 256? i.e., exclude any read that
    has the 'not primary alignment' flag set?



    I'm also trying to simplify by excluding both mates if either is multi mapped
    (i.e., has flag 256 set or has NH field set with something other than NH:i:1).
    This should be extremely conservative in giving uniquely mapped reads.

    I plan to run these through cufflinks, htseq-count, and featureCounts
    to compare results.

    Leave a comment:


  • ThePresident
    replied
    Hi whiteman, I'm still trying to figure this out.

    Even when I try to extract only uniquely mapped reads from BAM files with samtools view -q 255 option, I still see a 2x number of reads for one of my samples. Crazy... I must have confused files or something since this makes no sense whatsoever. Let me figure this out

    TP

    Leave a comment:


  • bswhite
    replied
    TP,

    I'm also trying to calculate RPKM manually (though for paired end
    reads) and to compare against cufflinks FPKM.

    Are you seeing a bimodal distribution in your cufflinks RPKMs? Are
    you recapitulating that manually? Was this dependent on your only using
    uniquely mapped reads?

    I can't reproduce the bimodal distribution I see with cufflinks. In particular,
    the mode I see at higher cufflinks FPKMs is not present with my manual
    htseq-count-based approach. I understand there are plenty of caveats (e.g.,
    with respect to FPKMs vs RPKMs, transcripts vs gene, etc).

    I'm also using htseq-count. As the length of the gene, I'm taking the number
    of exonic bases in any transcript of associated with a gene. These lengths
    are universally not longer than those cufflinks is reporting in the locus field.

    I have tried passing --no-length-correction to cufflinks(2.0.2) as suggested by
    sdriscoll here:

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    thinking the second mode might be an artifact of cufflinks. But I retain it.

    Any ideas or particulars I'm missing? It sounds as if you are much closer
    than I am.

    Hope this newbie isn't hijacking your thread ...

    Leave a comment:


  • dpryan
    replied
    Originally posted by AdrianP View Post
    You mentioned taking into account partial reads, which seems okay. But you also mentioned differences in gene length, so I am wondering why would cufflinks assume a different gene length than the actual gene length given.

    From the manual calculations, it seems that one number is always above the other. Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.
    Part of the point of cufflinks is to reconstruct the transcripts that the organism is actually expressing, rather than just directly using those in an annotation file. If the observed transcript size after assembly is different from the annotation file then the denominators will differ as well. This is more frequently the case in eukaryotes, where there are multiple transcripts and cufflinks will calculate the gene's length as their weighted average (or something like that, I'd have to reread the paper to refresh my memory). Perhaps cufflinks is observing a shorter or longer transcript in these cases, in which case the RPKMs could differ by 2x.

    Leave a comment:


  • ThePresident
    replied
    Originally posted by AdrianP View Post
    Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.
    Yup, it's cufflinks RPKM value that is always higher. But I can't guarantee it would've been for all the values.

    Leave a comment:


  • dpryan
    replied
    I vaguely remember bowtie1 using a MAPQ of 255 for unique mappers (similar to older versions of tophat, since the equivalent is now 50 there). If you're using bowtie2, then just use -q 2 (0 and 1 are for multimappers in bowtie2). If you did in fact use bowtie1, then what you wrote should work. BTW, it's faster to just:

    Code:
    samtools view -q 255 input.bam | htseq-count ...
    Just use a dash ("-") as the file name. If what you wrote above is giving you the apparent 2x increase in read counts then something went seriously wrong at some point

    Leave a comment:


  • AdrianP
    replied
    Originally posted by dpryan View Post
    As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).
    You mentioned taking into account partial reads, which seems okay. But you also mentioned differences in gene length, so I am wondering why would cufflinks assume a different gene length than the actual gene length given.

    From the manual calculations, it seems that one number is always above the other. Is the number that is higher done by cufflinks? In that case, it should be no surprise why it is higher, due to partial read calculations. Except, sometimes it's much higher... like the case where it is double, which is scary.

    Leave a comment:


  • ThePresident
    replied
    Originally posted by dpryan View Post
    As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).

    BTW, the ability of htseq-count to discriminate between unique and multimappers is aligner dependent. You mention using bowtie (I'll take that to mean tophat, unless your organism doesn't do splicing), so it should then have the NH auxiliary tag properly set. When in doubt, use the "-o" option of htseq-count to follow how reads are being treated. Then you can determine if there's anything really amiss.
    I'm dealing with bacterial transcriptome, so no splicing here. That's why I'm simply using bowtie instead of tophat. And, somebody told me that bowtie does not include the NH auxiliary tag (which is used to discriminate for multimappers / unique reads). What I have been told is that by default, bowtie should assign a mapping quality of 255 to reads that map once. So I could extract them with samtools:

    Code:
    samtools view -hb -q 255 input.bam > output.bam
    That way, I have BAM files with uniquely-mapped reads. Then perform HTSeq-count and manual calculation of RPKM. What's your opinion on that?

    Leave a comment:


  • dpryan
    replied
    Originally posted by ThePresident View Post
    It's somewhat near but not exactly the same thing. I did it with just a couple of values but still: 8vs9 , 146vs300, 13vs18.
    As I mentioned above, you shouldn't expect to get the same numbers as cufflinks (while FPKM==RPKM for SE reads by definition, you and cufflinks will likely have vastly different conceptions of how many reads there are and how long each gene is).

    BTW, the ability of htseq-count to discriminate between unique and multimappers is aligner dependent. You mention using bowtie (I'll take that to mean tophat, unless your organism doesn't do splicing), so it should then have the NH auxiliary tag properly set. When in doubt, use the "-o" option of htseq-count to follow how reads are being treated. Then you can determine if there's anything really amiss.

    Leave a comment:


  • ThePresident
    replied
    Originally posted by AdrianP View Post
    If you use 61,402,323 as total reads, what value of RPKM do you get for any given gene compared to FPKM by cufflinks?
    It's somewhat near but not exactly the same thing. I did it with just a couple of values but still: 8vs9 , 146vs300, 13vs18.

    Leave a comment:


  • AdrianP
    replied
    Originally posted by ThePresident View Post
    It might seems that I'm flooding this thread, but I just realized that I already had the same problem. I think that the right way of dealing with all this is to extract only uniquely-mapped reads from bowtie-generated BAM files and use them in HTSeq-count, DESeq and RPKM calculations.

    My only concern is to lose too much reads du to the multimapping...
    If you use 61,402,323 as total reads, what value of RPKM do you get for any given gene compared to FPKM by cufflinks?

    Leave a comment:


  • ThePresident
    replied
    It might seems that I'm flooding this thread, but I just realized that I already had the same problem. I think that the right way of dealing with all this is to extract only uniquely-mapped reads from bowtie-generated BAM files and use them in HTSeq-count, DESeq and RPKM calculations.

    My only concern is to lose too much reads du to the multimapping...

    Leave a comment:


  • Simon Anders
    replied
    [deleted] .
    Last edited by Simon Anders; 01-24-2014, 10:36 AM. Reason: post was meant to appear in another thread, but I confused my browser tabs

    Leave a comment:


  • ThePresident
    replied
    But on the other side, the total number of mapped reads in a the RPKM formula is quite arbitrary. It will be the same for every gene inside one single library, so as long as keep that in mind, the formula still makes sense.

    On the other side, I wonder if I could compare RPKM values calculated by this manner in between libraries containing replicates? Ex. If I have three libraries of the same condition (biological replicates) but with different sequencing depth, could I calculate RPKM as explained above and then just average them...?

    Leave a comment:


  • ThePresident
    replied
    @AdrianP : I'm at work now and I don't have access to my linux station now. But, from bowtie alignment I know how much reads I have in my libraries (fastq). For example, in one of my libraries I have 61,402,323 reads, and when counted with HTSeq-count (sum of all reads across all genes) I get 116,898,233 which is almost double.

    @dpryan: My understanding of HTSeq-count is that it does not count multimappers. I used intersection-nonempty mode. Here is the final output:

    no_feature 9034352
    ambiguous 958299
    too_low_aQual 0
    not_aligned 2097930
    alignment_not_unique 0

    Fastq file contained 61,402,323 reads and 96.58% of that number have been aligned with bowtie. So, the SAM file has to contain around 59,302,363. So, if there are multimapped reads in mu HTSeq-count, how to get rid of them?

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 03-27-2024, 06:37 PM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-27-2024, 06:07 PM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
69 views
0 likes
Last Post seqadmin  
Working...
X