Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq: calculating per-gene coverage

    Hello,

    I know that this topic has been massively covered in this forum, but I wasn't able to find the right answer for my questions.

    Basically, I'm dealing with some bacterial RNA-seq data. I have two biological replicates and two conditions (so four libraries). I used bowtie for alignment of my reads and subsequently DESeq to calculate means/fold change/stats in order to compare my two conditions.

    Now, I would like to calculate gene coverage per 1kb (relative) in order to compare absolute expression of every gene. I was thinking that if I take number of reads and I divide it by the gene length and then multiply by 1000 I will get some kind of "gene expression ratio per 1kb" which I could use to compare for all my genes. So, I could say that gene "A" is ten times more expressed then gene "B" just by comparing those ratios. Is this the right way of doing it, or perhaps I'm missing something?

    Also, I sought to use BEDTools for this, but I don't know if I could insert multiple files (biological replicates) for my analysis. Anybody know how?

    Finaly, anybody know how to extract gene length from a GFF or even GenBank file? BEDTools does it but I have duplicated values (for gene/CDS/exons).

    Thanks in advance! I don't know what I would do without this forum...

    TP

  • #2
    After re-reading a couple of papers having for subject RNA-seq analysis, I'm a little bit confused. Many papers use Gene Expression Index (GEI) and it is calculated as mean coverage depth of the gene normalized by the total number of reads mapped to the non-rRNA regions (Wang 2011, BMC Genomics)

    Well, in my case, I've performed HTSeq to get reads/gene count and then DESeq to compare those data for significant genes. And, normally, DESeq already takes into account library depth and perform correction, right? So, I can't do it again?

    I might be missing something here... any help would be greatly appreciated!

    TP

    Comment


    • #3
      Is there a reason you don't want to use RPKM (reads per kilobase per million reads), such as you'd get out of a program like Cufflinks?

      RPKM values are nice, since they correct for different samples arbitrarily being sequenced to different depths (and unless you magically loaded precisely the same amount on the sequencer, and the poisson noise worked exactly to zero, you did sequence them to slightly different depths). On the other hand, the statistics for properly interpreting differential expression from RPKM alone are unconvincing to me, at best, so you'd want to use something like DESeq anyways.

      EDIT: another nice advantage is that people are used to seeing RPKM values, and have a sense of how to reason about them, and where they work/don't work, so reviewers will more readily accept them than something you cobbled together.
      Last edited by rflrob; 10-17-2012, 04:44 PM.

      Comment


      • #4
        Thanks for your answer.

        Well, that's exactly why I went for DESeq: it works with raw data (number of reads that mapped to a specific gene). Since I have only two biological replicates, I was not confident to compare with RPKM.

        On the other side, if I just want to get the general idea of gene expression in my samples, I cannot compare raw counts for two genes that have different length and coming from different sequencing lanes (sequencing depth being different). I don't know if I'm being clear but, for example, let's say that a gene A and B have both 100 reads aligned but gene A is 200pb long and gene B 500pb, their absolute expression is not the same even though they have the same coverage (reads/gene).

        Next, since data obtained from DESeq (coverage per gene, i.e. number of aligned reads per gene) is already normalized for sequencing depth, I could just take those data and divide it by genes lenght, so I would take into account seq-depth and gene length. Sound logic to me... and almost like RPKM.

        Finally, I do agree that calculating RPKM is convenient because everybody is familiar with that and won't stick on it.

        What do you think?

        Comment


        • #5
          The raw data (counts, i.e. DESeq, edgeR, etc) is the way to go for differential expression.

          FPKM is the way to go for relative expression within a condition.

          You just need to understand the limitations and deficiencies.

          For example, map-ability can drastically effect your FPKM value. As can other biases.
          --------------
          Ethan

          Comment


          • #6
            Okay! So, I have to use cufflinks to get RPKM? (I'm dealing with single-end data, so it's RPKM in my case). I'll try that...

            And, while we are on that subject, can you (or someone) give me some limitations and deficiencies on RPKM? What you mean by map-ability? The "ability" of reads to map on a reference sequence? Sorry if it sounds basic to you guys, but I just want to be sure I understand all this

            TP

            Comment


            • #7
              I can't say for certain where I got this impression, but I thought that relative expression in RNA-seq was close to un-knowable. (I know there have been papers criticizing RPKM lately*)

              There are just too many other factors in play (most notably differential PCR amplification of A vs B, although others exist) to be able to confidently say that 10X as many counts reflects a true difference in relative gene expression.

              Even spiked-in controls with known 'truth' values for their relative gene expression have some difficulty comparing A to B**

              *Bias detection and correction in RNA-Sequencing data.
              *Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples
              **Synthetic spike-in standards for RNA-seq experiments.

              Comment


              • #8
                Originally posted by jparsons View Post

                There are just too many other factors in play (most notably differential PCR amplification of A vs B, although others exist) to be able to confidently say that 10X as many counts reflects a true difference in relative gene expression.
                I do agree with this however I think it can give an overall idea... I'm not sure how much can we actually be confident in interpretation but it can give an idea especially if the difference is huge.

                Comment


                • #9
                  Hi, does anybody know how to get the result of raw number of mapped reads for each gene?I don't want the FPKM or RPKM value.

                  Comment


                  • #10
                    HTSeq-count does the job. I have some scripts that will take the grunt work out of it, but it takes a little time to set up and there's a reasonable chance it won't work on your platform.
                    I updated a bunch of my scripts and they are working on Linux and Mac now.  Nothing innovative as I am not a statistician or informatician, but there are a couple nice easy to use scripts that link…
                    --------------
                    Ethan

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Best Practices for Single-Cell Sequencing Analysis
                      by seqadmin



                      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                      06-06-2024, 07:15 AM
                    • seqadmin
                      Latest Developments in Precision Medicine
                      by seqadmin



                      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                      Somatic Genomics
                      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                      05-24-2024, 01:16 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 06-14-2024, 07:24 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-13-2024, 08:58 AM
                    0 responses
                    14 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-12-2024, 02:20 PM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-07-2024, 06:58 AM
                    0 responses
                    186 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X