Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks transcripts.expr file

    Hello,

    I am looking for a way to derive the raw count of reads mapping to individual Cufflinks transcripts. My closest guess is that the "coverage" value (column 13) in the transcripts.expr file is what I need (the manual explains this as being an "Estimate for the absolute depth of read coverage across the whole transcript"). This number is non-integer for most transcripts, but I guess that may come from averaging across isoforms. Does anyone have an insight into this? Any advice would be much appreciated.

    Thanks,

    Shurjo

  • #2
    Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. Note that Cufflinks' statistical model probabilistically assigns reads to individual isoforms, because when isoforms share exons, reads from those exons can't be unambiguously assigned to one particular isoform.

    Comment


    • #3
      OK, cool, thanks.

      Comment


      • #4
        Watch out when considering these read densities if they come from overlapping transcripts though: although it is counter-intuitive, they cannot be directly compared to each other as if they could indicate a relative "expression" or "splicing" value.

        Indeed, since a read that overlaps a common exonic part of two splicing variants is used to compute both density values (although it comes from one of the variants only), it would be mathematically incorrect to assume that read densities can be compared between such overlapping splicing variants. If reads are assigned to each variant independently, cases can appear where a variant A that is present at a lower number of copy than a variant B is given a higher density.. so RPKM are only relevant across "genes" (non overlapping transcripts).

        Therefore, reads should actually be assigned across transcripts (considering the whole set of variants) in a discriminative way, and that is why proper transcriptome reconstruction requires linear programming algorithms in order to find the optimal distribution.

        Unless of course i am totally wrong

        Comment


        • #5
          Originally posted by Cole Trapnell View Post
          Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript. Note that Cufflinks' statistical model probabilistically assigns reads to individual isoforms, because when isoforms share exons, reads from those exons can't be unambiguously assigned to one particular isoform.
          So, how do you determine the average depth. For example, consider
          exon1| exon2|exon3

          Lets say that Sample A exhibits
          exon1|exon3 only 20% of the times
          exon1|exon2|exon3 only 80% of the times

          while Sample B exhibits
          exon1|exon3 100% of the times

          how will you determine average depth of this gene in sampleA?
          I understand sampleB should be easier as you should see no reads spanning exon1| exon2 or exon2|exon3

          How does cufflinks handle or calculate relative abundance of isoforms within a sample?

          Comment


          • #6
            Originally posted by thinkRNA View Post
            So, how do you determine the average depth. For example, consider
            exon1| exon2|exon3

            Lets say that Sample A exhibits
            exon1|exon3 only 20% of the times
            exon1|exon2|exon3 only 80% of the times

            while Sample B exhibits
            exon1|exon3 100% of the times

            how will you determine average depth of this gene in sampleA?
            I understand sampleB should be easier as you should see no reads spanning exon1| exon2 or exon2|exon3

            How does cufflinks handle or calculate relative abundance of isoforms within a sample?
            In the upcoming Cufflinks paper, we have a large section in the supplement devoted to the math of how this is done. This isn't the best medium for explaining how the algorithm and the math work, so keep an eye out for the paper. We hope it will be out within a few weeks, and I'll post a message to our mailing list and link to the paper from the site when it appears.

            For now, let me roughly describe how it works to give you some intuition. Suppose you have two isoforms A and B of a gene (we'll assume they are the same length for now). We want to calculate the abundances of A and B by counting up how many fragments came from A and how many came from B, and then dividing each count by the total number of fragments. This gives us the relative abundances for A and B. Now consider a fragment that could have come from either of two isoforms, A or B. We can't just assign this fragment to A or B - we need to "probabilistically assign" to both of them. The probability it came from A in this example is just equal to the relative abundance of A (and the same goes for B). Now we're in a bind, because to get the abundances we need the counts, and to get the (probabilistic) counts, we need the abundances! There's an additional complication that happens because of course transcripts have differing lengths.

            Cufflinks calculates the abundances of transcripts by assuming that the fragments in an experiment were generated according to a basic model of how RNA-Seq works. With these assumptions one can write down a function that takes an assignment of relative abundances to each transcript as its argument. This function outputs a single number - the likelihood of observing the data in the experiment given the abundances supplied as the argument. The function is built by writing down the probability of observing each fragment in the actual data set, given some assignment of abundances to the transcript. It turns out that with this model, there's a single assignment of abundances to transcripts that *best* explains the data that came off of the sequencer. Cufflinks is able to find this value by systematically exploring the space of possible abundances for each transcript.

            There are a lot of details in the paper, but this is the core idea. Once we have the abundances, we can use them to estimate how many fragments came from each transcript, and thus we can estimate the average depth of sequencing coverage for each one.

            Comment


            • #7
              Originally posted by Cole Trapnell View Post

              This function outputs a single number - the likelihood of observing the data in the experiment given the abundances supplied as the argument. The function is built by writing down the probability of observing each fragment in the actual data set, given some assignment of abundances to the transcript. It turns out that with this model, there's a single assignment of abundances to transcripts that *best* explains the data that came off of the sequencer. Cufflinks is able to find this value by systematically exploring the space of possible abundances for each transcript.
              .
              Great explanation and definitely helps intuitively understand the process! Since numbers/examples help a lot, if we continue with your example of two isoforms A and B of a gene. Lets say that reads spanning isoform A = 100 reads and reads spanning isoform B=10 reads. So, cufflinks explores all possible *isoforms* using these numbers and figures out that probabilistically isoform A is more prevalent than isoform B.
              Did I get this right?

              I think, it takes into account what are all the possible isoform for this gene by taking in the gene annotation that we provide (is this where the GTF file comes in as input). It then taken the counts of reads and determines which combination of exons (or which isoform) is most probable.

              Comment


              • #8
                Originally posted by Cole Trapnell View Post
                Multiplying the average depth of coverage by the transcript length will give you the *estimated* number of reads assigned to each transcript.
                Please correct me if I am wrong (which most definitively I am) in interpreting how to calculate the number of mapped reads per transcript. I am total noob at this RNAseq stuff

                if a transcript looks like this in transcripts.gtf:

                Code:
                chr1	Cufflinks	transcript	3204563	3661579	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
                chr1	Cufflinks	exon	3204563	3207049	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "1"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
                chr1	Cufflinks	exon	3411783	3411982	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "2"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
                chr1	Cufflinks	exon	3660633	3661579	1000	-	.	gene_id "Xkr4"; transcript_id "Xkr4"; exon_number "3"; FPKM "0.0459178326"; frac "1.000000"; conf_lo "0.000000"; conf_hi "0.110855"; cov "0.020363";
                then

                transcript length = sum of length of exons = 3,631
                cov = 0.020363

                number of mapped reads/transcript = 3,631 * 0.020363 = 74

                Is doesn't seem to be correct!

                It would be great if cufflinks could report not only normalized FPKM values but also the raw number of mapped reads per transcript and exon (to use in software like DESeq, which I am using since I couldn't figure out how to use cuffcompare/cuffdiff with biological replicates for each conditition )

                Thanx in advance for your help and consideration.
                Last edited by marcora; 05-18-2010, 03:59 AM.

                Comment


                • #9
                  marcora,

                  it's been a while since you posted this, but it appears to me that 74 is the number or *bases* that mapped to that transcript, as cufflinks coverage is i think on a base-level. so to get the number of reads you would have to divide this number by your read length, i.e, by 50 or 75, or the length of your reads.

                  Does this make sense?

                  Carmen

                  Comment


                  • #10
                    Originally posted by carmeyeii View Post
                    marcora,

                    it's been a while since you posted this, but it appears to me that 74 is the number or *bases* that mapped to that transcript, as cufflinks coverage is i think on a base-level. so to get the number of reads you would have to divide this number by your read length, i.e, by 50 or 75, or the length of your reads.

                    Does this make sense?

                    Carmen
                    yeah,it makes sense!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    49 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    66 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X