Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • cuffcompare class codes

    I ran cuffcompare with the -r option and got class codes which are not written in the manual.
    cut -f2 cuffcompare.tracking | sort | uniq -c
    3852	=
    171	-
    11919	.
    159666	c
    18619	e
    2	i
    11263	j
    12492	o
    53	p
    I don't know what the "-" and the "." classes stand for. On the other hand I don't have the "u" class.


  • #2
    Yes, I am afraid that in the last versions of cuffcompare the 'u' class code is no longer used in the tracking file as it's basically replaced by '-' and '.'
    • '-' should appear only when there are no reference transcripts to be associated with the transcript(s) on that line (so it's really the 'u' case)
    • '.' is used when the relationship to the reference transcript is not the same (not consistent) among the multiple cufflinks transcripts on that line.
    Though it seems to me that sometimes '.' is also used as a last resort when no other code has been assigned (i.e. just like '-'). Sorry for the inconsistency, I'll have this fixed in a later version.

    For now I think it's safe to assume that if there is no reference transcript on that line and the code is '-' or '.', the actual class code is in fact 'u' there, for all intents and purposes. But if there is a reference and the code is '.' it means that the transcripts on that line have different relationships to the reference transcript (and one can check the .tmap output file to find how exactly each of the transcripts on that tracking line relates to the reference)


    • #3
      cuffcompare class codes

      What does the class code p stand for? A reference to it is also not in the manual.


      • #4
        That is the code for "polymerase run" - it's supposed to signal that the relative positioning of the transcripts to the reference transcript suggests a potential polymerase read-through downstream of the 3' UTRs of the reference. Currently this is crudely implemented to have the 'p' code assigned whenever the cufflinks transcripts start in the range of 2Kb downstream of a reference.


        • #5
          and what does 'o' stand for ? Thanks


          • #6
            The 'o' code stands for "other overlap" - that is an exonic overlap with the reference transcript that doesn't fall in any other, "more interesting" overlap categories - e.g. no splice sites match ('j' class), no containment ('c' code) etc. These 'o' codes could be assigned, for example, to assembled single-exon fragments that happen to overlap one of the terminal exons of a reference transcript (but not enough to make it "contained").


            • #7
              May I know where you got this information? I can't seem to find it.



              • #8
                Originally posted by Haneko View Post
                May I know where you got this information? I can't seem to find it.

                gpertea is Geo Pertea. He wrote cuffcompare, and he and I came up with these classification codes and the rules by which they are assigned to fragments while we were designing the program together.


                • #9
                  Oh ok. Thanks!


                  • #10
                    Since this thread was started the Cufflinks manual has been updated so all the codes should be listed there now, and the Cuffcompare code was also updated to fix the little inconsistencies discussed above.


                    • #11
                      I found the following issue when deailing with cufflinks/cuffcompare:

                      I have an experiment with 7 lanes. I am studying the effect of combining or not these lanes in the analyisis. I am getting some strange results in the class codes provided by cuffcompare:

                      What does the "e" class code mean? The manual says: "A single exon transcript overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment." However I checked one of these "e" transcripts and it seems to be overlapping the intron only 4 bp...

                      * When I use a single lane in the study, these are the class codes (the results are quite similar for every lane):

                      Class_code Number_reads Counts %
                      c 80589 35546855.389316 24.105
                      e 18586 34152995.119766 23.16
                      i 39500 4951663.761766 3.358
                      j 27 76931.419867 0.052
                      o 21448 60651025.9952 41.128
                      p 6114 1563040.590947 1.06
                      u 15133 6021136.2572 4.083
                      = 396 4504079.989821 3.054

                      * However when I use the 7 lanes:

                      Class_code Number_reads Counts %
                      c 64197 88839653.484614 6.38
                      e 99716 962234656.969969 69.103
                      i 411761 73484397.951014 5.277
                      j 9 25164.917023 0.002
                      o 21828 134378950.510602 9.65
                      p 21268 8650851.800035 0.621
                      u 115917 39115195.697265 2.809
                      = 1690 85729215.60263 6.157

                      Is this normal? I have tried using two lanes and the percentage of "e" would be around 44%...

                      Thanks in advance for any help


                      • #12

                        Re: 1)
                        You are right, I just looked at the code and it appears that the 10bp check had been inadvertently commented out -- sorry about that, I am going to revert it. There was also an attempt to extend that code definition to include any observed unspliced introns, I have to check with Cole for the status of this modification and I'll update the manual if that's the case.

                        Re: 2)
                        I am not sure which of the output files of cuffcompare you used to generate those tables showing class code frequencies and number of "reads" (?). If you can explain how you got those numbers I will look into why they don't add up.


                        • #13

                          Thank you very much for your quick reply. Please, let me know when you get to fix this little bug.

                          I am not sure which of the output files of cuffcompare you used to generate those tables showing class code frequencies and number of "reads" (?). If you can explain how you got those numbers I will look into why they don't add up.
                          I used the .tmap file. "Number of reads" is indeed a bad name. It is the number of "transfrags" that I find in the tmap file with a given class code. The counts are computed (in an estimated way) by multiplying the coverage and the length of the transfrag. I was trying to reproduce the information provided in the Table 2 of the Supplementary Material of the cufflinks paper. Am I doing something wrong?

                          Thanks again for your help


                          • #14
                            I cannot provide a full cufflinks release now but you can get the updated cuffcompare source code with this quick fix, here:
                            After unpacking simply change to that directory and run 'make' to compile the updated cuffcompare binary for your platform, it should work if you have a recent gcc/g++ installed.

                            About the second problem with the counts, I have no idea where the problem is as I am still not sure how you actually get the counts -- are you using something like this to count the transfrags for various class codes in a .tmap file ?
                            cut -f3 60hr.tmap | sort | uniq -c
                            Because in this case the counts for a given sample should be consistent between cuffcompare runs if the same reference annotation was used for all runs -- no matter how many other input files (samples) are given to cuffcompare.

                            But perhaps I misunderstood you -- I assumed that you assembled each lane separately with cufflinks and when you said "I used 7 lanes" I assumed you meant that you ran cuffcompare with all the 7 GTF files as input. And then you summed up the class code counts in all 7 resulting .tmap files.


                            • #15
                              I recently ran cuffcompare with all 4 GTF files as input and this is some of my outcome (from my tracking file). Each sample has the raw number and % of each category. I am confused about my output:

                              --my first two samples have no intergenic reads where my other two samples have ample amount (sample 1:2, sample 2: 0 sample 3: 680K, sample 5: 688K)

                              --when looking at the transfrags that fall into the random category ".", the first two samples have much more in that category than they do in the last two samples.

                              --when you add the two columns together, they seem to equal to the same amount of reads.

                              All the samples are the same type of tissue, but different stages of a disease.

                              Is there something wrong here?

                              All of the other categories are consistent across the board. Any explanation of why, in two samples ~680k reads were labeled category "u" in two samples and 0 in the other two samples would be greatly appreciated. Thank you.
                              Attached Files
                              Last edited by zorph; 06-18-2010, 10:03 AM.


                              Latest Articles


                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin

                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin

                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM





                              Topics Statistics Last Post
                              Started by seqadmin, Today, 07:20 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-16-2024, 05:49 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-15-2024, 06:53 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              Last Post seqadmin