Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cuffdiff does not output all the CDS in cds.FPKM.tracking file

    Hi guys,

    I run cuffdiff with an annotation file from Ensemble like this:
    -------------------
    cuffdiff -o diff_out -b ../genome.fa -p 8 -L s1,s2 -u genome.gtf tophat_s1/accepted_hits.bam tophat_s2/accepted_hits.bam
    -------------------
    where genome.gtf is Ensemble annotation file which has ~70000 different CDS annotation.

    However in the output file of cds.fpkm.tracking, there are only ~14000 cds features. Why is that? Did I do anything wrong?

    Any help is appreciated. Thank you so much.

    Regards,

    xiangq

  • #2
    I'm having a simliar issue. I obtained my .gtf file from UCSC at this website:


    My problem is that this outputs REFSEQ, UNIPROT, and UCSC gene IDs. Can anyone make a good recommendation?

    Comment


    • #3
      how many lines are in the genes.fpkm_tracking file?
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #4
        Hi, sdriscoll,

        There are ~52000 lines in genes.fpkm_tracking file.
        ~156600 lines in isoform.fpkm_tracking file.

        Any idea? Thanks.

        Comment


        • #5
          nope. according to my ensemble gtf reference (downloaded from UCSC) there are 55,796 unique transcript ids with CDS features and 22,948 unique gene names with CDS features. so the number returned from cuffdiff seems random.
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #6
            Hi guys,

            I check the data and find that cuffdiff seems only output the CDS features under following condition:

            suppose p_id of cds is p01, its gene_id is ENSG00000123456,
            if the total number of the features of the gene_id of ENSG00000123456 in annotation GTF file is 3 and they all have p_id annotation,such as p02,p03
            then cds p01,p02 and p03 will be outlined in cds.FPKM.tracking file,
            otherwise none of p01,p02,p03 will be outlined.

            Why cuffdiff calculate cds fpkm like that? Is that right?

            Any comments are welcome? Thanks

            -xiangq

            Comment


            • #7
              Hi sdriscoll,

              Thanks for your quick response.

              Did you do the cuffdiff on your data? How many CDS features are outlined in the cds.FPKM.tracking file?

              Thanks.

              -xiangq

              Comment


              • #8
                did you do as they suggest on their website to modify the annotation you're using to include the necessary information for cuffdiff to work properly? i'm assuming you did since you say your annotation includes p_id values.

                maybe...could we see some lines from your GTF file that include some CDS features? I know that cuffdiff only does CDS DE when there are CDS features in the GTF reference.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Hi, I used the Ensemble GTF annotation file, not every feature has a p_id annotation. following are several features in the file
                  PHP Code:
                  chr10    protein_coding    exon    97990550    97990590    .    -    .    exon_number "4"gene_id "ENSG00000095585"gene_name "BLNK"transcript_id "ENST00000467799"transcript_name "BLNK-004"tss_id "TSS39268";
                  chr10    protein_coding    exon    97990550    97990590    .    -    .    exon_number "6"gene_id "ENSG00000095585"gene_name "BLNK"p_id "P56921"transcript_id "ENST00000393894"transcript_name "BLNK-202"tss_id "TSS58878";
                  chr10    protein_coding    exon    97990550    97990590    .    -    .    exon_number "4"gene_id "ENSG00000095585"gene_name "BLNK"p_id "P32418"transcript_id "ENST00000371176"transcript_name "BLNK-002"tss_id "TSS85036";
                  chr10    protein_coding    exon    97998644    97998920    .    -    .    exon_number "4"gene_id "ENSG00000095585"gene_name "BLNK"transcript_id "ENST00000495266"transcript_name "BLNK-003"tss_id "TSS11211"
                  I dont know if UCSC annotation file will be different with this one.


                  Thanks.

                  Comment


                  • #10
                    did you try this?

                    Code:
                    cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf annotation.gtf
                    that's what's recommended by the cufflinks people when using GTF files that were NOT generated by cuffcompare. it basically tags up the GTF annotation in a way that cuffdiff prefers. annotation.gtf would be your Ensembl GTF and you'd specify the genome FASTA file you built your bowtie index from.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #11
                      Hi sdriscoll,

                      thanks for your advice.

                      Actually, I have tried cuffcompare before, the output GTF is as following:
                      PHP Code:
                      chr1    Cufflinks       exon    850183  850351  .       +       .       gene_id "XLOC_000031"transcript_id "TCONS_00000075"exon_number "2"gene_name "RP11-54O7.2"oId "ENST00000398216"nearest_ref "ENST00000398216"class_code "="tss_id "TSS53";
                      chr1    Cufflinks       exon    860260  860328  .       +       .       gene_id "XLOC_000032"transcript_id "TCONS_00000076"exon_number "1"gene_name "SAMD11"oId "ENST00000420190"contained_in "TCONS_00000078"nearest_ref "ENST00000420190"class_code "="tss_id "TSS54"p_id "P5"
                      the gene_id, transcript_id, tss_id, p_id are replaced by cufflinks interior ids, therefore, in the cuffdiff output, all features Id are also replaced by the cufflinks interior id, which seems to make the following analysis more difficult.

                      -xiangq

                      Comment


                      • #12
                        I guess the only important question is if the Cuffcompare version of the GTF results in more CDS output. You can always swap gene names later with Perl or something. Of course I'm a programmer so that's how I think. At least you'll know if it's the annotation causing the lack of CDS output. Maybe try a GTF from UCSC as well. See what happens.
                        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                        Salk Institute for Biological Studies, La Jolla, CA, USA */

                        Comment


                        • #13
                          Hi, thanks for the response.

                          I will try and let you know the results.

                          Comment


                          • #14
                            So I"m very confused on this GTF format.

                            From the UCSC wiki, it states "At this time, this genePredToGtf command can provide better GTF files than available from the table browser."


                            However, I get outputs from cuffdiff in UNIPROT, RefSeq, UCSC and ENSEMBLE. So that's not a huge deal, I can use DAVID to get all of them. What I am really confused by is this:

                            One of my CDS is uc010nxq.1. When I search for this on Google, I get this and this

                            One of these is on chromosome 15 and the other on chromosome X. Please help, I could not be more confused.

                            Comment


                            • #15
                              Be careful with the genome browser when you're searching from Google. What you've got there is one link to uc010nxq.1 in the hg19 build and one to the same ID in the hg18 build. Since the ucXXXxxx.X IDs are always UCSC I'd do this: go to the browser at http://genome.ucsc.edu/cgi-bin/hgGateway, select the correct genome & build, and paste the ID into the empty "gene" box, then click "submit". That will take you to its location on the genome. The transcript corresponding to the one you searched for will be highlighted in the left margin (if it was a UCSC id, that is). You can click on the transcript and you'll be taken to the information page for the gene.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X