Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annotation difference between refSeq and Gencode

    Hi all,

    I am trying to set up an RNAseq work flow:

    1. Generated genome files for STAR using .fna files from NCBI ftp and gtf files from Gencode;

    2. Aligned fq using STAR, convert sam to bam and sorted bam.

    3. Then I used the sorted bam files to test cufflinks and compared different gtf files for the -G option. The cufflinks output somehow all have different positions for the same genes:

    refSeq:
    gene_id gene_short_name locus
    PDIA3 - chr15:44038589-44064804
    CD276 - chr15:73976621-74006859
    PROM2 - chr2:95940200-95957055

    gencode:
    gene_id gene_short_name locus
    ENSG00000167004.12 PDIA3 chr15:43746391-43773279
    ENSG00000103855.17 CD276 chr15:73683965-73714518
    ENSG00000155066.15 PROM2 chr2:95274452-95291308

    And the FPKM as a result are very different in the two output.

    What am I missing here and how to fix it, please? If the two gtf are inherently different in regard to gene loci, which one should I trust, pls?

    Best,
    Grace

  • #2
    As you have discovered the hard way it is extremely important to make sure that you are using a consistent genome build/patch level for your analysis (I assume that is what is being reflected in the co-ordinate differences above).

    If you want to avoid these types of issues you could download sequence/annotation/index bundles (you will need to roll your own indexes if you want to use STAR but at least the sequence/annotation would be consistent) from iGenomes.

    In terms of salvaging the analysis, check to see if there are corresponding annotation files available at NCBI where you got the sequence files.

    Comment


    • #3
      Example PDIA3:
      RefSeq co-ordinates are from Hg19/GRCh37.p19
      Gencode are from GRCh38.p2

      So if your sequence was from GRCh37/Hg19 then get the corresponding annotation file.
      Attached Files
      Last edited by GenoMax; 10-07-2015, 08:49 AM.

      Comment


      • #4
        Thanks for the responses.

        I used GCA_000001405.15_GRCh38_no_alt_analysis_set.fna to build the genome for STAR. Does it mean gencode is the right gtf to use here?

        Is it right that if I want to use RefSeq annotation, I could just download hg19 reference sequences from iGenome?

        Also the cufflinks output with refseq or gencode gtf are very different, less than 30K genes with refseq and about 60K genes with gencode. Is there any explanation on it?

        Comment


        • #5
          If you used the GRCh38 fasta then gencode should be the right gtf file to use.

          If you want to re-do the alignments then you could go the iGenomes route and save yourself some trouble.

          Since you are sampling different areas of the genome with the two GTF files (co-ordinate differences) the cufflinks outputs is different (though 2x is a big change). How are you handling multi-mappers? Perhaps there is a repeat region in one but not the other.

          Comment


          • #6
            Thanks.

            By 30K vs 60K difference I meant the row numbers in the cufflinks output with the two different gtf files. The row numbers and genes are fixed for each regardless of the input bam files. I checked and found that gencode gtf returns a lot of rows of Y_RNA or 5s_rRNA. Is there a way to only return mRNA annotation with gencode/GRCh38 gtf, pls?

            Comment


            • #7
              You can filter the rows you do not want/need from the GTF file using grep. Look into the -v option.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Recent Advances in Sequencing Analysis Tools
                by seqadmin


                The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                05-06-2024, 07:48 AM
              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin




                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                04-22-2024, 07:01 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 05-14-2024, 07:03 AM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-10-2024, 06:35 AM
              0 responses
              44 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-09-2024, 02:46 PM
              0 responses
              54 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 05-07-2024, 06:57 AM
              0 responses
              42 views
              0 likes
              Last Post seqadmin  
              Working...
              X