Header Leaderboard Ad

Collapse

Find all annotated rRNA (rDNA) sequences

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find all annotated rRNA (rDNA) sequences

    Hello all,

    I think it would be good to post it here for future reference. I could not find a respective topic here, and only found one discussion on Biostars.

    Picard tools has a tool named CollectRNASeqStatistics

    It's a very useful program that requires, among other, more obvious things, a file of ribosomal intervals, in SAM-like format (SAM-type header, and intervals in 5 fields: chr, begin, end, strand (+ or -), and actual gene name.
    Since I mostly deal with mouse (in mm9 assembly) and human (in hg19 assembly) genomes, I wanted to find these files or make them myself.

    I've tried to make sense of the files http://www.arb-silva.de/ and just flat-out failed. If someone can tell me how to convert files they have available for download into genomic intervals that correspond to rRNA, I'd be very grateful.

    At any rate, I've proceeded to the latest version of GENCODE. There are 1587 intervals annotated as "rRNA" transcript type in v17 of GENCODE. However, I've found that many intervals I had in a previous rRNA interval file (origins of which are mysterious), such as LSU-rRNAs, are absent.

    So, here are two main questions:

    - what should be the ultimate source of the information for rDNA annotated intervals?
    - what would be such source for mouse genome, considering that there's no GENCODE data for mouse?

    Thank you for your inputs.

  • #2
    Ok I've figured it out - I guess I did not search thoroughly enough.

    You can find the intervals using the UCSC Table browser. For this, you go to

    http://genome.ucsc.edu/cgi-bin/hgTables

    and then set group:all tables, table:rmsk, and filter to "repClass (does match) rRNA"

    then output it as a GTF file. Voila! Works for both mouse and human.

    Comment


    • #3
      As another option: this information is also in the GTF files available from Ensembl: http://www.ensembl.org/info/data/ftp/index.html You will have to "grep" out the information.

      Comment


      • #4
        Cool. It would be interesting to compare the two.

        Comment


        • #5
          There are differences in GTF files for UCSC and Ensembl downloaded from the iGenomes. See for reference: http://seqanswers.com/forums/showthread.php?t=41701

          Can you check to see if the data you got from UCSC matches the Ensembl GTF?

          Comment


          • #6
            Originally posted by GenoMax View Post
            There are differences in GTF files for UCSC and Ensembl downloaded from the iGenomes. See for reference: http://seqanswers.com/forums/showthread.php?t=41701

            Can you check to see if the data you got from UCSC matches the Ensembl GTF?
            I definitely will, and I'll post the results here in a day or two.

            Comment


            • #7
              Hey folks,

              I've been looking into this for a day or two as well, and I also downloaded the .gtf corresponding to repClass=rRNA (also tRNA) from the UCSC Table Browser.

              When I scroll through the .gtf file, though, it seems to only have entries corresponding to the 5S ribosome. Is this true for you guys, too?


              Thanks,
              Jeremy

              Comment


              • #8
                Originally posted by jbchang View Post
                Hey folks,
                I've been looking into this for a day or two as well, and I also downloaded the .gtf corresponding to repClass=rRNA (also tRNA) from the UCSC Table Browser.
                When I scroll through the .gtf file, though, it seems to only have entries corresponding to the 5S ribosome. Is this true for you guys, too?
                Thanks,
                Jeremy
                So I spent some time to learn the situation, and this is actually pretty cool
                So, there are 6 kinds of ribosomal RNA in mammals (well for sure in humans): 3 belonging to large subunit (LSU rRNA: 5S, 5.8S, and 28S), 1 belonging to small subunit (SSU rRNA: 18 S), and 2 mitochondrial rRNAs (12S, 16S).

                Mitochondrial ones are the easiest - they reside, well, in chromosome M Two rRNAs, two genes, very neat.

                However, others are a real mess. Here's what Wiki says about it:

                The 28S, 5.8S, and 18S rRNAs are encoded by a single transcription unit (45S) separated by 2 internally transcribed spacers. The 45S rDNA is organized into 5 clusters (each has 30-40 repeats) on chromosomes 13, 14, 15, 21, and 22. These are transcribed by RNA polymerase I. 5S occurs in tandem arrays (~200-300 true 5S genes and many dispersed pseudogenes), the largest one on the chromosome 1q41-42. 5S rRNA is transcribed by RNA polymerase III.
                So, yes, we do see a ton of 5S genes and pseudogenes all over the annotated GTFs. There are also some 5.8S, but not many (about ten). Where are the the above-mentioned clusters of 45 rDNA though? Well, that's when it gets interesting. They are still not annotated! Gene cards say the following here:

                The sequences coding for ribosomal RNAs are present as rDNA repeating units, designated RNR1 through RNR5, in the p12 region of chromosomes 13, 14, 15, 21 and 22. A 45S rRNA which serves as the precursor for the 18S, 5.8S and 28S rRNA, is transcribed from each rDNA unit by RNA polymerase I. The number of rDNA repeating units varied between individuals and from chromosome to chromosome, although usually 30 to 40 repeats are found on each chromosome. These ribosomal repeating units are not currently annotated on the reference genome. This gene represents the portion of one rDNA repeat which encodes an 18S rRNA.(provided by RefSeq, Mar 2009) .
                Indeed, in Ensembl file I've found the following entries:

                Code:
                GL000220.1      109078  110946  5S_rRNA rRNA
                GL000220.1      109078  110946  RNA18S5 rRNA
                GL000220.1      109078  110946  RNA18S5 rRNA
                GL000220.1      112025  112177  RNA18S5 rRNA
                GL000220.1      112025  112177  RNA5-8S5        rRNA
                GL000220.1      112025  112177  RNA5-8S5        rRNA
                GL000220.1      113348  118417  RNA5-8S5        rRNA
                GL000220.1      113348  118417  RNA28S5 rRNA
                GL000220.1      113348  118417  RNA28S5 rRNA
                GL000220.1      114151  114242  RNA28S5 rRNA
                GL000220.1      114151  114242  RNA28S5 rRNA
                GL000220.1      118197  118253  RNA28S5 rRNA
                GL000220.1      118197  118253  RNA28S5 rRNA
                GL000220.1      155997  156149  RNA28S5 rRNA
                GL000220.1      155997  156149  RNA5-8S5        rRNA
                GL000220.1      155997  156149  RNA5-8S5        rRNA
                GL000228.1      20113   20230   RNA5-8S5        rRNA
                GL000228.1      20113   20230   5S_rRNA rRNA
                GL000228.1      20113   20230   5S_rRNA rRNA
                GL000228.1      22673   22791   5S_rRNA rRNA
                GL000228.1      22673   22791   5S_rRNA rRNA
                GL000228.1      22673   22791   5S_rRNA rRNA
                As of GRCh38, contig GL000220.1 is still unplaced. However, GL000228.1 is obsolete and probably is present in main GRCh38 assembly.

                Well, I think we all learned something today!

                Comment


                • #9
                  So, to continue, I've looked at the "gene_id" identifiers provided in UCSC "rmsk" table (which, as I understand now, refers to RepeatMasker database, go me! ).

                  For hg19, the output is

                  Code:
                     1275 5S
                      414 LSU-rRNA_Hsa
                       80 SSU-rRNA_Hsa
                  For mm9, it is

                  Code:
                        1035 5S
                      491 LSU-rRNA_Hsa
                       46 SSU-rRNA_Hsa
                  So basically the conclusions is 1) UCSC DOES include all of the rRNA into their "rmsk" table annotation, and 2) to get realistic picture of rRNA presence, you should have genome with all of the "random" chromsomes, etc.

                  Now, I'm seeing quite a bit of differences between intervals provided in Ensembl GTF file and in UCSC tables. I'll have to look at this in more detail to try and understand where do they come from.

                  Comment


                  • #10
                    Originally posted by apredeus View Post
                    Now, I'm seeing quite a bit of differences between intervals provided in Ensembl GTF file and in UCSC tables. I'll have to look at this in more detail to try and understand where do they come from.
                    First thing to check would be the genome builds and make sure they are the same.

                    Comment


                    • #11
                      Right, yes, I've considered this. From what I've learnt, GRCh37 should be precisely identical in terms of genomic coordinates to what's known as hg19 in UCSC notation.

                      Does the same hold true for GRCm38/mm10?

                      Comment


                      • #12
                        Originally posted by apredeus View Post
                        Right, yes, I've considered this. From what I've learnt, GRCh37 should be precisely identical in terms of genomic coordinates to what's known as hg19 in UCSC notation.

                        Does the same hold true for GRCm38/mm10?
                        That is correct.

                        Comment


                        • #13
                          Thanks, guys, for this valuable discussion. After looking through the rmsk rRNA entries more closely, they seem to make enough sense that I'm willing to trust RepeatMasker.

                          In terms of the Ensembl vs Hg19 (rmsk) coordinates/annotations for rRNA, although I am new to this, it doesn't seem like they should necessarily correspond. Don't they have different annotation pipelines/procedures, anyway?


                          Best,
                          Jeremy

                          Comment


                          • #14
                            Originally posted by jbchang View Post
                            In terms of the Ensembl vs Hg19 (rmsk) coordinates/annotations for rRNA, although I am new to this, it doesn't seem like they should necessarily correspond. Don't they have different annotation pipelines/procedures, anyway?
                            yes, they are different, but the question is which one is more comprehensive (and thus will be more complete in i.e. evaluation of rRNA presence in RNA-seq experiment).

                            Actually, as I'm finding out, they are super different. I'm not quite sure what is the reason for that. I'll post more details in the next post.

                            Comment


                            • #15
                              so, from the comparison I've done, it seems like rmsk intervals have a lot more coverage. They include virtually all of the Gencode intervals, but many more unique intervals as well.

                              So, for hg19:

                              hg19_rRNA_rmsk.gtf has 1769 intervals covering 193760 bp
                              hg19_rRNA_gencode.gtf has 571 intervals covering 70960 bp (v19 of human Gencode)

                              When you intersect the two, you see that they have about 50 kb in common. However all of the gencode intervals but the few that follow are included in rmsk version (I can't evaluate differences in random chromosomes and unplaced scaffolds since they have different names):

                              Code:
                              chr2	133010727	133010878	RNA5-8SP5
                              chr2	162266065	162266181	5S_rRNA
                              chr9	32293556	32293690	RNA5SP281
                              chr9	110681147	110681259	RNA5SP293
                              chr9	111754689	111754849	RNA5-8SP3
                              chr10	49248476	49248591	RNA5SP315
                              chr11	8866810	8866905	RNA5SP330
                              chr11	96207736	96207856	RNA5SP346
                              chr12	66460001	66460118	RNA5SP362
                              chr15	98015356	98015482	RNA5SP401
                              chr16	33965426	33965577	RNA5-8SP2
                              chr19	24187160	24187309	RNA5-8SP4
                              chr20	5326652	5326806	RNA5-8SP7
                              chrY	10037764	10037915	RNA5-8SP6
                              So basically, I'm going to use rmsk gtf file with the addition of these 14 lines. This should be more than enough for my purposes. I'm also definitely including all of the random chromosomes etc. since a lot of rRNA elements are there.
                              Last edited by apredeus; 03-28-2014, 09:08 PM.

                              Comment

                              Working...
                              X