Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A question about the small RNA sequencing data

    Hi,

    Has anyone here analysed the small RNA sequencing data that produced by solexa? I found that many sequences has some untemplated nucleotides in its 3' end after removing the 3' adaptor. For example,

    hsa-let-7a-1
    TGGGATGAGGTAGTAGGTTGTATAGTTTTAGGGTCACACCCACCACTGGGAGATAACTATACAATCTACTGTCTTTCCTA
    TGAGGTAGTAGGTTGTATAGTTgc
    TGAGGTAGTAGGTTGTATAGTTaa
    TGAGGTAGTAGGTTGTATAGTTagg
    TGAGGTAGTAGGTTGTATAGTTatc
    TGAGGTAGTAGGTTGTATAGTTgtt
    TGAGGTAGTAGGTTGTATAGTTaaa
    TGAGGTAGTAGGTTGTATAGTTagt
    TGAGGTAGTAGGTTGTATAGTTca
    TGAGGTAGTAGGTTGTATAGTTa
    TGAGGTAGTAGGTTGTATAGTTggt
    TGAGGTAGTAGGTTGTATAGTTg
    TGAGGTAGTAGGTTGTATAGTTatcttatt

    Lower case characters refer to the nucleotides that sequenced by Solexa
    which can't be mapped to the genome.

    Do anyone knows why this happens? Can I use these sequences with some
    untemplated 3' nucleotide?

    Leo

  • #2
    Hi Leo,

    A couple of questions:
    What proportion of your have reads these 3' 'untemplated' bits?

    If it's not too many, I wouldn't worry about it. For small RNA you tend to get high levels of coverage so throwing out a few is fine.

    What are the quality scores like for your reads, esp. at the 3' end?
    If the quality is poor, you could try clipping the reads in order to remove the low quality regions. We've had to do that in the past as for some datasets the quality was so poor that any reads longer than 25bp were virtually guaranteed to not match to the genome.

    Regards,
    Chris

    Comment


    • #3
      Hi chris,

      Thanks for your answer.

      Take sequences refer to hsa-let-7 as example, the total amount of sequences with 3' untemplated nucleotides is one fifth of sequences without untemplated nucleotides. In addition, most of the sequences with 3' untemplated nucleotides have 1 to 3 untemplated nucleotides. I don't know whether it is appropriate to discard these sequences with 3' untemplated nucleotides, since I want to compare the expression level of microRNA between two samples and as we know, some isomiRs indeed have a untemplated nucleotide in vivo.

      Most of the quality scores of 3' untemplated nucleotides are ok, see the sequences as follows:

      @I82_3_FC30HF2AAXX:6:1:11:1235
      TGAGGTAGTAGGTTGTATAGTTAATCGTATGCCGT
      +I82_3_FC30HF2AAXX:6:1:11:1235
      hhghhhhhhhhhhhhhhhhhhhhhhhhhhh[hhhh
      @I82_3_FC30HF2AAXX:6:1:15:1585
      TGAGGTAGTAGGTTGTATAGTTATCGTATGCCGTC
      +I82_3_FC30HF2AAXX:6:1:15:1585
      hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
      @I82_3_FC30HF2AAXX:6:1:1334:984
      TGAGGTAGTAGGTTGTATAGTTAAAATCGTATTCC
      +I82_3_FC30HF2AAXX:6:1:1334:984
      hhhhhhhchhhXhhhhhhhdhhhhhhhO_hhhDTS
      @I82_3_FC30HF2AAXX:6:1:420:1736
      TGAGGTAGTAGGTTGTATAGTTCATCGTATGCCGT
      +I82_3_FC30HF2AAXX:6:1:420:1736
      hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh`Z_h
      @I82_3_FC30HF2AAXX:6:1:511:1438
      TGAGGTAGTAGGTTGTATAGTTGTCGTATGCCGTC
      +I82_3_FC30HF2AAXX:6:1:511:1438
      hhhhhhhhhhhhhhhhhhhhhhhhhShhhYhhVhF

      The 3' adaptor sequence is TCGTATGCCGTCTTCTGCTTG

      Regards,
      Leo

      Comment


      • #4
        untemplated 3' additions

        The 3' variability of miRNAs is a headache for both mapping and quantitation. We have recently adopted Novoalign for mapping miRNA-seq reads, since it allows multiple mismatches while still finding the optimal alignment. The alignment process takes many, many CPU-hours, so I recommend collapsing your reads first (which means you can't use your quality values). Once you have all your alignments, you can sum up the tags that align to the same place in the genome (including any with mismatches at the 3' end or elsewhere). This is probably more appropriate than throwing away tags with mismatches, since some miRNAs might be more prone to these extensions than others.

        Ryan

        Comment


        • #5
          Originally posted by myrna View Post
          The 3' variability of miRNAs is a headache for both mapping and quantitation. We have recently adopted Novoalign for mapping miRNA-seq reads, since it allows multiple mismatches while still finding the optimal alignment. The alignment process takes many, many CPU-hours, so I recommend collapsing your reads first (which means you can't use your quality values). Once you have all your alignments, you can sum up the tags that align to the same place in the genome (including any with mismatches at the 3' end or elsewhere). This is probably more appropriate than throwing away tags with mismatches, since some miRNAs might be more prone to these extensions than others.

          Ryan
          Hi Ryan,

          I haven't used Novoalign. Which one is faster when it compared to the megablast?

          Do you mean that all sequences with 3' untemplated nucleotides can be used for the subsequence analysis? Then how these 3' untemplated nucleotides generate?

          I have read your paper about analysing hESC microRNA which published in GR. I noticed that you had analysed the single nucleotide 3' extension in this paper. Could you please tell me how to analyse it since many microRNA sequences have more than one untemplated nucleotides in its 3' end.

          Thanks for any help.

          Leo

          Comment


          • #6
            3' extensions

            Hi Leo.
            From what is known about miRNA target selection, the 3' extensions should not affect the interaction between a miRNA and its target. If you take this perspective, then the sum of all tags for a given miRNA (including any 3' variants) should tell you how much of the mature miRNA was in the cell. I used megablast for the hESC miRNA paper because there was no better option at the time. SOAP was the first aligner that really addressed the issue of variable length alignment (for next-gen sequence data). Novoalign is much faster than SOAP and allows more flexibility, so that is what we are using now.

            Ryan

            Comment


            • #7
              We've tended to use vmatch for doing complete variable length matches in the past, but now bowtie seems to be ticking all the boxes for small RNA sequences. Never used any of the BLAST tools as they didn't seem to fit our needs and you need to mess around with gap penalties etc.

              In terms of matching to known miRNAs, I've used vmatch to match the reads to the mature sequences by ignoring any 3' or 5' extensions. This gives me the complete set of matching reads.

              Comment


              • #8
                Bowtie for miRNA alignment

                How does Bowtie handle read trimming for miRNA data? Does it recognize the adaptor in advance and only align the pre-adaptor portion of the read? Or does it do a local alignment of the full read against the reference?

                Comment


                • #9
                  No. The adaptors have to be removed prior to matching against the reference. To ensure that the majority of the adaptors are removed we also clip reads using quality score thresholds. i.e. moving from 5' to 3' if the mean quality drops below, say, 20 the read is clipped at that position.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    The Impact of AI in Genomic Medicine
                    by seqadmin



                    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                    02-26-2024, 02:07 PM
                  • seqadmin
                    Multiomics Techniques Advancing Disease Research
                    by seqadmin


                    New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                    A major leap in the field has
                    ...
                    02-08-2024, 06:33 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:12 AM
                  0 responses
                  17 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-23-2024, 04:11 PM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-21-2024, 08:52 AM
                  0 responses
                  73 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 02-20-2024, 08:57 AM
                  0 responses
                  62 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X