Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • interesting unmapped reads

    New to RNAseq, thus everything found seems interesting to me, and as well, strange to me.

    I used TopHat mapping my mouse PE100 data to its reference genome and got about 85~90% mapped.
    Then, I looked into those unmapped reads and found some reads look like this:CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACCCCCCCACACCTCAAAAAACACCCCAAAATAAAAATAACCGATCTGATTTAAAAATTAG

    I found about 20, 000+ reads like the above one ( >40 C at the head) among total 30 M reads.

    Is this usual? Or can we tell anything from these unmapped reads?

  • #2
    A little bit more details,

    20,000+ reads with >40C at the heads were only found in the left reads.
    only a few of this kind of reads were found in the right one.
    And only happened to C, not A, T, G.

    Comment


    • #3
      Those look like low complexity reads. Sometimes they get marked as low complexity, or too many multimaps, if they aren't marked you can take some of the reads and run blast to figure out where they are mapping. They are probably real data, there are regions in genomes like that, many regions in fact.

      Comment


      • #4
        Originally posted by rskr View Post
        Those look like low complexity reads. Sometimes they get marked as low complexity, or too many multimaps, if they aren't marked you can take some of the reads and run blast to figure out where they are mapping. They are probably real data, there are regions in genomes like that, many regions in fact.
        Yes, I found miRNAs could have long and continuous C.
        But for these 100 bps reads, I tried UCSC blat and NCBI blast, it seems these reads matched nothing.

        Another question, if it is real, I found "CCCC....C" in the left reads, should I find symmetrical reads in the corresponding right reads ? Or it is not necessary?
        Last edited by ZoeG; 07-24-2013, 10:23 AM.

        Comment


        • #5
          Originally posted by ZoeG View Post
          Yes, I found miRNAs could have long and continuous C.
          But for these 100 bps reads, I tried UCSC blat and NCBI blast, it seems these reads matched nothing.

          Another question, if it is real, I found "CCCC....C" in the left reads, should I find symmetrical reads in the corresponding right reads ? Or it is not necessary?
          Did you turn off low complexity filtering on BLAST and BLAT?

          Comment


          • #6
            Originally posted by rskr View Post
            Did you turn off low complexity filtering on BLAST and BLAT?
            After turning off complexity filtering, blastn found no significant similar by searching database Mouse G+T using Megablast; using database Nucleotide collection (nr/nt), it gave a list, with one record for mouse, Mus musculus BAC clone RP24-289J17 from chromosome 14, complete sequence, coverage 52%, score 84.2, ident 96%.

            Seems confusing to me..

            Comment


            • #7
              Let's start with the obvious...what's the quality string look like? I bet it's all just noisy garbage.

              Comment


              • #8
                Originally posted by ZoeG View Post
                After turning off complexity filtering, blastn found no significant similar by searching database Mouse G+T using Megablast; using database Nucleotide collection (nr/nt), it gave a list, with one record for mouse, Mus musculus BAC clone RP24-289J17 from chromosome 14, complete sequence, coverage 52%, score 84.2, ident 96%.

                Seems confusing to me..
                so you are saying you BLASTed it, but it didn't return anything, then you turned off low complexity filtering for BLAST, and BLASTing it did return something significant with 96% identity. Which part matched? Maybe you found a missing chunk of the mouse genome!

                Comment


                • #9
                  Coverage 52%, though, most of it was probably the stretch of Cs.
                  You can blast the various genomes at NCBI blast and that's as good as you'll get.
                  It's likely just a junk read. The 90% mapped is a good enough run. Don't worry about the junk, it's normal. Sometime these unmapped reads to go to contaminating bacteria or viruses but in your case it's probably just junk.

                  Comment


                  • #10
                    The matched part is the stretch. Those 'C' was miserably threw out.
                    Yes, it seems these reads are just junk. The quality strings of this kind of reads show a lot of '#'.
                    Thanks, all.
                    It is funny that the machine loves only 'C', not A, T or G.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 12:17 PM
                    0 responses
                    7 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 10:49 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-25-2024, 11:49 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    21 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X