Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Analysis of Unmapped Reads

    Has anyone done more than a perfunctory analysis of unmapped reads in SOLiD and Illumina data?

    I did a quick alignment against UCSC's unassembled data and found that the most amount of originally unmapped reads map to mitochondrial DNA. However, at most this was ~1% of the originally unmapped reads.

    I suspect that much of the unmapped reads are artifacts such as PCR dimers and self ligated oligos (for SOLiD). If it is, there may be easy molecular protocol steps to remove them.

  • #2
    We've looked at this in human and mouse Illumina ChIP-Seq samples. From a good run there is little to no primer or vector sequence. You can spot these fairly quickly by looking for the most over-represented sequences in your library. Generally you shouldn't see a single sequence which represents more than 0.1% of your library - if you do then these tend to be vectors / primers.

    The majority of our unmappable sequences are repeat sequences - mostly centromeric or telomeric repeats but also other classes of repeats. The remainder are sequences of very poor quality where the base calling is probably too unreliable to be useful.

    We now have a repeat mapping pipeline to measure the distributions of different classes of repeats within our libraries, regardless of whether we can map then uniquely, so we don't miss any interesting information which might be in the unmappable fraction.

    This will of course vary hugely between different libraries.

    Comment


    • #3
      Hi Simon,

      Any chance your repeat mapping pipeline is or will become public like some of the other nice tools you provide ?

      Comment


      • #4
        Originally posted by mattanswers View Post
        Any chance your repeat mapping pipeline is or will become public
        It wasn't really written with the intention of releasing it, but I'll ask the guy in our group who developed it how easy it would be to package up. It's fairly simple really - there's a script which uses the Ensembl repeat mapping position data to pull out every instance of every type of repeat and put them into a concatenated fasta file. We then make these into bowtie databases and search them in parallel using a script which then makes up statistics about the repeat content of the library being searched. We can then compare this to a reference sample to look for changes.

        Comment


        • #5
          Originally posted by simonandrews View Post
          It wasn't really written with the intention of releasing it, but I'll ask the guy in our group who developed it how easy it would be to package up. It's fairly simple really - there's a script which uses the Ensembl repeat mapping position data to pull out every instance of every type of repeat and put them into a concatenated fasta file. We then make these into bowtie databases and search them in parallel using a script which then makes up statistics about the repeat content of the library being searched. We can then compare this to a reference sample to look for changes.
          Would RepBase be another way of doing this? I think it also allows you to identify what kind of repeat you have matched.

          Comment


          • #6
            Originally posted by NGSfan View Post
            Would RepBase be another way of doing this? I think it also allows you to identify what kind of repeat you have matched.
            The way we do it also allows you to see what you matched since the databases are split by repeat type. Using repbase directly would make the mapping more difficult since many of their entries are consensus sequences where we pull out every individual instance of that repeat from the genome.

            Comment


            • #7
              Originally posted by simonandrews View Post
              The way we do it also allows you to see what you matched since the databases are split by repeat type. Using repbase directly would make the mapping more difficult since many of their entries are consensus sequences where we pull out every individual instance of that repeat from the genome.
              Nice, I didn't know you had them separated by repeat type. This is a very useful and simple approach.


              Speaking of unmapped reads - I have recently done a targeted enrichment with SureSelect of a subset of the genome (~500 genes). I found also that some portion of the unmapped reads aligned perfectly to mitochondrial genes. Is this a common contaminant? Is there so much mtDNA that not even a hybridized selection can clean all of it up?

              Comment


              • #8
                Originally posted by NGSfan View Post
                I found also that some portion of the unmapped reads aligned perfectly to mitochondrial genes. Is this a common contaminant? Is there so much mtDNA that not even a hybridized selection can clean all of it up?
                The mitochondrion is in fairly large copy number excess relative to the chromosomes so you often see a relatively high coverage of mitochondrial sequences compared to the rest of the genome. We've not seen it take up a significant proportion of a library though - maybe it depends on how you make and clean up your libraries?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                9 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X