Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping to specified parts of a genome?

    Is it possible to map reads to specified parts of the genome using BWA or Bowtie or other tools without directly customizing the reference genome or map to single chromosomes?

    E.g. if one wants to map reads to several distinct positions on the same chromsome?

  • #2
    Do you strictly want to avoid mapping to other regions or are you only interested in reads that map to specified regions? If former is the case then you would have to trim the reference but in the latter case you could use the "samtools view region" option to only look at mapping in the region(s) of interest after mapping to entire genome: http://www.biostars.org/p/16662/ or http://seqanswers.com/forums/showthread.php?t=24237
    Last edited by GenoMax; 10-08-2013, 07:11 AM.

    Comment


    • #3
      I see two overall strategies:

      1) Map to the whole genome, then filter away all the reads that align to places you don't want

      2) filter the reference genome ahead of time.

      1 is smarter, you don't want reads being forced to align to your favorite regions when they really belong somewhere else. But samtools can help you with both. Samtools faidx can be used to make a fasta that is just a particular region of a genome. Make a little script to run it over and over again if you have a lot of regions. Then align to that. Or, samtools view can filter a .bam file, and BEDTools can too.

      Comment


      • #4
        Perhaps I should explain more, I want to map to specific regions at first hand to decrease computation time, still I want the resulting alignment files to be compatible with the reference genome annotation which in my case is MM9, e.g. when using tracks in UCSC browser. I'm aware of the possibility to "view" specified regions in samtools, the problem here is that I would have to align to the entire reference beforehand and then filter afterwards. Is it possible to mask out unwanted sequence from the reference prior to alignment, e.g. by NNNNNNs?

        The estimated computation time for aligning to a full reference would be approx. 1000 hours for all data sets on a i7 with 16 GB of RAM running Ubuntu. We do not have a dedicated workstation in the lab for this kind of analysis.

        Also to add to GenoMax's reply we are only interested in ~60 specific regions, sample dependent. These regions coudl comprises 2 Mbases each or ~5% of the murine reference genome.
        Last edited by puggie; 10-08-2013, 10:18 AM.

        Comment


        • #5
          It sounds like you are going to have to do some work. Either upfront (mask the unwanted regions keeping the length constant) or later (align to the regions of your interest and then edit the alignment files to make it appear as if you aligned to the whole genome).

          For the first you can use bedtools to mask the genome if you have a bed file of your target regions: http://code.google.com/p/bedtools/wiki/UsageAdvanced or http://bedtools.readthedocs.org/en/l...ight=maskfasta You will have to create indexes and alignments. Some editing may still be required to view the results in UCSC browser.

          Just to point this out as a feature in case someone is interested: CLC Genomics Workbench allows masking a reference genome (include/exclude) easily by specifying the regions to be masked as a bed file during alignment.
          Last edited by GenoMax; 10-08-2013, 11:04 AM.

          Comment


          • #6
            Ok thx for the suggestion I will have a look at the bedtools possibilities. Also I am aware that mapping all reads to a few specified regions may result in "false" alignments. I believe this can be handled by sample to sample comparison as well as repeatmasking.

            it will be intersting to see if this actually decreases computation time relative to specified regions
            Last edited by puggie; 10-08-2013, 11:39 AM.

            Comment


            • #7
              Originally posted by swbarnes2 View Post
              I see two overall strategies:

              1) Map to the whole genome, then filter away all the reads that align to places you don't want

              2) filter the reference genome ahead of time.

              1 is smarter, you don't want reads being forced to align to your favorite regions when they really belong somewhere else.
              This is very true. Several years ago I aligned 454 reads from a sequence capture experiment to just the target region, and a separate alignment to the whole genome in dog. Aligning to just the target region resulted in some pretty terrible data. When the whole genome was used the quality was much better. That was quite a few versions ago in Newbler, not sure what it would look like now, probably still problematic I would imagine

              Cloud computing and storage space is pretty affordable these days, problems from false mapping could be very time consuming.

              Comment


              • #8
                Jeremy, do you remember how large your target region was as percent of total and read length

                Comment


                • #9
                  The target region was a little over 5 Mb and about 60% of the reads were on target, read lengths were between 200 and 400 I think.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X