Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 2b-RAD and GATK

    Hello all,

    I have some Type 2B RAD data for many individuals from several populations of my non-model species. I have made a reference to map the tags to by extracting all potential RAD sites from the available genome. I am now trying to discover SNPs among the individuals.

    I've previously used samtools mpileup and a home-made haplotype caller, and have also tried STACKS though my conclusion is that Stacks' SNPs do not agree with the other two methods', even when accounting for a weird indexing issue that appears to be going on. I am skeptical that Stacks is appropriate for Type2B RAD.

    I would ideally like to run the same SNP analysis using GATK, then find the intersection of SNPs called using both mpileup and GATK. However, I can't get GATK to run!

    I have trimmed, filtered, and mapped (bowtie1) reads with added read groups in individual.sorted.bam format. I would like to run the GATK UnifiedGenotyper on a single individual as a first pass, then refine that SNP list using BQSR, VQSR and multi-sample UnifiedGenotyper SNP identification.

    However, here is the problem: even on our available supercomputer and even using -nt and -nct, this single individual will take 4.9 weeks to process!! I am surprised at this. The individual in question contains 6,925,188 36-b reads, mapped to a reference made of every possible RAD site in the genome, 1,624,953 36-b contigs.

    A collaborator suggested that GATK massively increases run time with increasing numbers of reference contigs, so I went back to my .sam files and deleted any reference RAD site from my reference.fasta that was not seen more than 100 times among all my individuals. This reduced my reference from the 1.6 million potential tags to 95,000 tags that are actually seen in my data. Still, this did not solve (or even appreciably decrease) the amount of time the UnifiedGenotyper predicts.

    Does anyone have any ideas about what is going on here, or, better, how to overcome this problem? I would really like to use GATK!

    Thank you!

  • #2
    Have you tried running it against the reference without extracting potential RAD sites? I've done large populations using novoalign against a reference (RAD or nextRAD) and the time is trivial. I also routinely take a population, identify the tags in the population (you can decide to include only predominant tags or all tags), and then align those tags against each other to determine alleles, and then count those alleles in each sample.

    Sorry to not address your question, but these are paths that work for me for RAD and nextRAD. I assume 2b-RAD would behave in similar ways.
    Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

    Comment


    • #3
      Interesting. I had previously resisted mapping to the whole genome (12.5k contigs) because I felt like there may be increased mapping error around the restriction site. However, I don't have any data about this yet, just a gut feeling that feeding the mapper known sites selected for their restriction sites would improve this. But... on the other hand, I filter and trim my raw reads and insist that each one contains the RAD site, so maybe I'm just being cautious for no reason.

      Anyway -- I did what you suggested and mapped an individual to the whole genome, then ran the UnifiedGenotyper on that guy. We're down to 51 minutes! Looks like the reference database size does indeed make a huge time difference (12.5 contigs/1 hour vs 95k contigs/4.9 weeks...)

      Another idea that was suggested to me was to extract the RAD sites from the genome, then concatenate them into artificial chromosomes, potentially using strings of 1000 N's to separate each tag. I think this is a good idea, but more complicated than just mapping to the genome.

      Can you think of any reason why mapping to the genome instead of to the sites themselves would be better or worse? I can think of pros and cons for each side, but I am not convinced of either way yet. Obviously, though, the computational time alone will make up my mind

      Comment


      • #4
        To me, the biggest source of error when mapping to a whole genome is from alignment to duplicate genes/genomic regions. Sometimes a tag will align to multiple locations. One is true, and the others spurious. But a sequencing error may be enough to shift the tag from one location to the other. Or, because few genomes are "complete", your tag may map to the duplicate region rather than its real locus.

        But these are not serious issues. A small loss of tags from tossing out ones that map to multiple locations is just not an issue when you have 10k (or 100k) markers to play with. And in the second scenario, the usual information desired is comparing your samples to each other. So if all the samples map to a spurious locus because of missing sequence in the reference, they will be true in comparison to each other. Mapping issues might be more of a problem with 2b-RAD with its short sequence length, but I bet it will be OK.
        Providing nextRAD genotyping and PacBio sequencing services. http://snpsaurus.com

        Comment

        Latest Articles

        Collapse

        • seqadmin
          The Impact of AI in Genomic Medicine
          by seqadmin



          Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
          02-26-2024, 02:07 PM
        • seqadmin
          Multiomics Techniques Advancing Disease Research
          by seqadmin


          New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

          A major leap in the field has
          ...
          02-08-2024, 06:33 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 06:12 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-23-2024, 04:11 PM
        0 responses
        64 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-21-2024, 08:52 AM
        0 responses
        70 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-20-2024, 08:57 AM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Working...
        X