Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • alignment questions...pooled-seq, multiple references, pseudogenes, plastid genomes

    Hi Everyone,

    I am working with pooled-seq data from several plant populations. From this data we are primarily interested in obtaining nuclear SNP frequencies by population but would also like to recover whatever information we can about the chloroplast and mitochondrial genome. I have the following questions:

    1) Is it standard to always include the mitochondria and chloroplast genome as part of the reference genome or do people usually only align reads to the nuclear genome?

    2) Do we have to set any of the parameters in BWA differently if we have multiple reference genomes in the same fasta file (e.g. the nuclear, chloroplast and mitochondrial genome)?

    3) Is it straightforward to separate the results for the nuclear and plastid genomes downstream (e.g. is it that by indexing the reference genome, we will be able to somehow partition the SAM/BAM file into data for our analysis of just nuclear SNP frequencies and data that goes into our analysis of the plastid genomes)?

    4) Finally, wondering if anyone has a sense of how common plastid pseudogenes are in plant nuclear genomes? My thinking with the alignment of our reads to all three genomes is (in part) that we will be able to detect and avoid these types of pseudogenes but how important is this?

    Thank you in advance for the help! Sorry if some of these are naive questions: still new to all of this!

  • #2
    I cannot help with the plant specific questions. But, for mapping, I usually have my reference genome, and then a filter file to exclude things mapping to mtDNA and non-protein coding genomic regions. So anything getting mapped to the filter file contents just gets excluded from the genomic mapping altogether.

    Of course, I don't care about mitochondrial genes nor non-coding genes. And that sort of filter mapping is very simply to set up in LifeScope.

    In your case, it depends what you'd like to do with such genes. If you want to analyze them independently downstream, then using a filter reference may be best (you can save the results of the filter reference mapping separatly from the genomic mapping). Basically get your mtDNA and chloroplast mappings separate from your genomic mappings all at the same time that way.

    Otherwise you can include everything into one reference genome file and map to that. It just depends on what you intend to do with it all later on, and which will give you the most straightforward file or set of files.
    Michael Black, Ph.D.
    ScitoVation LLC. RTP, N.C.

    Comment


    • #3
      Originally posted by jullee View Post
      1) Is it standard to always include the mitochondria and chloroplast genome as part of the reference genome or do people usually only align reads to the nuclear genome?
      Yes, they should all be together when mapping for greatest accuracy.
      2) Do we have to set any of the parameters in BWA differently if we have multiple reference genomes in the same fasta file (e.g. the nuclear, chloroplast and mitochondrial genome)?
      No. Most references will have multiple scaffolds anyway; aligners don't care if they are different chromosomes, different organelles, or different organisms.
      3) Is it straightforward to separate the results for the nuclear and plastid genomes downstream (e.g. is it that by indexing the reference genome, we will be able to somehow partition the SAM/BAM file into data for our analysis of just nuclear SNP frequencies and data that goes into our analysis of the plastid genomes)?
      Well... it is if you use BBSplit.

      bbsplit.sh ref=plant.fa,mito.fa,chloroplast.fa in=reads.fastq basename=out_%.sam outu=unmapped.fastq -Xmx29g

      "-Xmx29g" should be adjusted to the amount of RAM the computer has, roughly 85% of the total. This will align to all of the references at once, but create multiple output files:
      out_plant.sam (reads mapped best to the plant)
      out_mito.sam (reads mapped best to mito)
      out_chloroplast.sam (reads mapped best to chloroplast)
      unmapped.fastq (reads that did not map)

      If you have paired reads, you can use "in1=read1.fq in2=read2.fq" for input.

      4) Finally, wondering if anyone has a sense of how common plastid pseudogenes are in plant nuclear genomes? My thinking with the alignment of our reads to all three genomes is (in part) that we will be able to detect and avoid these types of pseudogenes but how important is this?
      Pseudogenes are not conserved so I wouldn't worry about them too much, if you align to all references at once, since they will have SNPs that make the pseudogene reads go to the pseudogenes and the real gene reads go to the real genes. If you align to the references separately it would be more problematic.

      Comment


      • #4
        Thank you mbblack and Brian Bushnell for the replies!

        Unfortunately, I think my collaborator is quite set on working with BWA for the alignment and he will be doing that part of the pipeline. It looks like I will be coming in afterwards and be in the position of needing to extract mtDNA and cpDNA information from the resulting SAM or BAM file (e.g. the goal would be to partition the full SAM file into three new separate files for the nuclear, mtDNA and cpDNA data). I think this may be pretty straightforward using standard unix commands, but this will be my first time working with these types of files so if anyone has any additional thoughts, I'd be interested in hearing them...

        Thanks!

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Non-Coding RNA Research and Technologies
          by seqadmin




          Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

          Nobel Prize for MicroRNA Discovery
          This week,...
          Yesterday, 08:07 AM
        • seqadmin
          Recent Developments in Metagenomics
          by seqadmin





          Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
          09-23-2024, 06:35 AM
        • seqadmin
          Understanding Genetic Influence on Infectious Disease
          by seqadmin




          During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

          Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
          09-09-2024, 10:59 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 10-02-2024, 04:51 AM
        0 responses
        87 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-01-2024, 07:10 AM
        0 responses
        95 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-30-2024, 08:33 AM
        1 response
        96 views
        0 likes
        Last Post EmiTom
        by EmiTom
         
        Started by seqadmin, 09-26-2024, 12:57 PM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Working...
        X