Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • batch effect in radseq

    Hi all,

    We sent 600 samples of a plant species (no reference genome) for DNA-extraction, library prep and ddRadSeq (paired end, Illumina Hiseq). Firstly, 200 samples were sequenced as pilot, then the others 400 followed the same procedure. The raw reads were then:

    - Demultiplexed and adepter clipped
    - Filtered by restriction enzyme cut site
    - Merged and Clusterized using CD-HIT-EST to form the reference contigs
    - Quality trimmed
    - Aligned against the reference contigs (Bowtie2)
    - Variant discovery and SNP called using Freebayes.

    Then, many filters were applied to remove less informative SNPs or samples:
    - filter out snps with missingness (% of samples with missing data) above 10%
    - filter out samples with missingness (% of snps with missing data) above 10%
    - filter out snps with minor allele frequency (frequency of the less represented allele) below 5%

    This resulted in roughly 500 samples kept and 1500 snps.

    The results show a clear batch effect between the samples sequenced as pilot and those sequenced after. Particularly:

    - The PCA show a clear separation on the first axis (explaining 3% of the total variance) between pilot and rest of the samples.
    - The samples from the pilot show an overall higher level of heterozygosity (12 % in pilot against 9 % rest of samples). This increased observation of heterozygous loci is distributed across many loci.

    I've spent a few weeks trying to identify a technical factor that could explain these differences. The most remarkable technical difference I found between the samples is:

    - I calculated the median read count for each sample, then compared it between pilot and non-pilot samples. The mean of this parameter is higher in the pilot samples and the variance is half in comparison to the other samples. This is shown in the attached figure (AB: pilot samples, CDEF: other samples).

    My questions are:
    - Can a differential sequencing depth cause an increase/decrease of heterozygous call in radseq? Is there a read-processing step that could overcome this problem?
    - Can you see other possible causes of these batch effects?

    thank you in advance

    Attached Files

  • #2
    The higher read count and higher heterozygosity is certainly suggestive, since more reads will give you a better chance to see and call a second allele, or create an error artifact that looks like a second allele. You might re-do the genotype calls and cap the reads at a nucleotide to 20 and see if the difference is reduced.

    I'd also look at SNPs per read nucleotide to see if there was a difference in quality scores between runs. The paired-end read seems less consistent so check that closely. You might see an increase in SNP at the end of the read, and a greater increase in the pilot, or a spike at a particular position.
    Providing nextRAD genotyping and PacBio sequencing services.


    Latest Articles


    • seqadmin
      Exploring the Dynamics of the Tumor Microenvironment
      by seqadmin

      The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
      07-08-2024, 03:19 PM
    • seqadmin
      Exploring Human Diversity Through Large-Scale Omics
      by seqadmin

      In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
      06-25-2024, 06:43 AM





    Topics Statistics Last Post
    Started by seqadmin, 07-10-2024, 07:30 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 07-03-2024, 09:45 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 07-03-2024, 08:54 AM
    0 responses
    Last Post seqadmin  
    Started by seqadmin, 07-02-2024, 03:00 PM
    0 responses
    Last Post seqadmin