Header Leaderboard Ad

Collapse

Genome analysis - reference genomes and co-contamination

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome analysis - reference genomes and co-contamination

    Dear Seqanswer forum,

    Thanks in advance for any contribution for this topic. Sorry for the long post, I want to provide the most information possible. Our goal is to isolate and perform de novo assembly of the genome of bacterial isolates obtained from a complex environment. We obtained several different phyla/taxa, but mostly we focus on one particular dominant taxa. This group comprises more than 100 different species (~65% G+C content). For this post I provided an example for one isolate of this group.

    BACKGROUND: Colonies from a solid media were picked and transferred/purified several times. DNA extracted and PCR of 16S rRNA was performed. Sequenced PCR was used to identified the isolate and check for contamination. All isolates generated a unique 16S sequence (no more test were performed).

    Genomic DNA was sequenced using the Nextera kit on a HiSeq 2x125 PE. The initial coverage ~ 120X with an avg insert of 320 (histogram). Standard processing analysis was applied to libraries/reads using the BBMap software package:

    PHP Code:
    (1)    Removal of adapters
    (2)    Removal of phix and artifacts (Illumina)and quality trimming
    (3)    Removal of human contaminants
    (4)    NormalizationRemove low coverage reads and error correction 
    After an initial processing we apply two types of downstream analysis:

    (1) de novo assembly using SPAdes. For example, a typical result look like this (QUAST – no reference):

    Code:
    Assembly		H001_scaffolds
    # contigs (>= 0 bp)		88  
    # contigs (>= 1000 bp)		40  
    Total length (>= 0 bp	)	5451929  
    Total length (>= 1000 bp)	5437233  
    # contigs			46
    Largest contig			1047126  
    Total length			5441988  
    GC (%)				64.01 
    N50				269139 
    N75				135614 
    L50				5 
    L75				12  
    # N's per 100 kbp		0.00  
    # predicted genes (unique)	5316  
    # predicted genes (>= 0 bp)	5317  
    # predicted genes (>= 300 bp)	4786  
    # predicted genes (>= 1500 bp)	642 
    # predicted genes (>= 3000 bp)	85  
    
    All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
    (2) Initial Split (mapped using BBSplitter) and assembly using SPAdes (or alternatively BAM using BBMap).

    For the split, I used a folder containing a mixture of type reference and draft genomes. In most occasion the sample/reads mapped to a single draft genome (80 to 90%) and the rest distributed among the rest of the genomes, in rare occasions to a type genomes and in few cases is 50-50% (co-contamination?). After using SPAdes, the assembly produced hundred to several hundreds of contigs (largest contig of ~200,000) When the contigs are blast it not show fragmentation (NCBI graph) compare to the de novo assemblage that show a lot of fragmentation.

    Code:
    java -Xmx29g -cp /path/to/bbmap/current align2.BBSplitter in=ID_step04_r#.fastq.gz ref=/ path/to/bbmap/resources/taxa_group outu1=ID_step05_unmapped_r1.fastq.gz outu2=ID_step05_unmapped_r2.fastq.gz ihist=ID_step05_insert_histogram.txt minratio=0.56 minhits=1 maxindel=16000 basename=ID_step05_mapped_r%_#.fastq.gz ambig2=all


    Here are my questions:

    Is valid/appropiate (or not) the use of draft genomes (instead of type) as reference genomes for mapping (e.g for BBSplitter, BBMap etc)?

    Is the large and few contigs generated in step1 (de novo) the product of co-contamination by very close strains or subspecies (see comment below)?

    There are more question relate to each specific downstream analysis and for the possible responses for these questions, but for now I want their comment/suggestion about how to proceed.

    Thanks again



    Another, maybe relate question:
    There is ample discussion in the forum (and elsewhere) about how to proceed in genome assembly with samples containing contamination by Prokaryotic in Eukaryote samples (or vice versa) or very different taxa (Bacteroidetes in Proteobacteria) and other samples containing very clear differences. However, is very hard to find information about how to proceed when you are suspecting samples (i.e. sequenced product) containing very close species (maybe at the level of subspecies or strains).
    How do you proceed in this situation?
    Last edited by vingomez; 05-06-2015, 08:54 AM.

  • #2
    I wrote a program, CrossBlock, to do automatic de-cross-contamination of assemblies from cross-contaminated samples. However, it is designed for the low levels of contamination - generally, 2.5% or lower - that you might get from barcode misassignment or slight impurities in adapter batches, not the 50% level of physically combining two libraries. It should theoretically work with up to around 10% cross-contamination, though.

    It's pretty easy to use. First, give interleave all of your fastq reads so you have one fastq file per library, and give it a unique name. Then give each fasta file (one assembly per library) a unique name also. Then make two text files, one named "readlist.txt" and one named "reflist.txt". Each should have one path per line, pointing to a fastq or fasta file, in the same order. For example:

    readlist.txt:
    Code:
    x.fq.gz
    y.fastq.gz
    z.fastq.gz
    reflist.txt:
    Code:
    x.fa
    y.fa
    z.fa
    Then run this:

    crossblock.sh readnamefile=readlist.txt refnamefile=reflist.txt out=clean/ log=dclog.txt

    The result should be clean assemblies. It won't clean the reads for you; you'd have to subsequently use BBSplit for that, against the clean assemblies. Whether CrossBlock will work for strains or subspecies is hard to say. It should have no trouble with different species, but start to have problems once genomes exceed 97% identity.

    As for BBSplit, and your first question -

    Can you clarify as to whether the assemblies were better before or after splitting and reassembling? It looks like you had quite good assemblies initially, then you ran BBSplit, reassembled, and ended up with highly fragmented assemblies. Is that correct?

    So, my recommendation may change in light of your answer, but my current view is that you should use BBSplit just between your different assemblies (ideally, after decontamination). If you have, say, E.coli Strain 1 in your assemblies and you include the reference for type genome E.coli Strain 2 even though it is not one of your assemblies, it will force the reads to split between two very similar things and could cause major fragmentation.

    Since you do suspect some degree of heterogenity in the organisms, you may want to run BBSplit with increased sensitivity, with the additional flags "minhits=1 minratio=0.56 maxindel=100" which will give sensitivity similar to BBMap. BBSplit's defaults are low sensitivity which can also increase fragmentation in the presence of assembly errors or strain variation.

    And overall, since you asked, the presence of strain variation is expected to give more fragmented assemblies, not nice big contigs like you got in your initial assemblies.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
      by seqadmin


      ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

      01-24-2023, 01:19 PM
    • seqadmin
      Introduction to Single-Cell Sequencing
      by seqadmin
      Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

      The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
      ...
      01-09-2023, 03:10 PM

    ad_right_rmr

    Collapse
    Working...
    X