Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genome analysis - reference genomes and co-contamination

    Dear Seqanswer forum,

    Thanks in advance for any contribution for this topic. Sorry for the long post, I want to provide the most information possible. Our goal is to isolate and perform de novo assembly of the genome of bacterial isolates obtained from a complex environment. We obtained several different phyla/taxa, but mostly we focus on one particular dominant taxa. This group comprises more than 100 different species (~65% G+C content). For this post I provided an example for one isolate of this group.

    BACKGROUND: Colonies from a solid media were picked and transferred/purified several times. DNA extracted and PCR of 16S rRNA was performed. Sequenced PCR was used to identified the isolate and check for contamination. All isolates generated a unique 16S sequence (no more test were performed).

    Genomic DNA was sequenced using the Nextera kit on a HiSeq 2x125 PE. The initial coverage ~ 120X with an avg insert of 320 (histogram). Standard processing analysis was applied to libraries/reads using the BBMap software package:

    PHP Code:
    (1)    Removal of adapters
    (2)    Removal of phix and artifacts (Illumina)and quality trimming
    (3)    Removal of human contaminants
    (4)    NormalizationRemove low coverage reads and error correction 
    After an initial processing we apply two types of downstream analysis:

    (1) de novo assembly using SPAdes. For example, a typical result look like this (QUAST – no reference):

    Code:
    Assembly		H001_scaffolds
    # contigs (>= 0 bp)		88  
    # contigs (>= 1000 bp)		40  
    Total length (>= 0 bp	)	5451929  
    Total length (>= 1000 bp)	5437233  
    # contigs			46
    Largest contig			1047126  
    Total length			5441988  
    GC (%)				64.01 
    N50				269139 
    N75				135614 
    L50				5 
    L75				12  
    # N's per 100 kbp		0.00  
    # predicted genes (unique)	5316  
    # predicted genes (>= 0 bp)	5317  
    # predicted genes (>= 300 bp)	4786  
    # predicted genes (>= 1500 bp)	642 
    # predicted genes (>= 3000 bp)	85  
    
    All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
    (2) Initial Split (mapped using BBSplitter) and assembly using SPAdes (or alternatively BAM using BBMap).

    For the split, I used a folder containing a mixture of type reference and draft genomes. In most occasion the sample/reads mapped to a single draft genome (80 to 90%) and the rest distributed among the rest of the genomes, in rare occasions to a type genomes and in few cases is 50-50% (co-contamination?). After using SPAdes, the assembly produced hundred to several hundreds of contigs (largest contig of ~200,000) When the contigs are blast it not show fragmentation (NCBI graph) compare to the de novo assemblage that show a lot of fragmentation.

    Code:
    java -Xmx29g -cp /path/to/bbmap/current align2.BBSplitter in=ID_step04_r#.fastq.gz ref=/ path/to/bbmap/resources/taxa_group outu1=ID_step05_unmapped_r1.fastq.gz outu2=ID_step05_unmapped_r2.fastq.gz ihist=ID_step05_insert_histogram.txt minratio=0.56 minhits=1 maxindel=16000 basename=ID_step05_mapped_r%_#.fastq.gz ambig2=all


    Here are my questions:

    Is valid/appropiate (or not) the use of draft genomes (instead of type) as reference genomes for mapping (e.g for BBSplitter, BBMap etc)?

    Is the large and few contigs generated in step1 (de novo) the product of co-contamination by very close strains or subspecies (see comment below)?

    There are more question relate to each specific downstream analysis and for the possible responses for these questions, but for now I want their comment/suggestion about how to proceed.

    Thanks again



    Another, maybe relate question:
    There is ample discussion in the forum (and elsewhere) about how to proceed in genome assembly with samples containing contamination by Prokaryotic in Eukaryote samples (or vice versa) or very different taxa (Bacteroidetes in Proteobacteria) and other samples containing very clear differences. However, is very hard to find information about how to proceed when you are suspecting samples (i.e. sequenced product) containing very close species (maybe at the level of subspecies or strains).
    How do you proceed in this situation?
    Last edited by vingomez; 05-06-2015, 08:54 AM.

  • #2
    I wrote a program, CrossBlock, to do automatic de-cross-contamination of assemblies from cross-contaminated samples. However, it is designed for the low levels of contamination - generally, 2.5% or lower - that you might get from barcode misassignment or slight impurities in adapter batches, not the 50% level of physically combining two libraries. It should theoretically work with up to around 10% cross-contamination, though.

    It's pretty easy to use. First, give interleave all of your fastq reads so you have one fastq file per library, and give it a unique name. Then give each fasta file (one assembly per library) a unique name also. Then make two text files, one named "readlist.txt" and one named "reflist.txt". Each should have one path per line, pointing to a fastq or fasta file, in the same order. For example:

    readlist.txt:
    Code:
    x.fq.gz
    y.fastq.gz
    z.fastq.gz
    reflist.txt:
    Code:
    x.fa
    y.fa
    z.fa
    Then run this:

    crossblock.sh readnamefile=readlist.txt refnamefile=reflist.txt out=clean/ log=dclog.txt

    The result should be clean assemblies. It won't clean the reads for you; you'd have to subsequently use BBSplit for that, against the clean assemblies. Whether CrossBlock will work for strains or subspecies is hard to say. It should have no trouble with different species, but start to have problems once genomes exceed 97% identity.

    As for BBSplit, and your first question -

    Can you clarify as to whether the assemblies were better before or after splitting and reassembling? It looks like you had quite good assemblies initially, then you ran BBSplit, reassembled, and ended up with highly fragmented assemblies. Is that correct?

    So, my recommendation may change in light of your answer, but my current view is that you should use BBSplit just between your different assemblies (ideally, after decontamination). If you have, say, E.coli Strain 1 in your assemblies and you include the reference for type genome E.coli Strain 2 even though it is not one of your assemblies, it will force the reads to split between two very similar things and could cause major fragmentation.

    Since you do suspect some degree of heterogenity in the organisms, you may want to run BBSplit with increased sensitivity, with the additional flags "minhits=1 minratio=0.56 maxindel=100" which will give sensitivity similar to BBMap. BBSplit's defaults are low sensitivity which can also increase fragmentation in the presence of assembly errors or strain variation.

    And overall, since you asked, the presence of strain variation is expected to give more fragmented assemblies, not nice big contigs like you got in your initial assemblies.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    7 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    7 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    66 views
    0 likes
    Last Post seqadmin  
    Working...
    X