Dear Seqanswer forum,
Thanks in advance for any contribution for this topic. Sorry for the long post, I want to provide the most information possible. Our goal is to isolate and perform de novo assembly of the genome of bacterial isolates obtained from a complex environment. We obtained several different phyla/taxa, but mostly we focus on one particular dominant taxa. This group comprises more than 100 different species (~65% G+C content). For this post I provided an example for one isolate of this group.
BACKGROUND: Colonies from a solid media were picked and transferred/purified several times. DNA extracted and PCR of 16S rRNA was performed. Sequenced PCR was used to identified the isolate and check for contamination. All isolates generated a unique 16S sequence (no more test were performed).
Genomic DNA was sequenced using the Nextera kit on a HiSeq 2x125 PE. The initial coverage ~ 120X with an avg insert of 320 (histogram). Standard processing analysis was applied to libraries/reads using the BBMap software package:
After an initial processing we apply two types of downstream analysis:
(1) de novo assembly using SPAdes. For example, a typical result look like this (QUAST – no reference):
(2) Initial Split (mapped using BBSplitter) and assembly using SPAdes (or alternatively BAM using BBMap).
For the split, I used a folder containing a mixture of type reference and draft genomes. In most occasion the sample/reads mapped to a single draft genome (80 to 90%) and the rest distributed among the rest of the genomes, in rare occasions to a type genomes and in few cases is 50-50% (co-contamination?). After using SPAdes, the assembly produced hundred to several hundreds of contigs (largest contig of ~200,000) When the contigs are blast it not show fragmentation (NCBI graph) compare to the de novo assemblage that show a lot of fragmentation.
Here are my questions:
Is valid/appropiate (or not) the use of draft genomes (instead of type) as reference genomes for mapping (e.g for BBSplitter, BBMap etc)?
Is the large and few contigs generated in step1 (de novo) the product of co-contamination by very close strains or subspecies (see comment below)?
There are more question relate to each specific downstream analysis and for the possible responses for these questions, but for now I want their comment/suggestion about how to proceed.
Thanks again
Another, maybe relate question:
There is ample discussion in the forum (and elsewhere) about how to proceed in genome assembly with samples containing contamination by Prokaryotic in Eukaryote samples (or vice versa) or very different taxa (Bacteroidetes in Proteobacteria) and other samples containing very clear differences. However, is very hard to find information about how to proceed when you are suspecting samples (i.e. sequenced product) containing very close species (maybe at the level of subspecies or strains).
How do you proceed in this situation?
Thanks in advance for any contribution for this topic. Sorry for the long post, I want to provide the most information possible. Our goal is to isolate and perform de novo assembly of the genome of bacterial isolates obtained from a complex environment. We obtained several different phyla/taxa, but mostly we focus on one particular dominant taxa. This group comprises more than 100 different species (~65% G+C content). For this post I provided an example for one isolate of this group.
BACKGROUND: Colonies from a solid media were picked and transferred/purified several times. DNA extracted and PCR of 16S rRNA was performed. Sequenced PCR was used to identified the isolate and check for contamination. All isolates generated a unique 16S sequence (no more test were performed).
Genomic DNA was sequenced using the Nextera kit on a HiSeq 2x125 PE. The initial coverage ~ 120X with an avg insert of 320 (histogram). Standard processing analysis was applied to libraries/reads using the BBMap software package:
PHP Code:
(1) Removal of adapters
(2) Removal of phix and artifacts (Illumina)and quality trimming
(3) Removal of human contaminants
(4) Normalization, Remove low coverage reads and error correction
(1) de novo assembly using SPAdes. For example, a typical result look like this (QUAST – no reference):
Code:
Assembly H001_scaffolds # contigs (>= 0 bp) 88 # contigs (>= 1000 bp) 40 Total length (>= 0 bp ) 5451929 Total length (>= 1000 bp) 5437233 # contigs 46 Largest contig 1047126 Total length 5441988 GC (%) 64.01 N50 269139 N75 135614 L50 5 L75 12 # N's per 100 kbp 0.00 # predicted genes (unique) 5316 # predicted genes (>= 0 bp) 5317 # predicted genes (>= 300 bp) 4786 # predicted genes (>= 1500 bp) 642 # predicted genes (>= 3000 bp) 85 All statistics are based on contigs of size >= 500 bp, unless otherwise noted (e.g., "# contigs (>= 0 bp)" and "Total length (>= 0 bp)" include all contigs).
For the split, I used a folder containing a mixture of type reference and draft genomes. In most occasion the sample/reads mapped to a single draft genome (80 to 90%) and the rest distributed among the rest of the genomes, in rare occasions to a type genomes and in few cases is 50-50% (co-contamination?). After using SPAdes, the assembly produced hundred to several hundreds of contigs (largest contig of ~200,000) When the contigs are blast it not show fragmentation (NCBI graph) compare to the de novo assemblage that show a lot of fragmentation.
Code:
java -Xmx29g -cp /path/to/bbmap/current align2.BBSplitter in=ID_step04_r#.fastq.gz ref=/ path/to/bbmap/resources/taxa_group outu1=ID_step05_unmapped_r1.fastq.gz outu2=ID_step05_unmapped_r2.fastq.gz ihist=ID_step05_insert_histogram.txt minratio=0.56 minhits=1 maxindel=16000 basename=ID_step05_mapped_r%_#.fastq.gz ambig2=all
Here are my questions:
Is valid/appropiate (or not) the use of draft genomes (instead of type) as reference genomes for mapping (e.g for BBSplitter, BBMap etc)?
Is the large and few contigs generated in step1 (de novo) the product of co-contamination by very close strains or subspecies (see comment below)?
There are more question relate to each specific downstream analysis and for the possible responses for these questions, but for now I want their comment/suggestion about how to proceed.
Thanks again
Another, maybe relate question:
There is ample discussion in the forum (and elsewhere) about how to proceed in genome assembly with samples containing contamination by Prokaryotic in Eukaryote samples (or vice versa) or very different taxa (Bacteroidetes in Proteobacteria) and other samples containing very clear differences. However, is very hard to find information about how to proceed when you are suspecting samples (i.e. sequenced product) containing very close species (maybe at the level of subspecies or strains).
How do you proceed in this situation?
Comment