Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-17-2014, 08:19 PM

It looks like you have around 60x coverage, which is too low to benefit from normalization, so I suggest you skip that step. I have written a normalizer, BBNorm, that also does error-correction - you can do either or both. When you have too much data, I find assemblies greatly benefit from normalization plus error-correction (with Soap and Velvet, at least; AllPathsLG does not seem to like normalization). Still, it's often worthwhile to compare assemblies with raw, normalized, error-corrected, and ((error-corrected)+(normalized)) data, to see which comes out the best. Sometimes error-correction and normalization will both reduce assembly quality. You can roughly determine assembly quality using metrics like N50, percent of raw reads that map to the assembly, error rates when mapping to the assembly, pairing rates, ambiguous mapping rates, assembly size (lower is better, all else equal), and so forth. If you have EST data, you can also evaluate assemblies using that (using, for example, bbest.sh).

Adapter-trimming is always a good idea when you know the adapter sequences, as is human removal, but since you're using a reptile, my suggestions in that post are inadequate. Still, you can at least remove human symbiotes/parasites like e.coli and salmonella, and known Illumina contaminants like phiX. Adapters can be trimmed with Trimmomatic, but BBDuk is both faster and more sensitive (allowing adjustable edit distance or hamming distance). I do not recommend Fastx, ever, since it ruins read pairing.

Quality-trimming can be useful but is best done conservatively (to Q10 or less). There are many quality-trimming algorithms; the best-performing is the Phred algorithm (that website doesn't show up for me, but maybe it does for you? If not, I can send you a description of my reimplementation of the algorithm), implemented by BBTools and Seqtk. There are other quality-trimming programs, and all give inferior results, because the Phred algorithm is provably optimal. Most are written with no understanding of logarithms.

FLASH does a poor job of merging, with a high false-positive rate. BBMerge (part of the BBTools package) does much better, with a similar merging rate to FLASH but around 1/100 the false positive rate (under 0.03% with the default settings, using synthetic data which can be verified). However, merging is not always beneficial. Ray seems to perform much better with merged reads, while Soap gives terrible assemblies using merged reads, compared to unmerged. I have not tested AllPathsLG or Velvet on merged data. So if you want to merge reads, don't use FLASH; but whether merging is a good idea depends on your genome and assembler. The accuracy of a merger is quite easy to verify - if you have an assembly, no matter how primitive, just track the fraction of mapped reads raw and after merging, allowing no indels.

For optimizing kmer size in an assembly - the normal solution is to assemble using all odd kmer sizes (from, say, k=21 to 80% of read length) and pick the one with the best N50. It's not ideal, but much better than choosing K at random. There are algorithms available that claim to determine the optimal kmer length, but I have yet to test them.

Lastly - it's normal to assemble all fragment libraries into contigs, then use the long-mate-pair library only for scaffolding. They're different, so treat them as such.

**jpummil** · 04-18-2014, 07:50 AM

Good morning Brian!
Thank you for the very helpful synopsis you have provided. I have gone so far as to run three assemblies with just the base, un-preprocessed data with Ray at 3 different kmer settings (31, 43, and 57). Command line I am using is:
/share/apps/Ray/Ray-2.3.1/ray-build/Ray -k 43 -p /scratch/jpummil/Snake_v2/3
_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq /scratch/jpummil/Snake_v2/3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq -p
/scratch/jpummil/Snake_v2/RattleSnakeReverseCompedRead1.fastq /scratch/jpummil/Snake_v2/RattleSnakeReverseCompedRead
2.fastq -p /scratch/jpummil/Snake_v2/s_1_1_sequence.fastq /scratch/jpummil/Snake_v2/s_1_2_sequence.fastq -p /scratch/
jpummil/Snake_v2/s_2_1_sequence.fastq /scratch/jpummil/Snake_v2/s_2_2_sequence.fastq -p /scratch/jpummil/Snake_v2/MIK
E_S1_L001_R1_001.fastq /scratch/jpummil/Snake_v2/MIKE_S1_L001_R2_001.fastq -o /scratch/jpummil/Snake_v2/RayOut43-AMD

Using the quast quality metrics tool, I get the following:
kmer=31
Assembly Contigs
# contigs (>= 0 bp) 742358
# contigs (>= 1000 bp) 355954
Total length (>= 0 bp) 1150125950
Total length (>= 1000 bp) 969349841
# contigs 511734
Largest contig 39372
Total length 1085189817
GC (%) 37.98
N50 2917
N75 1583
L50 106043
L75 233027
# N's per 100 kbp 0.00

kmer=43
Assembly Contigs
# contigs (>= 0 bp) 677575
# contigs (>= 1000 bp) 355684
Total length (>= 0 bp) 1194030037
Total length (>= 1000 bp) 1030937265
# contigs 501163
Largest contig 40236
Total length 1138917269
GC (%) 38.17
N50 3235
N75 1717
L50 100464
L75 221906
# N's per 100 kbp 0.00

kmer=57
Assembly Contigs
# contigs (>= 0 bp) 594352
# contigs (>= 1000 bp) 349257
Total length (>= 0 bp) 1155057304
Total length (>= 1000 bp) 1014358551
# contigs 487964
Largest contig 40714
Total length 1117432969
GC (%) 38.10
N50 3249
N75 1732
L50 98136
L75 216568
# N's per 100 kbp 0.00

Looks like next step is to work on some pre-processing of the data to eliminate poor quality areas and adapters and maybe try again at kmer=57 to see if there is a substantial improvement.

**westerman** · 04-18-2014, 10:23 AM

Brian's suggestions, as always, are very good. You may also wish to try the 'ABySS' program. In my experience there is no one perfect assembler so if you have the time then trying out two or three is a good idea.

**Brian Bushnell** · 04-18-2014, 10:25 AM

Since you're using Ray, I do recommend trying an assembly of merged data to see if you get a better result (assuming your libraries have short enough inserts for merging). The command is like this:

bbmerge.sh in1=reads1.fq in2=reads2.fq out=merged.fq outu1=unmerged1.fq outu2=unmerged2.fq.

If you do merging with bbmerge, it's best to do it after adapter removal, but before quality-trimming.

**jpummil** · 04-18-2014, 10:55 AM

So, to use bbtrim.sh and NOT quality trim, just don't flag the quality options? So for each PE set, use as follows?

bbtrim.sh in=<input file> in2=<input file> out=<output file> out2=<output file> ref=<adapter files>

So, for the adapter references, I just looked at one of the newer PE reads to count the various adapters:

[jpummil@razor-l2 Snake_v2]$ !762
grep -c "^GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
122
[jpummil@razor-l2 Snake_v2]$ !763
grep -c "GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG$" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
104
[jpummil@razor-l2 Snake_v2]$ !764
grep -c "GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
7217
[jpummil@razor-l2 Snake_v2]$ !765
grep -c "AGATCGGAAGAGC" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
267557

Should I include all 4 of these in csv form for the adapter ref?

What about PCR primers? I haven't found a reference that suits me for those yet...

**Brian Bushnell** · 04-18-2014, 11:18 AM

Originally posted by jpummil View Post

So, to use bbtrim.sh and NOT quality trim, just don't flag the quality options? So for each PE set, use as follows?

bbtrim.sh in=<input file> in2=<input file> out=<output file> out2=<output file> ref=<adapter files>

bbtrim.sh is intended for quality-trimming. To do adapter trimming without quality-trimming, you would use bbduk:

bbduk.sh in=<input file> in2=<input file> out=<output file> out2=<output file> ref=<adapter files> ktrim=r mink=10 -Xmx1g

You can add "hdist=1" to allow up to one substitution. The "ktrim=r" flag means trim to the right when finding a kmer (without that flag, matches would be discarded rather than trimmed). The "mink=10" allows it to look for matching sequences as short as 10 bp at the end of the read. In the middle of the read, the full kmer length (default 28) will be used.

I have included both the truseq adapters and phix reference in /bbmap/resources/

So, for the adapter references, I just looked at one of the newer PE reads to count the various adapters:

...

Should I include all 4 of these in csv form for the adapter ref?

You can put them in a properly-formatted fasta file, like this:

>1
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
>2
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
>3
GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
>4
AGATCGGAAGAGC

...or, you can use the flag "literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,AGATCGGAAGAGC"

Note that BBDuk uses kmers, and the default kmer length is 28; it will not find adapters shorter than kmer length. You can change it to, say, 13 with the "k=13" flag, but the shorter it is the more false positives will be found, particularly if you allow mismatches.

What about PCR primers? I haven't found a reference that suits me for those yet...

I have a file of all common Illumina contaminants that we use for filtering on everything at JGI. It's small, ~120kb, and includes various primers. I'll check for permission, and post it here if it's allowed.

**jpummil** · 04-18-2014, 12:24 PM

Ahhh...understood. bduk.sh as well as the noted options.

So, while I could "nuke" about 7500 of the longer adapters with k=28:
grep -c "GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
7217
I would be left with nearly a quarter million of the shorter adapters unless I reduce k to 13...
grep -c "AGATCGGAAGAGC" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
267557

But...to lower k to less than half of default in this case would substantially increase the likelihood of eliminating things OTHER than adapters. Of course, the grep shown is likely indicating this as well.

Looks like 75M reads in that particular file, so...
grep -c @HWI 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
75552988

Yes, if allowable, I'd love to have the file with all the adapters, primers, etc that you mentioned.

Want to thank you for your patience and all the help thus far...I truly do appreciate it!

**Brian Bushnell** · 04-18-2014, 03:48 PM

Sorry, I couldn't get permission to release it. I think some of it may be considered proprietary by Illumina.

However, you might try these lists:

fastqc/Contaminants/contaminant_list.txt at master · csf-ngs/fastqc

https://github.com/csf-ngs/fastqc/blob/master/Contaminants/contaminant_list.txt

csf fork of fastqc for usage on selected reads of unaligned bam file - csf-ngs/fastqc

Sequence Contaminant List - BioWiki

http://bioinfo-core.org/index.php/Sequence_Contaminant_List

...both of which contain some Illumina primers. They'd have to be formatted as Fasta before use.

**jpummil** · 04-18-2014, 03:52 PM

No worries, Brian

As it stands now, i have plenty to tinker with. I'll do some adapter trimming with what I have and see where that leads and report back on progress.

Oh yeah, while digging around, I ran upon some 454 files from 2010 for the same project. Not sure if they'd be more trouble than they're worth, or useful for a specific purpose:

-rwxr-xr-x 1 jpummil jpummil 1.5G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GE68THX01_RSn3-5.sff
-rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GEHF0TB01_RSn3-3.sff
-rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GEHF0TB02_RSn3-3.sff
-rwxr-xr-x 1 jpummil jpummil 1.1G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GELISTB01_RSn3-2pL1.sff
-rwxr-xr-x 1 jpummil jpummil 599M Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GEUJLEB01_RSn3-4pL2.sff
-rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GF99A5Z01_RSn3-JM1.sff
-rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GF99A5Z02_RSn3-JM1.sff
-rwxr-xr-x 1 jpummil jpummil 2.2G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGDZ8VL01_RSn3-JM2.sff
-rwxr-xr-x 1 jpummil jpummil 2.1G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGDZ8VL02_RSn3-JM2.sff
-rwxr-xr-x 1 jpummil jpummil 1.7G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGLRLMF01_RSn3-JM1r2.sff
-rwxr-xr-x 1 jpummil jpummil 2.2G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGLRLMF02_RSn3-JM2r2.sff
-rwxr-xr-x 1 jpummil jpummil 2.0G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGPCL1V01_RSn3-JM3.sff
-rwxr-xr-x 1 jpummil jpummil 2.0G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGPCL1V02_RSn3-JM3.sff

**jpummil** · 04-18-2014, 05:24 PM

So, first try seemed to run fine, but minimal change in file sizes:
Start:
-rw-r--r-- 1 jpummil jpummil 19907863971 Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
-rw-r--r-- 1 jpummil jpummil 19907863971 Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq

First with cautious: literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
-rw-rw-r-- 1 jpummil jpummil 19905819559 Apr 18 20:14 TRIM-AD/New_PE_R1.fastq
-rw-rw-r-- 1 jpummil jpummil 19906242537 Apr 18 20:14 TRIM-AD/New_PE_R2.fastq

And with more aggressive: k=13 literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,AGATCGGAAGAGC
-rw-rw-r-- 1 jpummil jpummil 19885080597 Apr 18 20:20 TRIM-AD/New_PE_R1a.fastq
-rw-rw-r-- 1 jpummil jpummil 19884681173 Apr 18 20:20 TRIM-AD/New_PE_R2a.fastq

Also, the size of the paired files is no longer the same...I guess I expected they would be.

**Brian Bushnell** · 04-18-2014, 09:14 PM

The program will print statistics about how how many reads and bases were trimmed to stander error. No reason to expect the files to remain equal size, since adapters may be detected in one read but not the other. They should have the same number of lines, though.

Remember that even if you lose only 0.2% of your data by trimming down to 13-mers, you will also break your assembly everywhere that 13-mer occurs. Hopefully it will be rare (in a random genome, 1 in 67 million), but still, I don't recommend dropping below K=21.

**jpummil** · 04-19-2014, 08:04 AM

Output of both reductions.
Conservative:
Initial:
Memory: free=1507m, used=11m

Added 41 kmers; time: 1.031 seconds.
Memory: free=1467m, used=51m

Input is being processed as paired
Started output stream: 1.221 seconds.

Input: 151105976 reads 15261703576 bases.
KTrimmed: 106096 reads (0.07%) 1750831 bases (0.01%)
Result: 151103306 reads (100.00%) 15259952745 bases (99.99%)

Time: 98.518 seconds.
Reads Processed: 151m 1533.80k reads/sec
Bases Processed: 15261m 154.91m bases/sec

Aggressive:
Initial:
Memory: free=1507m, used=11m

Added 32 kmers; time: 0.069 seconds.
Memory: free=1467m, used=51m

Input is being processed as paired
Started output stream: 0.102 seconds.

Input: 151105976 reads 15261703576 bases.
KTrimmed: 640339 reads (0.42%) 21374886 bases (0.14%)
Result: 151053684 reads (99.97%) 15240328690 bases (99.86%)

Time: 89.117 seconds.
Reads Processed: 151m 1695.59k reads/sec
Bases Processed: 15261m 171.25m bases/sec

And, as you predicted, still the same number of lines in each file:
[jpummil@razor-l2 TRIM-AD]$ wc -l New_PE_R1.fastq
302206612 New_PE_R1.fastq
[jpummil@razor-l2 TRIM-AD]$ wc -l New_PE_R2.fastq
302206612 New_PE_R2.fastq

So, I think I'll run an assembly with the best kmer setting of the original runs and see what it yields in the way of improvement for both trimmed sets. Then, do a merge and re-run again. Might as well collect some statistics while I'm learning

And, I still plan on trying a couple of other assemblers as well. Certainly Velvet, and maybe ABySS. Thought about SOAP also.

**AdrianP** · 04-20-2014, 05:28 AM

Sorry to ask this here, but where can I find more about BBNorm, and what exactly does k-mer normalization do?

**Brian Bushnell** · 04-20-2014, 09:24 AM

Originally posted by AdrianP View Post

Sorry to ask this here, but where can I find more about BBNorm, and what exactly does k-mer normalization do?

BBNorm is packaged together with BBMap. The usage information is available by running the shellscript, "bbnorm.sh", with no arguments. It's not published yet (though I do have a mostly-written draft), but it is being used in an upcoming paper on a new assembler, "Omega".

Kmer normalization is useful for preprocessing prior to assembly in situations where you have too much data or highly variable coverage (for example in a metagenome, amplified single cell, RNA-seq, or genomes with high-copy plasmids or features like 16S). It's an alternative to subsampling. Some (not all) assemblers perform much better with a flat coverage distribution, and most assemblers use an amount of time and memory that increases with the number of input reads, so reducing the input and flattening the coverage can give better results, and often allow assembly of data that would otherwise take too long or run out of memory.

Subsampling: Randomly discard X% of all reads.
Normalization: Randomly discard reads with a probability based on kmer frequency, to achieve uniform coverage depth.

Subsampling is very fast and easy. You can subsample to 1% depth with BBTools like this:
reformat.sh in1=r1.fq in2=r2.fq out1=s1.fq out2=s2.fq samplerate=0.01

That will reduce the coverage, but not flatten it. Instead, to normalize it to 50x:

bbnorm.sh in1=r1.fq in2=r2.fq out1=n1.fq out2=n2.fq target=50

This takes longer and requires much more memory, but many assemblers (like Velvet and Soap) will produce a substantially better assembly with the normalized data, in my testing.

BBNorm can also do error-correction at the same time as normalization, if you include the flag "ecc=t". Even if you don't error-correct, BBNorm will discard reads containing errors with a higher probability than error-free reads, so the resulting data after normalization generally has a lower error rate. This is in contrast to naive normalization, which will typically increase the error rate (because reads with errors will have more rare kmers and thus be discarded at a lower rate).

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

de novo assembly of 1.7Gb reptile - Best Practices?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News