Seqanswers Leaderboard Ad

**sganschow** · 11-03-2016, 09:22 AM

Thanks for your reply.
I followed your advise to simulate different repeat lengths and generated paired end reads. Indeed, when adding more repeat units, there's a point where the pair is not joined but classified as ambiguous. I could manage to merge some longer amplicons by setting loose=t.

However, it is not clear to my why a pair like the following can't be merged, although the reads have an overlap of ~190nt and the repeat region has a complex structure, i.e. repeat stretches are interspersed by single bases and non-repetitive elements. Merging by hand (aligning R1 and rev-comp R2) revealed that there should be only 1 solution without any mismatches.

Please have a look at the following read pair (2x250nt):

Read 1

Code:

>31.2_14/17
AATCTGGGCGACAAGAGTGAAACTCCGTCAAAAGAAAGAAAGAAAGAGACAAAGAGAGTTAGAAAGAAAGAAAGAGAGAGAGAGAGAAAGGAAGAAAGGAAGAAAAAGAAAGAAAAAGAAAGAAAGAGAAAGAAAGAAAGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAA

Code:

>31.2_14/17
ACATCTCCCCTACCGCTATAGTAACTTGCTCTTTCTTTCCTTCCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTCTTTCTTTCTTTCTCTTTCTTTCTTTTTCTTTCTTTTTCTTCCTTTCTTCCTTTCTCTCTCTCTCTCTTTCTTTCTTTC

Amplicon (311nt)

Code:

AATCTGGGCGACAAGAGTGAAACTCCGTCAAAAGAAAGAAAGAAAGAGACAAAGAGAGTTAGAAAGAAAGAAAGAGAGAGAGAGAGAAAGGAAGAAAGGAAGAAAAAGAAAGAAAAAGAAAGAAAGAGAAAGAAAGAAAGAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGAAAGGAAGGAAAGAAAGAGCAAGTTACTATAGCGGTAGGGGAGATGT

Maybe it just depends on the parameters in bbmerge?
Thanks in advance for your help.
Sebastian

**Brian Bushnell** · 11-07-2016, 09:51 AM

Originally posted by sganschow View Post

Maybe it just depends on the parameters in bbmerge?
Thanks in advance for your help.
Sebastian

Indeed, it does depend on the parameters. They will merge at 311bp if you use these parameters:

bbmerge.sh in=amplicons.fa loose minoverlap0=14

..or anything less stringent, such as

bbmerge.sh in=amplicons.fa xloose minoverlap0=50

...which would yield a much higher overall merge rate for this kind of read pair. The problem is that in this case the last 13 bp of the reads look like this:

Code:

AAAGAAAGAAAGA

TCTTTCTTTCTTT

...which, once you reverse-complement read 2, look like this:

Code:

AAAGAAAGAAAGA
|||||||||||||
AAAGAAAGAAAGA

They match perfectly, so there's no way to determine that that isn't the correct overlap frame based on sequence information alone, without additional information about the expected insert size or overlap length.

**sganschow** · 11-08-2016, 09:56 AM

Ah I see. That was really helpful, thank you Brian.
Since I know the shortest and longest expected amplicon, it was possible to get perfectly merged reads by setting the correct values for minoverlap0, minoverlap, mininsert0, and mininsert.

One last question: what does minentropy stand for? I haven't seen it in the bbmerge usage.

**Brian Bushnell** · 11-08-2016, 10:14 AM

BBMerge's "minentropy" is an approximation; it generates a score based on counting the number of unique 3-mers in the head and tail of the read to adjust the minimum allowed overlap. It can be disabled with the flag "entropy=f". Specifically, the score is the sum of 4*(kmers occurring at least once)+1*(kmers occurring at least twice).

For normal (complex) sequence, minoverlap will be used as the minimum allowed overlap to merge reads. But if there is a minentropy setting, the head (3' end) and tail (5' end) of the read are examined and the minimum overlap may be increased (never decreased) if the sequence is low-complexity, until the entropy score reaches a minimum value. For example, if one read ends with "...ATCGTTAGTCCCCCCC....CCCCCCC", the poly-C part will contribute only a single unique 3-mer no matter how much there is, so the score will remain at 5 until it gets to the complex part, at which the score will start to rapidly increase due to new unique 3-mers. When the score passes a threshold (default 39, corresponding to roughly 10 unique 3-mers) the minimum overlap length is determined.

This reduces the false-positive merge rate because low-complexity sequence is much more prone to coincidental matches than high-complexity sequence, so it makes more sense to determine the minimum overlap dynamically based on information content than to use the same fixed number for all reads. You can manually set the entropy cutoff with the "minentropy=" flag on the command line, but it's not documented because it's hard to explain concisely. Rather, the different stringencies (strict, loose, etc) each have their own default value for minentropy, with higher values for higher stringency. For your reads, the minentropy flag may be important because they are so low-complexity.

**GenoMax** · 11-22-2016, 08:43 AM

Originally posted by moistplus

is it possible to estimate insert size of non overlapped library ?

By mapping with bbmap (assuming you have a reference).

**Brian Bushnell** · 11-22-2016, 01:19 PM

There is no equivalent to -M, since BBMap does not produce split alignments. Also, by default, BBMap does not consider a pair "proper" unless it aligns in the fragment orientation, with reads on opposite strands. if you are trying to quantify the insert size distribution of a long-mate-pair library with a different orientation, you should add the flag "requirecorrectstrand=f".

To generate an insert size histogram, you can do this:

Code:

(for interleaved reads)
bbmap.sh ref=ref.fa in=reads.fq ihist=ihist.txt

(for reads in two files)
bbmap.sh ref=ref.fa in1=r1.fq in2=r2.fq ihist=ihist.txt

**GenoMax** · 11-23-2016, 11:59 AM

@moistplus: Take a look at these options for BBMap and their default values.

Code:

samestrandpairs=f       (ssp) Specify whether paired reads should map to the
                        same strand or opposite strands.
requirecorrectstrand=t  (rcs) Forbid pairing of reads without correct strand 
                        orientation.  Set to false for long-mate-pair libraries.
killbadpairs=f          (kbp) If a read pair is mapped with an inappropriate
                        insert size or orientation, the read with the lower  
                        mapping quality is marked unmapped.
pairedonly=f            (po) Treat unpaired reads as unmapped.  Thus they will 
                        be sent to 'outu' but not 'outm'.

**Brian Bushnell** · 11-23-2016, 01:25 PM

Basically, you can do something like this:

bbmap.sh in=reads.fq outm=normal_pairs.sam outu=abnormal.fq killbadpairs=t pairedonly=t

Then the proper pairs will be in normal_pairs.sam, and everything else will be in abnormal.fq. You can subsequently apply further filters to abnormal.fq to look at it in more detail. Either file can put output in fastq or sam (depending on the extension) but everything in outu will have its mapping information removed so if you want to see the orientations, you'd need to remap them. You could in a subsequent pass, do this:

bbmap.sh in=abnormal.fq outm=strange_pairs.sam outu=unmapped.fq requirecorrectstrand=f pairedonly=t

That would give you FF and RR pairs in strange_pairs.sam and all the unmapped or half-mapped pairs in unmapped.fq. You can further split them like this:

bbmap.sh in=unmapped.fq outm=half_mapped.sam outu=both_unmapped.fq

**Brian Bushnell** · 11-23-2016, 01:57 PM

Originally posted by moistplus

If I set

HTML Code:

 outm=normal_pairs.fq

, I will have interleaved paired end ?

Yes, they will be interleaved. If you want you can use "outm1=r1.fq outm2=r2.fq" to keep them in 2 files but I find interleaved reads more convenient.

**Brian Bushnell** · 12-01-2016, 10:19 AM

You can output fastq to outm and sam to out or outu (or any similar combination), but you can't output both sam and fastq to outm.

**Brian Bushnell** · 12-06-2016, 10:12 AM

@Moistplus

To clarify, "mix" puts both merged and unmerged reads in the same file, the target of "out=". It should only be used in conjunction with "ecco" which does not actually merge the reads. With "ecco" and "mix" you will get the same number of output reads as input reads. Without "ecco" and "mix" your merged reads will go to "out=" and unmerged reads will go to "outu=", and the total number of reads afterward will be less than the number of input reads.

**tshalev** · 12-15-2016, 10:32 AM

Confusion regarding quality/adapter trimming and read merging

Hi everyone,

First off, I'd like to say thanks Brian, BBMerge has been really useful in speeding up my assemblies and improving assembly quality.

I'm having some confusion regarding how to combine trimming and merging. In the quick guide for BBMerge, it says:

"Adapter-trimming reads is fine and is recommended prior to running BBDuk, particularly if kmers will be used. Quality-trimming is usually not recommended, unless the library has major quality issues at the right end of reads resulting in a low merge rate, in which case only weak trimming (such as qtrim=r trimq=8) should be used."

I'm not sure if "BBDuk" is a typo, as this is the guide for BBMerge. Basically I understand that the process needs to be something like this:

1)

Code:

 bbduk.sh in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

To trim adapters;

2)

Code:

bbmerge-auto.sh in=clean.fq out=merged.fq adapter1=something adapter2=something rem k=62 extend2=50 ecct

To merge reads;

3)

Code:

bbduk.sh in=merged.fq out=merged_trimmed.fq qtrim=r trimq=10

To further trim poor quality reads.

My main issues here are:

1) This seems a bit convoluted, and feels like too much messing with the data.

2) I'm still unsure whether I am actually supposed to trim adapters before merging, and if so, does it make sense to use the adapter flags in BBMerge? Can it use the adapter information if I've already trimmed them?

Thanks!

**j.m.c** · 12-15-2016, 01:29 PM

BBmerge smaller insert sizes than expected

Hi,

I just run BBmerge on a set of paired-end reads (illumina 2x100bp) that had been trimmed with trimmomatic to remove adapter sequences and low quality bases at the end of the reads. Trimmed reads (87 bases long) were merged with BBmerge (following the recommended command for optimal accuracy) and the merged-reads had an insert size that ranged from 35 to 265 bases with an avg of 124 bases.

My questions:

How can I get merged reads that are 35 bases long if my input PE reads were 87 bases long?

Is there something wrong in my commands?

Below the code I used with trimmomatic and BBmerge:

trimmomatic-0.36.jar PE -threads 4 -trimlog YC1.trimlog.txt YC1_R1.fastq YC1_R2.fastq YC1_R1.trimmed.paired.fastq YC1_R1.trimmed.unpaired.fastq YC1_R2.trimmed.paired.fastq YC1_R2.trimmed.unpaired.fastq HEADCROP:13 SLIDINGWINDOW:4:15 MINLEN:60 &

/usr/local/BBMap_36.62/bbmerge.sh in1=YC1_R1.trimmed.paired.fastq in2=YC1_R2.trimmed.paired.fastq out=YC1.trimmed.paired.merged.fastq outu1=YC1_R1.trimmed.paired.unmerged.fastq outu2=YC1_R2.trimmed.paired.unmerged.fastq rem k=62 extend2=50 ecct -Xmx48000m -threads=6 &

Thank you very much!

**Brian Bushnell** · 12-15-2016, 02:05 PM

@tshalev:

Hi! Sorry it's a little convoluted. Well, allow me to explain:

First off, yep, BBDuk was a typo, thanks for catching it. Secondly, I would change your commands to something like this:

Code:

bbduk.sh in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

bbmerge-auto.sh in=clean.fq out=merged.fq [B]outu=unmerged.fq[/B] adapter1=something adapter2=something rem k=62 extend2=50 ecct

bbduk.sh in=merged.fq out=merged_trimmed.fq qtrim=r[B]l[/B] trimq=10

[B]bbduk.sh in=unmerged.fq out=unmerged_trimmed qtrim=r trimq=10[/B]

I always recommend keeping the unmerged reads and using those too for the assembler, particularly if you are assembling with Spades. Some reads, like those in low-complexity areas, will not merge without a very long overlap, so tossing out unmerged reads can incur bias.

As for using adapter sequences for BBMerge - that reduces the false-positive rate. If two reads appear to have a good overlap for merging, but that overlap indicates the insert size is shorter than read length, then if and only if you have specified adapter sequences, BBMerge will look to see if adapter sequences are in the expected locations (the overhanging part outside of the overlap). If there are not adapter sequences there, it will assume that was a false positive and the merge will be rejected. Therefore, if you already did adapter-trimming, all of these pairs that merge with insert shorter than read length should be false positives, but BBMerge won't know that unless you actually give it the adapter sequences. So it will print at the end something like "300 adapter sequences expected, 7 found" or similar, indicating that 293 false-positive merges were prevented.

As for over-processing... that's a good thing to worry about. All the processing steps are beneficial as long as the assembly improves, so I often like to assemble after every step to determine its impact. Adapter-trimming is basically always beneficial. Merging depends on the library, the sequencing mode, and the assembler, but is (in my experience) always beneficial with Spades, Tadpole and Ray, and not beneficial with Megahit. Quality-trimming is generally beneficial if you pick the correct cutoff, but 10 is a nice, conservative value that should generally be fine.

**Brian Bushnell** · 12-15-2016, 02:12 PM

Originally posted by j.m.c View Post

My questions:

How can I get merged reads that are 35 bases long if my input PE reads were 87 bases long?

Is there something wrong in my commands?

Your commands look fine, although I'm a little confused about what you mean by 87-bp reads. Were they initially 87 bp long, or was that the average after trimming? Also, I suggest trying BBMerge prior to quality-trimming unless you have very low quality data, but you can also try either way and see which gives a higher merge rate.

Note that BBMerge's default "mininsert" is 35; it does not, by default, look for overlaps indicating an insert size smaller than that. If you have a lot of short-insert reads you may want to set it lower, or just do adapter-trimming prior to BBMerge and throw away the short junk. You can do adapter-trimming with BBDuk as in the above post.

The 35bp reads you ended up with are because of the short insert. When you have 2x87bp reads with a 35bp insert, you get 35bp of overlap on the 3' end and then 52bp of the 5' end overhanging on each side; that's adapter sequence. BBMerge trims that off so you are left with only the 35bp of genomic sequence.

Topics	Statistics	Last Post
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, Yesterday, 05:31 AM	0 responses 10 views 0 likes	Last Post by seqadmin Yesterday, 05:31 AM
Small Blood Stem Cell Subset Linked to Immune System Aging by seqadmin Started by seqadmin, 10-24-2024, 06:58 AM	0 responses 20 views 0 likes	Last Post by seqadmin 10-24-2024, 06:58 AM
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types by seqadmin Started by seqadmin, 10-23-2024, 08:43 AM	0 responses 50 views 0 likes	Last Post by seqadmin 10-23-2024, 08:43 AM
Microbes in Urban Spaces Adapt to Disinfectants and Scarce Resources by seqadmin Started by seqadmin, 10-17-2024, 07:29 AM	0 responses 58 views 0 likes	Last Post by seqadmin 10-17-2024, 07:29 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News