Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-20-2015, 01:10 PM

Dedupe will merge assemblies, but it will not produce consensus sequences or join overlapping reads; it only removes sequences that are fully contained within other sequences (allowing the specified number of mismatches or edits).

BBMap works pretty well for mapping Nanopore reads, but as GenoMax stated, it has a length cap of 6kbp. Reads longer than this will be broken into 6kbp pieces and mapped independently. The command I have been using for mapping Nanopore reads:

mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400

The "maxlen" flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.

**hubery_Bio** · 05-05-2015, 04:11 AM

Hi, Brian,
I used BBmap to evaluate the Insert Size of a PE dataset download from SRA database, and found that the peak of the histogram is 89 bp. Actually, the PE reads were 96-96 bp. It seems weird, since I saw it elsewhere that you defined the Insert Size as "PE reads + unsequenced part". In addition, the minimum Size was begin from 1 bp!
Although the dataset has some problems in the quality, it is amazing to got this result.
Could you help me? Sorry for my poor English...

**Brian Bushnell** · 05-05-2015, 09:58 AM

It sounds like you don't have any "unsequenced part" in this library. If the insert size is shorter than read length, that means that the the library was fragmented into pieces that were too small, and the sequencer reads off the end of the genomic sequence into the adapter sequence. Before using those reads, you should do adapter trimming with (for example) BBDuk.

As for the minimum size being 1bp, that is probably a read pair being mis-mapped. You can get a second opinion on the insert size distribution with BBMerge, which calculates it by looking for overlaps instead of by mapping, but I expect the result to be the same in this case. BBMap is more accurate for insert size calculations with long inserts (which don't overlap), while BBMerge is more accurate for insert sizes shorter than read length because the adapter sequence interferes with mapping.

**blsfoxfox** · 05-10-2015, 03:19 PM

Trimming of Nextera mate pair reads

Hi Brian,

I am using BBmap to trim the Nextera mate pair reads, could you please provide a workflow with parameters to show us how to do this with splitnextera.sh and bbduk.sh?

Thanks in advance.

**Brian Bushnell** · 05-10-2015, 04:38 PM

Sure. For simplicity, I will assume the reads are interleaved in a single file initially (you can transform between interleaved and dual-file reads with reformat.sh if you want). First thing is to trim the adapters with BBDuk:

Code:

bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe ref=nextera_LMP_adapter.fa.gz

nextera_LMP_adapter.fa.gz is in /bbmap/resources/.

Next, if you want to remove contaminants/spikeins such as PhiX, you can do so now with a filtering pass of BBDuk, but that's optional.

Finally, split the reads into LMP pairs, short pairs, and singletons:

Code:

splitnextera.sh in=trimmed.fq out=longmatepairs.fq outf=fragmentpairs.fq outu=unknownpairs.fq outs=singletons.fq mask=t

At this point you could do quality-trimming if you want to on the individual sub-libraries; it's important not to do so sooner as it interferes with the ability to detect junctions. If all you care about is the LMP data you can just keep longmatepairs.fq and throw away the others, but theoretically, you might be able to get enough fragment pairs and singletons to assemble the contigs from those, then use the LMP for scaffolding, and thus get a complete single-library assembly. That's what Illumina is encouraging, at any rate.

The majority of the reads in the "unknown" bin were LMP, not short-fragment reads, in the one library I tested (and also according to Illumina's data), but I would not really consider it safe to use for scaffolding OR contig-building, and would probably throw it away unless you don't have enough data.

**blsfoxfox** · 05-11-2015, 08:37 AM

Originally posted by Brian Bushnell View Post

Sure. For simplicity, I will assume the reads are interleaved in a single file initially (you can transform between interleaved and dual-file reads with reformat.sh if you want). First thing is to trim the adapters with BBDuk:

Code:

bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=23 mink=12 hdist=1 tbo tpe ref=nextera_LMP_adapter.fa.gz

nextera_LMP_adapter.fa.gz is in /bbmap/resources/.

Next, if you want to remove contaminants/spikeins such as PhiX, you can do so now with a filtering pass of BBDuk, but that's optional.

Finally, split the reads into LMP pairs, short pairs, and singletons:

Code:

splitnextera.sh in=trimmed.fq out=longmatepairs.fq outf=fragmentpairs.fq outu=unknownpairs.fq outs=singletons.fq mask=t

At this point you could do quality-trimming if you want to on the individual sub-libraries; it's important not to do so sooner as it interferes with the ability to detect junctions. If all you care about is the LMP data you can just keep longmatepairs.fq and throw away the others, but theoretically, you might be able to get enough fragment pairs and singletons to assemble the contigs from those, then use the LMP for scaffolding, and thus get a complete single-library assembly. That's what Illumina is encouraging, at any rate.

The majority of the reads in the "unknown" bin were LMP, not short-fragment reads, in the one library I tested (and also according to Illumina's data), but I would not really consider it safe to use for scaffolding OR contig-building, and would probably throw it away unless you don't have enough data.

Hi Brian,

Thank you for your patient instruction!
I have three more question about the split step:
1)After the split, what are the orientation of longmatepairs.fq and fragmentpairs.fq? I guess RF and FR respectively?
2)What will be the insert size of those fragmentpairs.fq, 300bp?Does the original library size matter?
3)I have tried another software called "Nxtrim" with similar function to seperate those Nextera Mate pairs into different groups, which gives me about 40% fragment pairs, do you think the number is reasonable? If you have used it, could you please tell me the difference between splitnextera.sh and Nxtrim for doing the same job?

Thanks for your time in advance, I know that is too much to answer.

**Brian Bushnell** · 05-11-2015, 09:30 AM

1) LMP maps as "outie", so the left read is reverse and the right read is forward. Fragment maps as "innie" like a normal fragment library, so left read forward and right read reverse.
2) The insert size of the fragment pairs depends on your shearing length, while the insert size of the LMP pairs depends on something else, like the transposase concentration. In my data, frag peaked at 319bp and LMP peaked at 2605bp (see image).

3) This was the output for one of our libraries:

Code:

Long Mate Pairs: 851440 reads (42.57%) 98864654 bases (32.74%)
Fragment Pairs: 655758 reads (32.79%) 82605742 bases (27.35%)
Unknown Pairs: 496638 reads (24.83%) 74992338 bases (24.83%)
Singletons: 170008 reads (8.50%) 9856956 bases (3.26%)

Adapters Detected: 751681 (75.17%)
Bases Recovered: 266319690 (88.19%)

I think 40% as fragment is reasonable; I assume the exact amount depends on reagent concentrations and times.

I designed splitnextera.sh to recover more pairs as LMP and fewer as fragments, compared to NXTrim - you have some leeway in deciding which one you want to recover. In general, it should run faster, recover more reads as LMP, and generally recover a higher number of total reads and bases. But that's just based on my reading the NXTrim description; I have not actually tested them on the same dataset, so I'd be interested if you could give us some numbers.

Incidentally, I'm considering upgrading it to increase the number of fragments and reduce the amount of unknown, but the difference would probably be slight since unknown is already mostly LMP.

Attached Files

nextera_lmp.png (34.0 KB, 404 views)

**blsfoxfox** · 05-11-2015, 12:09 PM

Originally posted by Brian Bushnell View Post

1) LMP maps as "outie", so the left read is reverse and the right read is forward. Fragment maps as "innie" like a normal fragment library, so left read forward and right read reverse.
2) The insert size of the fragment pairs depends on your shearing length, while the insert size of the LMP pairs depends on something else, like the transposase concentration. In my data, frag peaked at 319bp and LMP peaked at 2605bp (see image).

3) This was the output for one of our libraries:

Code:

Long Mate Pairs: 851440 reads (42.57%) 98864654 bases (32.74%)
Fragment Pairs: 655758 reads (32.79%) 82605742 bases (27.35%)
Unknown Pairs: 496638 reads (24.83%) 74992338 bases (24.83%)
Singletons: 170008 reads (8.50%) 9856956 bases (3.26%)

Adapters Detected: 751681 (75.17%)
Bases Recovered: 266319690 (88.19%)

I think 40% as fragment is reasonable; I assume the exact amount depends on reagent concentrations and times.

I designed splitnextera.sh to recover more pairs as LMP and fewer as fragments, compared to NXTrim - you have some leeway in deciding which one you want to recover. In general, it should run faster, recover more reads as LMP, and generally recover a higher number of total reads and bases. But that's just based on my reading the NXTrim description; I have not actually tested them on the same dataset, so I'd be interested if you could give us some numbers.

Incidentally, I'm considering upgrading it to increase the number of fragments and reduce the amount of unknown, but the difference would probably be slight since unknown is already mostly LMP.

Your answer is always informative and helpful!

Thanks!

**Len Trigg** · 05-11-2015, 02:46 PM

Hi Brian,

Could you please clarify the license for BBMap? I thought I might try a couple of runs with it for fun, but once I installed it I saw conflicting license information. In the top level of the tar file there is a license.txt that looks like a fairly permissive license (it doesn't look like one of the standard OSI open source licenses, but that's probably fine), yet within docs/readme.txt it explicitly states that the software is to be used for non-commercial use only.

Cheers,
Len.

**Brian Bushnell** · 05-11-2015, 03:51 PM

Hi Len,

To summarize, it appears that BBMap/BBTools is usable by any individual or organization, for any purpose (as long as the disclaimers are followed), without any need to recompense Berkeley Labs in any way. So a for-profit enterprise may run it without any restrictions.

The text about it being free for noncommercial use was added by me, to clarify, in case people were wondering, since initially it was made clear to me that that was the case by the legal department but I was not really sure what the status was with regards to commercial use (it was hard for me to get them to state things explicitly). Subsequently, however, BBMap was made part of Geneious, after they contacted Berkeley Lab and were told by the legal department that it's essentially an unrestricted license and commercial organizations do not need to pay anything. I should probably remove that line from the readme.

**Len Trigg** · 05-11-2015, 03:59 PM

Brian,

Thanks for the reply, that clears things up nicely! (I agree with you about adjusting the readme.)

Cheers,
Len.

**lankage** · 05-12-2015, 08:27 AM

bbmap settings

Hi Brian,
I was trying to use bbmap to remap some PacBio reads to some reference sequences in order to generate the detailed cigar string bbmap includes in SAM output. I am trying to generate mappings with sequences that are longer than a reference on one or more ends, but matching at 99% identity with the overlap portion. Previously i was using Geneious to do mappings and their tool seems to be doing the soft clipping i am looking for, but does not output the right type of cigar string. Could you help me with the specific settings to use with bbmap to include these mapping situations in the output?

**Brian Bushnell** · 05-12-2015, 11:49 AM

For PacBio reads, you need to use mapPacBio.sh rather than BBMap.sh. It's the same algorithm, but with different parameters and support for longer reads.

mapPacBio.sh ref=reference.fasta in=reads.fasta out=mapped.sam idfilter=0.99 local

(if the reads are in fastq format, then also add the flag "maxlen=6000").

"idfilter=0.99" will limit the mappings to reads with 99% identity. The local flag will soft-clip the overhang bases before the idfilter is applied, so only the overlapping bases will be considered.

**lankage** · 05-15-2015, 12:47 PM

Hi Brian,
i took your advice with using the "local" parameter with mapPacBio.sh, but I am still seeing reads which perfectly map to a reference in the overlap and have ends that should be soft clipped be absent in the output file. Additionally, if i run mapPacBio.sh in isolation with just one read (fasta) and one reference (fasta), i get the expected output. The soft clipped perfect mappings seem to only get missed in the larger file.

output when running in isolation:

Code:

@HD     VN:1.4  SO:unsorted
@SQ     SN:16120_FcGR1_776delG_Ctg25    LN:894
@PG     ID:BBMap        PN:BBMap        VN:33.65        CL:java -Djava.library.path=/Users/mgraham/Desktop/PB6-FCGR/bbmap/jni/ -ea -Xmx10g BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 ref=776delG.fasta in=lbc1_read.fasta out=test.sam outputunmapped=f idfilter=0.99 maxlen=6000 local -Xmx10g
lbc1_read_1165  0       16120_FcGR1_776delG_Ctg25       1       23      14S894=245S     *       0       0
       AGCTTGGAGACAACATGTGGTTCTTGACAGCTCTGCTCCTTTGGGTTCCAGTTGATGGGCAAGTGGATACCACAAAGGCAGTGATCACTTTGCAGCCTCCATGGGTCAGCGTGTTCCAAGAGGAAACTGTAACCTTACAGTGTGAGGTGCCCCGTCTGCCTGGGAGCAGCTCCACACAGTGGTTTCTCAATGGCACAGCCACTCAGACCTCGACTCCCAGCTACAGAATCACCTCTGCCAGTGTCAAGGACAGTGGTGAATACAGGTGCCAGAGAGGTCCCTCAGGGCGAAGTGACCCCATACAGCTGGAAATCCACAGAGACTGGCTACTACTGCAGGTATCCAGCAGAGTCTTCACAGAAGGAGAACCTCTGGCCTTGAGGTGTCATGCATGGAAGGATAAGCTGGTGTACAATGTGCTTTACTATCAAAATGGCAAAGCCTTTAAGTTTTTCTACCGGAATTCTCAACTCACCATTCTGAAAACCAACATAAGTCACAACGGCGCCTACCACTGCTCAGGCATGGGAAAGCATCGCTACACATCAGCAGGAGTATCTGTCACTGTGAAAGAGCTATTTCCAGCTCCAGTGCTGAATGCATCCGTGACATCCCCGCTCCTGGAGGGGAATCTGGTCACCCTGAGCTGTGAAACAAAGTTGCTTCTGCAGAGGCCTGGTTTGCAGCTTTACTTCTCCTTCTACATGGGCAGCAAGACCCTGCGAGGCAGGAACACGTCCTCTGAATACCAAATACTAACTGCTAGAAGAGAAGACTCTGGTTTTACTGGTGCGAGGCCACCACAGAAGACGGAAATGTCCTTAAGCGCAGCCCTGAGTTGGAGCTTCAAGTGCTTGGCCTCCAGTTACCAACTCCTGTCTGGCTTCATGTCCTTTTCTATCTGGTAGTGGGAATAATGTTTTTAGTGAACACTGTTCTCTGGGTGACAATACGTAAAGAACTGAAAAGAAAGAAAAAGTGGAATTTAGAAATCTCTTTGGATTCTGCTCATGAGAAGAAGGTAACTTCCAGCCTTCAAGAAGACAGACATTTAGAAGAAGAGCTGAAGAGTCAGGAACAAGAATAAAGAACAGCTGCAGGAAGGGGTGCACCGGAAGGAGCCTGAGGAGGCCAAGTAGCAGCAGCTCAGTGG       *       NM:i:0  AM:i:23

I can email a compressed set of files to demonstrate this if you'd be willing to take the time to look into it.

**Brian Bushnell** · 05-15-2015, 01:08 PM

Yes, please do email it, and I'll take a look.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 33 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News