Seqanswers Leaderboard Ad

**pongly** · 06-25-2014, 04:22 AM

what no reads?

Hi

You seem to have only 200 reads, not enough by a huge margin to make any sensible estimations.

**hartmaier** · 07-23-2014, 06:01 AM

Originally posted by pongly View Post

Hi

You seem to have only 200 reads, not enough by a huge margin to make any sensible estimations.

As I said in my post, I took a subset of my data to troubleshoot what was going on. The same input file was used in both programs, yet the calculated insert sizes were completely different. Yes, I agree that 100 paired reads is not a reasonable amount to get an accurate distribution of the true insert size, however, my point is that the two programs are giving extremely different outputs for the insert sizes for those 100 reads and it is obvious from the output that bwa is wrong.

**N311V** · 07-23-2014, 05:23 PM

Is the mean and standard deviation the estimated insert size from novoalign?

**Brian Bushnell** · 07-23-2014, 06:58 PM

Thread: [Bio-bwa-help] bwa mem insert size as option | Burrows-Wheeler Aligner

http://sourceforge.net/p/bio-bwa/mailman/bio-bwa-help/thread/[email protected]/

Originally posted by Heng Li

If you have a library with two distinct insert size distributions, the better way is not to perform paired-end mapping at all.

Maybe this library has a small-insert peak and a large-insert peak, which confuses BWA? Often LMP libraries have a substantial fraction of reads with a short insert size. I have not used Novoalign, but the "MP 4000,2500" could perhaps be forcing it to use the higher of two peaks and ignore short inserts. In other words, it is not at all clear that BWA-mem is wrong and Novoalign is right.

Perhaps you should use a 3rd aligner as a tiebreaker. BBMap can plot the insert size distribution so you can see what's going on, though you'd need more like 100k reads to produce a nice smooth curve. e.g.

bbmap.sh -Xmx24g in=reads.fq ref=ref.fa ihist=ihist.txt rcs=f reads=100000

...where the 'rcs=f' flag is for long-mate-pair libraries; otherwise it assumes fragment library orientation for pairs.

**nucacidhunter** · 07-23-2014, 08:08 PM

It is normal to find paired end reads in mate pair libraries. This is more pronounced as the size of fragments used for mate pair library prep increases. There are at least two sources of paired end reads in mate pair libraries. One is binding non-biotinylated fragments (fragments lacking junction) to capture beads which make library fragments. The other is result of random shearing true mate pair fragments. The latter is explained here:http://res.illumina.com/documents/pr...processing.pdf

**hartmaier** · 07-24-2014, 11:04 AM

First, thank you everyone for reviving this old post and getting some nice discussion here.

Originally posted by N311V View Post

Is the mean and standard deviation the estimated insert size from novoalign?

Yes, size selection was done via gel cutting for 3.5-5 kb so novoalign fits perfectly. I also have some larger insert size libraries and novoalign insert sizes always match with what is expected.

I re-ran bwa mem with 100k reads...same result...note the complete lack of R-F reads (correct Illumina mate pairs should be in this orientation).

Code:

[M::main_mem] read 50000 sequences (4242525 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (9024, 0, 0, 9093)
[M::mem_pestat] analyzing insert size distribution for orientation FF...
[M::mem_pestat] (25, 50, 75) percentile: (10, 22, 43)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 109)
[M::mem_pestat] mean and std.dev: (27.99, 23.96)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 142)
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation RR...
[M::mem_pestat] (25, 50, 75) percentile: (10, 22, 45)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 115)
[M::mem_pestat] mean and std.dev: (29.08, 25.38)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 150)
[M::mem_process_seqs] Processed 50000 reads in 8.093 CPU sec, 8.111 real sec
[main] Version: 0.7.10-r789
[main] CMD: bwa mem -M /Volumes/GenomicsIII/Reference_Genomes/hg19.fa test.R1.fastq test.R2.fastq
[main] Real time: 188.272 sec; CPU: 14.480 sec

It is normal to find paired end reads in mate pair libraries.

Yup, I know and I expect some. After plotting insert sizes from Novoalign, for example, this is a very minor fraction (for the new Nextera based Illumina protocol). I've also aligned the old v2 illumina MP data with novoalign which shows a much larger fraction of contaminating 'inward' facing reads (the new nextera kit is way better with regards to elimination of the contaminating inward reads). So, novoalign can detect both library types if they are there. Together, this tells me, at least in its current form, bwa mem cannot be used with large insert mate pair data (unless someone has any additional suggestions).

Perhaps you should use a 3rd aligner as a tiebreaker. BBMap can plot the insert size distribution so you can see what's going on, though you'd need more like 100k reads to produce a nice smooth curve.

I've never run BBmap before but I got 'out of memory error' on my local machine with 16GB. I have access to more on a cluster but based on the above explanation I am convinced that bwa mem cannot work with large insert mate pair data.

**Brian Bushnell** · 07-24-2014, 11:38 AM

Originally posted by hartmaier View Post

I've never run BBmap before but I got 'out of memory error' on my local machine with 16GB. I have access to more on a cluster but based on the above explanation I am convinced that bwa mem cannot work with large insert mate pair data.

Yes, BBMap requires ~20GB for human reference normally, though you can run it on a 16G node with this command:

bbmap.sh -Xmx12g in=reads.fq ref=ref.fa ihist=ihist.txt rcs=f reads=100000 usemodulo nodisk

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

BWA MEM mate pair incorrect insert sizes?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News