I'm using BFAST with BWA (version 0.6.4e) on SOLiD paired end reads (read1: 50 bp, read2: 35 bp). So far, these steps work nicely (thanks to help from the BFAST team) and quite fast:

find CALs separately for the 50 bp ends:

bfast match -n 4 -t -f $REF -A 1 -z -T $TEMPDIR -r reads_r1.fastq > reads_r1.bmf

and for the 35 bp ends:

bfast bwaaln -c $REF reads_r2.fastq > reads_r2.bmf

bring them together:

bfast localalign -f $REF -1 reads_r1.bmf -2 reads_r2.bmf -A 1 -t -U -n 8 > reads.baf

But the postprocess step, which was done in a few minutes for single end, can take > 100 hours on 16 CPUs for 50 Mio read pairs:

bfast postprocess -f $REF -i reads.baf -a 3 -A 1 -R -z -t -n 16 > reads.sam

Also, the reported values seem quite strange to me. Often there are negative means and large standard deviations, e.g.

*********************************************************

Estimating paired end distance...

Used 7438 paired end distances to infer the insert size distribution.

The paired end distance range was from -10240 to 2847.

The paired end distance mean and standard deviation was -5413.80 and 4928.88.

The inversion ratio was 0.999866 (7437 / 7438).

Reads processed: 2700000

*********************************************************

Estimating paired end distance...

Used 9477 paired end distances to infer the insert size distribution.

The paired end distance range was from -4985 to 1899.

The paired end distance mean and standard deviation was -1038.24 and 2097.46.

The inversion ratio was 0.999894 (9476 / 9477).

Reads processed: 2750000

*********************************************************

*********************************************************

Estimating paired end distance...

Used 9925 paired end distances to infer the insert size distribution.

The paired end distance range was from -17 to 7508.

The paired end distance mean and standard deviation was 110.48 and 79.03.

The inversion ratio was 1.000000 (9925 / 9925).

Reads processed: 2800000

*********************************************************

If I use -g for gapped rescue it's even slower. (By the way, where to find the documentation how gapped rescue works?)

For the whole set, ABI BioScope could map 34% as proper pairs, 42% of the reads were unmapped, and it reported Insert range 94-206. I split the data set for BFAST and from the 2 parts that finished, least 60% mapped but <20% in proper pairs.

Am I doing something wrong? Or might it be because of bad read quality? Any help will be very much appreciated.

Barbara

find CALs separately for the 50 bp ends:

bfast match -n 4 -t -f $REF -A 1 -z -T $TEMPDIR -r reads_r1.fastq > reads_r1.bmf

and for the 35 bp ends:

bfast bwaaln -c $REF reads_r2.fastq > reads_r2.bmf

bring them together:

bfast localalign -f $REF -1 reads_r1.bmf -2 reads_r2.bmf -A 1 -t -U -n 8 > reads.baf

But the postprocess step, which was done in a few minutes for single end, can take > 100 hours on 16 CPUs for 50 Mio read pairs:

bfast postprocess -f $REF -i reads.baf -a 3 -A 1 -R -z -t -n 16 > reads.sam

Also, the reported values seem quite strange to me. Often there are negative means and large standard deviations, e.g.

*********************************************************

Estimating paired end distance...

Used 7438 paired end distances to infer the insert size distribution.

The paired end distance range was from -10240 to 2847.

The paired end distance mean and standard deviation was -5413.80 and 4928.88.

The inversion ratio was 0.999866 (7437 / 7438).

Reads processed: 2700000

*********************************************************

Estimating paired end distance...

Used 9477 paired end distances to infer the insert size distribution.

The paired end distance range was from -4985 to 1899.

The paired end distance mean and standard deviation was -1038.24 and 2097.46.

The inversion ratio was 0.999894 (9476 / 9477).

Reads processed: 2750000

*********************************************************

*********************************************************

Estimating paired end distance...

Used 9925 paired end distances to infer the insert size distribution.

The paired end distance range was from -17 to 7508.

The paired end distance mean and standard deviation was 110.48 and 79.03.

The inversion ratio was 1.000000 (9925 / 9925).

Reads processed: 2800000

*********************************************************

If I use -g for gapped rescue it's even slower. (By the way, where to find the documentation how gapped rescue works?)

For the whole set, ABI BioScope could map 34% as proper pairs, 42% of the reads were unmapped, and it reported Insert range 94-206. I split the data set for BFAST and from the 2 parts that finished, least 60% mapped but <20% in proper pairs.

Am I doing something wrong? Or might it be because of bad read quality? Any help will be very much appreciated.

Barbara

## Comment