I'm using BFAST with BWA (version 0.6.4e) on SOLiD paired end reads (read1: 50 bp, read2: 35 bp). So far, these steps work nicely (thanks to help from the BFAST team) and quite fast:
find CALs separately for the 50 bp ends:
bfast match -n 4 -t -f $REF -A 1 -z -T $TEMPDIR -r reads_r1.fastq > reads_r1.bmf
and for the 35 bp ends:
bfast bwaaln -c $REF reads_r2.fastq > reads_r2.bmf
bring them together:
bfast localalign -f $REF -1 reads_r1.bmf -2 reads_r2.bmf -A 1 -t -U -n 8 > reads.baf
But the postprocess step, which was done in a few minutes for single end, can take > 100 hours on 16 CPUs for 50 Mio read pairs:
bfast postprocess -f $REF -i reads.baf -a 3 -A 1 -R -z -t -n 16 > reads.sam
Also, the reported values seem quite strange to me. Often there are negative means and large standard deviations, e.g.
*********************************************************
Estimating paired end distance...
Used 7438 paired end distances to infer the insert size distribution.
The paired end distance range was from -10240 to 2847.
The paired end distance mean and standard deviation was -5413.80 and 4928.88.
The inversion ratio was 0.999866 (7437 / 7438).
Reads processed: 2700000
*********************************************************
Estimating paired end distance...
Used 9477 paired end distances to infer the insert size distribution.
The paired end distance range was from -4985 to 1899.
The paired end distance mean and standard deviation was -1038.24 and 2097.46.
The inversion ratio was 0.999894 (9476 / 9477).
Reads processed: 2750000
*********************************************************
*********************************************************
Estimating paired end distance...
Used 9925 paired end distances to infer the insert size distribution.
The paired end distance range was from -17 to 7508.
The paired end distance mean and standard deviation was 110.48 and 79.03.
The inversion ratio was 1.000000 (9925 / 9925).
Reads processed: 2800000
*********************************************************
If I use -g for gapped rescue it's even slower. (By the way, where to find the documentation how gapped rescue works?)
For the whole set, ABI BioScope could map 34% as proper pairs, 42% of the reads were unmapped, and it reported Insert range 94-206. I split the data set for BFAST and from the 2 parts that finished, least 60% mapped but <20% in proper pairs.
Am I doing something wrong? Or might it be because of bad read quality? Any help will be very much appreciated.
Barbara
find CALs separately for the 50 bp ends:
bfast match -n 4 -t -f $REF -A 1 -z -T $TEMPDIR -r reads_r1.fastq > reads_r1.bmf
and for the 35 bp ends:
bfast bwaaln -c $REF reads_r2.fastq > reads_r2.bmf
bring them together:
bfast localalign -f $REF -1 reads_r1.bmf -2 reads_r2.bmf -A 1 -t -U -n 8 > reads.baf
But the postprocess step, which was done in a few minutes for single end, can take > 100 hours on 16 CPUs for 50 Mio read pairs:
bfast postprocess -f $REF -i reads.baf -a 3 -A 1 -R -z -t -n 16 > reads.sam
Also, the reported values seem quite strange to me. Often there are negative means and large standard deviations, e.g.
*********************************************************
Estimating paired end distance...
Used 7438 paired end distances to infer the insert size distribution.
The paired end distance range was from -10240 to 2847.
The paired end distance mean and standard deviation was -5413.80 and 4928.88.
The inversion ratio was 0.999866 (7437 / 7438).
Reads processed: 2700000
*********************************************************
Estimating paired end distance...
Used 9477 paired end distances to infer the insert size distribution.
The paired end distance range was from -4985 to 1899.
The paired end distance mean and standard deviation was -1038.24 and 2097.46.
The inversion ratio was 0.999894 (9476 / 9477).
Reads processed: 2750000
*********************************************************
*********************************************************
Estimating paired end distance...
Used 9925 paired end distances to infer the insert size distribution.
The paired end distance range was from -17 to 7508.
The paired end distance mean and standard deviation was 110.48 and 79.03.
The inversion ratio was 1.000000 (9925 / 9925).
Reads processed: 2800000
*********************************************************
If I use -g for gapped rescue it's even slower. (By the way, where to find the documentation how gapped rescue works?)
For the whole set, ABI BioScope could map 34% as proper pairs, 42% of the reads were unmapped, and it reported Insert range 94-206. I split the data set for BFAST and from the 2 parts that finished, least 60% mapped but <20% in proper pairs.
Am I doing something wrong? Or might it be because of bad read quality? Any help will be very much appreciated.
Barbara
Comment