I am flummoxed by "fastq" formats. My goal for the moment is to test out BreakDancer to find translocations at low coverage. I don't have our own data yet, so want to use data from SRA. I have downloaded a paired-end data set (two 'fastq' files called SRR031156_1.fastq and SRR031156_2.fastq) from SRA but I cant get the SRA file format to be accepted by anything.
@SRR031156.3 HWI-EAS367_fcID30810AAXXr1:6:1:1564:603 length=31
GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG
+SRR031156.3 HWI-EAS367_fcID30810AAXXr1:6:1:1564:603 length=31
<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<<
@SRR031156.4 HWI-EAS367_fcID30810AAXXr1:6:1:1183:938 length=31
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
+SRR031156.4 HWI-EAS367_fcID30810AAXXr1:6:1:1183:938 length=31
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
My first effort is to align these nice fastq files with BWA, but I get a segmentation fault. I have checked some other threads on this forum, and it seems clear that the BWA segfault is due to problems with fastq format. This is just bad. The most popular aligner should work with the official standard file archive format in the public data repository.
[genomes@hactar BreakDancer]$ ./bwa-0.5.8a/bwa aln Mouse_mm9_fa/MUS SRX014041/SRR031156_1.fastq > MUS_aln.sai
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... Segmentation fault
My second effort is to try to convert to SAM or BAM format using FastqToSam.jar from the SAMTOOLS/Picard toolkit. This fails because there is not a /1 or /2 at the end of the header line for each read.
I really do not want to write some new Perl script for such an obvious task. There must be a well validated workflow out there from SRA to BWA.
... And from BWA output to BreakDancer (to avoid my next post when that also turns out to be problematic)
@SRR031156.3 HWI-EAS367_fcID30810AAXXr1:6:1:1564:603 length=31
GTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG
+SRR031156.3 HWI-EAS367_fcID30810AAXXr1:6:1:1564:603 length=31
<<<<<<<<<<<<<<<<<<<<<<4<<<<<<<<
@SRR031156.4 HWI-EAS367_fcID30810AAXXr1:6:1:1183:938 length=31
GATCGGAAGAGCGGTTCAGCAGGAATGCCGA
+SRR031156.4 HWI-EAS367_fcID30810AAXXr1:6:1:1183:938 length=31
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
My first effort is to align these nice fastq files with BWA, but I get a segmentation fault. I have checked some other threads on this forum, and it seems clear that the BWA segfault is due to problems with fastq format. This is just bad. The most popular aligner should work with the official standard file archive format in the public data repository.
[genomes@hactar BreakDancer]$ ./bwa-0.5.8a/bwa aln Mouse_mm9_fa/MUS SRX014041/SRR031156_1.fastq > MUS_aln.sai
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... Segmentation fault
My second effort is to try to convert to SAM or BAM format using FastqToSam.jar from the SAMTOOLS/Picard toolkit. This fails because there is not a /1 or /2 at the end of the header line for each read.
I really do not want to write some new Perl script for such an obvious task. There must be a well validated workflow out there from SRA to BWA.
... And from BWA output to BreakDancer (to avoid my next post when that also turns out to be problematic)
Comment