Unconfigured Ad

**lh3** · 11-09-2009, 07:02 AM

Someone has reported that bwa-short segfaults given 454 reads. I have not looked into the details. Nonetheless, if you read the documentation or the FAQ, you will know that bwa-{aln,sampe,samse} is not designed for 454 reads at all. You should use "bwasw" instead. For sequencing reads, bwa typically uses less than 4GB memory.

**mingkunli** · 11-09-2009, 07:08 AM

btw, how did you covert 454 reads to fastq format?

**dpmccul** · 11-09-2009, 07:32 AM

I used sfffile to extract the data of interest to a .sff file, and then sffinfo to go to a *.fa and
*.qual file. I then wrote a c program to convert these to fastq (I know c better than perl),
and then verified with validateFastq Version 1 by Tyler Backman.

The c code for conversion from fasta to fastq is:
-----------------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>
#include <ctype.h>

#define MAX_STRING_SIZE 200

/**********************************************************************
*
* Program to convert fasta (with qual) files to fastq format.
*
* fasta2fastq takes the path to the fasta fil4e and associated qual
* file, and writes to the fastq file the converted file format
* information.
*
* Dean P McCullough
* Nov 8, 2009
*
*********************************************************************/

int main(int argc, char *argv[] )
{
FILE *fasta, *qual, *fastq;
int i,jj;
char k;
char *aline, *qline;
int nreads,imin,imax;

/******************
* USAGE ********
******************/
if( argc < 4 )
{
fprintf(stderr,"\n\n");
fprintf(stderr,"usage: %s dna_fasta_file dna_qual_file fastq_file\n", argv[0]);
fprintf(stderr,"Convert the fasta and qual file to fastq format.\n");
exit(1);
}

fasta = fopen(argv[1],"r");

fprintf(stderr,"Fasta file = %s \n",argv[1]);

if(fasta == NULL)
{
fprintf(stderr,"fopen failed for fasta file \n");
exit(1);
}

qual = fopen(argv[2],"r");

fprintf(stderr,"Qual file = %s \n",argv[2]);

if(qual == NULL)
{
fprintf(stderr,"fopen failed for qual file \n");
exit(1);
}

fastq = fopen(argv[3],"w");

fprintf(stderr,"Fastq output file = %s \n",argv[3]);

if(fastq == NULL)
{
fprintf(stderr,"fopen failed for fastq output file \n");
exit(1);
}

/* line for fgets reads */
aline=(char *)malloc(MAX_STRING_SIZE*sizeof(char));
if( aline == NULL )
{
fprintf(stderr,"malloc failed for aline\n");
exit(1);
}

qline=(char *)malloc(MAX_STRING_SIZE*sizeof(char));
if( qline == NULL )
{
fprintf(stderr,"malloc failed for qline\n");
exit(1);
}

memset((char *)aline,'\0',MAX_STRING_SIZE*sizeof(char) );
memset((char *)qline,'\0',MAX_STRING_SIZE*sizeof(char) );

nreads=0;
imin=1000;
imax = -1;

fgets( (char *)aline,MAX_STRING_SIZE,fasta);
fgets( (char *)qline,MAX_STRING_SIZE,qual);
while(strlen(aline) != 0)
{
if(strncmp(&aline[0],">",1)==0)
{
nreads++;

fprintf(fastq,"@%s",aline+1);

if(strcmp(aline,qline)!=0)
{
fprintf(stderr,"Read labels do not agree.\n%s\n%s\n",
aline,qline);
exit(1);
}

for(;

{
memset((char *)aline,'\0',MAX_STRING_SIZE*sizeof(char) );
fgets( (char *)aline,MAX_STRING_SIZE,fasta);
if(strncmp(&aline[0],">",1)==0) break;
if(strlen(aline) == 0) break;
fputs(aline,fastq);
}
fprintf(fastq,"+\n");

for(;

{
memset((char *)qline,'\0',MAX_STRING_SIZE*sizeof(char) );
fgets( (char *)qline,MAX_STRING_SIZE,qual);
if(strncmp(&qline[0],">",1)==0) break;
if(strlen(qline) == 0) break;
i=0;
while(qline[i]!='\0')
{
jj=atoi(&qline[i]);
k = jj+33;
fprintf(fastq,"%c",k);
if(imin > jj) imin=jj;
if(imax < jj) imax=jj;
i++;

while(isdigit(qline[i]))i++;
while(isspace(qline[i]))i++;
}
fprintf(fastq,"\n");
}
}
else
{
fprintf(stderr,"Something wrong \n%s\n%s\n",
aline,qline);
exit(1);
}
}

fprintf(stderr,"%d reads min = %d max =%d\n",nreads,imin,imax);
}

**jnfass** · 11-16-2009, 03:20 PM

I don't want to get into a language debate, but here's the same thing in perl:

Code:

#!/usr/bin/perl

# AUTHOR: Joseph Fass
# LAST REVISED: August 2009
# 
# The Bioinformatics Core at UC Davis Genome Center
# http://bioinformatics.ucdavis.edu
# Copyright (c) 2008 The Regents of University of California, Davis Campus.
# All rights reserved.

use strict;

my $usage = "\nusage: $0 <reads (fasta format)> <qualities (fasta format)>\n\n".
            "-- All sequences and qualities must be on a single line each. --.\n".
            "-- All quality values MUST be followed by a space character. --\n\n".
            "Merges fasta and qual files into a single Sanger fastq file (STDOUT).";

my $seqfile = shift or die $usage;
my $qualfile = shift or die $usage;

open F, "<$seqfile" or die $usage;
open Q, "<$qualfile" or die $usage;

my ($fh, $fs, $qh, $qs);
my @quals;  my $qv;

while (<F>) {
	$fh = $_;
	$fs = <F>;
	$qh = <Q>;
	$qs = <Q>;
	if ($fh eq $qh) {
		$fh =~ s/>/@/;
		print $fh.$fs;
		$qh =~ s/>/+/;
		print $qh;
		@quals = split(/\s+/,$qs);
		for (my $i=0; $i<=$#quals; $i++) {
			$qv = $quals[$i] + 33;
			print chr($qv);
		}
		print "\n";
	}
	else { print "PROBLEM\n" }
}

close F; close Q;

A note on the above ... I tend to write all my code depending on having each sequence on one line ... even when I'm working with huge scaffolds or chromosomes ... can be hard to read, but generally doesn't cause problems unless the sequences are really huge ... and it makes coding a lot easier. So here's some perl for converting to the 'oneline' format:

Code:

#!/usr/bin/perl

# AUTHOR: Joseph Fass
# LAST REVISED: October 2009
# 
# The Bioinformatics Core at UC Davis Genome Center
# http://bioinformatics.ucdavis.edu
# Copyright (c) 2009 The Regents of University of California, Davis Campus.
# All rights reserved.

use strict;

my $usage = "\nusage: cat <input fasta file> | $0 > <output fasta file>\n\n".
            "For each sequence, puts nt/aa all on one line following header line.\n".
            "Reads from and writes to STDOUT, so use redirection ... \n\n\n";

my $block=""; my $line;

while (<>) {
	$line = $_;
	if (m/^>/) {  # header line
		print $block;  # which is either empty, or capped by a newline
		$block = $line."\n";  # adds another newline, because it will be chomped below
	}
	else {
		chomp $line;
		if (substr($line,-1) =~ m/[0-9]/) {
			chomp $block;  # remove newline before adding more sequence
			$block .= $line." \n";  # must be fasta qualities, and needs a space
		}
		else {
			chomp $block;  # remove newline before adding more sequence
			$block .= $line."\n";  # not fasta qualities, so no space needed
		}
	}
}
print $block;  # print last sequence block

**Francesco Lescai** · 04-15-2010, 08:47 AM

I have the same problem.
I got a segmentation error in indexing the hg19.fasta sequence of the human genome.
then I changed machine (actually to one with much less memory) and it worked.. magic?
then I tried to bwa align my reads and I get this typical pattern, independently of the computer, and indipendently of the number of reads I'm trying to align
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... Segmentation fault

If I align the same reads over the chromosome X only, it seems to work but then I get reads without coordinates (as if they were not mapped).
Of course I know they map very well because I aligned them with MAQ and I got 90% of the reads mapping.

what about the segmentation fault?

I have the same problem.
I got a segmentation error in indexing the hg19.fasta sequence of the human genome.
then I changed machine (actually to one with much less memory) and it worked.. magic?
then I tried to bwa align my reads and I get this typical pattern, independently of the computer, and indipendently of the number of reads I'm trying to align
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... Segmentation fault

If I align the same reads over the chromosome X only, it seems to work but then I get reads without coordinates (as if they were not mapped).
Of course I know they map very well because I aligned them with MAQ and I got 90% of the reads mapping.

what about the segmentation fault?

**Lien** · 04-30-2010, 12:30 AM

I also have the same segmentation fault when trying to align Illumina fastq reads to human genome.

Anyone has an idea to solve this problem?

**Lien** · 05-06-2010, 07:30 AM

I used the latest version of BWA 0.5.7 and got the segmentation fault when aligning Illumina reads. Even when I aligned only 10 reads, this segmentation fault appeared.
Then I changed to BWA 0.5.6 and everything works just as it should!

Some other question:
I aligned my Illumina data both with Bowtie and with BWA. With Bowtie, I get a nice overview of the number of reads that were processed, the number of reads with at least 1 reported alignment, ...
Is it possible to get similar results for BWA?

Thanks,
Lien

**maubp** · 05-06-2010, 08:35 AM

Originally posted by dpmccul View Post

I used sfffile to extract the data of interest to a .sff file, and then sffinfo to go to a *.fa and
*.qual file. I then wrote a c program to convert these to fastq (I know c better than perl),

There are much easier ways to do SFF to FASTQ, for example sff_extract:

http://bioinf.comav.upv.es/sff_extract/index.html

As long as you are not working with paired end 454, you can also do this with Biopython 1.54b or later in just two lines of code (import the Bio.SeqIO module, call the Bio.SeqIO.convert function).

404 Page not found

http://news.open-bio.org/news/2010/04/biopython-1-54-beta-released/

**lh3** · 05-06-2010, 08:38 AM

@Lien

Please use the version of bwa in SVN. You can get the statistics from samtools' flagstat. Or a simple awk command will do.

**prisnirath** · 05-31-2011, 02:06 AM

BWA for Roche454 data.

Hi there..
I am using BWA to perform sequence alignment for the reads generated from Roche454 data.
I used sfffile and sffinfo to MID sort and trim off mids from SFF files.
So, I have got a read file in fastq format, and my reference sequences in fasta format.
I used BWa to index my reference fasta file.
Then, aligning the read fastq file against my reference fasta file.
Now, I am getting a SAM format file with data in this format::

GGDP4G001AT8TK 0 gi|237757283|ref|NM_007294.3| 1331 197 57M1I50M1I13M1I5M1I4M1I15M1I9M1I23M1I12M1I8M1I5M1I9M1I24M1I16M25S * 0 0 ACTGAAGATGTTCCTTGGATAACACTAAATAGCAGCATTCAGAAAGTTAATGAGTGGTTTTTCCAGAAGTGATGAACTGTTAGGTTCTGATGACTCACATGATGGGGAGGTCTGAATCAAATAGCCAACGGTAGGCTGATGTATTGGACGGTTCTAAATAGNGGTAGATGAATATTCTGGTTCGTTCAGAGAAAATAAGACTTACTTGGCCGAGTGATCCTCCATGAGGCTTTATTATGTAAAAGTAGAAGAAGTTCACTCCAGTCATCGCTGCGCCCCTCGACGCAC IIIIIIIIIIIIIIIDHHHIGGHHHFFFDDDDFFFFFDDDFFFFFDDC><@A???4411333==BBBBDFFFFFFFFFA@@FFFFFFFFFFFFFFFFFFFFFF;;99933998<<899666:7=;;../..4/..:<==<<<<;;../4....//..6=49!<<??AAB::555<<?>>>>40111::<<<4455A5143446332155-.////3::;9322<7442631,,-,,-172....995<<<?>><44444444333:::::57---,,,,7,----175 AS:i:139 XS:i:0 XF:i:3 XE:i:2 XN:i:0

I am not clearly understanding this part.
I am using the default parameters for all.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 54 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Using BWA

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News