Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using BWA

    I am having problems with bwa, and would like some direction. I am running on some
    454 reads for Salmonella.

    I have converted the reads to fastq format.
    I index with the command:
    bwa index -p sal_chr sal.fa

    I then align with the command:
    bwa aln -t 4 sal_chr 454Reads.MID1.sff.fq sal.sam
    and get:
    [bwa_aln] 17bp reads: max_diff = 2
    [bwa_aln] 38bp reads: max_diff = 3
    [bwa_aln] 64bp reads: max_diff = 4
    [bwa_aln] 93bp reads: max_diff = 5
    [bwa_aln] 124bp reads: max_diff = 6
    [bwa_aln] 157bp reads: max_diff = 7
    [bwa_aln] 190bp reads: max_diff = 8
    [bwa_aln] 225bp reads: max_diff = 9
    [bwa_aln_core] calculate SA coordinate... Segmentation fault

    Ideas on what I am doing wrong?
    I am requesting 10GB memory, is this insufficient?
    Thank you.

  • #2
    Someone has reported that bwa-short segfaults given 454 reads. I have not looked into the details. Nonetheless, if you read the documentation or the FAQ, you will know that bwa-{aln,sampe,samse} is not designed for 454 reads at all. You should use "bwasw" instead. For sequencing reads, bwa typically uses less than 4GB memory.

    Comment


    • #3
      btw, how did you covert 454 reads to fastq format?

      Comment


      • #4
        I used sfffile to extract the data of interest to a .sff file, and then sffinfo to go to a *.fa and
        *.qual file. I then wrote a c program to convert these to fastq (I know c better than perl),
        and then verified with validateFastq Version 1 by Tyler Backman.

        The c code for conversion from fasta to fastq is:
        -----------------------------------------------------------------------------------
        #include <stdio.h>
        #include <stdlib.h>
        #include <string.h>
        #include <strings.h>
        #include <ctype.h>

        #define MAX_STRING_SIZE 200

        /**********************************************************************
        *
        * Program to convert fasta (with qual) files to fastq format.
        *
        * fasta2fastq takes the path to the fasta fil4e and associated qual
        * file, and writes to the fastq file the converted file format
        * information.
        *
        * Dean P McCullough
        * Nov 8, 2009
        *
        *********************************************************************/

        int main(int argc, char *argv[] )
        {
        FILE *fasta, *qual, *fastq;
        int i,jj;
        char k;
        char *aline, *qline;
        int nreads,imin,imax;


        /******************
        * USAGE ********
        ******************/
        if( argc < 4 )
        {
        fprintf(stderr,"\n\n");
        fprintf(stderr,"usage: %s dna_fasta_file dna_qual_file fastq_file\n", argv[0]);
        fprintf(stderr,"Convert the fasta and qual file to fastq format.\n");
        exit(1);
        }

        fasta = fopen(argv[1],"r");

        fprintf(stderr,"Fasta file = %s \n",argv[1]);

        if(fasta == NULL)
        {
        fprintf(stderr,"fopen failed for fasta file \n");
        exit(1);
        }

        qual = fopen(argv[2],"r");

        fprintf(stderr,"Qual file = %s \n",argv[2]);

        if(qual == NULL)
        {
        fprintf(stderr,"fopen failed for qual file \n");
        exit(1);
        }

        fastq = fopen(argv[3],"w");

        fprintf(stderr,"Fastq output file = %s \n",argv[3]);

        if(fastq == NULL)
        {
        fprintf(stderr,"fopen failed for fastq output file \n");
        exit(1);
        }


        /* line for fgets reads */
        aline=(char *)malloc(MAX_STRING_SIZE*sizeof(char));
        if( aline == NULL )
        {
        fprintf(stderr,"malloc failed for aline\n");
        exit(1);
        }

        qline=(char *)malloc(MAX_STRING_SIZE*sizeof(char));
        if( qline == NULL )
        {
        fprintf(stderr,"malloc failed for qline\n");
        exit(1);
        }

        memset((char *)aline,'\0',MAX_STRING_SIZE*sizeof(char) );
        memset((char *)qline,'\0',MAX_STRING_SIZE*sizeof(char) );

        nreads=0;
        imin=1000;
        imax = -1;


        fgets( (char *)aline,MAX_STRING_SIZE,fasta);
        fgets( (char *)qline,MAX_STRING_SIZE,qual);
        while(strlen(aline) != 0)
        {
        if(strncmp(&aline[0],">",1)==0)
        {
        nreads++;

        fprintf(fastq,"@%s",aline+1);

        if(strcmp(aline,qline)!=0)
        {
        fprintf(stderr,"Read labels do not agree.\n%s\n%s\n",
        aline,qline);
        exit(1);
        }

        for(;
        {
        memset((char *)aline,'\0',MAX_STRING_SIZE*sizeof(char) );
        fgets( (char *)aline,MAX_STRING_SIZE,fasta);
        if(strncmp(&aline[0],">",1)==0) break;
        if(strlen(aline) == 0) break;
        fputs(aline,fastq);
        }
        fprintf(fastq,"+\n");


        for(;
        {
        memset((char *)qline,'\0',MAX_STRING_SIZE*sizeof(char) );
        fgets( (char *)qline,MAX_STRING_SIZE,qual);
        if(strncmp(&qline[0],">",1)==0) break;
        if(strlen(qline) == 0) break;
        i=0;
        while(qline[i]!='\0')
        {
        jj=atoi(&qline[i]);
        k = jj+33;
        fprintf(fastq,"%c",k);
        if(imin > jj) imin=jj;
        if(imax < jj) imax=jj;
        i++;

        while(isdigit(qline[i]))i++;
        while(isspace(qline[i]))i++;
        }
        fprintf(fastq,"\n");
        }
        }
        else
        {
        fprintf(stderr,"Something wrong \n%s\n%s\n",
        aline,qline);
        exit(1);
        }
        }

        fprintf(stderr,"%d reads min = %d max =%d\n",nreads,imin,imax);
        }

        Comment


        • #5
          I don't want to get into a language debate, but here's the same thing in perl:

          Code:
          #!/usr/bin/perl
          
          # AUTHOR: Joseph Fass
          # LAST REVISED: August 2009
          # 
          # The Bioinformatics Core at UC Davis Genome Center
          # http://bioinformatics.ucdavis.edu
          # Copyright (c) 2008 The Regents of University of California, Davis Campus.
          # All rights reserved.
          
          use strict;
          
          my $usage = "\nusage: $0 <reads (fasta format)> <qualities (fasta format)>\n\n".
                      "-- All sequences and qualities must be on a single line each. --.\n".
                      "-- All quality values MUST be followed by a space character. --\n\n".
                      "Merges fasta and qual files into a single Sanger fastq file (STDOUT).";
          
          my $seqfile = shift or die $usage;
          my $qualfile = shift or die $usage;
          
          open F, "<$seqfile" or die $usage;
          open Q, "<$qualfile" or die $usage;
          
          my ($fh, $fs, $qh, $qs);
          my @quals;  my $qv;
          
          while (<F>) {
          	$fh = $_;
          	$fs = <F>;
          	$qh = <Q>;
          	$qs = <Q>;
          	if ($fh eq $qh) {
          		$fh =~ s/>/@/;
          		print $fh.$fs;
          		$qh =~ s/>/+/;
          		print $qh;
          		@quals = split(/\s+/,$qs);
          		for (my $i=0; $i<=$#quals; $i++) {
          			$qv = $quals[$i] + 33;
          			print chr($qv);
          		}
          		print "\n";
          	}
          	else { print "PROBLEM\n" }
          }
          
          close F; close Q;
          A note on the above ... I tend to write all my code depending on having each sequence on one line ... even when I'm working with huge scaffolds or chromosomes ... can be hard to read, but generally doesn't cause problems unless the sequences are really huge ... and it makes coding a lot easier. So here's some perl for converting to the 'oneline' format:

          Code:
          #!/usr/bin/perl
          
          # AUTHOR: Joseph Fass
          # LAST REVISED: October 2009
          # 
          # The Bioinformatics Core at UC Davis Genome Center
          # http://bioinformatics.ucdavis.edu
          # Copyright (c) 2009 The Regents of University of California, Davis Campus.
          # All rights reserved.
          
          use strict;
          
          my $usage = "\nusage: cat <input fasta file> | $0 > <output fasta file>\n\n".
                      "For each sequence, puts nt/aa all on one line following header line.\n".
                      "Reads from and writes to STDOUT, so use redirection ... \n\n\n";
          
          my $block=""; my $line;
          
          while (<>) {
          	$line = $_;
          	if (m/^>/) {  # header line
          		print $block;  # which is either empty, or capped by a newline
          		$block = $line."\n";  # adds another newline, because it will be chomped below
          	}
          	else {
          		chomp $line;
          		if (substr($line,-1) =~ m/[0-9]/) {
          			chomp $block;  # remove newline before adding more sequence
          			$block .= $line." \n";  # must be fasta qualities, and needs a space
          		}
          		else {
          			chomp $block;  # remove newline before adding more sequence
          			$block .= $line."\n";  # not fasta qualities, so no space needed
          		}
          	}
          }
          print $block;  # print last sequence block

          Comment


          • #6
            I have the same problem.
            I got a segmentation error in indexing the hg19.fasta sequence of the human genome.
            then I changed machine (actually to one with much less memory) and it worked.. magic?
            then I tried to bwa align my reads and I get this typical pattern, independently of the computer, and indipendently of the number of reads I'm trying to align
            [bwa_aln] 17bp reads: max_diff = 2
            [bwa_aln] 38bp reads: max_diff = 3
            [bwa_aln] 64bp reads: max_diff = 4
            [bwa_aln] 93bp reads: max_diff = 5
            [bwa_aln] 124bp reads: max_diff = 6
            [bwa_aln] 157bp reads: max_diff = 7
            [bwa_aln] 190bp reads: max_diff = 8
            [bwa_aln] 225bp reads: max_diff = 9
            [bwa_aln_core] calculate SA coordinate... Segmentation fault

            If I align the same reads over the chromosome X only, it seems to work but then I get reads without coordinates (as if they were not mapped).
            Of course I know they map very well because I aligned them with MAQ and I got 90% of the reads mapping.

            what about the segmentation fault?

            I have the same problem.
            I got a segmentation error in indexing the hg19.fasta sequence of the human genome.
            then I changed machine (actually to one with much less memory) and it worked.. magic?
            then I tried to bwa align my reads and I get this typical pattern, independently of the computer, and indipendently of the number of reads I'm trying to align
            [bwa_aln] 17bp reads: max_diff = 2
            [bwa_aln] 38bp reads: max_diff = 3
            [bwa_aln] 64bp reads: max_diff = 4
            [bwa_aln] 93bp reads: max_diff = 5
            [bwa_aln] 124bp reads: max_diff = 6
            [bwa_aln] 157bp reads: max_diff = 7
            [bwa_aln] 190bp reads: max_diff = 8
            [bwa_aln] 225bp reads: max_diff = 9
            [bwa_aln_core] calculate SA coordinate... Segmentation fault

            If I align the same reads over the chromosome X only, it seems to work but then I get reads without coordinates (as if they were not mapped).
            Of course I know they map very well because I aligned them with MAQ and I got 90% of the reads mapping.

            what about the segmentation fault?
            Last edited by nilshomer; 04-15-2010, 10:05 AM.

            Comment


            • #7
              I also have the same segmentation fault when trying to align Illumina fastq reads to human genome.

              Anyone has an idea to solve this problem?

              Comment


              • #8
                I used the latest version of BWA 0.5.7 and got the segmentation fault when aligning Illumina reads. Even when I aligned only 10 reads, this segmentation fault appeared.
                Then I changed to BWA 0.5.6 and everything works just as it should!

                Some other question:
                I aligned my Illumina data both with Bowtie and with BWA. With Bowtie, I get a nice overview of the number of reads that were processed, the number of reads with at least 1 reported alignment, ...
                Is it possible to get similar results for BWA?

                Thanks,
                Lien

                Comment


                • #9
                  Originally posted by dpmccul View Post
                  I used sfffile to extract the data of interest to a .sff file, and then sffinfo to go to a *.fa and
                  *.qual file. I then wrote a c program to convert these to fastq (I know c better than perl),
                  There are much easier ways to do SFF to FASTQ, for example sff_extract:


                  As long as you are not working with paired end 454, you can also do this with Biopython 1.54b or later in just two lines of code (import the Bio.SeqIO module, call the Bio.SeqIO.convert function).

                  Comment


                  • #10
                    @Lien

                    Please use the version of bwa in SVN. You can get the statistics from samtools' flagstat. Or a simple awk command will do.

                    Comment


                    • #11
                      BWA for Roche454 data.

                      Hi there..
                      I am using BWA to perform sequence alignment for the reads generated from Roche454 data.
                      I used sfffile and sffinfo to MID sort and trim off mids from SFF files.
                      So, I have got a read file in fastq format, and my reference sequences in fasta format.
                      I used BWa to index my reference fasta file.
                      Then, aligning the read fastq file against my reference fasta file.
                      Now, I am getting a SAM format file with data in this format::
                      GGDP4G001AT8TK 0 gi|237757283|ref|NM_007294.3| 1331 197 57M1I50M1I13M1I5M1I4M1I15M1I9M1I23M1I12M1I8M1I5M1I9M1I24M1I16M25S * 0 0 ACTGAAGATGTTCCTTGGATAACACTAAATAGCAGCATTCAGAAAGTTAATGAGTGGTTTTTCCAGAAGTGATGAACTGTTAGGTTCTGATGACTCACATGATGGGGAGGTCTGAATCAAATAGCCAACGGTAGGCTGATGTATTGGACGGTTCTAAATAGNGGTAGATGAATATTCTGGTTCGTTCAGAGAAAATAAGACTTACTTGGCCGAGTGATCCTCCATGAGGCTTTATTATGTAAAAGTAGAAGAAGTTCACTCCAGTCATCGCTGCGCCCCTCGACGCAC IIIIIIIIIIIIIIIDHHHIGGHHHFFFDDDDFFFFFDDDFFFFFDDC><@A???4411333==BBBBDFFFFFFFFFA@@FFFFFFFFFFFFFFFFFFFFFF;;99933998<<899666:7=;;../..4/..:<==<<<<;;../4....//..6=49!<<??AAB::555<<?>>>>40111::<<<4455A5143446332155-.////3::;9322<7442631,,-,,-172....995<<<?>><44444444333:::::57---,,,,7,----175 AS:i:139 XS:i:0 XF:i:3 XE:i:2 XN:i:0
                      I am not clearly understanding this part.
                      I am using the default parameters for all.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      33 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X