Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question on the speed of bwa aln

    Hi,

    I am trying to use bwa to align ~800 million 100bp PE reads from a whole genome dataset generated on the Illumina platform. These are "anomaly" reads which are either unmapped reads, or singletons or discordant pairs by another aligner.

    I have the reads in read1.fastq and read2.fastq and called "bwa aln" the standard way:

    bwa aln hg19.fa read1.fastq > 1.sai
    bwa aln hg19.fa read2.fastq > 2.sai

    The problem is that when I checked the progress, it processed every 250K reads in ~10 minutes. This is much, much slower than I expected - at this speed it will take 20+ days for me to get the .sai files.

    I had some experience with BWA before on an whole exome dataset and it seemed to be much faster. So I don't know whether this is due to the nature of these reads I started with (the fact that majority of them have failed to be aligned by another aligner or aligned discordantly), or I have made any simple mistake.

    Any comment or suggestion will be high appreciated. Thanks!

  • #2
    Did you forget the -t parameter to increase the number of threads ?

    I use a rough perl script to run BWA, here it is.

    Code:
    # A wrapper script for bwa - for PE data
    # version 0.2 
    
    
    use strict;
    my($version) = 0.2;
    
    
    #Declar vars
    my($mapProcessorInfile) = "";
    my($mapProcessorOutfile) = "";
    my($runSpeciesLevelAnalysis) = 1;
    my($i) = 0;
    #args
    my($infile1) = "";
    my($infile2) = "";
    my($infileprefix) = "";
    my($fastqInput) = 0;
    my($reference) = "/home/col/sequences/genome_data/human_build37_ucsc/human_build37_ucsc.fa";
    my($runTypeArg) = 'species';
    my($version) = '0.2';
    
    open (LOGFILE,   ">>", "bwa.log")  || die "Couldn't open file 'bwa2.log'";
    
    print "Reading arguments\n";
    
    if ($ARGV[1] eq "") {
    	die "########\nError: Not enough arguments entered.
    	Please enter a file containing reads as an argument:
    	perl runbwa_PE.pl -q1 reads1.fastq -q2 reads2.fastq
    	
    # Required arguments 
    
    -q1 reads.fastq : the input read file 1 in multi fastq format 
    -q2 reads.fastq : the input read file 2 in multi fastq format 
    
    # Optional arguments
    -r all_viral : the reference name in the subdirectory indexes (options: any! default: allgenomes_2009) 
    "
    };
    
    #Assign command line 
    assignArgs();
    
    print "Settings\n";
    print "Reads file 1: ".$infile1."\n";
    print "Reads file 2: ".$infile2."\n";
    print "Reference: ".$reference."\n";
    print "------- \n";
    
    print LOGFILE "Reads file: ".$infile1."\n";
    print LOGFILE "Reads file: ".$infile2."\n";
    print LOGFILE "Reference: ".$reference."\n";
    print LOGFILE "------- \n";
    print  LOGFILE ("\nRunning Bwa PE - version $version"."\n");
    
    # Time
    my(@months) = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
    my(@weekDays) = qw(Sun Mon Tue Wed Thu Fri Sat Sun);
    (my($second), my($minute), my($hour), my($dayOfMonth), my($month), my($yearOffset), my($dayOfWeek), my($dayOfYear), my($daylightSavings)) = localtime();
    my($year) = 1900 + $yearOffset;
    my($theTime) = "$hour:$minute:$second, $weekDays[$dayOfWeek] $months[$month] $dayOfMonth, $year";
    print $theTime;
    print LOGFILE $theTime;
    
    
    #### create fasta index ####
    
    #bwa index -a bwtsw -p pao1 NC_002516.fna 
    
    $infileprefix = $infile1;
    $infileprefix =~ s/.fastq//;
    $infileprefix =~ s/.fas//;
    $infileprefix =~ s/.fq//;
    $infileprefix =~ s/.fa//;
    $infileprefix =~ s/.fna//;
    
    #### Alignment ####
    
    `bwa aln -t 15 $reference $infile1 > $infileprefix-sa1.sai`;
    `bwa aln -t 15 $reference $infile2 > $infileprefix-sa2.sai`;
    
    #### align, print aligns to SAM format ###
    
    #bwa samse indexes/allgenomes_2009_bwa humanstoolhealthy_sa.sai HumanStoolHealthy_SRX000431_filt.fas > x.sam
    
    `bwa sampe $reference $infileprefix-sa1.sai $infileprefix-sa2.sai $infile1 $infile2 > $infileprefix-bwa.sam`;
    
    
    
    
    print "Job completed\n";
    print LOGFILE "\nJob completed\n";
    
    
    # Time
    my(@months) = qw(Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
    my(@weekDays) = qw(Sun Mon Tue Wed Thu Fri Sat Sun);
    (my($second), my($minute), my($hour), my($dayOfMonth), my($month), my($yearOffset), my($dayOfWeek), my($dayOfYear), my($daylightSavings)) = localtime();
    my($year) = 1900 + $yearOffset;
    my($theTime) = "$hour:$minute:$second, $weekDays[$dayOfWeek] $months[$month] $dayOfMonth, $year";
    print $theTime."\n";
    print LOGFILE $theTime."\n";
    
    
    sub assignArgs {
    
    
    
    while ($i<9) {
    	#Assign all arguments
    	if ($ARGV[$i] eq "-f") {
    		$infile1 = $ARGV[$i+1];
    	}
    	if ($ARGV[$i] eq "-q1") {
    		$infile1 = $ARGV[$i+1];
    		$fastqInput = 1;
    	}
    	if ($ARGV[$i] eq "-q2") {
    		$infile2 = $ARGV[$i+1];
    		$fastqInput = 1;
    	}
    	if ($ARGV[$i] eq "-r") {
    		$reference = $ARGV[$i+1];
    	}
    
    $i++;
    }
    
    }

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Essential Discoveries and Tools in Epitranscriptomics
      by seqadmin




      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
      04-22-2024, 07:01 AM
    • seqadmin
      Current Approaches to Protein Sequencing
      by seqadmin


      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
      04-04-2024, 04:25 PM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Today, 11:49 AM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 08:47 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-11-2024, 12:08 PM
    0 responses
    61 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 04-10-2024, 10:19 PM
    0 responses
    60 views
    0 likes
    Last Post seqadmin  
    Working...
    X