Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bwa of colorspace: obviously I am doing something stupid

    I'm trying to run bwa (0.5.8a, the last version on SourceForge) on a 64-bit box running Enterprise Oracle Linux with colorspace data (using this guide) & am getting an obviously bogus result. These reads are known to align with Bowtie; here is one

    Code:
    @SRR040290.24278 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1232_543_1118 length=50
    T32310003110000202001010330012330302002022323210221
    +SRR040290.24278 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1232_543_1118 length=50
    !:8>>;=6=<9<9>>799;>::9:7<<=<<<7<:=8;7;:;/,81<)9;:&
    First I indexed hg18 in colorspace
    Code:
    bwa index -a bwtsw -c hg18.fasta > build-cs.sh.out
    Which produced STDERR output which seemed to indicate success
    Code:
    [bwa_index] Pack nucleotide FASTA... 41.87 sec
    [bwa_index] Convert nucleotide PAC to color PAC... 16.71 sec
    [bwa_index] Reverse the packed sequence... 11.21 sec
    [bwa_index] Construct BWT for the packed sequence...
    [bwa_index] 1675.10 seconds elapse.
    [bwa_index] Construct BWT for the reverse packed sequence...
    [bwa_index] 1658.72 seconds elapse.
    [bwa_index] Update BWT... 12.65 sec
    [bwa_index] Update reverse BWT... 11.94 sec
    [bwa_index] Construct SA from BWT and Occ... 616.15 sec
    [bwa_index] Construct SA from reverse BWT and Occ... 618.90 sec
    Then I have this code
    Code:
    bwa aln -c -t 8 ./hg18.fasta test-reads.fastq > test-reads.sai
    bwa samse ./hg18.fasta test-reads.sai test-reads.fastq > test-reads.sam
    with the usual sort of STDERR output
    Code:
    [bwa_aln] 17bp reads: max_diff = 2
    [bwa_aln] 38bp reads: max_diff = 3
    [bwa_aln] 64bp reads: max_diff = 4
    [bwa_aln] 93bp reads: max_diff = 5
    [bwa_aln] 124bp reads: max_diff = 6
    [bwa_aln] 157bp reads: max_diff = 7
    [bwa_aln] 190bp reads: max_diff = 8
    [bwa_aln] 225bp reads: max_diff = 9
    [bwa_aln_core] calculate SA coordinate... 0.00 sec
    [bwa_aln_core] write to the disk... 0.00 sec
    [bwa_aln_core] 482 sequences have been processed.
    [bwa_aln_core] convert to sequence coordinate... 3.14 sec
    [bwa_aln_core] refine gapped alignments... 1.45 sec
    [bwa_aln_core] print alignments... 0.00 sec
    [bwa_aln_core] 482 sequences have been processed.
    with the rather peculiar SAM output of everything looking like the below (i.e. nothing aligned)
    Code:
    SRR040290.24278 4       *       0       0       *       *       0       0       TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN     !:8>>;=6=<9<9>>799;>::9:7<<=<<<7<:=8;7;:;/,81<)9;:&
    SRR040290.113515        4       *       0       0       *       *       0       0       TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN     !5;2.25:15/2//77*:.48:9466195.<7949.:9)4)/-,1*1/4,4
    SRR040290.129188        4       *       0       0       *       *       0       0       TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN     !796878778;:<:8:5::95;<331:=397A;7;998)4:<;/,936&9.
    SRR040290.182661        4       *       0       0       *       *       0       0       TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN     !:==?A><8>>?<9>>>/>:<?/)?7>3:<<<)3=<>/;::<,<:31<282
    SRR040290.188969        4       *       0       0       *       *       0       0       TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN     !:<:<:?:>>9<<;<<(:57=2@<<<689>>8<<>=9:):8*4)7</<8&,
    Any clue what I did wrong?

  • #2
    Hi Krobison,

    I had the exact same problem and the problem is that you have to double-encode your colorspace reads before mapping them with BWA. NovoalignCS, Bowtie and BFAST will take these as CSFASTQ as-is.
    For BWA I ran a preprocessing step with a perl script to do the job then mapped the reads. Your example read with double encoding would look like this:

    Code:
    @SRR040290.24278 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1232_543_1118 length=50
    TGTCAAATCCAAAAGAGAACACATTAACGTTATAGAAGAGGTGTGCAGGC
    +SRR040290.24278 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1232_543_1118 length=50
    :8>>;=6=<9<9>>799;>::9:7<<=<<<7<:=8;7;:;/,81<)9;:&

    My code is below and only works for colorspace fastq. Just copy and paste if you would like to use it (no warranty ofcourse):

    Code:
    #!/usr/bin/perl -w
    my $f=shift or die "No input";
    if ($f =~ /.gz/) {
            open(IN,"zcat $f|") or die "$!";
    }else {
            open(IN,$f) or die "$!";
    }
    while(<IN>) {
            my $h=$_;
            my $s=<IN>;
            my $c=<IN>;
            my $d=<IN>;
            my $seq= substr($s,1,length($s));
            my $qual= substr($d,1,length($d));
            $seq=~tr/[0123.]/[ACGTN]/;
            
            print $h;
            print $seq;
            print $c;
            print $qual;
    
    }
    close IN;
    Last edited by zee; 09-19-2010, 10:03 PM.

    Comment


    • #3
      Thank you for the rapid & informative reply!

      I must say "UGH!" though about rewriting colorspace so it masquerades as basespace. It will be too easy to get confused with that.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      9 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      49 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X