Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    @Brad: hack back to color representiation to use solid2fastq.pl

    Hi Brad,

    You're right about the file formats that solid2fastq is expecting ... there's a <prefix>_F3.csfasta and a <prefix>_F3_QV.qual file, and then a stats file (not needed by the script). But your fastq is still in numerical representation of colorspace, whereas the fastq that results from solid2fastq.pl is really still in colorspace, but the colors are represented (for convenience -- probably for the data structures in bwa?) as [ATCG].

    So I'd recommend splitting/converting your fastq back into the csfasta and QV.qual files, and then using solid2fastq. For reference, I'll paste in the heads of some csfasta and qual files I have (below). However, I haven't yet dealt with PE SOLiD reads ... so I don't know how that's going to figure in your hacking. I don't see any "R3" references in your fastq example, which I thought was the reverse read nomenclature ... so I can't advise any further on that.

    head solid0180sequencer_20091218_AS75_1_AS75_F3.csfasta
    Code:
    # Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
    # Cwd: /home/pipeline
    # Title: solid0180sequencer_20091218_AS75_1_AS75
    >2_14_29_F3
    T22301223212203112133123302220112001120102111112031
    >2_14_37_F3
    T00031010222021220101011121221111121121112121222212
    >2_14_79_F3
    T01231023120123320111231011110221102001010122213211
    >2_14_97_F3
    head solid0180sequencer_20091218_AS75_1_AS75_F3_QV.qual
    Code:
    # Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
    # Cwd: /home/pipeline
    # Title: solid0180sequencer_20091218_AS75_1_AS75
    >2_14_29_F3
    4 24 8 12 24 4 26 4 11 4 4 6 4 4 4 4 4 4 4 4 4 8 4 4 4 4 4 0 0 4 4 4 0 0 4 4 4 0 4 0 4 4 0 0 4 4 4 0 0 4 
    >2_14_37_F3
    6 13 6 7 6 7 20 4 4 4 13 18 4 4 4 12 9 4 4 4 4 20 4 4 4 5 11 0 4 4 4 5 4 4 4 4 5 0 0 4 4 4 0 0 4 4 4 0 0 0 
    >2_14_79_F3
    8 25 7 6 9 8 30 6 4 5 4 21 4 4 4 7 20 4 4 4 4 21 4 0 4 4 17 4 0 4 4 17 4 0 4 4 11 0 0 4 0 15 0 0 4 4 6 0 0 4 
    >2_14_97_F3
    hope that helps ...
    ~Joe

    Comment


    • #17
      Problem Solved - For Now

      I was able to obtain the original .csfasta and .qual files from the scientists that generated them and convert them using solid2fastq.pl from BWA v 0.5.7. Along the way I had to fix 3 minor bugs in the script. I'm surer there are better ways to do this (I am nto a perl programmer) but I have included the altered script in case it is of use to anyone else.

      @Joe, Thanks for the examples and advice!

      -- Brad

      Code:
      #!/usr/bin/perl -w
      
      # Author: lh3
      # Note: Ideally, this script should be written in C. It is a bit slow at present.
      # Also note that this script is different from the one contained in MAQ.
      # Three bugs fixed: bgulko 2010-04-06
      #   1-replace -1 with 0 BEFORE trimming first quality value per line
      #   2-ignore comment lines that start with #
      #   3-ignore comments in name fields, ie characters after the "," in XXXX_YYYY_ZZZZ_F3,......
      
      use strict;
      use warnings;
      use Getopt::Std;
      
      my %opts;
      my $version = '0.1.3';
      my $usage = qq{
      Usage: solid2fastq.pl <in.title> <out.prefix>
      
      Note: <in.title> is the string showed in the `# Title:' line of a
            ".csfasta" read file. Then <in.title>F3.csfasta is read sequence
            file and <in.title>F3_QV.qual is the quality file. If
            <in.title>R3.csfasta is present, this script assumes reads are
            paired; otherwise reads will be regarded as single-end.
      
            The read name will be <out.prefix>:panel_x_y/[12] with `1' for R3
            tag and `2' for F3. Usually you may want to use short <out.prefix>
            to save diskspace. Long <out.prefix> also causes troubles to maq.
      
      };
      
      getopts('', \%opts);
      die($usage) if (@ARGV != 2);
      my ($title, $pre) = @ARGV;
      my (@fhr, @fhw);
      my @fn_suff = ('F3.csfasta', 'F3_QV.qual', 'R3.csfasta', 'R3_QV.qual');
      my $is_paired = (-f "$title$fn_suff[2]" || -f "$title$fn_suff[2].gz")? 1 : 0;
      if ($is_paired) { # paired end
        for (0 .. 3) {
      	my $fn = "$title$fn_suff[$_]";
      	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
      	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
        }
        open($fhw[0], "|gzip >$pre.read1.fastq.gz") || die; # this is NOT a typo
        open($fhw[1], "|gzip >$pre.read2.fastq.gz") || die;
        open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
        my (@df, @dr);
        @df = &read1(1); @dr = &read1(2);
        while (@df && @dr) {
      	if ($df[0] eq $dr[0]) { # mate pair
      	  print {$fhw[0]} $df[1]; print {$fhw[1]} $dr[1];
      	  @df = &read1(1); @dr = &read1(2);
      	} else {
      	  if ($df[0] le $dr[0]) {
      		print {$fhw[2]} $df[1];
      		@df = &read1(1);
      	  } else {
      		print {$fhw[2]} $dr[1];
      		@dr = &read1(2);
      	  }
      	}
        }
        if (@df) {
      	print {$fhw[2]} $df[1];
      	while (@df = &read1(1, $fhr[0], $fhr[1])) {
      	  print {$fhw[2]} $df[1];
      	}
        }
        if (@dr) {
      	print {$fhw[2]} $dr[1];
      	while (@dr = &read1(2, $fhr[2], $fhr[3])) {
      	  print {$fhw[2]} $dr[1];
      	}
        }
        close($fhr[$_]) for (0 .. $#fhr);
        close($fhw[$_]) for (0 .. $#fhw);
      } else { # single end
        for (0 .. 1) {
      	my $fn = "$title$fn_suff[$_]";
      	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
      	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
        }
        open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
        my @df;
        while (@df = &read1(1, $fhr[0], $fhr[1])) {
      	print {$fhw[2]} $df[1];
        }
        close($fhr[$_]) for (0 .. $#fhr);
        close($fhw[2]);
      }
      
      sub read1 {
        my $i = shift(@_);
        my $j = ($i-1)<<1;
        my ($key, $seq);
        my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]);
        my ($a, $b, $c, $key2);
        while (<$fhs>) {
      	while (substr( $_,0,1) eq '#') { <$fhs>; };
      	my $t = <$fhq>;
      	while (substr( $t,0,1) eq '#') { $t = <$fhq>; }
      	if (/^>(\d+)_(\d+)_(\d+)_[FR]3/) {
      	  $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines
                ( $a, $b, $c ) = split( /_/, substr($t,1) );
      	  $key2 = sprintf("%.4d_%.4d_%.4d", $a, $b, $c); # this line could be improved on 64-bit machines
      	  die(qq/** unmatched read name: '$_' != '$t' , '$key', '$key2'\n/) unless ($key eq $key2);
      	  my $name = "$pre:$1_$2_$3/$i";
      	  $_ = substr(<$fhs>, 2);
      	  tr/0123./ACGTN/;
      	  my $s = $_;
      	  $_ = <$fhq>;
      	  s/-1\b/0/eg;
      	  s/^(\d+)\s*//;
      	  s/(\d+)\s*/chr($1+33)/eg;
      	  $seq = qq/\@$name\n$s+\n$_\n/;
      	  last;
      	}
        }
        return defined($seq)? ($key, $seq) : ();
      }

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Best Practices for Single-Cell Sequencing Analysis
        by seqadmin



        While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
        06-06-2024, 07:15 AM
      • seqadmin
        Latest Developments in Precision Medicine
        by seqadmin



        Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

        Somatic Genomics
        “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
        05-24-2024, 01:16 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 07:23 AM
      0 responses
      9 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 06-17-2024, 06:54 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 06-14-2024, 07:24 AM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 06-13-2024, 08:58 AM
      0 responses
      18 views
      0 likes
      Last Post seqadmin  
      Working...
      X