Seqanswers Leaderboard Ad

**jnfass** · 03-08-2010, 01:15 PM

@Brad: hack back to color representiation to use solid2fastq.pl

Hi Brad,

You're right about the file formats that solid2fastq is expecting ... there's a <prefix>_F3.csfasta and a <prefix>_F3_QV.qual file, and then a stats file (not needed by the script). But your fastq is still in numerical representation of colorspace, whereas the fastq that results from solid2fastq.pl is really still in colorspace, but the colors are represented (for convenience -- probably for the data structures in bwa?) as [ATCG].

So I'd recommend splitting/converting your fastq back into the csfasta and QV.qual files, and then using solid2fastq. For reference, I'll paste in the heads of some csfasta and qual files I have (below). However, I haven't yet dealt with PE SOLiD reads ... so I don't know how that's going to figure in your hacking. I don't see any "R3" references in your fastq example, which I thought was the reverse read nomenclature ... so I can't advise any further on that.

head solid0180sequencer_20091218_AS75_1_AS75_F3.csfasta

Code:

# Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
# Cwd: /home/pipeline
# Title: solid0180sequencer_20091218_AS75_1_AS75
>2_14_29_F3
T22301223212203112133123302220112001120102111112031
>2_14_37_F3
T00031010222021220101011121221111121121112121222212
>2_14_79_F3
T01231023120123320111231011110221102001010122213211
>2_14_97_F3

head solid0180sequencer_20091218_AS75_1_AS75_F3_QV.qual

Code:

# Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
# Cwd: /home/pipeline
# Title: solid0180sequencer_20091218_AS75_1_AS75
>2_14_29_F3
4 24 8 12 24 4 26 4 11 4 4 6 4 4 4 4 4 4 4 4 4 8 4 4 4 4 4 0 0 4 4 4 0 0 4 4 4 0 4 0 4 4 0 0 4 4 4 0 0 4 
>2_14_37_F3
6 13 6 7 6 7 20 4 4 4 13 18 4 4 4 12 9 4 4 4 4 20 4 4 4 5 11 0 4 4 4 5 4 4 4 4 5 0 0 4 4 4 0 0 4 4 4 0 0 0 
>2_14_79_F3
8 25 7 6 9 8 30 6 4 5 4 21 4 4 4 7 20 4 4 4 4 21 4 0 4 4 17 4 0 4 4 17 4 0 4 4 11 0 0 4 0 15 0 0 4 4 6 0 0 4 
>2_14_97_F3

hope that helps ...
~Joe

**bgulko** · 04-06-2010, 11:49 AM

Problem Solved - For Now

I was able to obtain the original .csfasta and .qual files from the scientists that generated them and convert them using solid2fastq.pl from BWA v 0.5.7. Along the way I had to fix 3 minor bugs in the script. I'm surer there are better ways to do this (I am nto a perl programmer) but I have included the altered script in case it is of use to anyone else.

@Joe, Thanks for the examples and advice!

-- Brad

Code:

#!/usr/bin/perl -w

# Author: lh3
# Note: Ideally, this script should be written in C. It is a bit slow at present.
# Also note that this script is different from the one contained in MAQ.
# Three bugs fixed: bgulko 2010-04-06
#   1-replace -1 with 0 BEFORE trimming first quality value per line
#   2-ignore comment lines that start with #
#   3-ignore comments in name fields, ie characters after the "," in XXXX_YYYY_ZZZZ_F3,......

use strict;
use warnings;
use Getopt::Std;

my %opts;
my $version = '0.1.3';
my $usage = qq{
Usage: solid2fastq.pl <in.title> <out.prefix>

Note: <in.title> is the string showed in the `# Title:' line of a
      ".csfasta" read file. Then <in.title>F3.csfasta is read sequence
      file and <in.title>F3_QV.qual is the quality file. If
      <in.title>R3.csfasta is present, this script assumes reads are
      paired; otherwise reads will be regarded as single-end.

      The read name will be <out.prefix>:panel_x_y/[12] with `1' for R3
      tag and `2' for F3. Usually you may want to use short <out.prefix>
      to save diskspace. Long <out.prefix> also causes troubles to maq.

};

getopts('', \%opts);
die($usage) if (@ARGV != 2);
my ($title, $pre) = @ARGV;
my (@fhr, @fhw);
my @fn_suff = ('F3.csfasta', 'F3_QV.qual', 'R3.csfasta', 'R3_QV.qual');
my $is_paired = (-f "$title$fn_suff[2]" || -f "$title$fn_suff[2].gz")? 1 : 0;
if ($is_paired) { # paired end
  for (0 .. 3) {
	my $fn = "$title$fn_suff[$_]";
	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[0], "|gzip >$pre.read1.fastq.gz") || die; # this is NOT a typo
  open($fhw[1], "|gzip >$pre.read2.fastq.gz") || die;
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my (@df, @dr);
  @df = &read1(1); @dr = &read1(2);
  while (@df && @dr) {
	if ($df[0] eq $dr[0]) { # mate pair
	  print {$fhw[0]} $df[1]; print {$fhw[1]} $dr[1];
	  @df = &read1(1); @dr = &read1(2);
	} else {
	  if ($df[0] le $dr[0]) {
		print {$fhw[2]} $df[1];
		@df = &read1(1);
	  } else {
		print {$fhw[2]} $dr[1];
		@dr = &read1(2);
	  }
	}
  }
  if (@df) {
	print {$fhw[2]} $df[1];
	while (@df = &read1(1, $fhr[0], $fhr[1])) {
	  print {$fhw[2]} $df[1];
	}
  }
  if (@dr) {
	print {$fhw[2]} $dr[1];
	while (@dr = &read1(2, $fhr[2], $fhr[3])) {
	  print {$fhw[2]} $dr[1];
	}
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[$_]) for (0 .. $#fhw);
} else { # single end
  for (0 .. 1) {
	my $fn = "$title$fn_suff[$_]";
	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
  }
  open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
  my @df;
  while (@df = &read1(1, $fhr[0], $fhr[1])) {
	print {$fhw[2]} $df[1];
  }
  close($fhr[$_]) for (0 .. $#fhr);
  close($fhw[2]);
}

sub read1 {
  my $i = shift(@_);
  my $j = ($i-1)<<1;
  my ($key, $seq);
  my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]);
  my ($a, $b, $c, $key2);
  while (<$fhs>) {
	while (substr( $_,0,1) eq '#') { <$fhs>; };
	my $t = <$fhq>;
	while (substr( $t,0,1) eq '#') { $t = <$fhq>; }
	if (/^>(\d+)_(\d+)_(\d+)_[FR]3/) {
	  $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines
          ( $a, $b, $c ) = split( /_/, substr($t,1) );
	  $key2 = sprintf("%.4d_%.4d_%.4d", $a, $b, $c); # this line could be improved on 64-bit machines
	  die(qq/** unmatched read name: '$_' != '$t' , '$key', '$key2'\n/) unless ($key eq $key2);
	  my $name = "$pre:$1_$2_$3/$i";
	  $_ = substr(<$fhs>, 2);
	  tr/0123./ACGTN/;
	  my $s = $_;
	  $_ = <$fhq>;
	  s/-1\b/0/eg;
	  s/^(\d+)\s*//;
	  s/(\d+)\s*/chr($1+33)/eg;
	  $seq = qq/\@$name\n$s+\n$_\n/;
	  last;
	}
  }
  return defined($seq)? ($key, $seq) : ();
}

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Latest Articles

ad_right_rmr

News