Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    @Brad: hack back to color representiation to use solid2fastq.pl

    Hi Brad,

    You're right about the file formats that solid2fastq is expecting ... there's a <prefix>_F3.csfasta and a <prefix>_F3_QV.qual file, and then a stats file (not needed by the script). But your fastq is still in numerical representation of colorspace, whereas the fastq that results from solid2fastq.pl is really still in colorspace, but the colors are represented (for convenience -- probably for the data structures in bwa?) as [ATCG].

    So I'd recommend splitting/converting your fastq back into the csfasta and QV.qual files, and then using solid2fastq. For reference, I'll paste in the heads of some csfasta and qual files I have (below). However, I haven't yet dealt with PE SOLiD reads ... so I don't know how that's going to figure in your hacking. I don't see any "R3" references in your fastq example, which I thought was the reverse read nomenclature ... so I can't advise any further on that.

    head solid0180sequencer_20091218_AS75_1_AS75_F3.csfasta
    Code:
    # Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
    # Cwd: /home/pipeline
    # Title: solid0180sequencer_20091218_AS75_1_AS75
    >2_14_29_F3
    T22301223212203112133123302220112001120102111112031
    >2_14_37_F3
    T00031010222021220101011121221111121121112121222212
    >2_14_79_F3
    T01231023120123320111231011110221102001010122213211
    >2_14_97_F3
    head solid0180sequencer_20091218_AS75_1_AS75_F3_QV.qual
    Code:
    # Thu Dec 24 14:34:04 2009 /share/apps/corona/bin/filter_fasta.pl --output=/data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/results.F1B1/primary.20091224210709876 --name=solid0180sequencer_20091218_AS75_1_AS75 --tag=F3 --minlength=50 --prefix=T /data/results/solid0180sequencer/solid0180sequencer_20091218_AS75_1/AS75/jobs/postPrimerSetPrimary.206/rawseq 
    # Cwd: /home/pipeline
    # Title: solid0180sequencer_20091218_AS75_1_AS75
    >2_14_29_F3
    4 24 8 12 24 4 26 4 11 4 4 6 4 4 4 4 4 4 4 4 4 8 4 4 4 4 4 0 0 4 4 4 0 0 4 4 4 0 4 0 4 4 0 0 4 4 4 0 0 4 
    >2_14_37_F3
    6 13 6 7 6 7 20 4 4 4 13 18 4 4 4 12 9 4 4 4 4 20 4 4 4 5 11 0 4 4 4 5 4 4 4 4 5 0 0 4 4 4 0 0 4 4 4 0 0 0 
    >2_14_79_F3
    8 25 7 6 9 8 30 6 4 5 4 21 4 4 4 7 20 4 4 4 4 21 4 0 4 4 17 4 0 4 4 17 4 0 4 4 11 0 0 4 0 15 0 0 4 4 6 0 0 4 
    >2_14_97_F3
    hope that helps ...
    ~Joe

    Comment


    • #17
      Problem Solved - For Now

      I was able to obtain the original .csfasta and .qual files from the scientists that generated them and convert them using solid2fastq.pl from BWA v 0.5.7. Along the way I had to fix 3 minor bugs in the script. I'm surer there are better ways to do this (I am nto a perl programmer) but I have included the altered script in case it is of use to anyone else.

      @Joe, Thanks for the examples and advice!

      -- Brad

      Code:
      #!/usr/bin/perl -w
      
      # Author: lh3
      # Note: Ideally, this script should be written in C. It is a bit slow at present.
      # Also note that this script is different from the one contained in MAQ.
      # Three bugs fixed: bgulko 2010-04-06
      #   1-replace -1 with 0 BEFORE trimming first quality value per line
      #   2-ignore comment lines that start with #
      #   3-ignore comments in name fields, ie characters after the "," in XXXX_YYYY_ZZZZ_F3,......
      
      use strict;
      use warnings;
      use Getopt::Std;
      
      my %opts;
      my $version = '0.1.3';
      my $usage = qq{
      Usage: solid2fastq.pl <in.title> <out.prefix>
      
      Note: <in.title> is the string showed in the `# Title:' line of a
            ".csfasta" read file. Then <in.title>F3.csfasta is read sequence
            file and <in.title>F3_QV.qual is the quality file. If
            <in.title>R3.csfasta is present, this script assumes reads are
            paired; otherwise reads will be regarded as single-end.
      
            The read name will be <out.prefix>:panel_x_y/[12] with `1' for R3
            tag and `2' for F3. Usually you may want to use short <out.prefix>
            to save diskspace. Long <out.prefix> also causes troubles to maq.
      
      };
      
      getopts('', \%opts);
      die($usage) if (@ARGV != 2);
      my ($title, $pre) = @ARGV;
      my (@fhr, @fhw);
      my @fn_suff = ('F3.csfasta', 'F3_QV.qual', 'R3.csfasta', 'R3_QV.qual');
      my $is_paired = (-f "$title$fn_suff[2]" || -f "$title$fn_suff[2].gz")? 1 : 0;
      if ($is_paired) { # paired end
        for (0 .. 3) {
      	my $fn = "$title$fn_suff[$_]";
      	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
      	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
        }
        open($fhw[0], "|gzip >$pre.read1.fastq.gz") || die; # this is NOT a typo
        open($fhw[1], "|gzip >$pre.read2.fastq.gz") || die;
        open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
        my (@df, @dr);
        @df = &read1(1); @dr = &read1(2);
        while (@df && @dr) {
      	if ($df[0] eq $dr[0]) { # mate pair
      	  print {$fhw[0]} $df[1]; print {$fhw[1]} $dr[1];
      	  @df = &read1(1); @dr = &read1(2);
      	} else {
      	  if ($df[0] le $dr[0]) {
      		print {$fhw[2]} $df[1];
      		@df = &read1(1);
      	  } else {
      		print {$fhw[2]} $dr[1];
      		@dr = &read1(2);
      	  }
      	}
        }
        if (@df) {
      	print {$fhw[2]} $df[1];
      	while (@df = &read1(1, $fhr[0], $fhr[1])) {
      	  print {$fhw[2]} $df[1];
      	}
        }
        if (@dr) {
      	print {$fhw[2]} $dr[1];
      	while (@dr = &read1(2, $fhr[2], $fhr[3])) {
      	  print {$fhw[2]} $dr[1];
      	}
        }
        close($fhr[$_]) for (0 .. $#fhr);
        close($fhw[$_]) for (0 .. $#fhw);
      } else { # single end
        for (0 .. 1) {
      	my $fn = "$title$fn_suff[$_]";
      	$fn = "gzip -dc $fn.gz |" if (!-f $fn && -f "$fn.gz");
      	open($fhr[$_], $fn) || die("** Fail to open '$fn'.\n");
        }
        open($fhw[2], "|gzip >$pre.single.fastq.gz") || die;
        my @df;
        while (@df = &read1(1, $fhr[0], $fhr[1])) {
      	print {$fhw[2]} $df[1];
        }
        close($fhr[$_]) for (0 .. $#fhr);
        close($fhw[2]);
      }
      
      sub read1 {
        my $i = shift(@_);
        my $j = ($i-1)<<1;
        my ($key, $seq);
        my ($fhs, $fhq) = ($fhr[$j], $fhr[$j|1]);
        my ($a, $b, $c, $key2);
        while (<$fhs>) {
      	while (substr( $_,0,1) eq '#') { <$fhs>; };
      	my $t = <$fhq>;
      	while (substr( $t,0,1) eq '#') { $t = <$fhq>; }
      	if (/^>(\d+)_(\d+)_(\d+)_[FR]3/) {
      	  $key = sprintf("%.4d_%.4d_%.4d", $1, $2, $3); # this line could be improved on 64-bit machines
                ( $a, $b, $c ) = split( /_/, substr($t,1) );
      	  $key2 = sprintf("%.4d_%.4d_%.4d", $a, $b, $c); # this line could be improved on 64-bit machines
      	  die(qq/** unmatched read name: '$_' != '$t' , '$key', '$key2'\n/) unless ($key eq $key2);
      	  my $name = "$pre:$1_$2_$3/$i";
      	  $_ = substr(<$fhs>, 2);
      	  tr/0123./ACGTN/;
      	  my $s = $_;
      	  $_ = <$fhq>;
      	  s/-1\b/0/eg;
      	  s/^(\d+)\s*//;
      	  s/(\d+)\s*/chr($1+33)/eg;
      	  $seq = qq/\@$name\n$s+\n$_\n/;
      	  last;
      	}
        }
        return defined($seq)? ($key, $seq) : ();
      }

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        05-06-2024, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 05-14-2024, 07:03 AM
      0 responses
      17 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-10-2024, 06:35 AM
      0 responses
      40 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-09-2024, 02:46 PM
      0 responses
      50 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-07-2024, 06:57 AM
      0 responses
      41 views
      0 likes
      Last Post seqadmin  
      Working...
      X