Seqanswers Leaderboard Ad

**ECO** · 05-20-2008, 04:59 PM

I haven't gotten it running yet on my system, but Maq seems to include such a functionality:

Will SolexaPipeline generate the FASTQ file Maq needs? If not, what should I do?
The Gerald module of the SolexaPipeline generates a FASTQ-like format, but it is not a FASTQ in fact. You should convert the format with "maq sol2sanger". This page gives more explanation. The latest SolexaPipeline-0.3.0 generates a new file in the "Export" format. Read sequences, qualities and most of alignment information is stored in this file

From here...http://maq.sourceforge.net/qual.shtml

**Farhat** · 05-21-2008, 12:19 PM

Thanks for the information, I will check it out. In the meantime I got a working script from the Sanger Center to do the conversion.

**ECO** · 05-21-2008, 01:53 PM

If you have permission, post it!

**Farhat** · 05-23-2008, 06:56 AM

Originally posted by ECO View Post

If you have permission, post it!

Sure.

There seems to be some problem with my posting it. I have tried several times since yesterday but I get a server unavailable message every time.

Service Temporarily Unavailable

The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

**ECO** · 05-23-2008, 07:05 AM

Hi Farhat,

There is a bug in vBulletin and/or my server configuration that crops up with some characters. Try putting it within [ code ] [ / code ] tags before posting it, otherwise you can email it to me. I will PM my email address.

**ECO** · 05-23-2008, 08:59 AM

Script to Convert Illumina to FASTQ

Code:

#!/usr/bin/perl -w
#######################################################################
# This software has been created by Genome Research Limited (GRL).    # 
#                                                                     #
# GRL hereby grants permission to use, copy, modify and distribute    # 
# this software and its documentation for non-commercial purposes     # 
# without fee at the user's own risk on the basis set out below.      #
#                                                                     #
# GRL neither undertakes nor accepts any duty whether contractual or  # 
# otherwise in connection with the software, its use or the use of    # 
# any derivative, and makes no representations or warranties, express #
# or implied, concerning the software, its suitability, fitness for   #
# a particular purpose or non-infringement.                           #
#                                                                     #
# In no event shall the authors of the software or GRL be responsible # 
# or liable for any loss or damage whatsoever arising in any way      # 
# directly or indirectly out of the use of this software or its       # 
# derivatives, even if advised of the possibility of such damage.     #
#                                                                     #
# Our software can be freely distributed under the conditions set out # 
# above, and must contain this copyright notice.                      #
#######################################################################

# Author: James Bonfield, March 2006
#
# Reads _seq.txt files and _prb.txt/_qcal.txt files in the main
# Bustard directory to produce a fastq output file. Sequence names are
# generated based on the input filename and their position in the
# file. Optionally an additional 'command' may be supplied to run as a
# filter to modify the name, eg to make the names unique across runs
#
# FH 5.21.08 Modified 'directory/filename hackery' to remove underscores 

use strict;
use Cwd;
use Getopt::Long;

# Argument parsing
my %opts = ("trim" => 0, "calibrated" => 0, "quiet" => 0);

$opts{gerald_dir} = [sort glob("GERALD*")]->[-1];

exit unless GetOptions(\%opts, "trim=i", "name_sub=s", "calibrated|c",
               "gerald_dir=s", "quiet|q");

# Compute log-odds to fastq-scale mapping
my @rescale;
for (my $i = -100; $i < 100; $i++) {
    $rescale[$i+100] = chr (041+int(10*log(1+10**($i/10))/log(10)+.499));
}
my %hmap = ('A' => 0, 'C' => 1, 'G' => 2, 'T' => 3, '.' => 0);

# Loop through all aligned sequence records;
foreach my $fn (@ARGV) {
    if (-d $fn) {
    foreach (glob("$fn/*seq.txt")) {
        process_file($_);
    }
    } else {
    process_file($fn);
    }
}

# Processes a single _seq.txt (the filename) and it's corresponding _prb.txt
# quality file.
sub process_file {
    my ($in_fn) = @_;
    local $"="";

    print STDERR "processing $in_fn\n" unless $opts{quiet};

    # Directory/filename hackery
    $in_fn = getcwd() . "/$in_fn" if ($in_fn !~ /^\//);
    my ($dir, $fn) = ($in_fn =~ m:(.*)/(.*):);

    if ($fn !~ /(.*)seq.txt$/) {
    print STDERR "Filename '$fn' is not a *seq.txt file\n";
    return;
    }

    my $base = $1;

    # Default 'op' is based on directory name, or $opts{name_sub}
    my $old_dir = getcwd();
    chdir($dir);
    my @dirs = split('/', getcwd());

    chdir($old_dir);
    my ($machine, $run) = ("unknown", "unknown");
    if (exists($dirs[-4]) &&
    $dirs[-4] =~ m:^[0-9]+_slxa-([^-_]+).*[-_]([0-9]+)$:) {
    ($machine, $run) = ($1, $2+0);
    }    
    my $op;
    if (exists($opts{name_sub})) {
    $op = $opts{name_sub};
    } else {
    $op = "\"IL${machine}_${run}_\${lane}_\${tile}_\${x}_\${y}\"";
    }

    # Open the files
    my $qfn;
    if ($opts{calibrated}) {
    $qfn = "$opts{gerald_dir}/${base}_qcal.txt";
    } else {
    $qfn = "${base}prb.txt";
    }
    my $count = 0;

    open (my $qualfh,  "<", "$dir/$qfn")
    || die "Couldn't open prb file '$qfn': $@";
    open (my $seqfh,   "<", "$dir/$fn") 
    || die "Couldn't open seq file '$fn': $@";

    # Format of seq is <lane> <tile> <xpos> <ypos> <sequence>
    while (<$seqfh>) {
    my $q = (<$qualfh>);

    # Decode quality into 1 per called base
    my ($lane,$tile,$x,$y,$seq) = split(/\s+/,$_);
    my $i = 0;
    my @qual;
    if ($opts{calibrated}) {
        chomp($q);
        @qual = map {ord($_) - 64} split("", $q);
    } else {
        my @qa = split('\t', $q);
        @qual = map {[split()]->[$hmap{substr($seq, $i++, 1)}]} @qa;
    }

    # Map from log-odds to Phred scale
    @qual = map { $rescale[$_+100] } @qual;

    ++$count;
    $_ = eval $op;
    die if $@;

    $seq =~ s/[^ACGT]/N/gi;
    print "\@$_\n" . substr($seq, $opts{trim}) .
         "\n+\n" . substr("@qual", $opts{trim}) . "\n";
    }

    close($seqfh);
    close($qualfh);
}

__END__

=head1 NAME

solexa2fastq - converts Bustard seq and prb files into fastq format

=head1 SYNPOSIS

 solexa2fastq [options] s_1_1_seq.txt ...
 solexa2fastq [options] bustard_dir ...

=head1 DESCRIPTION

This converts a Bustard output files (I<*_seq.txt> and I<*_prb.txt>,
and optionally GERALD*/*_qcal.txt) into Fastq format. Only the
I<seq.txt> files need to be specified on the command line as the
program will automatically read either the associated I<prb.txt> or
I<qcal.txt> file. Alternatively specifying a directory acts as a
synonym for reading all I<*_seq.txt> files from within that
directory. Any number of files or directories may be specified, with
all translations being concatenated and written to stdout.

=head1 OPTIONS

=over

=item --trim <integer>

This specifies how many bases to trim from the start of the
sequence. This now defaults this is "0", but older runs will likely
require "1". (Should this be specifying use-bases and an
nYYYYY.. string instead to allow trimming off the other end too?)

=item --name_sub <perlop>

This is a perl expression applied to the generate the reading name.
It may use the perl variables $machine, $run, $lane, $tile, $x, $y and
$count with the the first two being computed from the current working
directory and the latter being an auto-incremented counter. By default
it uses "IL\${machine}_\${run}_\${lane}_\${tile}_\${x}_\${y}".

For example to use the older solexa2fastq default of
s_<lane>_<tile>_<counter> you'd specified:

 --name_sub '"s_${lane}_${tile}_${count}"'

A more complex example could be:

 --name_sub 'sprintf("s_%d_%04d_%X_%X", $run, $tile, $x, $y)'

=item --calibrated

This boolean flag indicates that the quality values should be fetched
from the *_qcal.txt files in the GERALD directory.

=item --gerald_dir <directory name>

This specifies the GERALD output directory to use with the --calibrated flag.
Without it the last (sorted by name, not time) one found in the
directory is used instead. Note that there is an implicit assumption that
the Bustard directory is 3 levels lower down than our current working
directory. (This is assumed by when writing the output files to "../../..")

=back

=head1 AUTHOR

James Bonfield

=cut

Thanks Farhat, got it. Via trial and error that error you were getting was due to the "chr(" on line 49. Adding a space got rid of the error, weird. I've attached the original script to this post also. Please report if it works for you.

Attached Files

solexa2fastq.pl (6.8 KB, 277 views)

**biocc** · 10-08-2008, 06:53 AM

my result

Originally posted by Farhat View Post

Thanks for the information, I will check it out. In the meantime I got a working script from the Sanger Center to do the conversion.

HI,using your script, I got a result like:
@ILunknown_unknown_1_1_84_918
TACTTGTTGATCTATATGTTACTTAATATAATGAAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII0II%
@ILunknown_unknown_1_1_334_568
AGAGCGTGATAGTGCCATGCACAGACCTGTAGATAT
+
6EI+-$/I,%>E'8",I(IIB$I&5I,.<&%I&)4"
I am puzled.and my prb.txt is like this:
-40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40
and seq.txt is
1 1 119 395 GAAGAGGAGATAAATAAAACTCAAAATACAGCTGAA

**kmcarr** · 10-08-2008, 08:21 AM

Besides the solexa2fastq.pl script you can also use the script fq_all2std.pl. This script is part of the standard MAQ distribution (look in the scripts/ subdirectory). It handles conversion of just about every Solexa file format to FASTA/FASTQ, including the .seq and .prb files.

**swbarnes2** · 10-08-2008, 10:20 AM

Originally posted by biocc View Post

HI,using your script, I got a result like:
@ILunknown_unknown_1_1_84_918
TACTTGTTGATCTATATGTTACTTAATATAATGAAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII0II%
@ILunknown_unknown_1_1_334_568
AGAGCGTGATAGTGCCATGCACAGACCTGTAGATAT
+
6EI+-$/I,%>E'8",I(IIB$I&5I,.<&%I&)4"
I am puzled.and my prb.txt is like this:
-40 -40 40 -40 40 -40 -40 -40 40 -40 -40 -40 -40 -40
and seq.txt is
1 1 119 395 GAAGAGGAGATAAATAAAACTCAAAATACAGCTGAA

That's all correct.

@name_of_sequence
sequence
+name_of _sequence, or left blank to keep the file smaller
quality of each letter

Is the fastq format. The highest number of the group of 4 is the quality score of that base (its position in the 4 tells you which letter it is). Then an ASCII letter stands for that number. From the letters you've got there, ! means a quality of 0, and I is 40, the best quality.

The unknowns are the machine and the date, but the rest of the name is the lane, the tile, and the X + Y coordiantes of the cluster.

**jkbonfield** · 10-16-2008, 12:41 AM

The original doesn't seem to have that space after chr so it's likely some problem with mailing it or some daft word wrapping thing in a mail client.

If you're in the run folder then the automatically generated name should work, but it checks pwd to figure that out so copying the files elsewhere would break that. There's also some hideous hackery of supplying an anonymous function to generate the sequence names. When this was written Illumina hadn't produced any fastq files at all and even eland output didn't have sequence nanes, so we were a bit in the dark as to what names would look like. Hence the overcomplicated method in there :-)

James

**olus** · 10-17-2008, 01:31 AM

Perhaps I've missed it, but nobody cited the 'ShortRead' package in Bioconductor.
Thought I didn't use it yet (hope to do so very soon ;-)) the manual claims that:
"ShortRead includes additional functions to facilitate input. For instance, read-
Prb reads Solexa _prb.txt files. These files contain base-specific quality in-
formation, and readPrb returns an SFastqQuality-class object representing the
fastq-encoded base-specic quality scores of all reads."

Cheers

**hannat** · 01-26-2009, 03:28 AM

seq.txt
.................................................. ......................
6 1 914 893 GCTACTGCCGTGACCTCATTTCTCTTA
6 1 898 905 GAAAAAGAGAAAGTTTAGGAGATCGAT
.................................................. ...................................
prob.txt
.................................................. ...................................
-30 -30 30 -30 -30 30 -30 -30 -30 -30 -30 30 30 -30 -30
-30 -30 30 -30 -30 -30 -30 -30 30 -30 -30 30 -30 -30 3
0 -30 -30 -30 30 -30 -30 -30 -30 30 -30 -30 -30 -30 30
-30 -30 30 -30 30 -30 -30 -30 -30 30 -30 -30 -30 30 -30
-30 -30 -30 -30...
.................................................. ...........................

but when i run
fq_all2std.pl seqprb2std seq.txt prb.txt
The output is following,
...........................................
@6:1:914:893
GCTACTGCCGTGACCTCATTTCTCTTA
+
???????????????????????????
..................................................

but i have lots warning messages, like following:

Use of uninitialized value in array element at solexa2fastq.pl line 128, <$qualfh> line 6609.
Argument "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0..." isn't numeric in addition (+) at solexa2fastq.pl line 132, <$qualfh> line 6609.

i wonder if any of you had met similar problems before, if so, where do you think the problem is?
now i am checking prb.txt, hope i can find out whats wrong with my input.

**robsyme** · 05-23-2009, 03:22 AM

fq_all2std.pl Errors.

UPDATE: My simple error, the prb had somehow become truncated. Should have checked before posting, my apologies.

I tried maq's fq_all2std.pl script, but having initial problems

Executing

Code:

fq_all2std.pl seqprb2std foo.seq.txt bar.prb.txt > sn4.fastq

seems to run well for a while, but then a get a flood of

Code:

..
Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
Use of uninitialized value $_ in split at fq_all2std.pl line 148, <$fhq> line 1132267
..

I'm pretty sure those blank lines are supposed to filled with quality information.

**Oxygen81** · 08-07-2009, 06:15 AM

Hello everybody,

like some others here I have some problems to convert my Solexa reads into an appropriate format for Maq.
I tried the scipts fq_all2std.pl and the Script of Farhad (#7) but nothing is working. Here is an example of my files:

*.seq.txt
>KN-964_02-2424_07-03179_1_1_1_1001_230
GTGCTAA ... (36 sites)

*.qual.txt
>KN-964_02-2424_07-03179_1_1_1_1001_230
32 32 32 32 32 32 32 ...

The sequencing was done in 2007. The command seqprb2std is not contained in fq_all2std.pl, is it? We did an other sequencing (paired-end) in 2008/2009 where I got a new format:

>KN-1413_07-00010:6:1:727:174/1: GTGTG...:40 40 40 40 40

First I thought this is the new Illumina 1.3 format and tried to convert it with a script but it doesn't work, too.

So what can I do? Could someone give me an example of a working input and tell me what I have to prompt in the command line exactly.

Thank you very much and a nice weekend to all!
Janina

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Illumina's PRB file to FastQ

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News