Seqanswers Leaderboard Ad

**maubp** · 12-07-2009, 11:55 AM

That does look like something has gone wrong.

Also, assuming you are using FASTQ files from Illumina pipeline 1.3+, then don't use sol2sanger, use ill2sanger (requires a patch to MAQ - search the forum).

Or BioPerl, or EMBOSS, or an ad-hoc perl script or, ... lots of examples on the forum. My biased suggestion would be to use Biopython, http://news.open-bio.org/news/2009/0...vert-function/

See also: http://en.wikipedia.org/wiki/FASTQ_format

**dawe** · 12-08-2009, 04:59 AM

Originally posted by cliff View Post

It is weird..They should have given out the same size, right? Besides, in all the other lanes, this conversion all output the same size for the pair.

Have you checked the files? sol2sanger predicate doesn't print sequence headers twice, so

@seqID
CGATCGTAGCTAGC
+seqID
BBBBBBBBBBBB

becomes

@seqID
CGATCGTAGCTAGC
+
###########

(the scores are completely random in this example ^__^)

hence you may missing bytes

**maubp** · 12-08-2009, 05:02 AM

I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

cliff - how about posting the first few records of each file?

**dawe** · 12-08-2009, 05:07 AM

Originally posted by maubp View Post

I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

cliff - how about posting the first few records of each file?

You're right! On a second read I realize the issue here is not "the size differ before and after conversion" but "the paired reads differ in size after conversion"... Whoops!

d

**cliff** · 12-08-2009, 08:18 AM

Thanks for all your replies. Here the fastq files:

1: $ more s_1_1_sequence.fastq

@BILLIEHOLIDAY:1:1:3:1204#0/1
GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
+
BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
@BILLIEHOLIDAY:1:1:3:277#0/1
TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
+
BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: $ more s_1_2_sequence.fastq
@BILLIEHOLIDAY:1:1:3:1204#0/2
TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
+
BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
@BILLIEHOLIDAY:1:1:3:277#0/2
GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
+
A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>

**maubp** · 12-08-2009, 08:24 AM

Or using the [ code ] tags, since otherwise the forum mangles them:

1: $ more s_1_1_sequence.fastq

Code:

@BILLIEHOLIDAY:1:1:3:1204#0/1
GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
+
BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
@BILLIEHOLIDAY:1:1:3:277#0/1
TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
+
BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: $ more s_1_2_sequence.fastq

Code:

@BILLIEHOLIDAY:1:1:3:1204#0/2
TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
+
BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
@BILLIEHOLIDAY:1:1:3:277#0/2
GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
+
A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>

At first glance, I see nothing amiss with the FASTQ representation. Interestingly the read quality of the forward reads trails off much more quickly than the reverse reads.

**cliff** · 12-08-2009, 08:58 AM

Thanks, maubp!

We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

Thank~

**maubp** · 12-08-2009, 09:07 AM

Originally posted by cliff View Post

Thanks, maubp!

We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

Thank~

This probably won't make any difference to the file size oddity. The difference between sol2sanger and ill2sanger is how they map the quality scores.

If your data is from Illumina 1.3 or later, use ill2sanger.

If your data is from Solexa 1.0 up to Illumina 1.2, use sol2sanger.

**cliff** · 12-08-2009, 09:21 AM

maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

Do you know how to install and use this maq-ill2sanger.patch?

I am sorry I am not a cs background..

**maubp** · 12-08-2009, 09:59 AM

Originally posted by cliff View Post

maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

Do you know how to install and use this maq-ill2sanger.patch?

I am sorry I am not a cs background..

There was a discussion on this here:

MAQ problems - SEQanswers

http://seqanswers.com/forums/showthread.php?t=2499&highlight=ill2sanger

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

Basically (and this isn't going to be detailed enough), grab the MAQ source code, use the patch command to make this change, compile MAQ, install MAQ. If you didn't install MAQ in the first place, this might be tricky.

--

Alternatively, there are non-MAQ options for converting the FASTQ files.

If you like Perl, there are plenty of scripts to do this in Perl (some using BioPerl) - search the forum.

You could also use the seqret tool from EMBOSS 6.1.0 patch 1 or later.

Other options include installing Biopython 1.52 or later, and using a tiny Python script like http://www.biopython.org/wiki/Reading_from_unix_pipes or like this:

Code:

from Bio import SeqIO
count = SeqIO.convert("s_1_1_sequence.txt", "fastq-illumina", "s_1_1_sequence.fastq", "fastq-sanger")
print "Converted %i forward reads" % count
count = SeqIO.convert("s_1_2_sequence.txt", "fastq-illumina", "s_1_2_sequence.fastq", "fastq-sanger")
print "Converted %i reverse reads" % count

**HESmith** · 12-08-2009, 01:45 PM

I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
"#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
#1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
#2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
#3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
at fastq2bfq.c:137
#4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
#5 0x00000000004019b9 in _start ()"

Any suggestions in solving the problem(s) would be greatly appreciated.

Thanks,
Harold

**dawe** · 12-09-2009, 01:09 AM

Originally posted by HESmith View Post

I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
"#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
#1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
#2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
#3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
at fastq2bfq.c:137
#4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
#5 0x00000000004019b9 in _start ()"

Any suggestions in solving the problem(s) would be greatly appreciated.

Thanks,
Harold

Interesting... can you tell me your system configuration? (Hardware/software). Also, can you test if the sol2sanger works? ill2sanger is nothing but a different version of sol2sanger so, a segfault should be raised in that case too

**maubp** · 12-09-2009, 05:32 AM

As dawe suggested, retry sol2sanger on your newly compiled MAQ to see if that crashes.

It would also be worth re-downloading the FASTQ files (from your service provider, collaborator - where ever you got them from) just in case there was a corruption on transfer. That could could explain the file size oddity. Its a long shot though.

**cliff** · 12-21-2009, 01:18 PM

Hi, maubp

I have tried ill2sanger, but still got the same problem.

The orginal txt files from Read 1 and Read 2 of the same lane are in the same size as below:

4116883072 read1.txt
4116883072 read2.txt

But, after ill2sanger, the two reads have different sizes:

3644668984 read1.fastq
3644660878 read2.fastq

This problem is exactly the same as what I saw after sol2sanger. And all the other lanes are fine except this one.

Do you have thoughts on this?

Thanks

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Maq - sol2sanger problem - different sizes for the pair?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News