Unconfigured Ad

**nilshomer** · 02-04-2010, 09:51 AM

Originally posted by jkbonfield View Post

I've had a report from a user of srf2fastq that bfast cannot read its output. Specifically in the fastq I produce I write out N instead of . for ambiguity code, as N is the original "unknown" symbol with dot being a recent illumina invention (along with many other broken changes to fastq to further muddy the waters).

So my question heres are:

1) Is it correct that bfast cannot handle N and requires .? I haven't tested this myself.

2) Should I "fix" srf2fastq to output . instead via a command line option?

My own inclination to question 2 is simply to say no, fix bfast instead - we don't need to try and promote yet another format variant. However if the community feels it's needed then I'll put it in.

Comments anyone?

James

BFAST does handle "N"s (actually [nN.] for Illumina data). If you could give me an link to the SRA # or the srf file or the fastq file I would be happy to debug to identify the problem.

One thing I would recommend for a SRF2FASTQ program is to output paired end (mate-pair) reads to the same file. Programs like BFAST and Velvet expect that there is only one FASTQ file, with paired end (mate pair) reads occurring successively with the same name. This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!). Having one file per "end" or "mate" is not scalable to such grouping data.

Nils

**lcollado** · 02-04-2010, 10:27 AM

Originally posted by nilshomer View Post

This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!).

I had no idea that such type of data existed! WOW!!!

**jkbonfield** · 02-05-2010, 01:01 AM

Originally posted by nilshomer View Post

BFAST does handle "N"s (actually [nN.] for Illumina data). If you could give me an link to the SRA # or the srf file or the fastq file I would be happy to debug to identify the problem.

This wasn't my data so I haven't see it, simply had a patch offered to change the code to output dots instead.

I believe it was ABI SOLID though which uses dot and so in that context I think it's correct, given that 0123 aren't "normal" sequence characters we're already defining a new character set and so . for amibiguity seems fine.

One thing I would recommend for a SRF2FASTQ program is to output paired end (mate-pair) reads to the same file. Programs like BFAST and Velvet expect that there is only one FASTQ file, with paired end (mate pair) reads occurring successively with the same name. This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!). Having one file per "end" or "mate" is not scalable to such grouping data.

Nils

This has already been added, so there is the option to output ends sequentially. In theory it should work for the triple and quad-ended scenario too as SRF supports such data using the "REGN" (region) list.

More intriguing will be to see quite what aligner output we can produce for it though given that SAM only supports two ends currently.

**jkbonfield** · 02-05-2010, 03:51 AM

Is there an official document describing the csfastq format? The SOLID run outputs I have do not contain any of these files, but I believe it to be ABI's own format?

The only documentation I can find consists of the ZOOM manual, which states (for example):

Code:

@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
T32322133300002330031001022230020232002203222030231 
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
!21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'

This appears to be taken from the NCBI downloads directly at:

ftp://ftp.ncbi.nlm.nih.gov/sra/stati...15241.fastq.gz

However an old post on the ABYSS users mailing list by Nils gives this example:

Code:

@ucla153_20090610_1102N.41:796_1758_1693
g23111112222312301131331111331023122031222222111120
+
:46<=985889::;<829462*3<464554-6403128+-+&-.'$$%.#
@ucla153_20090610_1102N.62:1159_1411_238
t32200300033221321101031000000332000002013110000000
+
89;669>?6<<;57.:+/#&+%$####$&#&&&#####'#&###$###%#

So my question is - which is correct or are both formats in use? My instinct tells me that the former, with the fastq quality line being the same length as the sequence line, is the standard.

**nilshomer** · 02-05-2010, 10:52 AM

Originally posted by jkbonfield View Post

Is there an official document describing the csfastq format? The SOLID run outputs I have do not contain any of these files, but I believe it to be ABI's own format?

The only documentation I can find consists of the ZOOM manual, which states (for example):

Code:

@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
T32322133300002330031001022230020232002203222030231 
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
!21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'

This appears to be taken from the NCBI downloads directly at:

ftp://ftp.ncbi.nlm.nih.gov/sra/stati...15241.fastq.gz

However an old post on the ABYSS users mailing list by Nils gives this example:

Code:

@ucla153_20090610_1102N.41:796_1758_1693
g23111112222312301131331111331023122031222222111120
+
:46<=985889::;<829462*3<464554-6403128+-+&-.'$$%.#
@ucla153_20090610_1102N.62:1159_1411_238
t32200300033221321101031000000332000002013110000000
+
89;669>?6<<;57.:+/#&+%$####$&#&&&#####'#&###$###%#

So my question is - which is correct or are both formats in use? My instinct tells me that the former, with the fastq quality line being the same length as the sequence line, is the standard.

BFAST requires that the # of qualities is one less than the # of colors. For a 50bp read, there are only 50 observed colors (the adapter is never observed) and so it makes sense to have 50 qualities while the sequence is length 51 (one base adapter plus 50 colors). This follows from viewing the original *csfasta and *_QV.qual files.

Arguments can be made for both.

Nils

**jkbonfield** · 02-08-2010, 01:45 AM

Agreed arguments can be made for both, however a strong argument can be made that we pick ONE and stick with it. The world really doesn't need yet more fastq variants to deal with, yet apparently we already have two variants in the wild.

Does anyone know if "csfastq" is ABI's own format name or just a name we have placed on their data when reformatted? If the former then I think we can just do whatever they do. If not then I'd advise following the lead of the main public data banks (as frankly, good luck trying to get them to change their output formats now).

**nilshomer** · 02-08-2010, 02:05 AM

Originally posted by jkbonfield View Post

Agreed arguments can be made for both, however a strong argument can be made that we pick ONE and stick with it. The world really doesn't need yet more fastq variants to deal with, yet apparently we already have two variants in the wild.

Does anyone know if "csfastq" is ABI's own format name or just a name we have placed on their data when reformatted? If the former then I think we can just do whatever they do. If not then I'd advise following the lead of the main public data banks (as frankly, good luck trying to get them to change their output formats now).

ABI is changing output format to SAM/BAM, so it is not as bad as you think. I agree, lets standardize how to convert SRF to FASTQ for ABI SOLiD data (no need to call it csfastq).

Topics	Statistics	Last Post
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, Yesterday, 11:05 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 28 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 27 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM

Unconfigured Ad

Bfast, fastq and Ns

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News