Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • nilshomer
    replied
    Originally posted by jkbonfield View Post
    Agreed arguments can be made for both, however a strong argument can be made that we pick ONE and stick with it. The world really doesn't need yet more fastq variants to deal with, yet apparently we already have two variants in the wild.

    Does anyone know if "csfastq" is ABI's own format name or just a name we have placed on their data when reformatted? If the former then I think we can just do whatever they do. If not then I'd advise following the lead of the main public data banks (as frankly, good luck trying to get them to change their output formats now).
    ABI is changing output format to SAM/BAM, so it is not as bad as you think. I agree, lets standardize how to convert SRF to FASTQ for ABI SOLiD data (no need to call it csfastq).

    Leave a comment:


  • jkbonfield
    replied
    Agreed arguments can be made for both, however a strong argument can be made that we pick ONE and stick with it. The world really doesn't need yet more fastq variants to deal with, yet apparently we already have two variants in the wild.

    Does anyone know if "csfastq" is ABI's own format name or just a name we have placed on their data when reformatted? If the former then I think we can just do whatever they do. If not then I'd advise following the lead of the main public data banks (as frankly, good luck trying to get them to change their output formats now).

    Leave a comment:


  • nilshomer
    replied
    Originally posted by jkbonfield View Post
    Is there an official document describing the csfastq format? The SOLID run outputs I have do not contain any of these files, but I believe it to be ABI's own format?

    The only documentation I can find consists of the ZOOM manual, which states (for example):

    Code:
    @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
    T32322133300002330031001022230020232002203222030231 
    +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
    !21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
    This appears to be taken from the NCBI downloads directly at:

    ftp://ftp.ncbi.nlm.nih.gov/sra/stati...15241.fastq.gz

    However an old post on the ABYSS users mailing list by Nils gives this example:

    Code:
    @ucla153_20090610_1102N.41:796_1758_1693
    g23111112222312301131331111331023122031222222111120
    +
    :46<=985889::;<829462*3<464554-6403128+-+&-.'$$%.#
    @ucla153_20090610_1102N.62:1159_1411_238
    t32200300033221321101031000000332000002013110000000
    +
    89;669>?6<<;57.:+/#&+%$####$&#&&&#####'#&###$###%#
    So my question is - which is correct or are both formats in use? My instinct tells me that the former, with the fastq quality line being the same length as the sequence line, is the standard.
    BFAST requires that the # of qualities is one less than the # of colors. For a 50bp read, there are only 50 observed colors (the adapter is never observed) and so it makes sense to have 50 qualities while the sequence is length 51 (one base adapter plus 50 colors). This follows from viewing the original *csfasta and *_QV.qual files.

    Arguments can be made for both.

    Nils

    Leave a comment:


  • jkbonfield
    replied
    Is there an official document describing the csfastq format? The SOLID run outputs I have do not contain any of these files, but I believe it to be ABI's own format?

    The only documentation I can find consists of the ZOOM manual, which states (for example):

    Code:
    @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
    T32322133300002330031001022230020232002203222030231 
    +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
    !21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
    This appears to be taken from the NCBI downloads directly at:

    ftp://ftp.ncbi.nlm.nih.gov/sra/stati...15241.fastq.gz

    However an old post on the ABYSS users mailing list by Nils gives this example:

    Code:
    @ucla153_20090610_1102N.41:796_1758_1693
    g23111112222312301131331111331023122031222222111120
    +
    :46<=985889::;<829462*3<464554-6403128+-+&-.'$$%.#
    @ucla153_20090610_1102N.62:1159_1411_238
    t32200300033221321101031000000332000002013110000000
    +
    89;669>?6<<;57.:+/#&+%$####$&#&&&#####'#&###$###%#
    So my question is - which is correct or are both formats in use? My instinct tells me that the former, with the fastq quality line being the same length as the sequence line, is the standard.

    Leave a comment:


  • jkbonfield
    replied
    Originally posted by nilshomer View Post
    BFAST does handle "N"s (actually [nN.] for Illumina data). If you could give me an link to the SRA # or the srf file or the fastq file I would be happy to debug to identify the problem.
    This wasn't my data so I haven't see it, simply had a patch offered to change the code to output dots instead.

    I believe it was ABI SOLID though which uses dot and so in that context I think it's correct, given that 0123 aren't "normal" sequence characters we're already defining a new character set and so . for amibiguity seems fine.

    One thing I would recommend for a SRF2FASTQ program is to output paired end (mate-pair) reads to the same file. Programs like BFAST and Velvet expect that there is only one FASTQ file, with paired end (mate pair) reads occurring successively with the same name. This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!). Having one file per "end" or "mate" is not scalable to such grouping data.

    Nils
    This has already been added, so there is the option to output ends sequentially. In theory it should work for the triple and quad-ended scenario too as SRF supports such data using the "REGN" (region) list.

    More intriguing will be to see quite what aligner output we can produce for it though given that SAM only supports two ends currently.

    Leave a comment:


  • lcollado
    replied
    Originally posted by nilshomer View Post
    This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!).
    I had no idea that such type of data existed! WOW!!!

    Leave a comment:


  • nilshomer
    replied
    Originally posted by jkbonfield View Post
    I've had a report from a user of srf2fastq that bfast cannot read its output. Specifically in the fastq I produce I write out N instead of . for ambiguity code, as N is the original "unknown" symbol with dot being a recent illumina invention (along with many other broken changes to fastq to further muddy the waters).

    So my question heres are:

    1) Is it correct that bfast cannot handle N and requires .? I haven't tested this myself.

    2) Should I "fix" srf2fastq to output . instead via a command line option?

    My own inclination to question 2 is simply to say no, fix bfast instead - we don't need to try and promote yet another format variant. However if the community feels it's needed then I'll put it in.

    Comments anyone?

    James
    BFAST does handle "N"s (actually [nN.] for Illumina data). If you could give me an link to the SRA # or the srf file or the fastq file I would be happy to debug to identify the problem.

    One thing I would recommend for a SRF2FASTQ program is to output paired end (mate-pair) reads to the same file. Programs like BFAST and Velvet expect that there is only one FASTQ file, with paired end (mate pair) reads occurring successively with the same name. This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!). Having one file per "end" or "mate" is not scalable to such grouping data.

    Nils

    Leave a comment:


  • jkbonfield
    started a topic Bfast, fastq and Ns

    Bfast, fastq and Ns

    I've had a report from a user of srf2fastq that bfast cannot read its output. Specifically in the fastq I produce I write out N instead of . for ambiguity code, as N is the original "unknown" symbol with dot being a recent illumina invention (along with many other broken changes to fastq to further muddy the waters).

    So my question heres are:

    1) Is it correct that bfast cannot handle N and requires .? I haven't tested this myself.

    2) Should I "fix" srf2fastq to output . instead via a command line option?

    My own inclination to question 2 is simply to say no, fix bfast instead - we don't need to try and promote yet another format variant. However if the community feels it's needed then I'll put it in.

    Comments anyone?

    James

Latest Articles

Collapse

  • seqadmin
    An Introduction to the Technologies Transforming Precision Medicine
    by seqadmin


    In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
    01-27-2025, 07:46 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 09:30 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-05-2025, 10:34 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-03-2025, 09:07 AM
0 responses
27 views
0 likes
Last Post seqadmin  
Started by seqadmin, 01-31-2025, 08:31 AM
0 responses
35 views
0 likes
Last Post seqadmin  
Working...
X