Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bfast, fastq and Ns

    I've had a report from a user of srf2fastq that bfast cannot read its output. Specifically in the fastq I produce I write out N instead of . for ambiguity code, as N is the original "unknown" symbol with dot being a recent illumina invention (along with many other broken changes to fastq to further muddy the waters).

    So my question heres are:

    1) Is it correct that bfast cannot handle N and requires .? I haven't tested this myself.

    2) Should I "fix" srf2fastq to output . instead via a command line option?

    My own inclination to question 2 is simply to say no, fix bfast instead - we don't need to try and promote yet another format variant. However if the community feels it's needed then I'll put it in.

    Comments anyone?

    James

  • #2
    Originally posted by jkbonfield View Post
    I've had a report from a user of srf2fastq that bfast cannot read its output. Specifically in the fastq I produce I write out N instead of . for ambiguity code, as N is the original "unknown" symbol with dot being a recent illumina invention (along with many other broken changes to fastq to further muddy the waters).

    So my question heres are:

    1) Is it correct that bfast cannot handle N and requires .? I haven't tested this myself.

    2) Should I "fix" srf2fastq to output . instead via a command line option?

    My own inclination to question 2 is simply to say no, fix bfast instead - we don't need to try and promote yet another format variant. However if the community feels it's needed then I'll put it in.

    Comments anyone?

    James
    BFAST does handle "N"s (actually [nN.] for Illumina data). If you could give me an link to the SRA # or the srf file or the fastq file I would be happy to debug to identify the problem.

    One thing I would recommend for a SRF2FASTQ program is to output paired end (mate-pair) reads to the same file. Programs like BFAST and Velvet expect that there is only one FASTQ file, with paired end (mate pair) reads occurring successively with the same name. This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!). Having one file per "end" or "mate" is not scalable to such grouping data.

    Nils

    Comment


    • #3
      Originally posted by nilshomer View Post
      This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!).
      I had no idea that such type of data existed! WOW!!!
      L. Collado Torres, Ph.D. student in Biostatistics.

      Comment


      • #4
        Originally posted by nilshomer View Post
        BFAST does handle "N"s (actually [nN.] for Illumina data). If you could give me an link to the SRA # or the srf file or the fastq file I would be happy to debug to identify the problem.
        This wasn't my data so I haven't see it, simply had a patch offered to change the code to output dots instead.

        I believe it was ABI SOLID though which uses dot and so in that context I think it's correct, given that 0123 aren't "normal" sequence characters we're already defining a new character set and so . for amibiguity seems fine.

        One thing I would recommend for a SRF2FASTQ program is to output paired end (mate-pair) reads to the same file. Programs like BFAST and Velvet expect that there is only one FASTQ file, with paired end (mate pair) reads occurring successively with the same name. This allows BFAST at least to support triple-end, quad-end, or higher grouping data, which we have generated (it exists!). Having one file per "end" or "mate" is not scalable to such grouping data.

        Nils
        This has already been added, so there is the option to output ends sequentially. In theory it should work for the triple and quad-ended scenario too as SRF supports such data using the "REGN" (region) list.

        More intriguing will be to see quite what aligner output we can produce for it though given that SAM only supports two ends currently.

        Comment


        • #5
          Is there an official document describing the csfastq format? The SOLID run outputs I have do not contain any of these files, but I believe it to be ABI's own format?

          The only documentation I can find consists of the ZOOM manual, which states (for example):

          Code:
          @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
          T32322133300002330031001022230020232002203222030231 
          +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
          !21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
          This appears to be taken from the NCBI downloads directly at:

          ftp://ftp.ncbi.nlm.nih.gov/sra/stati...15241.fastq.gz

          However an old post on the ABYSS users mailing list by Nils gives this example:

          Code:
          @ucla153_20090610_1102N.41:796_1758_1693
          g23111112222312301131331111331023122031222222111120
          +
          :46<=985889::;<829462*3<464554-6403128+-+&-.'$$%.#
          @ucla153_20090610_1102N.62:1159_1411_238
          t32200300033221321101031000000332000002013110000000
          +
          89;669>?6<<;57.:+/#&+%$####$&#&&&#####'#&###$###%#
          So my question is - which is correct or are both formats in use? My instinct tells me that the former, with the fastq quality line being the same length as the sequence line, is the standard.

          Comment


          • #6
            Originally posted by jkbonfield View Post
            Is there an official document describing the csfastq format? The SOLID run outputs I have do not contain any of these files, but I believe it to be ABI's own format?

            The only documentation I can find consists of the ZOOM manual, which states (for example):

            Code:
            @SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
            T32322133300002330031001022230020232002203222030231 
            +SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50 
            !21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
            This appears to be taken from the NCBI downloads directly at:

            ftp://ftp.ncbi.nlm.nih.gov/sra/stati...15241.fastq.gz

            However an old post on the ABYSS users mailing list by Nils gives this example:

            Code:
            @ucla153_20090610_1102N.41:796_1758_1693
            g23111112222312301131331111331023122031222222111120
            +
            :46<=985889::;<829462*3<464554-6403128+-+&-.'$$%.#
            @ucla153_20090610_1102N.62:1159_1411_238
            t32200300033221321101031000000332000002013110000000
            +
            89;669>?6<<;57.:+/#&+%$####$&#&&&#####'#&###$###%#
            So my question is - which is correct or are both formats in use? My instinct tells me that the former, with the fastq quality line being the same length as the sequence line, is the standard.
            BFAST requires that the # of qualities is one less than the # of colors. For a 50bp read, there are only 50 observed colors (the adapter is never observed) and so it makes sense to have 50 qualities while the sequence is length 51 (one base adapter plus 50 colors). This follows from viewing the original *csfasta and *_QV.qual files.

            Arguments can be made for both.

            Nils

            Comment


            • #7
              Agreed arguments can be made for both, however a strong argument can be made that we pick ONE and stick with it. The world really doesn't need yet more fastq variants to deal with, yet apparently we already have two variants in the wild.

              Does anyone know if "csfastq" is ABI's own format name or just a name we have placed on their data when reformatted? If the former then I think we can just do whatever they do. If not then I'd advise following the lead of the main public data banks (as frankly, good luck trying to get them to change their output formats now).

              Comment


              • #8
                Originally posted by jkbonfield View Post
                Agreed arguments can be made for both, however a strong argument can be made that we pick ONE and stick with it. The world really doesn't need yet more fastq variants to deal with, yet apparently we already have two variants in the wild.

                Does anyone know if "csfastq" is ABI's own format name or just a name we have placed on their data when reformatted? If the former then I think we can just do whatever they do. If not then I'd advise following the lead of the main public data banks (as frankly, good luck trying to get them to change their output formats now).
                ABI is changing output format to SAM/BAM, so it is not as bad as you think. I agree, lets standardize how to convert SRF to FASTQ for ABI SOLiD data (no need to call it csfastq).

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X