Header Leaderboard Ad

Collapse

New illumina2srf available on sourceforge

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • New illumina2srf available on sourceforge

    I have uploaded a new source package for illumina2srf on sourceforge. This version can read the .cif format files that RTA produces. It will also (with the right options) handle samples with sequence bar codes rather more sensibly than previous versions were able to. N.B.: This version requires .qseq.txt files, so it will not work with pipeline versions earlier than 1.3.

    It can be downloaded from https://sourceforge.net/projects/sequenceread/files/.

    For those interested, here are the release notes:

    Sequenceread package v2.0.0


    This package includes illumina2srf which can be used to convert the contents of an Illumina GA-pipeline run folder to the more convenient SRF format. Illumina2srf was originally part of the Staden io_lib package and was later shipped with the Illumina analysis software. This version is a major revision of the software which can support sequence bar codes
    and more variants of the Illumina pipeline. Also included is a helper application called 'index_decoder' which can be used to decode sequence bar codes.

    The most notable changes to illumina2srf are:
    • Support for pipeline versions prior to 1.3 has been dropped.
    • Support has been added for the RTA (real-time analysis) versions of the Illumina software.
    • Illumina2srf now searches for the intensity files produced by all the different variants of the pipeline it supports automatically. It should therefore no longer be necessary to use the -I flag for IPAR projects (although it shouldn't do any harm).
    • When storing unprocessed data (using the -b or -r options) illumina2srf will no longer automatically include noise data (from *.nse.txt, *.nse.txt.p or *.cnf files). If you do want to include this data, you will need to use the -nse option.
    • Support has been added for samples that include sequence bar codes. Such samples will first need to be processed using the included index_decoder program to work out which bar code was attached to each sample and make a new set of .qseq.txt files. These can then be processed by illumina2srf. You will need to use the new -use_bases option to indicate which part of the read corresponds to the bar code tag. This will change the REGN tag so that it indicates which bases comprise the bar code and will also make illumina2srf attach the name of the decoded tag to the end of the read name after a '#' character. See the illumina2srf and index_decoder man pages for more details.
    • Some extra options have been added to give finer control over what data gets stored and which file variants are searched for.
    • There is a new test harness that does some basic checks to ensure that illumina2srf is storing the correct data.


    Unlike previous versions of illumina2srf, this package does not come with a version of io_lib. If you do not have it already, io_lib can be obtained by downloading it from https://sourceforge.net/projects/staden/files/ or you may be able to install a pre-packaged version from your operating system vendor. You will need io_lib version 1.12.1 or later. When you have installed io_lib, you can build and install this package in the standard GNU autoconf way by running 'configure' followed by 'make' and
    'make install'. Please see the INSTALL file for more details.
    Rob.

  • #2
    I have made a new version (2.1.0) of illumina2srf, which is now available on sourceforge. Highlights include:
    • The -N/-n command line arguments are back, by popular demand. These allow you to change the read name format.
    • New options -pos/-no_pos have been added. Selecting -pos will make illumina2srf store spot positions as metadata on the BASE chunks.
    • Byte swapping code has been added to the .cif file reader for big-endian platforms. This means sparc and powerpc owners will now be able to build srf files from RTA runs correctly.
    • A new program called srf_split_by_tag has been included in the package. This is for use with data that has been tagged with sequence barcodes. It will split a single srf file into a set of files where each output file contains the reads for a single tag.


    The new version can be obtained from the project download page.

    Rob.

    Comment


    • #3
      Trouble using srf_split_by_tag

      I tried using srf_split_by_tag for the first time today and have encountered an error. I get the following error:
      Code:
      srf_split_by_tag: srf_split_by_tag.c:118: outfiles_open: Assertion `added != 0' failed.
      Abort
      This occurs whether I use the default (i.e. no arguments other than SRF file name) or if explicitly state "-u unindexed and -d myDir".

      index_decoder was first run on the the qseq files followed by illumina2srf. I can extract fastq files from this SRF and everything appears to have been created properly. The index_decoder and illumina2srf were from v 2.1.0 of the sequenceread package. I have tested v 2.1.0 and 2.1.1 of srf_split_by_tag with the same result. The srf_split_by_tag 2.1.0 as well as the illumina2srf used to build the SRF file were built against io_lib 1.12.1. The v 2.1.1 of srf_split_by_tag was built against v 1.12.4 of io_lib.

      Has anyone else experienced this problem using srf_split_by_tag? Any and all help appreciated.

      ---------------------------------
      Never mind, solution found.

      It turns out that the program did not like the '/' I had put in the read name. What tipped me off to this possibility was the following comment in the source code:

      Code:
      /* Crikey, someone put a / in a tag name.  We need to replace it. */
      Who says code comments aren't useful.
      Last edited by kmcarr; 08-09-2010, 11:30 AM. Reason: Found source of problem.

      Comment


      • #4
        I haven't used srf_split_by_tag so far, but I've realized today that srf2fastq messes up my sequence qualities, attempting to convert them from the stored scale to a mixed sanger/solexa one...

        d

        Comment


        • #5
          By design and default srf2fastq outputs Phred style q-scores using the Sanger scale (Phred+33). Could you describe what you mean by "a mixed sanger/solexa one"?

          Comment


          • #6
            Originally posted by kmcarr View Post
            By design and default srf2fastq outputs Phred style q-scores using the Sanger scale (Phred+33). Could you describe what you mean by "a mixed sanger/solexa one"?
            Mmm... I've been too quick... can you confirm srf2fastq converts automagically from phred64 to phred33 (or from solexa to phred33)?

            d

            Comment


            • #7
              Originally posted by dawe View Post
              Mmm... I've been too quick... can you confirm srf2fastq converts automagically from phred64 to phred33 (or from solexa to phred33)?

              d
              Well, it's really a combination of illumina2srf and srf2fastq.

              illumina2srf creates the SRF file from the *_qseq.txt files it stores it stores Phred-style q-scores as integers from 0-40; that is it subtracts 64 from the q-score in the *_qseq.txt files before storing it in the SRF.

              srf2fastq reads the integer based q-score from the SRF file and prints the corresponding ASCII character, but first off-setting from the character '!' (ASCII = 33).

              The combination of these two transformations creates the appearance of magic.

              Comment


              • #8
                Originally posted by kmcarr View Post
                Well, it's really a combination of illumina2srf and srf2fastq.

                illumina2srf creates the SRF file from the *_qseq.txt files it stores it stores Phred-style q-scores as integers from 0-40; that is it subtracts 64 from the q-score in the *_qseq.txt files before storing it in the SRF.

                srf2fastq reads the integer based q-score from the SRF file and prints the corresponding ASCII character, but first off-setting from the character '!' (ASCII = 33).
                Got it! I've missed the first step, I didn't know illumina2srf stores phred values (and not q-values).
                Thanks.
                d

                Comment


                • #9
                  Illumina certainly have a lot to answer for with the myriad of quality encodings. The sole reason they had +64 was because they were using log-odds (which isn't a bad idea by any means - I rather liked them) and so could get negative values.

                  Switching to Phred was I guess a business decision to go with the flow, but using phred scale +64 was a total disaster!

                  For what it's worth SRF could store either phred or log-odds encodings, but internally it doesn't store these as ASCII. Instead it generates the data in a binary form representing the actual value. This maybe come from ASCII phred-33, phred-64 or logodds-64 depending on the input. It has a meta-data field to indicate the scale (phred vs logodds).

                  These days though they seem to be generating purely phred, even for secondary scores which then end up all the same as phred can't cope with that... should have stuck with logodds. Arggh

                  Comment


                  • #10
                    Originally posted by kmcarr View Post
                    I tried using srf_split_by_tag for the first time today and have encountered an error. I get the following error:
                    Code:
                    srf_split_by_tag: srf_split_by_tag.c:118: outfiles_open: Assertion `added != 0' failed.
                    Abort
                    This was indeed a bug, caused by the use of the wrong variable name in a loop. It has now been fixed, and I have uploaded a new version (2.1.2) of the package which includes the correction.

                    I have also added a couple of new options to srf_split_by_tag. The -s option can be used to change the separator in the output file names, so:

                    Code:
                    srf_split_by_tag 2956_3.srf
                    will produce output files named:

                    Code:
                    2956_3_1.srf
                    2956_3_2.srf
                    ...etc.
                    assuming that the tags were imaginatively named 1,2,3 and so on, whereas:

                    Code:
                    srf_split_by_tag -s '#' 2956_3.srf
                    will produce files named:

                    Code:
                    2956_3#1.srf
                    2956_3#2.srf
                    ...etc.
                    The -e option takes a comma separated list of tag names. If it is present, then only tags which appear in the list will be split into their own files. Any others will be treated as if they are unindexed.

                    This can be useful in the case where some of the tags in the list passed to index_decoder were not actually used in the sequencing experiment. When this happens, you often find that a small number of reads match the unused tags due to random base calling errors happening to match the tag sequence. Normally srf_split_by_tag would put these reads in their own files. By using the -e option, they can instead be put in with all the other tags that can't be decoded. For example, if 2956_3.srf contained tags 1 to 6, but only 1 to 3 were for real samples,

                    Code:
                    srf_split_by_tag -u 0 -e 1,2,3 2956_3.srf
                    will produce the following files:

                    Code:
                    2956_3_0.srf
                    2956_3_1.srf
                    2956_3_2.srf
                    2956_3_3.srf
                    2956_3_1.srf, 2956_3_2.srf and 2956_3_3.srf will contain the reads for tags 1, 2 and 3 respectively. 2956_3_0.srf will contain all of the reads where the tag could not be decoded along with any that matched the unwanted tags 4, 5 and 6.

                    As usual, the new version can be obtained from the sourceforge downloads area.

                    Comment


                    • #11
                      This program is a lifesaver.
                      However, I get this error repeatedly when I try to extract .srf files generated by your program:
                      Zero or greater than one CNF chunks found.
                      Another error occurs repeatedly when using Illumina's GA 1.5.1 srf2illumina:
                      WARNING: can't find expected pos information in read
                      Any idea what causes this?
                      Also, which program is best to extract these v2 .srf files? Illumina's srf2illumina? or the staden io_lib one? or what?

                      I found this: http://seqanswers.com/forums/showthread.php?t=4101
                      Which says use the "-c" command, which none of my srf2illumina executables possess.

                      Any thoughts?
                      Last edited by Awesome; 12-20-2010, 03:57 PM.

                      Comment


                      • #12
                        The -c option would be for srf2fastq in the io_lib package. In fact, the latest version of srf2fastq can work out with quality values are present itself, so it should work correctly without the -c now.

                        Unfortunately srf2illumina has been unsupported for a while now, so it has fallen a bit behind the times. It should be possible to tweak the Illumina version so that it works with the newer files, but I expect the output would be for a very old version of the Illumina pipeline. Fixing that would take much more effort.

                        What do you want to use srf2illumina for? If you just need fastq files, then srf2fastq is a much better way to go.

                        Comment


                        • #13
                          I need to be able to store and retrieve base calls, quality scores, intensities, and noises. That is why I'm concerned with srf2illumina.

                          Comment


                          • #14
                            You can get the intensity and noise values out using srf_dump_all, for example:

                            srf_dump_all -c int -t solexa myfile.srf

                            will dump out all of the intensity data. The format is from an ancient version of the Illumina pipeline, I.e. lane, tile, x, y (derived from the read name) followed by the intensity data in groups of four numbers (for A, C, G and T). The groups are separated by tabs. You can change the -c parameter to get different data types (nse for noise, sig2 for processed intensities), if they are present.

                            It isn't an ideal solution, but it does give a fairly easy way of getting at the data.

                            Comment


                            • #15
                              Running illumina2srf after removing cycle(s)

                              Hello-

                              I have a question regarding the use of the script illumina2srf. We recently had a HiSeq run in which the first cycle did not contain any data (clogged fluidics?). Illumina technical support advised us that we could improve the overall quality of our data for the lane in question by removing the first cycle. This involved removing the data folder in <run folder>/Data/Intensities/<lane>/C1.1, renaming the folders for all of the subsequent cycles, editing the config.xml in the Intensities folder to reflect the changes, and then repeating the entire procedure for the control lane as well. Following these steps we were able to generate fastq files, but when we attempt to run illumina2srf to generate our srf files we encounter an error indicating that cycle 1 is missing from our renumbered tiles:

                              /house/sdm/prod/illumina/staging/hiseq05/110224_HISEQ05_0066_B816YKABXX_1606/Data/Intensities/Bustard1.8.0_25-04-2011_sdm/../../../Config/FlowCellId.xml:
                              No such file or directory
                              Processing sequence files
                              /house/sdm/prod/illumina/staging/hiseq05/110224_HISEQ05_0066_B816YKABXX_1606/Data/Intensities/Bustard1.8.0_25-04-2011_sdm/s_3_1_0001_qseq.txt
                              /house/sdm/prod/illumina/staging/hiseq05/110224_HISEQ05_0066_B816YKABXX_1606/Data/Intensities/Bustard1.8.0_25-04-2011_sdm/s_3_2_0001_qseq.txt
                              Error: Missing cycle 1 for lane 3 tile 1 from CIF files.

                              I don't know how illumina2srf knows about cycles - perhaps they are encoded in the cif files? Is there a way that we can (easily) fool illumina2srf and force it to process the lane in a similar way to how we generated our fastqs?

                              Thanks in advance!

                              Comment

                              Working...
                              X