No announcement yet.

SFF Read names

  • Filter
  • Time
  • Show
Clear All
new posts

  • SFF Read names

    Is there anyone here who knows how the read names are assigned to reads in the SFF-output from a 454 sequencing round. I have multiple reads with the same read name, with almost (!) identical nucleotide sequences. Anyone seen something like it, or who knows in what way the read names are assigned?

  • #2
    Do you mean IDs that look like this?


    454 calls those unique accession numbers (uaccno). The first seven characters encode the start time of the run, the next two digits represent region of the picotiter plate which contained the reads and the last 5 characters encode the X and Y coordinates of the read. I forget the exact encoding scheme but I think it's some sort of 16 bit encoding of the epoch time and x-y postions.

    These IDs are supposed to be universally unique so you should not have multiple reads with the same ID. If you do it most likely means that someone has altered the names.


    • #3
      I mean exactly these IDs. And if you're correct, I should really start worrying about my non-unique IDs... Thanks a lot for the information!


      • #4
        Could someone have processed the original sff file in different ways (changed filters, trim points etc.), with the resultant files later being merged together?

        You could have a look at the manifest with sffinfo -m <filename> and see if there are any duplications.


        • #5
          For those interested in the in all the gory details of what the Universal Accession Number means I stumbled across the description in the Roche documentation "SW-Manual_Overview-FileFormats_Oct2009"

          2.3.7 454 “Universal” Accession Numbers
          The standard 454 read identifiers, used in Genome Sequencer FLX System data analysis software versions prior to 1.0.52 (early GS 20 System), have the format “rank_x_y” (as in 003048_1034_0651), where “rank” is a ranking of the well in a region by signal intensity, and “x” and “y” are the pixel location of the well’s center on the sequencing Run images. This identifier is guaranteed to be unique only within the context of a single sequencing Run, and may or may not be unique across specific sets of Runs.

          To allow for the combination of reads across larger data sets, a more unique accession number format has been developed. An accession in this format is a 14 character string, as in C3U5GWL01CBXT2, and consist of 4 components:
          C3U5GW - a six character encoding of the timestamp of the Run
          L - a randomizing “hash” character to enhance uniqueness
          01 - the region the read came from, as a two-digit number
          CBXT2 - a five character encoding of the X,Y location of the well

          The timestamp, hash character and X,Y location use a base-36 encoding (where values 0-25 are the letters ‘A’-‘Z’ and the values 26-35 are the digits ‘0’-‘9’). An accession thus consists only of letters and digits, and is case-insensitive.
          • The timestamp is encoded by computing a “total” value as shown below, then converting
          it into a base-36 string:
          total =
          (year - 2000) * 13 * 32 * 24 * 60 * 60 +
          month * 32 * 24 * 60 * 60 +
          day * 24 * 60 * 60 +
          hour * 60 * 60 +
          minute * 60 +
          As a result of this calculation, the first character of read accessions will always be a letter for Runs performed from now until 2038. The timestamp values are taken from the rigRunName found in the analysisParms.parse file in the specified analysis directory.

          This rigRunName is the R_... name that is generated by the instrument software, and is also used as the standard directory name for the Run. Thus, a Run whose name begins with R_2004_09_22_16_59_10_... generates C3U5GW as its encoded timestamp value.

          • Since two Runs may be started at the same second, an additional base-36 character is generated by hashing the full rigRunName to a base-31 number (the highest prime below 36), as in:

           chval = 0; 
           for (s=rigRunName; *s; s++) { 
            chval += (int) *s; 
            chval %= 31; 
           ch = (chval < 26 ? 'A' + chval : '0' + chval - 26);
          • The X,Y location is encoded by computing a total value of “X * 4096 + Y” and encoding that as a five character, base-36 string.


          • #6
            Thanks all of you for your answers, information and suggestion. I have now discussed with the bioinformatician who sent me the sequences, and it turned out that the problem was with the DNA barcodes for the different samples. Mismatches were allowed in these barcodes, which in a few instances led to the same accession number being coupled to more than one sequence. As the library was sent over as one file for each barcode, the IDs looked unique until all sequences from the run was compared to each other and the problem occurred. The problem was solved by not allowing mismatches in the barcodes.


            • #7
              Anyone knows how you can extract the X/Y-coordinates from the name? Somehow sffinfo does this ...



              • #8
                Originally posted by maasha View Post
                Anyone knows how you can extract the X/Y-coordinates from the name? Somehow sffinfo does this ...

                On this page the first script listed, 454_base36 will do what you want.


                • #9
                  Biopython 1.60 will include this too. Thanks kmcarr for that very informative post, and Jeff Hussmann who wrote the code.