Is there anyone here who knows how the read names are assigned to reads in the SFF-output from a 454 sequencing round. I have multiple reads with the same read name, with almost (!) identical nucleotide sequences. Anyone seen something like it, or who knows in what way the read names are assigned?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
-
Do you mean IDs that look like this?
EBO6PME01EGNVK
454 calls those unique accession numbers (uaccno). The first seven characters encode the start time of the run, the next two digits represent region of the picotiter plate which contained the reads and the last 5 characters encode the X and Y coordinates of the read. I forget the exact encoding scheme but I think it's some sort of 16 bit encoding of the epoch time and x-y postions.
These IDs are supposed to be universally unique so you should not have multiple reads with the same ID. If you do it most likely means that someone has altered the names.
-
Could someone have processed the original sff file in different ways (changed filters, trim points etc.), with the resultant files later being merged together?
You could have a look at the manifest with sffinfo -m <filename> and see if there are any duplications.
Comment
-
For those interested in the in all the gory details of what the Universal Accession Number means I stumbled across the description in the Roche documentation "SW-Manual_Overview-FileFormats_Oct2009"
2.3.7 454 “Universal” Accession Numbers
The standard 454 read identifiers, used in Genome Sequencer FLX System data analysis software versions prior to 1.0.52 (early GS 20 System), have the format “rank_x_y” (as in 003048_1034_0651), where “rank” is a ranking of the well in a region by signal intensity, and “x” and “y” are the pixel location of the well’s center on the sequencing Run images. This identifier is guaranteed to be unique only within the context of a single sequencing Run, and may or may not be unique across specific sets of Runs.
To allow for the combination of reads across larger data sets, a more unique accession number format has been developed. An accession in this format is a 14 character string, as in C3U5GWL01CBXT2, and consist of 4 components:
C3U5GW - a six character encoding of the timestamp of the Run
L - a randomizing “hash” character to enhance uniqueness
01 - the region the read came from, as a two-digit number
CBXT2 - a five character encoding of the X,Y location of the well
The timestamp, hash character and X,Y location use a base-36 encoding (where values 0-25 are the letters ‘A’-‘Z’ and the values 26-35 are the digits ‘0’-‘9’). An accession thus consists only of letters and digits, and is case-insensitive.
• The timestamp is encoded by computing a “total” value as shown below, then converting
it into a base-36 string:total =As a result of this calculation, the first character of read accessions will always be a letter for Runs performed from now until 2038. The timestamp values are taken from the rigRunName found in the analysisParms.parse file in the specified analysis directory.
(year - 2000) * 13 * 32 * 24 * 60 * 60 +
month * 32 * 24 * 60 * 60 +
day * 24 * 60 * 60 +
hour * 60 * 60 +
minute * 60 +
second;
This rigRunName is the R_... name that is generated by the instrument software, and is also used as the standard directory name for the Run. Thus, a Run whose name begins with R_2004_09_22_16_59_10_... generates C3U5GW as its encoded timestamp value.
• Since two Runs may be started at the same second, an additional base-36 character is generated by hashing the full rigRunName to a base-31 number (the highest prime below 36), as in:
Code:chval = 0; for (s=rigRunName; *s; s++) { chval += (int) *s; chval %= 31; } ch = (chval < 26 ? 'A' + chval : '0' + chval - 26);
Comment
-
Thanks all of you for your answers, information and suggestion. I have now discussed with the bioinformatician who sent me the sequences, and it turned out that the problem was with the DNA barcodes for the different samples. Mismatches were allowed in these barcodes, which in a few instances led to the same accession number being coupled to more than one sequence. As the library was sent over as one file for each barcode, the IDs looked unique until all sequences from the run was compared to each other and the problem occurred. The problem was solved by not allowing mismatches in the barcodes.
Comment
-
Originally posted by maasha View PostAnyone knows how you can extract the X/Y-coordinates from the name? Somehow sffinfo does this ...
M
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 05-02-2024, 08:06 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
05-02-2024, 08:06 AM
|
||
Started by seqadmin, 04-30-2024, 12:17 PM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
04-30-2024, 12:17 PM
|
||
Started by seqadmin, 04-29-2024, 10:49 AM
|
0 responses
24 views
0 likes
|
Last Post
by seqadmin
04-29-2024, 10:49 AM
|
||
Started by seqadmin, 04-25-2024, 11:49 AM
|
0 responses
28 views
0 likes
|
Last Post
by seqadmin
04-25-2024, 11:49 AM
|
Comment