454 re-scoring

CPCantalapiedra

Member

Join Date: Sep 2011

Posts: 38
- Share
- Tweet
#1

454 re-scoring

03-20-2012, 09:55 AM

Hi there!

I am a little confused about the re-scoring option on sfffile. In the manual, v2.6 of the Data Analysis package (2011), says:

sfffile -r : This option re-generates the phred-based quality scores for each of the input reads using the current quality scoring table, and overwrites the existing quality scores with these new quality scores in the output file.

But, in the manual of the Data Processing software, v2.3 (2009), Section 6.6 states that (not transcript, sorry, in my own notes):

For GS 20, GS FLX and GS FLX Titanium, different training sets were used to build the lookup tables, since they show slightly different error tendencies.

Well my question now is: How do I know when to use this sfffile option? How many different scoring tables exist? For what chemistry should I use this option or I can use it with old .sff files from SRA NCBI archive?

This is SRR000001.sra processed with SRA toolkit fastq-dump:

@SRR000001.1 EM7LVYS01C1LWG length=255
TCAGGGGGGAGCTTAAATTTGAAACTAGAAAAATTTTGAACAAAATAATCATAATTGTTAGCTGATGAAAAACTAGAAAAGATTTTCTGAGTGTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAACGGTATCCCGTAGTGTGCATTCATCCCTGCTCTGGATACAGTCAGCTCCCAAATTCCATAAACAACTCCTTTGTAAGTAACCTCCTTTTGACAGGGGGTACTGAGCGGGCTGGCAAGGCN
+SRR000001.1 EM7LVYS01C1LWG length=255
=;8GC91*#==<C=EA.EA/<B=(<<:=HC90'FB5&;B:<GC6(=D=<<==C=C==B<=<<<=;<<GC8.#<<9=FB4%<8EA4%87:<<8=B;C<@8>5=C?*A<&A<&<=49/2A='@;#A<&<A9C=@9B::B:<;=C?+<<;<===<=;C<==<FB0=<=<<<D=9=;;=<=<=<;=FB2FB2C<C<;=FB0<C==;C<D@-<=B:<=C=C;<C=GD7*=;:=HD90'==<<=<=:FB0<<C<;C=C=<!

And this is the same read, after "sfffile -r ... ; sff2fastq ... ":

@EM7LVYS01C1LWG
TCAGGGGGGAGCTTAAATTTGAAACTAGAAAAATTTTGAACAAAATAATCATAATTGTTAGCTGATGAAAAACTAGAAAAGATTTTCTGAGTGTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAACGGTATCCCGTAGTGTGCATTCATCCCTGCTCTGGATACAGTCAGCTCCCAAATTCCATAAACAACTCCTTTGTAAGTAACCTCCTTTTGACAGGGGGTA
+EM7LVYS01C1LWG
FFFDEFGGGFFFEEEEFFFFD;;;FFFE55555BBCCEEFHFFFGIHGIFFFFFFFEEEEFFFFFFD77777FFCC1111CA7777@AEFFFFFFFFDDAAC?33444444=??7774444444443?FAAEEEEFFFFEEDDDFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEAAA===EEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEEEBBBEE@@@@AEE

Both programs are using Phred+33 encoding. SRR000001 is from GS FLX experiment done in 2007. From http://sourceforge.net/apps/mediawik...454_Platforms:

"Recent versions (since 2008) produce better QV's than early versions. Our SFF parser detects the software version by searching for the XML element "<qualityScoreVersion>1.1.03</qualityScoreVersion>" in the SFF manifest. The parser will complain "WARNING: Fragments not rescored!" if this XML element is not found."

but when I did "sffinfo -mnf" I got a "No manifest found". So, this is everything very confusing.
What FASTQ values should I be confident with??

Some final, maybe more broad questions:
how much has changed the scoring in 454 since GS20 till now?
And, is still the same homopolymer based scoring?
Is it better the actual algorithm?
Is it important which Data Analysis package version I use in relation with the 454 chemistry procedence of the reads?

Thank you,
Carlos

Last edited by CPCantalapiedra; 03-20-2012, 09:59 AM.
Tags: 454 backward compat, 454 data analysis, 454 scoring
flxlex

Moderator

Join Date: Nov 2008

Posts: 414
- Share
- Tweet
#2

03-23-2012, 04:24 AM

I have always thought that the SRA 454 files are incomplete, and your case of a missing manifest could mean I'm right in my suspicion. What if you run 'sffinfo' without paramaters on your sff file, do you see anything metadata-like?

As far as I know, any older sff file benefits from rescoring. The latest scoring is supposed to be better than older ones, also for older data. Also, the newest data analysis software should work best regardless of sequencing chemistry (at least that is what 454 intended, I believe)
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
- Channel: Articles
Yesterday, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Latest Articles

ad_right_rmr

News