I have some 454 data produced from cDNA randomly amplified using proofreading taq, sheared and then sequenced. Once the amplification tags were removed, we'ved assembled it using CLC genomic work bench. Looking at the protein translations of two particular genes we have very good homology to genes from related organisms but in two places we appear to have frame shifts in the DNA sequences which introduce stop codons leading to incomplete proteins. Correcting the frameshifts would be CCC to CCCC in one case and CCCC to CCC in the other and completely restore the protein homology. Both regions have over 200 fold coverage reading the frameshift with only one read giving the "corrected" unshifted sequence. I have read that the 454 does have problems with homopolymers. What I am wondering is if this is biological (a mutation leading to an unusual terminated protein--I think unlikley these proteins are essential) a sequencing error (introduced by the 454 / emPCR) or sample prep (PCR error-but I was using proofreading taq). Does anyone have any experience of this sort of thing and any comments on the "introduced by the sequencing / emPCR option"?
We are going to Sanger sequence across these regions and see what we get but that will take time.
thx
Ian
We are going to Sanger sequence across these regions and see what we get but that will take time.
thx
Ian
Comment