I processed a lot of 454 datasets (mostly fetched from NCBI Short Read Archive). My general recommendation is: cleanup the reads before throwing them into any assembler. The assemblers won't do anything magic on your behalf. Crap in, crap out.
Second, as you mention transcriptome sequencing (of course, the plants)

Ouch, extract the full raw reads from the .sra files and process them through the trimming pipeline. Don't presume the sequence in "high-qual" region is without adapters.
Finally to say, some people deposited into NCBI SRA somehow trimmed FASTA/Q files. If you go and extract the sequences from .sra files you will end up with sequences in all uppercase letters, giving you the impression they are cleaned up. No. You don't even have to look into the FASTQ into quality values to learn where is a low-qual region. We are talking here about adapters, and sadly, due to lack of appropriate software and knowledge, they often do remain in the "high-qual" region. So do not get fooled that all-uppercase sequence is already cleaned up, and (re)do the work youself. Even worse, realizing what is left uncorrected in a dataset badly processed by somebody else is not an easy task. I hit some cases like that and unavailability of the original, "unprocessed" data is quite unpleasant.
(I have to admit you will likely fail to do it right -- I saw in about 400 datasets from 454 pyrosequencers so many *different* issues that it will take you a long while to realize and overcome all of them).
BTW: When you say ~ 230nt long reads .... That is a quality-trimmed read length, right? Were these from prepared by the titanium protocol? Don't expect long assembled transcripts from these, the properly trimmed reads might be in the range between 120-180nt, way too short to reconstruct CDS of even average proteins (in terms of their length).
Leave a comment: