Seqanswers Leaderboard Ad

**Jeremy** · 11-05-2012, 12:24 AM

Depends on the format that you downloaded the reads in. If they are sff then get Newbler from Roche, it is free to researchers, and use GSMapper. If you have it in fasta and are already familiar with tophat/cufflink then just use that.

**ZHONG Xiao** · 11-05-2012, 06:22 AM

Hi, Jeremy! Thank you very much for your reply!
The type of my downloaded reads is NCBI SRA(Short Read Archive), which should be FASTQ format, but their length are different, greater or smaller than 230nt.

At first, I assembled their transcripts using Tophat/cufflink, but failed, which may be leaded to by the type of reads! Tophat calls for the standard FASTQ, "TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies"!

404 Not Found

http://tophat.cbcb.umd.edu/manual.html;

Next, I change the 454 FASTQ reads to FASTA in-house script. And directly BLAT them to reference genome and connect the aligned related hits to one transcript using PASA. Unfortunately, most of the "transcripts" are too short, too fragmented!

So, today, I am trying this: First, denovo-assemble those 454 reads to contigs using TGICL(similar to Newbler) and then map/BLAT to genome to get exon-intron trancripts. Good luck!

gsMapper - to map reads to a transcriptome or genome reference. Sorry, I am not familar with it, but I guess it is similar with BLAT, Tophat, just is alignment tool and having no assmbling function, right?

So, I guess there should be a tool to assemble the type of 454 SE reads(FASTQ/FASTA, etc) to transcripts based on their genome sequences, which I think is much more accurated than denovo-assemble.

I want do one work in different ways and then get the best method!
Happy for yours' reply! Thanks~

**martin2** · 01-06-2013, 05:22 PM

Hi Xiao,
I processed a lot of 454 datasets (mostly fetched from NCBI Short Read Archive). My general recommendation is: cleanup the reads before throwing them into any assembler. The assemblers won't do anything magic on your behalf. Crap in, crap out.

Second, as you mention transcriptome sequencing (of course, the plants)

I fear the adapters used for sample preparation were from Evrogen/Clontech which offer molecular methods for cDNA first strand synthesis, directional or formerly non-directional cloning, and eventually normalization. These datasets have completely different types of issues compared to those made according to Roche protocols. If this procedure was taken in the lab then I am quite certain you will end up with chimeric assemblies. Lookup sequences of MINT/SMART adapters elsewhere and trim the raw reads.

Ouch, extract the full raw reads from the .sra files and process them through the trimming pipeline. Don't presume the sequence in "high-qual" region is without adapters.

Finally to say, some people deposited into NCBI SRA somehow trimmed FASTA/Q files. If you go and extract the sequences from .sra files you will end up with sequences in all uppercase letters, giving you the impression they are cleaned up. No. You don't even have to look into the FASTQ into quality values to learn where is a low-qual region. We are talking here about adapters, and sadly, due to lack of appropriate software and knowledge, they often do remain in the "high-qual" region. So do not get fooled that all-uppercase sequence is already cleaned up, and (re)do the work youself. Even worse, realizing what is left uncorrected in a dataset badly processed by somebody else is not an easy task. I hit some cases like that and unavailability of the original, "unprocessed" data is quite unpleasant.

(I have to admit you will likely fail to do it right -- I saw in about 400 datasets from 454 pyrosequencers so many *different* issues that it will take you a long while to realize and overcome all of them).

BTW: When you say ~ 230nt long reads .... That is a quality-trimmed read length, right? Were these from prepared by the titanium protocol? Don't expect long assembled transcripts from these, the properly trimmed reads might be in the range between 120-180nt, way too short to reconstruct CDS of even average proteins (in terms of their length).

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

How to assembe the transcripts for 454 reads?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News