Header Leaderboard Ad

Collapse

How to assembe the transcripts for 454 reads?

Collapse

Announcement

Collapse

SEQanswers June Challenge Has Begun!

The competition has begun! We're giving away a $50 Amazon gift card to the member who answers the most questions on our site during the month. We want to encourage our community members to share their knowledge and help each other out by answering questions related to sequencing technologies, genomics, and bioinformatics. The competition is open to all members of the site, and the winner will be announced at the beginning of July. Best of luck!

For a list of the official rules, visit (https://www.seqanswers.com/forum/sit...wledge-and-win)
See more
See less
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • martin2
    replied
    Hi Xiao,
    I processed a lot of 454 datasets (mostly fetched from NCBI Short Read Archive). My general recommendation is: cleanup the reads before throwing them into any assembler. The assemblers won't do anything magic on your behalf. Crap in, crap out.

    Second, as you mention transcriptome sequencing (of course, the plants) I fear the adapters used for sample preparation were from Evrogen/Clontech which offer molecular methods for cDNA first strand synthesis, directional or formerly non-directional cloning, and eventually normalization. These datasets have completely different types of issues compared to those made according to Roche protocols. If this procedure was taken in the lab then I am quite certain you will end up with chimeric assemblies. Lookup sequences of MINT/SMART adapters elsewhere and trim the raw reads.

    Ouch, extract the full raw reads from the .sra files and process them through the trimming pipeline. Don't presume the sequence in "high-qual" region is without adapters.

    Finally to say, some people deposited into NCBI SRA somehow trimmed FASTA/Q files. If you go and extract the sequences from .sra files you will end up with sequences in all uppercase letters, giving you the impression they are cleaned up. No. You don't even have to look into the FASTQ into quality values to learn where is a low-qual region. We are talking here about adapters, and sadly, due to lack of appropriate software and knowledge, they often do remain in the "high-qual" region. So do not get fooled that all-uppercase sequence is already cleaned up, and (re)do the work youself. Even worse, realizing what is left uncorrected in a dataset badly processed by somebody else is not an easy task. I hit some cases like that and unavailability of the original, "unprocessed" data is quite unpleasant.

    (I have to admit you will likely fail to do it right -- I saw in about 400 datasets from 454 pyrosequencers so many *different* issues that it will take you a long while to realize and overcome all of them).

    BTW: When you say ~ 230nt long reads .... That is a quality-trimmed read length, right? Were these from prepared by the titanium protocol? Don't expect long assembled transcripts from these, the properly trimmed reads might be in the range between 120-180nt, way too short to reconstruct CDS of even average proteins (in terms of their length).
    Last edited by martin2; 03-04-2014, 11:22 AM. Reason: Typo editing.

    Leave a comment:


  • ZHONG Xiao
    replied
    Hi, Jeremy! Thank you very much for your reply!
    The type of my downloaded reads is NCBI SRA(Short Read Archive), which should be FASTQ format, but their length are different, greater or smaller than 230nt.

    At first, I assembled their transcripts using Tophat/cufflink, but failed, which may be leaded to by the type of reads! Tophat calls for the standard FASTQ, "TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies"!


    Next, I change the 454 FASTQ reads to FASTA in-house script. And directly BLAT them to reference genome and connect the aligned related hits to one transcript using PASA. Unfortunately, most of the "transcripts" are too short, too fragmented!

    So, today, I am trying this: First, denovo-assemble those 454 reads to contigs using TGICL(similar to Newbler) and then map/BLAT to genome to get exon-intron trancripts. Good luck!

    gsMapper - to map reads to a transcriptome or genome reference. Sorry, I am not familar with it, but I guess it is similar with BLAT, Tophat, just is alignment tool and having no assmbling function, right?

    So, I guess there should be a tool to assemble the type of 454 SE reads(FASTQ/FASTA, etc) to transcripts based on their genome sequences, which I think is much more accurated than denovo-assemble.

    I want do one work in different ways and then get the best method!
    Happy for yours' reply! Thanks~
    Last edited by ZHONG Xiao; 11-05-2012, 06:37 AM.

    Leave a comment:


  • Jeremy
    replied
    Depends on the format that you downloaded the reads in. If they are sff then get Newbler from Roche, it is free to researchers, and use GSMapper. If you have it in fasta and are already familiar with tophat/cufflink then just use that.

    Leave a comment:


  • ZHONG Xiao
    started a topic How to assembe the transcripts for 454 reads?

    How to assembe the transcripts for 454 reads?

    Hello! Now, I had downloaded 454 SE reads(~230nt) of a plant from NCBI. But, I don't know how to assemble them to transcripts using mapping to its genome, similar to the mothod that assembles the illumina PE reads using Tophat/cufflink. How can I do? Which tools can I use?
    Thanks!

Latest Articles

Collapse

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 01:08 PM
0 responses
6 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-01-2023, 08:56 PM
0 responses
12 views
0 likes
Last Post seqadmin  
Started by seqadmin, 06-01-2023, 07:33 AM
0 responses
141 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-31-2023, 07:50 AM
0 responses
182 views
0 likes
Last Post seqadmin  
Working...
X