Header Leaderboard Ad


454 RNA assembly



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • 454 RNA assembly

    Hello all,

    I'm working on a transcriptome which is sequenced by Roche 454. At first I used to assemble it using Newber but the result was awful. Then I tried iAssembler and got a result that looks all right. But when I mapped reads back, only about 30% reads could be located on the assembly.

    Is there anyone who can help me with this issue? Many thanks.

  • #2
    We investigated the effect of the assembly algorithm on the assembly quality of 454 data some time ago (see J.exp.bot.: Critical assessment of assembly strategies for non-model species mRNA-Seq data and application of next-generation sequencing to the comparison of C(3) and C(4) species.).

    However, to cut a long story short, I would suggest 1) read cleaning with a decent quality cut-off (Phred 20 or higher) and 2) assembly with CAP3 or TGICL, which essentially is CAP3 on preclustered reads.



    • #3
      Hi Simon,

      Thanks for your suggestions.
      Can you offer some softwares that can do the quality control? I used SeqClean to trim vectors but it can't do quality control.
      The iAssembler used MIRA and CAP3. I also used TGICL after that. But I still don't understand why so few reads aligned to the assembly.


      • #4
        Sorry, I just realized my post was a bit short.

        So we usually use Fastq/Fasta quality trimming and filtering from the Galaxy-FASTX-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
        There are multiple possible reasons for your low mapping efficiency:
        1) Mapping tool too stringent
        2) Assembly errors
        3) Only unique mapping reads reported, while having many repetitive contigs

        As there is no golden rule to transcriptomics (as far as I know), you might want to share some information about your project and discuss it with the people here. (e.g. model vs non-model, animal/plant/etc., experimental design: qualitative vs. quantitative, etc.). The more you are able and willing to share, the more you can get out of it here



        • #5
          About my project, I have 2 samples of shrimp, one is infected by virus and the other one is as control. My purpose is to find gene expression changes under infection.



          • #6
            454's own program, newbler (gsAssembly, runAssembly) has a '-cDNA' option. Did you use that? It is very unfortunate this program was not tested in the study by @sisch - as on paper it seems a nice approach.


            • #7
              @flxlex Yes, it is a pity that newbler was not included

              I don't really overview your field, as I am coming from plant science, so please bare with my ignorance. Just to add some more thoughts:

              - Is there ANY genome available that is reasonably, closely related. In plants this might be as far diverged as 50 million years to still yield good results.
              - Do you know the viral genome? (i.e. can you tell which transcripts are of viral origin)
              - I've seen datasets with virus-induced transcriptional changes, where you would find many differences in splicing variants. These are hardly detectable in de novo approaches. Thus, you should have a genome reference in the backend.


              • #8
                Hi Robin,
                I developed my own tool to do QC and adapter/artifact/MID removal and I offer cleanup work as a commercial service. I think I may dare to say that I really have an overview based on more than 1700 454-based datasets from worldwide. I collected lots of artifacts from all those datasets and learned what one has to remove to get better assemblies, found several funny mistakes which happened time to time in some labs, some software-driven errors and notably, bad designs of certain lab protocols.

                In respect to publicly available tools ... and your question what you could use for your work? None. They just don't do the right thing, at all. Nobody told those programmers what to look for properly and because I know what they are missing I can only say that they never tested properly their software. ( In contrary, I have been constantly rewriting my tool several years and kept hitting yet another new artifact or new adapter in datasets weekly. Frustrating. Nowadays I perform at least several hundreds of queries for each read in a dataset before deciding what is in a particular read. Sadly in some cases that needs to be a few thousands of queries before I can judge what and how to trim. And not only, I have to use several aligners to be able to find what I want to becuase it is sometimes obfuscated too much my sequencing errors.

                Forget about QC based on PHRED values. It is merely useless. CAP3 is a usual trick to squash reads into some consecutive sequence once you realize you are unable to get reads merged together. It is good to get on average 400nt long contigs for the purpose of your paper. Interestingly, reviewers let such papers become published although raw read length was for example 310nt on average (FLXti). Anyway, CAP3 merges closely related copies of same gene together. Just think how many whole-genome duplications underwent your bug already. Do you have 4 or 8 loci of your favourite gene in the genome? Why do you think CAP3 would not mix them into a single bunch of splice variants? Maybe even it makes up just one or two splice variants, discards 3 SNPs and drop 7 3'-UTR exons out of those 8. It should have been banned for a long while applying CAP3 on nextgen data. This is a last resort you should ever try.

                I see several shrimp datasets in NCBI SRA including one infected with some virus. Is that the dataset you are talking about here? Drop me an email if you want me to cleanup the data for you.