Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 454 assemly

    Hello,

    I have been given two sets 454 data which have already been assembled into contigs using the newbeler assembler. The data is transcript sequence from two different flower types, from a plant species which has not yet had its genome sequenced.

    They want me to compare the expression levels between two sets of data by comparing the number of reads which were used to make up matching contigs. The coverage isn't that great so some genes have multiple contigs. I can tell this as their are many cases several smaller contigs from one sample (but still in the 100's) align with a larger ones in the other.

    I thought the best way to deal with this would be to pool the reads from the two samples and assemble them. Then compare the number of reads from each sample that make up each of the contigs.

    I don't have access to newbeler so I tired re-assembling one of the samples with velvet. But this resulted in 25,091 contigs whereas newebler had produced 13,990.

    Does anyone know which of the publicly available assemeblers works bets on 454 data? Or has anyone got an suggestion on how to deal with this data?

    Thanks,

    John

  • #2
    John,

    You are absolutely on the right track in wanting to do a single, unified assembly of the data. We do exactly the type of analysis you are trying on a very regular basis. We haven't used Newbler for transcript assembly in quite a while. Here is the procedure we use:

    Clean the hell out of your raw sequences; trimming polyA (or polyT), vector/adapter sequences and low quality and low complexity regions. It certainly helps to know what procedures/kits/adapters were used in creating the cDNA library used for 454 sequencing so that you can limit screening steps to just those. We first run cross_match to do vector screening. The screened output is then input to the SeqClean. SeqClean (http://compbio.dfci.harvard.edu/tgi/software/) is a pipeline originally created at TIGR for cleaning EST sequences.

    After cleaning the reads are fed into the assembly pipeline TGI Clustering Tools (TGICL, also available at the URL above). This is another pipeline first developed by TIGR for clustering and assembling ESTs for their Gene Index project. It calculates pairwise similarity scores for all possible pairwise comparisons. It then performs a transitive clustering of the reads based on these similarity scores. Finally, it assembles each cluster using CAP3. We use parameters a little more stringent than the defaults (minimum overlap and percent identity). At this stage any singletons are set aside and not considered further. All of the contigs created are then assembled together using CAP3, with more relaxed parameters than the first round. You will still end up with multiple contigs which are very similar.

    The two stage assembly does add an extra layer of complexity when you are trying to track reads. Since the assembly components of the second round would be contigs themselves you have to track back to which reads made up those contigs from the first round assembly.

    If you decide that you do not want to do an entire new assembly I do have an alternative. As you have discovered you will never be able to make a 1-to-1 matching of contigs but you could try to create groups of contigs from the two assemblies. A useful program to do this is blastclust, which is part of the standard NCBI blast toolkit. The grouping can be very stringent (e.g. only finding orhtologous sequences) or more relaxed (grouping sequences from gene families) based on the adjusting the two primary scoring parameters -L and -S. In a situation like yours you will have to be careful with -L parameter. This parameter controls what percentage of the shorter sequence must overlap the longer one. Blastclust was written assuming assuming that people would be comparing complete sequences (transcripts or proteins) so that one sequence should be 'contained' within the other. This is not true for your incomplete transcript assemblies.

    I rambled on for quite some time here, I hope you find some of this information useful.

    Comment


    • #3
      Thanks for your detailed reply.

      I had been thinking about clustering the contigs today. I have used blastclust before also I was having a look at CLOBB (http://www.biomedcentral.com/1471-2105/3/31/abstract) as it seems to be similar to blastclust but optimised for EST clustering. Also I found a program called wcd (http://bioinformatics.oxfordjournals...ull/24/13/1542) that is also designed EST data.

      I think I'll give the clustering a go. If I have lots of problems then I might look into using the pipeline you suggested.

      Comment


      • #4
        Hi,

        I am just following kmcarr's advice using TGICL to cluster ~75.000 ~250nt sequences from a GS FLX run.

        CAP3 has a problem, giving the following error:
        *** glibc detected *** cap3: corrupted double-linked list: 0x09724b98 ***
        ======= Backtrace: =========
        /lib/libc.so.6[0x8763a4]
        /lib/libc.so.6[0x878123]
        /lib/libc.so.6(cfree+0x96)[0x878356]
        /lib/libc.so.6(fclose+0xea)[0x92b65a]
        cap3[0x8056cd6]
        cap3[0x80497c0]
        cap3(free+0x4a)[0x8048726]
        ======= Memory map: ========
        0070a000-00717000 r-xp 00000000 fd:00 21227675 /lib/libgcc_s-4.3.2-20081105.so.1
        00717000-00718000 rwxp 0000c000 fd:00 21227675 /lib/libgcc_s-4.3.2-20081105.so.1
        007e2000-00802000 r-xp 00000000 fd:00 21227666 /lib/ld-2.9.so
        00803000-00804000 r-xp 00020000 fd:00 21227666 /lib/ld-2.9.so
        00804000-00805000 rwxp 00021000 fd:00 21227666 /lib/ld-2.9.so
        00807000-00975000 r-xp 00000000 fd:00 21227601 /lib/libc-2.9.so
        00975000-00977000 r-xp 0016e000 fd:00 21227601 /lib/libc-2.9.so
        00977000-00978000 rwxp 00170000 fd:00 21227601 /lib/libc-2.9.so
        00978000-0097b000 rwxp 00978000 00:00 0
        08048000-08060000 r-xp 00000000 fd:00 18547219 /home/ele/tools/tgicl_linux/bin/cap3
        08060000-08061000 rwxp 00017000 fd:00 18547219 /home/ele/tools/tgicl_linux/bin/cap3
        08061000-081e1000 rwxp 08061000 00:00 0
        0970e000-097d6000 rwxp 0970e000 00:00 0 [heap]
        f7e00000-f7e21000 rwxp f7e00000 00:00 0
        f7e21000-f7f00000 ---p f7e21000 00:00 0
        f7f9f000-f7fa2000 rwxp f7f9f000 00:00 0
        f7fbb000-f7fbc000 rwxp f7fbb000 00:00 0
        f7fbc000-f7fbd000 r-xp f7fbc000 00:00 0 [vdso]
        ff993000-ff9bd000 rwxp 7ffffffd5000 00:00 0 [stack]

        this is repeated several times...

        It is the same error whether I use multiple processors (-c option) or just one.

        Has anybody an idea what to do?

        I'm working on a 2x quad-core, 16GB RAM running fedora10.x86_64.
        Last edited by Emanuel Heitlinger; 05-20-2009, 07:08 AM.

        Comment


        • #5
          Maybe it's a good idea to get the latest cap3 binary for your platform at http://seq.cs.iastate.edu/

          If you want to cluster before assembling you could try wcd (http://code.google.com/p/wcdest/) which makes a good job on clustering our data.

          You can also use (for assembling your transcriptome dataset) MIRA2 assembler, http://chevreux.org/projects_mira.html

          We ran into problems with TGICL with 454 data with very deep clusters (the clustering itself crashed, not the subsequent assembly). The author doesn't recommend TGICL for big NGS datasets.

          hth,
          Sven

          Comment


          • #6
            Thanks!

            I just downloaded the right binay, copied it to tgicl/bin and voila: It works! The clustering takes only very short time running on 6 processors... I still have to wait wait till CAP3 finishes.

            MIRA looks very promising, their web page has great howtos and what they say in their paper on miraEST seems very promising.

            SCARF looks promising as well. Although I am sure, that I will not have an organism being related closely enough (working on a notoriously long branching nematode).

            I have to take a closer look into the literature as I am still a bit confused which tools do clustering, assembly or both... I will try to post a summary once I have a bit more of a clue.
            Last edited by Emanuel Heitlinger; 05-21-2009, 05:06 AM.

            Comment


            • #7
              Wcd

              I am trying to run WCD however I am not able to generate sequence cluster that is generated by WCD with default parameters as follows:
              wcd -c -N 4 -o test.clust -g -t testSeq.fna


              I get the output file test.clust but it does not have clustered sequences just the information as follows:
              Index Cluster Link Orientation Witness
              0 0 -1 1 -1
              1 1 -1 1 -1

              I looked further down and its shows the stats:

              Number of clusters is 2023 with ave sq of len %
              Size of Cluster Number of Clusters
              1 1902
              2 86
              3 25
              4 4
              5 4
              11 1
              29 1

              And further down below are just each cluster and the links:
              ...........
              208.
              210.
              211.
              212.
              213 161 406.
              ..........

              Am I missing any parameter. In the document, its written that sequences are generated. Any help will be highly appreciated.

              Comment


              • #8
                Originally posted by donniemarco View Post
                [....]
                And further down below are just each cluster and the links:
                ...........
                208.
                210.
                211.
                212.
                213 161 406.
                ..........

                Am I missing any parameter. In the document, its written that sequences are generated. Any help will be highly appreciated.
                This output is all you need ... It tells you that the sequences with index 208, 210, 211 and 212 are singletons, whereas the sequences 213, 161 and 406 form a cluster with the given parameters (note that defaultly counting starts from "0").
                It's now up to you to extract the cluster-forming sequences from your input file and assemble them with cap3, phrap, MIRA or whatever assembler ...

                cheers,
                Sven

                Comment


                • #9
                  Hi Sven,

                  Thanks for your helpful reply.

                  If I extract all the cluster forming sequences say 213, 161 and 406 (actual sequence 214, 162, 407: thanks for pointing this as I had missed this) from the file and try to assemble them with say MIRA.

                  Do you not think that Mira not knowing these being one cluster will again be restricted to perform assembling the same way as these sequences would be assembled from original file (example the transitive closure property that assemblers lack).

                  I am thinking of using these cluster to generate single supercontig (and then probably rerun with my remainging singleton data with MIRA) however I am looking any particular program to do this. Any help would be appreciated.

                  Thanks!!

                  Originally posted by sklages View Post
                  This output is all you need ... It tells you that the sequences with index 208, 210, 211 and 212 are singletons, whereas the sequences 213, 161 and 406 form a cluster with the given parameters (note that defaultly counting starts from "0").
                  It's now up to you to extract the cluster-forming sequences from your input file and assemble them with cap3, phrap, MIRA or whatever assembler ...

                  cheers,
                  Sven

                  Comment


                  • #10
                    Originally posted by donniemarco View Post
                    Hi Sven,

                    Thanks for your helpful reply.

                    If I extract all the cluster forming sequences say 213, 161 and 406 (actual sequence 214, 162, 407: thanks for pointing this as I had missed this) from the file and try to assemble them with say MIRA.
                    -Q int, --reindex int: reindex sequence ids.

                    may change the default indexing start.

                    Do you not think that Mira not knowing these being one cluster will again be restricted to perform assembling the same way as these sequences would be assembled from original file (example the transitive closure property that assemblers lack).
                    Yes, I'd expect MIRA to assemble these reads similar as if done with the whole dataset, maybe. It's not always easy to directly compare different clustering/assembly approaches. It depends on what you are expecting.

                    For 454-generated EST datasets I currently use MIRA3 alone, quite successful IMHO. Datasets range from ~300,000 to 1,500.000 reads.

                    I am thinking of using these cluster to generate single supercontig (and then probably rerun with my remainging singleton data with MIRA) however I am looking any particular program to do this. Any help would be appreciated.
                    Thanks!!
                    What do want to achieve? What is a 'supercontig' in this context?

                    A common way is to use a cluster program (wcd,tclust) and assemble these clusters with "cap3". This works quite fine. You could also use phrap.
                    I'd use MIRA as a "complete program to assemble ESTs".

                    If you want to further "join" contigs, just do so with cap3 or phrap
                    (although I would not do this with my datasets ;-))

                    Again, everthings depends on what you want / are expecting.

                    cheers,
                    Sven

                    Comment


                    • #11
                      Thanks for the reply.

                      I will run assembling with mira and see how it goes. BTW, earlier I ran the same data with TIGR but it seems to have crashed while performing assembling with CAP3 (it finished clustering). I used 32gb RAM, 8 cpu processor so I think it was not memory contraint somehow. The no. of reads are approx. 900,000 with file size of 350mb.

                      Originally posted by sklages View Post
                      -Q int, --reindex int: reindex sequence ids.

                      may change the default indexing start.



                      Yes, I'd expect MIRA to assemble these reads similar as if done with the whole dataset, maybe. It's not always easy to directly compare different clustering/assembly approaches. It depends on what you are expecting.

                      For 454-generated EST datasets I currently use MIRA3 alone, quite successful IMHO. Datasets range from ~300,000 to 1,500.000 reads.



                      What do want to achieve? What is a 'supercontig' in this context?

                      A common way is to use a cluster program (wcd,tclust) and assemble these clusters with "cap3". This works quite fine. You could also use phrap.
                      I'd use MIRA as a "complete program to assemble ESTs".

                      If you want to further "join" contigs, just do so with cap3 or phrap
                      (although I would not do this with my datasets ;-))

                      Again, everthings depends on what you want / are expecting.

                      cheers,
                      Sven
                      Thanks!!!!!!!!

                      Comment


                      • #12
                        donniemarco,
                        did you read the whole thread? Look at my post about progblems with cap within tgicl! If yours are is the same problem, use the solution Sklages provided, it worked for me and for at least one other user of tgicl I know of.

                        Comment


                        • #13
                          yes, i got the same error and i downloaded the cap3 specific to my machine as suggested by Sklages.
                          It seemed to have worked as it didn't gave the same error message about cap3 but crashed after 1.5-2 days with error message something as follows:

                          >>> Assemble [] started at
                          -Waiting for all children to finish before starting last child!!
                          -Waiting for all children to finish!!

                          Process terminated with an error at step 'Assemble' !

                          TGICL encountered an error at Step 'Aseemble'

                          I don't have any idea whether it crashed on assembling or last step of clustering when it seems to be waitng for last thread to return?

                          Originally posted by Emanuel Heitlinger View Post
                          donniemarco,
                          did you read the whole thread? Look at my post about progblems with cap within tgicl! If yours are is the same problem, use the solution Sklages provided, it worked for me and for at least one other user of tgicl I know of.

                          Comment


                          • #14
                            That's a problem with the TGICL, not with cap3 itself, it seems.
                            I ran into problems as well with TGICL, that's why I don't use it anymore :-)

                            cheers,
                            Sven

                            Comment


                            • #15
                              Originally posted by sklages View Post
                              That's a problem with the TGICL, not with cap3 itself, it seems.
                              I ran into problems as well with TGICL, that's why I don't use it anymore :-)
                              Originally posted by donniemarco View Post
                              yes, i got the same error and i downloaded the cap3 specific to my machine as suggested by Sklages.
                              It seemed to have worked as it didn't gave the same error message about cap3 but crashed after 1.5-2 days with error message something as follows:

                              >>> Assemble [] started at
                              -Waiting for all children to finish before starting last child!!
                              -Waiting for all children to finish!!

                              Process terminated with an error at step 'Assemble' !

                              TGICL encountered an error at Step 'Aseemble'

                              I don't have any idea whether it crashed on assembling or last step of clustering when it seems to be waitng for last thread to return?
                              It could still be a problem with CAP3. All TIGCL knows is that CAP3 did not exit normally. There are error logs in each of the 'assemble_n' subdirectories. Look in those error logs to get a better idea of the actual error.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 11-08-2024, 11:09 AM
                              0 responses
                              31 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-08-2024, 06:13 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X