Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sklages
    replied
    Hi Edge,

    You should have a look at their web site, http://compbio.dfci.harvard.edu/tgi/software/ .
    Geo Pertea, who is not working on TGICL anymore for years now, has fixed some things in the source code in 2008 as a result of some problems I had with large (454) datasets.

    I have compiled it (years ago) under 32bit linux.
    It runs quite fine on 64bit systems, but with limitations. If you have huge datasets, I'd not use TGICL.

    regards,
    Sven

    Leave a comment:


  • edge
    replied
    Hi Sven,

    Do you mind to email zmsort source code to my mail box?
    I'm wanna trying to run tgicl in my server.
    However I still fail to run it successful due to the zmsort can't install properly at my server.
    I'm using x86_64.
    Many thanks and looking forward to hear from you.

    best regards
    Edge

    Leave a comment:


  • sklages
    replied
    Originally posted by Old guy View Post
    How do you remove poly A's before assembly? Don't want to use SeqClean because you have to use FASTA files.
    What do you want to use?

    primer/adaptor clipping can easily be done directly on the SFF files, but polyA clipping ist best done on FASTA, because you can use your program of choice, e.g. cross_match, for finding polyA.

    If you use MIRA, you can even let MIRA do this for you ...
    Last edited by sklages; 07-07-2010, 10:43 PM. Reason: ...

    Leave a comment:


  • AndreaB
    replied
    Originally posted by johnwhitaker View Post
    They want me to compare the expression levels between two sets of data by comparing the number of reads which were used to make up matching contigs. The coverage isn't that great so some genes have multiple contigs. I can tell this as their are many cases several smaller contigs from one sample (but still in the 100's) align with a larger ones in the other.
    John,

    there may be another way to assess the expression level. We also work with plants without a sequenced genome and tried read matching to A. thaliana sequences and it worked beautifully in our case. Basically, we used the A. thaliana coding sequences as a reference and matched reads with BLAST and BLAT. Palmieri and Schlotterer (2009) showed that BLAT will work reasonably well with 454 reads since they are long enough. In our hands, BLAST and BLAT give essentially the same result but BLAT is much faster. The output in this type of analysis is based on the Arabidopsis AGIs. Once we know our genes of interest we retrieve the contig from a contig database using Blast for further work in the non-model.
    If you are interested in pursueing this kind of analysis, I could post further details or you could check http://www.plantphysiol.org/cgi/cont...p.110.159442v1

    Leave a comment:


  • Old guy
    replied
    How do you remove poly A's before assembly? Don't want to use SeqClean because you have to use FASTA files.

    Leave a comment:


  • akira2291
    replied
    Originally posted by kmcarr View Post
    John,

    You are absolutely on the right track in wanting to do a single, unified assembly of the data. We do exactly the type of analysis you are trying on a very regular basis. We haven't used Newbler for transcript assembly in quite a while. Here is the procedure we use:

    Clean the hell out of your raw sequences; trimming polyA (or polyT), vector/adapter sequences and low quality and low complexity regions. It certainly helps to know what procedures/kits/adapters were used in creating the cDNA library used for 454 sequencing so that you can limit screening steps to just those. We first run cross_match to do vector screening. The screened output is then input to the SeqClean. SeqClean (http://compbio.dfci.harvard.edu/tgi/software/) is a pipeline originally created at TIGR for cleaning EST sequences.

    After cleaning the reads are fed into the assembly pipeline TGI Clustering Tools (TGICL, also available at the URL above). This is another pipeline first developed by TIGR for clustering and assembling ESTs for their Gene Index project. It calculates pairwise similarity scores for all possible pairwise comparisons. It then performs a transitive clustering of the reads based on these similarity scores. Finally, it assembles each cluster using CAP3. We use parameters a little more stringent than the defaults (minimum overlap and percent identity). At this stage any singletons are set aside and not considered further. All of the contigs created are then assembled together using CAP3, with more relaxed parameters than the first round. You will still end up with multiple contigs which are very similar.

    The two stage assembly does add an extra layer of complexity when you are trying to track reads. Since the assembly components of the second round would be contigs themselves you have to track back to which reads made up those contigs from the first round assembly.

    If you decide that you do not want to do an entire new assembly I do have an alternative. As you have discovered you will never be able to make a 1-to-1 matching of contigs but you could try to create groups of contigs from the two assemblies. A useful program to do this is blastclust, which is part of the standard NCBI blast toolkit. The grouping can be very stringent (e.g. only finding orhtologous sequences) or more relaxed (grouping sequences from gene families) based on the adjusting the two primary scoring parameters -L and -S. In a situation like yours you will have to be careful with -L parameter. This parameter controls what percentage of the shorter sequence must overlap the longer one. Blastclust was written assuming assuming that people would be comparing complete sequences (transcripts or proteins) so that one sequence should be 'contained' within the other. This is not true for your incomplete transcript assemblies.

    I rambled on for quite some time here, I hope you find some of this information useful.
    hi kmcarr,

    i was also working with a few assemblery the last weeks and also TGICL as well. your idea of using cross match in combination with seqclean is a very good idea which i also will try out soon thanks for that advice. you said in your post that you use more stringent parameters for TGICL than the default ones. can you tell me which one and why?

    i also had a question to your process you described. maybe i got you wrong. is it so, that you use TGICL first and then once again use CAP3 in a second round clustering? if so, can you tell me the advantage of that maybe? is it general useful or for the circumstances of the first message here posted?

    im thankful for your advice and good idea.

    take care, akira.

    Leave a comment:


  • Khanjan
    replied
    Originally posted by dbrami View Post
    Have you tried using the '-large' flag?
    Thanks a lot, it worked !

    Cheers,
    Khanjan

    Leave a comment:


  • sklages
    replied
    As an assembler I could recommend MIRA3 (http://sourceforge.net/apps/mediawik...itle=Main_Page). It is doing a good job. But I am afraid, 16G is not really much for assembling 3mio sequences ... (try "miramem" from the package to estimate RAM usage).

    You could try to cluster your data and assemble every cluster separately (e.g. http://code.google.com/p/wcdest/ and then cap3 or phrap).

    If your library is not normalised, you will run into problems with either approach :-)

    my 2p,
    Sven

    Leave a comment:


  • dbrami
    replied
    Have you tried using the '-large' flag?

    Leave a comment:


  • Khanjan
    replied
    Assembly Failure

    Hey guys,

    I am doing an Assembly with 3,101,509 454 FLX reads with Newbler. These are cDNA sequences. I tried doing the assembly using Newbler and Velvet. However both of them are failing due to memory issues. I am using Red Hat Linux 5, 16GB RAM.

    Is there any better EST assembler to do this? Or, I can work this out on Newbler itself?

    Thanks in Advance,
    Khanjan

    Leave a comment:


  • sklages
    replied
    miramem could help estimating the amount of memory needed.

    cheers,
    Sven

    Leave a comment:


  • ikim
    replied
    Hi, I've come up with a very similar problem using TGICL/CAP3 for assembly, but I didn't think it was an issue with memory. Would anyone know if MiraEST can handle a 4 million+ read set? We are operating on a system with 64 GB workable ram.

    Thanks!
    IK

    Leave a comment:


  • sklages
    replied
    oh well, now i saw donniemarco's post. Memory :-)
    16G is quite small ...

    cheers,
    Sven

    Leave a comment:


  • sklages
    replied
    Originally posted by kmcarr View Post
    It could still be a problem with CAP3. All TIGCL knows is that CAP3 did not exit normally. There are error logs in each of the 'assemble_n' subdirectories. Look in those error logs to get a better idea of the actual error.
    Well, you are probably right ... I have not remembered very well :-)

    I had a problem with zmsort,

    [...]
    WAITING for all children to finish before starting last child!
    WAITING for all children to finish!
    <<< --- clustering [Pimp454FLXcomplete_2008-04-07_2008-04-08.fasta]
    finished at May 22 03:40:16 2008
    Error getting file size for 'zdir_cluster_1_002.Z'
    Error at command:
    zmsort -f11 -n -r -o zdir_cluster_1 -s 700 cluster_1/*.Z

    Process terminated with an error, at step 'clustering'!
    [...]
    cap3 has problems assembling very deep clusters, runs into memory problems. You are right, having a closer at the log is a good idea :-)

    cheers,
    Sven

    Leave a comment:


  • donniemarco
    replied
    Thanks, I looked into the error file and it seems to have run out of memory.

    "************
    Ran out of memory: 10911460019 bytes requested.
    Error! cap3 failure detected (code=256) on: CL1
    *************"

    I also ran with another 16gig Ram machine but it seems to generate error there as well. I think I will try to upgrade the machine.

    Originally posted by kmcarr View Post
    It could still be a problem with CAP3. All TIGCL knows is that CAP3 did not exit normally. There are error logs in each of the 'assemble_n' subdirectories. Look in those error logs to get a better idea of the actual error.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Genetic Variation in Immunogenetics and Antibody Diversity
    by seqadmin



    The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
    11-06-2024, 07:24 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 09:29 AM
0 responses
11 views
0 likes
Last Post seqadmin  
Started by seqadmin, Today, 09:06 AM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, Today, 08:03 AM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-22-2024, 07:36 AM
0 responses
61 views
0 likes
Last Post seqadmin  
Working...
X