Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Aman Mahajan
    replied
    Software used- SOAPdenovo Trans

    Leave a comment:


  • nangillala
    replied
    Hi,
    first of all: Which program gave you this output?
    Originally posted by Aman Mahajan View Post
    Size_includeN 14238304
    Size_withoutN 14238304

    Nucleotide_A 3733290 26.22%
    Nucleotide_C 3403704 23.91%
    Nucleotide_G 3387000 23.79%
    Nucleotide_T 3714310 26.09%
    GapContent_N 0 0.00%
    I'm just guessing here, but in your example the nucleotides A,C,G and T add up to the total size and it seems like you have no Ns in it, thus the number including Ns is the same.

    Originally posted by Aman Mahajan View Post
    N10 611 1677
    N20 420 4532
    N30 315 8483
    N40 250 13577
    N50 206 19868
    N60 174 27405
    N70 151 36212
    N80 134 46255
    N90 120 57488
    I _guess_ that your N50 is 206 here because the N60 should be smaller and so on. I don't know what the second number is. Maybe number of contigs above this threshold or something?
    Shouldn't this be in the doku of the program you are using to generate this output?

    Hope this is of any help.

    Leave a comment:


  • Aman Mahajan
    replied
    - Information for assembly Scaffold 'output.scafSeq'.(cut_off_length <
    100bp) -->

    Size_includeN 14238304
    Size_withoutN 14238304
    Scaffold_Num 69976
    Mean_Size 203
    Median_Size 154
    Longest_Seq 5423
    Shortest_Seq 100
    Singleton_Num 69976
    Average_length_of_break(N)_in_scaffold 0

    Known_genome_size NaN
    Total_scaffold_length_as_percentage_of_known_genome_size NaN

    scaffolds>100 69864 99.84%
    scaffolds>500 2964 4.24%
    scaffolds>1K 324 0.46%
    scaffolds>10K 0 0.00%
    scaffolds>100K 0 0.00%
    scaffolds>1M 0 0.00%

    Nucleotide_A 3733290 26.22%
    Nucleotide_C 3403704 23.91%
    Nucleotide_G 3387000 23.79%
    Nucleotide_T 3714310 26.09%
    GapContent_N 0 0.00%
    Non_ACGTN 0 0.00%
    GC_Content 47.69% (G+C)/(A+C+G+T)

    N10 611 1677
    N20 420 4532
    N30 315 8483
    N40 250 13577
    N50 206 19868
    N60 174 27405
    N70 151 36212
    N80 134 46255
    N90 120 57488


    Can anyone explain what is size include N means and how the size without N numbers is same.?

    and N50 value of this result is?

    Leave a comment:


  • Nomijill
    replied
    Originally posted by eslondon View Post
    I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.

    We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.

    -ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize

    -SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)

    -CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)

    In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.

    Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!

    All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.

    Now we are going to throw more data at it, hoping for a much better assembly

    best regards

    Elia
    Update on the CLC bio de novo assembler- It has scaffolding. It has the ability to control for bubble size, and it is faster than ever. I assemble 10 million paired end MiSeq reads in 15 minutes on my 8GB laptop. This is in the new version 5.0. The memory footprint makes it possible to assemble on machines that would otherwise be too small. It is commercial, but two weeks is free and the Genomics Workbench is very easy to use on Mac, Windows or Linux.

    Leave a comment:


  • sagarutturkar
    replied
    Originally posted by westerman View Post
    #1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.

    #2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"

    #3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?
    Thank you very much. That was helpful.

    Leave a comment:


  • westerman
    replied
    #1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.

    #2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"

    #3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?

    Leave a comment:


  • sagarutturkar
    replied
    Running Abyss

    Hi,

    Multiple people posted in this thread were able to run abyss succesfully. I am novice and have some doubts about running abyss. Please answer:

    Question 1:
    I want to use abyss for paired reads assembly. But I have paired reads (Forward and reverse) in single file. This is the file generated after quality trimming.

    The file structure is
    >001_forward
    ATGC.......
    >001_reverse
    ATGC....
    >002_forward
    ATGC....
    >002_reverse
    ATGC....

    How do I run Abyss for such file? I need command for this. Any suggestions?

    Question2:

    I have paired end files for single genome. e.g. Genome X reads are
    001_R1.fastq 001_R2.fastq
    002_R1.fastq 002_R2.fastq
    003_R1.fastq 003_R2.fastq

    Do i need to treat each pair as separate library? or if I mention
    abyss-pe name=ecoli k=64 in='001_R1.fastq 001_R2.fastq 002_R1.fastq 002_R2.fastq'
    should work fine?

    Question3:
    Does abyss have automated qulaity trimming incorporated or its necessory to use quality trimmed reads? I read somewhere it has -q flag


    Thanks

    Leave a comment:


  • jiltysequence
    replied
    Originally posted by seb567 View Post
    Yes, I think it is very clever to store genome variations as they are encountered.
    I've been hosting genome variations in a secure cloud server (have you heard of http://www.rackspace.com?) it would be interesting if some of us were able to collaborate and create some kind of an archive. This would be a good step in making information, from basic to advance, available to interested people of all shapes and sizes. What do you guys think?
    Last edited by jiltysequence; 06-23-2011, 10:16 AM.

    Leave a comment:


  • tonybolger
    replied
    Originally posted by themwg View Post
    Could you elaborate on what you mean by frayed ropes turning into single contigs?
    This phrase 'frayed rope' refers to the shape of part of the assembly graph. If you have non-tandem repeats, you get a graph something like:

    Code:
    A--->     ---->E
         C--->D
    B--->     ---->F

    where the 'correct' paths are A->C->D->E and B->C->D->F, with C->D being a repeat.

    It appears the CLC tends to be overly aggressive for my taste, and collapses the A->C and B->C paths into a forced consensus, even in the presence of strong support for the different paths. Likewise, D->E and D->F. Unfortunately, due to lack of tuning options, this isn't easy to prevent. Check for Ns in the assembly - this might be an indicator.

    Faced with this situation, other assemblers usually produce 5 contigs, whereas CLC will produce 1. This has already caused us to closely investigate family number differences of related genes (vs a related organism) which turned out to be merely 'merged' in the CLC assembly.

    Originally posted by themwg View Post
    We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.
    Agreed on all points - the problem is one of correctness however.

    Originally posted by themwg View Post
    We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.

    To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.
    This would tally with my experience.

    For SOAP assemblies, i would strongly recommend pre-filtering the reads by quality - it considerably reduces the memory footprint. Both assemblers may well give better N50 with filtering. Still, i would expect CLC to beat SOAP by a factor of 5-10 in contig N50.

    SOAP contig N50 is somewhat hampered by the fact that it doesn't use pairing information at all until the scaffolding stage. It is also broken in other interesting ways, but there doesn't seem to be a perfect beast for the job. You might also want to give the new CLC v4 beta a spin - it doesn't work on very big assemblies, but 200 million reads may be ok.

    Originally posted by themwg View Post
    We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb).
    Originally posted by themwg View Post
    Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.
    You can control the k-mer size with CLC, with -w, up to a max of 31 (at least in the version i'm using - 3.20) - unfortunately, it's about the only thing you can control

    Leave a comment:


  • themwg
    replied
    Originally posted by tonybolger View Post
    We've noticed a tendency for CLC denovo to oversimplify complex repeat-filled areas, turning 'frayed ropes' into a single contigs. These don't tend to be in coding regions, so you won't find them with genes or RNA
    Could you elaborate on what you mean by frayed ropes turning into single contigs?

    We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.

    We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.

    To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.

    We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb). Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.

    Leave a comment:


  • lcollado
    replied
    You could also try using the Columbus module from Velvet.

    Leave a comment:


  • tonybolger
    replied
    Originally posted by Gators View Post
    Quick question in the same vein as this thread...

    I have some deep sequencing results from a virus-infected sample. We know the viral sequence - kinda. We know that there are differences in our reference sequence and what is actually in the cells. If I allow for a couple mismatches in the alignment I do with bowtie, I seem to have more or less complete coverage of the viral genome in our reads. I'd like to assemble the reads to get a "consensus" sequence of the virus. Any recommendations for what program to use for this small scale assembly? Reads are about 25 bp, total viral genome should be <10kb
    I assume a reference based assembly would be ok. You just need to call the consensus on the alignment. You could try samtools, or the early steps of any snp pipeline.

    Leave a comment:


  • Gators
    replied
    Quick question in the same vein as this thread...

    I have some deep sequencing results from a virus-infected sample. We know the viral sequence - kinda. We know that there are differences in our reference sequence and what is actually in the cells. If I allow for a couple mismatches in the alignment I do with bowtie, I seem to have more or less complete coverage of the viral genome in our reads. I'd like to assemble the reads to get a "consensus" sequence of the virus. Any recommendations for what program to use for this small scale assembly? Reads are about 25 bp, total viral genome should be <10kb

    Leave a comment:


  • tonybolger
    replied
    Originally posted by avtsanger View Post
    There is a version of SOAP that does (up to a k-mer of 63 I think). I e-mailed the authors who kindly provided it. Not sure if the current downloadable version is the most recent
    Right you are sir, it's out since yesterday - limits are now kmer 31/63/127 using various versions.

    But strangely, some (not all) of the versions require the Intel MKL library.

    Leave a comment:


  • avtsanger
    replied
    Originally posted by tonybolger View Post

    Also, does soap actually use K-mer above 31?
    There is a version of SOAP that does (up to a k-mer of 63 I think). I e-mailed the authors who kindly provided it. Not sure if the current downloadable version is the most recent

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 11:49 AM
0 responses
1 view
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X