Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by Gators View Post
    Quick question in the same vein as this thread...

    I have some deep sequencing results from a virus-infected sample. We know the viral sequence - kinda. We know that there are differences in our reference sequence and what is actually in the cells. If I allow for a couple mismatches in the alignment I do with bowtie, I seem to have more or less complete coverage of the viral genome in our reads. I'd like to assemble the reads to get a "consensus" sequence of the virus. Any recommendations for what program to use for this small scale assembly? Reads are about 25 bp, total viral genome should be <10kb
    I assume a reference based assembly would be ok. You just need to call the consensus on the alignment. You could try samtools, or the early steps of any snp pipeline.

    Comment


    • #17
      You could also try using the Columbus module from Velvet.
      L. Collado Torres, Ph.D. student in Biostatistics.

      Comment


      • #18
        Originally posted by tonybolger View Post
        We've noticed a tendency for CLC denovo to oversimplify complex repeat-filled areas, turning 'frayed ropes' into a single contigs. These don't tend to be in coding regions, so you won't find them with genes or RNA
        Could you elaborate on what you mean by frayed ropes turning into single contigs?

        We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.

        We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.

        To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.

        We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb). Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.

        Comment


        • #19
          Originally posted by themwg View Post
          Could you elaborate on what you mean by frayed ropes turning into single contigs?
          This phrase 'frayed rope' refers to the shape of part of the assembly graph. If you have non-tandem repeats, you get a graph something like:

          Code:
          A--->     ---->E
               C--->D
          B--->     ---->F

          where the 'correct' paths are A->C->D->E and B->C->D->F, with C->D being a repeat.

          It appears the CLC tends to be overly aggressive for my taste, and collapses the A->C and B->C paths into a forced consensus, even in the presence of strong support for the different paths. Likewise, D->E and D->F. Unfortunately, due to lack of tuning options, this isn't easy to prevent. Check for Ns in the assembly - this might be an indicator.

          Faced with this situation, other assemblers usually produce 5 contigs, whereas CLC will produce 1. This has already caused us to closely investigate family number differences of related genes (vs a related organism) which turned out to be merely 'merged' in the CLC assembly.

          Originally posted by themwg View Post
          We have been using CLCbio, though I worry there is some mystery to it's algorithm. CLC seems to return much better n50 and max contig lengths than SOAP, CLC is also faster and able to handle significantly more data.
          Agreed on all points - the problem is one of correctness however.

          Originally posted by themwg View Post
          We have an insect genome ~200 MB for which we use one illumina paired end lane (~200 million reads at 100bp) and one mate pair lane (~50 million reads at 36 bp, with library size of 3kb). Approx 75% of paired reads end up mapping. With 200bp min contigs we get an n50 of ~4kb.

          To compare to SOAP using our limited machine (44GB RAM) we can only process ~50 million reads.. that subset with CLC gives an n50 of 1kb while SOAP gives n50 of 300bp.
          This would tally with my experience.

          For SOAP assemblies, i would strongly recommend pre-filtering the reads by quality - it considerably reduces the memory footprint. Both assemblers may well give better N50 with filtering. Still, i would expect CLC to beat SOAP by a factor of 5-10 in contig N50.

          SOAP contig N50 is somewhat hampered by the fact that it doesn't use pairing information at all until the scaffolding stage. It is also broken in other interesting ways, but there doesn't seem to be a perfect beast for the job. You might also want to give the new CLC v4 beta a spin - it doesn't work on very big assemblies, but 200 million reads may be ok.

          Originally posted by themwg View Post
          We then Scaffold the CLC contigs using the mate pair reads with SSPACE.. which seems to do a very decent job (from above example, post SSPACE scaffolding gives n50=88kb).
          Originally posted by themwg View Post
          Still, CLC users are rare (due to it being proprietary) and the inability to control kmer size makes me weary. So i'd appreciate any further light on the subject.
          You can control the k-mer size with CLC, with -w, up to a max of 31 (at least in the version i'm using - 3.20) - unfortunately, it's about the only thing you can control

          Comment


          • #20
            Originally posted by seb567 View Post
            Yes, I think it is very clever to store genome variations as they are encountered.
            I've been hosting genome variations in a secure cloud server (have you heard of http://www.rackspace.com?) it would be interesting if some of us were able to collaborate and create some kind of an archive. This would be a good step in making information, from basic to advance, available to interested people of all shapes and sizes. What do you guys think?
            Last edited by jiltysequence; 06-23-2011, 10:16 AM.

            Comment


            • #21
              Running Abyss

              Hi,

              Multiple people posted in this thread were able to run abyss succesfully. I am novice and have some doubts about running abyss. Please answer:

              Question 1:
              I want to use abyss for paired reads assembly. But I have paired reads (Forward and reverse) in single file. This is the file generated after quality trimming.

              The file structure is
              >001_forward
              ATGC.......
              >001_reverse
              ATGC....
              >002_forward
              ATGC....
              >002_reverse
              ATGC....

              How do I run Abyss for such file? I need command for this. Any suggestions?

              Question2:

              I have paired end files for single genome. e.g. Genome X reads are
              001_R1.fastq 001_R2.fastq
              002_R1.fastq 002_R2.fastq
              003_R1.fastq 003_R2.fastq

              Do i need to treat each pair as separate library? or if I mention
              abyss-pe name=ecoli k=64 in='001_R1.fastq 001_R2.fastq 002_R1.fastq 002_R2.fastq'
              should work fine?

              Question3:
              Does abyss have automated qulaity trimming incorporated or its necessory to use quality trimmed reads? I read somewhere it has -q flag


              Thanks

              Comment


              • #22
                #1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.

                #2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"

                #3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?

                Comment


                • #23
                  Originally posted by westerman View Post
                  #1) Run ABySS as SE (single end) or split your file into two parts. perl or awk would be my tools of choice for this.

                  #2) Set them up as individual libraries, e.g., lib="libA libB" libA="001_R1.fastq 001_R2.fastq" libB="002_R1.fastq 002_R2.fastq"

                  #3) I always do trimming pre-ABySS. What did you read about the '-q' flag? Did it mention trimming?
                  Thank you very much. That was helpful.

                  Comment


                  • #24
                    Originally posted by eslondon View Post
                    I have been playing around with ABYSS, SOAPdenovo and CLC Bio for a genome project. To cut a very long story short, these are our experiences.

                    We started from a set of standard 200bp PE reads and a set of 5kb mate pair reads.

                    -ABYSS: with our limited 5kb reads, we never managed to get ABYSS to use them properly for scaffolding. The Contig N50 was a bit poor, whatever we tried. It took a fair while, we never got it to parallelize

                    -SOAPdenovo: very fast because using multiple threads is as simple as saying -p number of processors, and VERY good at scaffolding. The Contig N50 was not great, but better than ABYSS (around 600bp)

                    -CLC Bio: although it does not support scaffolding, it gave us by far the best N50 in terms of contigs (an N50 of 2.2Kb)

                    In the end we used CLCBio contigs with SOAPdenovo for scaffolding, which got us a nice N50 of 8kb.

                    Finally we use the SOAPdenovo GapCloser to close GAPS in the scaffolds produced, which removed about 25% of the Ns we had in the assembly!

                    All the QC on these assemblies (mapping known genes, mapping RNA-Seq reads, etc) pointed to the CLCBio + SOAPdenovo as being the best we had.

                    Now we are going to throw more data at it, hoping for a much better assembly

                    best regards

                    Elia
                    Update on the CLC bio de novo assembler- It has scaffolding. It has the ability to control for bubble size, and it is faster than ever. I assemble 10 million paired end MiSeq reads in 15 minutes on my 8GB laptop. This is in the new version 5.0. The memory footprint makes it possible to assemble on machines that would otherwise be too small. It is commercial, but two weeks is free and the Genomics Workbench is very easy to use on Mac, Windows or Linux.

                    Comment


                    • #25
                      - Information for assembly Scaffold 'output.scafSeq'.(cut_off_length <
                      100bp) -->

                      Size_includeN 14238304
                      Size_withoutN 14238304
                      Scaffold_Num 69976
                      Mean_Size 203
                      Median_Size 154
                      Longest_Seq 5423
                      Shortest_Seq 100
                      Singleton_Num 69976
                      Average_length_of_break(N)_in_scaffold 0

                      Known_genome_size NaN
                      Total_scaffold_length_as_percentage_of_known_genome_size NaN

                      scaffolds>100 69864 99.84%
                      scaffolds>500 2964 4.24%
                      scaffolds>1K 324 0.46%
                      scaffolds>10K 0 0.00%
                      scaffolds>100K 0 0.00%
                      scaffolds>1M 0 0.00%

                      Nucleotide_A 3733290 26.22%
                      Nucleotide_C 3403704 23.91%
                      Nucleotide_G 3387000 23.79%
                      Nucleotide_T 3714310 26.09%
                      GapContent_N 0 0.00%
                      Non_ACGTN 0 0.00%
                      GC_Content 47.69% (G+C)/(A+C+G+T)

                      N10 611 1677
                      N20 420 4532
                      N30 315 8483
                      N40 250 13577
                      N50 206 19868
                      N60 174 27405
                      N70 151 36212
                      N80 134 46255
                      N90 120 57488


                      Can anyone explain what is size include N means and how the size without N numbers is same.?

                      and N50 value of this result is?

                      Comment


                      • #26
                        Hi,
                        first of all: Which program gave you this output?
                        Originally posted by Aman Mahajan View Post
                        Size_includeN 14238304
                        Size_withoutN 14238304

                        Nucleotide_A 3733290 26.22%
                        Nucleotide_C 3403704 23.91%
                        Nucleotide_G 3387000 23.79%
                        Nucleotide_T 3714310 26.09%
                        GapContent_N 0 0.00%
                        I'm just guessing here, but in your example the nucleotides A,C,G and T add up to the total size and it seems like you have no Ns in it, thus the number including Ns is the same.

                        Originally posted by Aman Mahajan View Post
                        N10 611 1677
                        N20 420 4532
                        N30 315 8483
                        N40 250 13577
                        N50 206 19868
                        N60 174 27405
                        N70 151 36212
                        N80 134 46255
                        N90 120 57488
                        I _guess_ that your N50 is 206 here because the N60 should be smaller and so on. I don't know what the second number is. Maybe number of contigs above this threshold or something?
                        Shouldn't this be in the doku of the program you are using to generate this output?

                        Hope this is of any help.

                        Comment


                        • #27
                          Software used- SOAPdenovo Trans

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Exploring the Dynamics of the Tumor Microenvironment
                            by seqadmin




                            The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                            07-08-2024, 03:19 PM
                          • seqadmin
                            Exploring Human Diversity Through Large-Scale Omics
                            by seqadmin


                            In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                            06-25-2024, 06:43 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, 07-19-2024, 07:20 AM
                          0 responses
                          139 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-16-2024, 05:49 AM
                          0 responses
                          114 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-15-2024, 06:53 AM
                          0 responses
                          109 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 07-10-2024, 07:30 AM
                          0 responses
                          43 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X