Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De novo assembly using Velvet: any idea why such small Kmers with long reads?

    Greetings!

    I am doing some de novo genome assembly of a 23Mb genome using Velvet 1.2.10 and quality trimmed MiSeq (Illumina) reads that average about 180bp in length. I have assembled different individuals of this species before with 100bp reads and the kmer size always comes in around 61 for the best N50s and very good max. contig size. However with this assmebly I am getting really low kmer sizes as optimal (for N50s) in the the low/mid 30s. Velvet estimated the kmer coverage averaging 23X.

    Results here (Kmer size is on the x-axis, the left y-axis is for the max. contig size ("max," red line) and the left y-axis if for the N50s (blue line)):

    I am concerned about how good an assembly would be for such a large genome and such small kmer, and I also just wonder why -- with longer reads -- I need smaller kmers.

    Thanks!
    Last edited by Genomics101; 02-21-2014, 03:51 PM. Reason: spelling, added kmer coverage detail

  • #2
    Your coverage is too low. At 23x coverage with 100bp reads and k=60, you'll only get a kmer depth of 40% or around 10, which will give a fragmented assembly missing low-depth areas, and making it hard to distinguish valid and error kmers.

    Edit:

    Oops, I see now that the old assemblies were 100bp and the new ones are 180bp. Still, the point remains that as you increase K you decrease kmer depth, and you appear to have too little data for that to help. What's the insert size distribution and quality distribution? I've seen a lot of MiSeq libraries get made with insert sizes shorter than read length. So, also, you might consider adapter-trimming based on kmers before quality trimming.
    Last edited by Brian Bushnell; 02-21-2014, 02:57 PM.

    Comment


    • #3
      Originally posted by Genomics101 View Post
      Velvet estimated the coverage averaging 23X.
      Are you talking about kmer coverage? or nucleotide coverage?

      In other words, your biggest contigs, what is their cov value? (those show kmer coverage)

      Comment


      • #4
        Kmer coverage

        Comment


        • #5
          Originally posted by Genomics101 View Post
          Kmer coverage
          Okay, than what the person in post #2 said doesn't apply, because they assumed nucleotide coverage. You need to go higher kmers. kmer coverage of higher than 20 is a waste, you need to aim between 10 and 20.

          Use VelvetOptimiser, and try kmers to 160-180, you might see a second peak in N50, this is common.

          Comment


          • #6
            @Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.

            Comment


            • #7
              @AdrianP Thanks for your reply, but I actually did kmers (with a larger gap between them ) all the way up to 191 as the initial analysis:

              Comment


              • #8
                Originally posted by Genomics101 View Post
                @Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.
                Okay, with a base coverage of 43X you do not want Velvet. DBG needs high coverage for repeat resolution.

                My advice, is to use SeqPrep to merge your reads. You should have 3 files, forward, reverse, and merged after using it. Feed those to the MIRA assembler, which is an OLC assembler, and I expect you to get better results.

                Comment


                • #9
                  Originally posted by Genomics101 View Post
                  @Brian Bushnell The insert size mean valus is 407 with an SD of 130, and the 23X is kmer coverage, the base coverage is about 43X. The per sequence quality is between 36 and 38 for almost all of them. There are no adaptors as they are removed by the Illumina 1.9 pipeline.
                  Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?

                  Comment


                  • #10
                    Originally posted by Brian Bushnell View Post
                    Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?
                    To obtain that, I can recommend:


                    CMD:
                    ./kmergenie --diploid -k <higher_kmer> -e 1 -l <lower_kmer> -t <cpu_threads> -o <output_name> <read_location>

                    Start with higher 101, and lower 41, see what graphs.

                    Comment


                    • #11
                      Originally posted by Brian Bushnell View Post
                      Hmm, well in that case, I am surprised that the N50 is best at such a low K. Unless the coverage is highly non-uniform, as can happen if data is over-amplified. Do you have a kmer-depth histogram?

                      Comment


                      • #12
                        Ah - what I actually mean is, a graph for a fixed kmer length (of, say, 31) where the X axis is depth and Y axis is number of kmers found at that depth, both log-scale. Ideally, you should have a sharp peak at some depth (maybe 40) and it should drop dramatically on the left and right.

                        My attachment shows the 31-mer frequency histogram for e.coli synthetic reads. You can see a main peak at about 200 and a few repeat peaks after that. If the data was real and had uneven coverage, there would be a broad peak rather than a sharp one.

                        FYI, I generated this with the 'khist.sh' script in the BBMap package and plotted it in Excel.
                        Attached Files
                        Last edited by Brian Bushnell; 02-21-2014, 04:33 PM.

                        Comment


                        • #13
                          I also have the option of using the longer and more uniform untrimmed reads, but the quality is pretty questionable:



                          I tried doing an assembly with these and got better N50s at very high kmers (~99-135) but the kmer depth has a weird bell curve relationship with kmer size rather than a direct one. The lowest kmer I tried (45) had a coverage of only 1,2 and it the kmer coverage peaked at 41.5X at kmer =93 (also a relative good N50 at ~20kb). Also, I am very wary of using data with so many errors.

                          Comment


                          • #14
                            Read length uniformity shouldn't matter to Velvet. Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. Excessive trimming can also cause biases.

                            Comment


                            • #15
                              Originally posted by Brian Bushnell View Post
                              Trimming to ~180bp seems like overkill for data of that quality; I would probably try trimming to something very conservative like Q10. .
                              Thanks. I didn't trim by length, but by quality (Q30 cut off) and the reads just came out with most of them at around. But doing a less strict trimming may be the answer.

                              Since I have your very helpful attention here, since I am doing the assmebly with the untrimmed reads, do you have a suggestion for a good way to assess how the sequencing errors are affecting the accuracy of the contigs? Should I just BLAST a few regions I have done with Sanger sequencing?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM
                              • seqadmin
                                Multiomics Techniques Advancing Disease Research
                                by seqadmin


                                New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                                A major leap in the field has
                                ...
                                02-08-2024, 06:33 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 06:12 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-23-2024, 04:11 PM
                              0 responses
                              64 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-21-2024, 08:52 AM
                              0 responses
                              70 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 02-20-2024, 08:57 AM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X