No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to set -max_coverage in velvetg


    I am assembling bacterial genomes (~6Mb) using 250 bp paired-end MiSeq data. I have tried a bunch of assemblers (idba_ud, mira, ray, SOAPdenovo, ABySS to name a few...), but am getting reasonably good results using good old velvet (~360 contigs, n50 = 40kb). But I have a question about how to set the velvetg parameter -max_coverage? It's value has a large effect on the resulting number of contigs and total number of bases in the assembly (ie assembled genome size). Am I correct in thinking that many of these high-coverage nodes errors (or at least error-prone, like repeat elements etc) and should be excluded for a better assembly?

    I estimate the coverage distribution (in R using plotrix) from the stats.txt file after running a preliminary:
    velvetg velvet_big_127 -cov_cutoff auto -exp_cov auto
    . It is then fairly easy to calculate the weighted mean coverage -exp_cutoff and to set a reasonable value for -cov_cutoff, but there is often a long tail in the distribution meaning that there are small number of nodes with very high coverage.

    Generally, what is a good way to determine a sensible value for -max_coverage?

    Many thanks! Reuben

  • #2
    You should run velvetg WITHOUT the two auto options first if you want to examine the coverage plot in R.

    There is no need to set -max_coverage unless you know you have some known HIGH level of undesirable contamination. Those high coverage contigs are important - they will likely be your ribosomal RNA locii and insertion sequences.

    If you are using 250bp PE MiSeq data I would first run FLASH to pre-combine and overlapping pairs. Also, if Nextera was used to prep your samples, and your reads are all exactly 250bp long, then your provider probably did NOT remove all the Nextera adaptors so you will have to do it.

    For the record, 360 contigs for a 6 Mbp bacteria is pretty good, considering it will probably have a bunch of insertion sequences. Pseudomonas?


    • #3
      Hello and thanks for the advice Torst,

      Firstly, yes, these are Pseudomonas genomes - well deduced! How did you guess??

      So, just to clarify, the reads have been adapter-trimmed using CutAdapt and quality-trimmed using Condetri, so there is considerable variation in read length in the input FastQ files.

      Doing as you suggest without setting any max_coverage gives 487 contigs (after scaffolding and gap-filling with SSPACE and GapFiller)... I'm just a bit surprised it's not lower than that, considering the length of reads.

      I did in fact try FLASH, and it combined a whopping ~80% of my reads. But running velvet on these data did not improve the contig numbers/n50 etc. It also lead to really strange coverage distributions, where a huge proportion of the nodes had a coverage of only 1 and thus being discarded when cov_cutoff is set. I suspect this may be due to FLASh not working optimally - perhaps combining reads that should not be combined or something? Have you used FLASh on MiSeq reads with success before? I can't help feeling that something is not quite right with it yet!

      Thanks again, and any feedback or tips are most welcome


      • #4
        The guess of Pseudomonas was based on genome size, number of contigs, and googling your name :-) I had similar results for the Pa AES strains.

        Nextera seems to produce a lot of very short reads, many of which completely overlap each other. The default parameters of FLASH are not suitable for 250bp miseq reads. You need to change -M, -r, -f, and -s to bigger values, as the defaults are tuned for 2x100bp.

        Ultimately we are finding many Nextera 2x250 runs aren't that much better than TruSeq 2x150 or 2x100 due to the smaller fragmentation. It's going to be a while before people fine tune it. If you can afford to use TruSeq size selection, then 2x250 sequence, then you might get better results.


        • #5
          Thanks again Torst, unfortunately the sequencing stage of our project is completed - we'll have to work with what we've got!

          I have now tried using FLASH with more realistic input parameters (-m 100 -r 207 -f 302 -s 132), but again there is no 'improvement' (insofar as n50, number contigs etc go) in the final assembly... I find this a bit puzzling since there is much talk in the literature in how much improvement can be gained in combining reads like this.

          I may open this up as a new thread in fact...



          • #6
            Originally posted by reubennowell View Post
            Thanks again Torst, unfortunately the sequencing stage of our project is completed - we'll have to work with what we've got!

            I have now tried using FLASH with more realistic input parameters (-m 100 -r 207 -f 302 -s 132), but again there is no 'improvement' (insofar as n50, number contigs etc go) in the final assembly... I find this a bit puzzling since there is much talk in the literature in how much improvement can be gained in combining reads like this.

            I may open this up as a new thread in fact...

            if you have generated larger reads by combining the pairs, you could/should also set the k-mer size higher. Did you do that?



            • #7
              Hi Boetsie, well I'm running velvet with a kmer of around 127 currently, so quite large already. I tried upping it a bit, with no improvements in the assembly statistics. But there is considerable variation in read length in both my unmerged paired-end reads and my merged reads (post FLASHing), like this:

              Min. 1st Qu. Median Mean 3rd Qu. Max.
              50.0 177.0 222.0 202.4 249.0 251.0

              extendedFrags (ie. 'FLASHed' reads):
              Min. 1st Qu. Median Mean 3rd Qu. Max.
              100.0 195.0 241.0 243.9 294.0 402.0

              Please correct me if I am wrong, but can Velvet use reads that are shorter than the kmer length? Because there are clearly a significant fraction of reads in both datasets that are relatively short. For example, if I up the kmer value to something around the 200 mark, does this render ~ one quarter of reads unusable? The kmer coverage plots (from the stats.txt Velvet output) assemblies using these FLASHed datasets look very strange too, with a very large number of kmers having a coverage of only 1...

              I'm still trying to get my head around what FLASH is actually doing to the reads (I mean specifically to my reads) - perhaps it's stitching together reads very badly, resulting in a bunch of reads that do not assemble?

              I just find it weird that FLASH certainly seems to merge a lot of my reads (ie look at the stats above), but that this has no effect (or a negative effect) on the resultant assembly...

              Any ideas, then please do let me know!



              • #8
                my experience w/ assembling 2x250bp metagenomic data w/ idba_ud is that I saw little to no improvement using longer reads unless I increased the maxK to ~230-250. This will hog memory but assemblies (N50, N75) get significantly bigger.


                • #9
                  That's interesting - so had you run FLASH or something similar also? Was the distribution of read lengths similar to what I have?

                  Also, idba_ud has a max kmer of <= 124. How did you recompile it with a larger max kmer? Can you remember which file you had to modify? The README says to increase kMaxShortSequence in src/sequence/short_sequence.h, but not how to up the max kmer value.



                  • #10
                    Nope - I just threw the two reads in directly without end-assembling. Haven't tried assembling them together or anything yet.

                    To increase maxK, I modified src/basic/kmer.h and increased kNumUint64 from 4 to 8. Careful doing this, it really increases the required memory! With my dataset, I don't run this on machines with <32G of RAM. That said, the effect on N50 is pretty dramatic w larger maxK. Let me know if you need help getting that to work and I'll see about digging up my actual modified file.


                    • #11
                      Thanks! That worked a treat. And running it up to k=250 produced an assembly with the highest N50 I've managed to produce thus far (>60kb for a 6Mb bacterial genome), across a range of assemblers (and I've tried loads).

                      There are still a lot of short contigs though:
                      Number of contigs 847
                      Number of contigs > 500 nt 382

                      I'm going to try a read-correction procedure (Reptile probably) to see if I can bring this number down.

                      Thanks for your input, and any other tips or tricks please do let me know it's a great way of learning!



                      • #12
                        Great! I have noticed that too - for some reason, with these changes the minimum contig length cutoff doesn't seem to work properly. I've just been post-filtering to exclude the little stuff under 500bp or 1kb.

                        I'm curious - how did your scaffolds.fa look? One thing I've seen is that there are no Ns, suggesting really good overlap and coverage. Did you see something similar?


                        • #13
                          Yeah, there are zero N's in the scaffolds.fa file in my assemblies also I'm going to run SSPACE and GapFiller anyway to see if there is further improvement.

                          RE the small contigs (<500bp): generally speaking, what are these?? Could they be single reads that don't map anywhere, possibly due to a large number of errors in that read? And is it generally acceptable to exclude these from final reports of assembly metrics?

                          I had a look at what effect both read-correction and read-merging had on assembly. I tried both Reptile and Coral read-correction, with both merged and unmerged datasets where possible (using FLASh). Generally, these pre-processes had little effect on the overall assembly metrics:

                          No read correction, no FLASH
                          # scaffolds = 847
                          # scaffolds >500 bp = 382
                          N50 = 61299
                          genome size (Mb) = 6.341

                          Reptile, no FLASH
                          # scaffolds = 845
                          # scaffolds >500 bp = 387
                          N50 = 62820
                          genome size (Mb) = 6.343

                          Coral, no FLASH
                          # scaffolds = 963
                          # scaffolds >500 bp = 452
                          N50 = 51815
                          genome size (Mb) = 6.335

                          No read correction, FLASH
                          # scaffolds = 1060
                          # scaffolds >500 bp = 397
                          N50 = 60802
                          genome size (Mb) = 6.401

                          I'm still a little puzzled as to why these pre-processes fail to improve assembly greatly (Reptile read-correction does produce a slightly better N50, but Reptile is not very easy to use......)

                          Using FLASH on Reptile-corrected reads would be interesting, but is super-faffy because of file output issues.



                          • #14

                            I have 2 short pair-ends Miseq fastq files (2*2,200,250 reads) where read length is 26 bases and I would like to perform de novo assembly for these reads. I just installed Velvet(g,h) from Galaxy Shed Tool. I used the tools on galaxy but I keep getting no result. I figured this could be due to the parameters I am using. I used the default parameters since I am not sure which one to change to meet my short read length spec.

                            I decided to do this from the command line.
                            1. I shuffled both fastq files into one
                            2. then, velveth auto 31,42,2 -fastaq -shortPaired1 shuffled_sequences.fq
                            I got one auto_31 folder but now I am not sure what to do for VAL1 and VAL2 in command
                            velvetg auto_31 -exp_cov VAL1 -ins_length1 VAL2

                            I would appreciate any kind of help either how to work with Galaxy or Command line options.



                            • #15
                              How to set -max_coverage in velvetg


                              I'm not very familiar with galaxy, but I am familiar with velvet.

                              The kmer size has to be an odd number, and you can't have a kmer size longer than your reads.

                              So if your reads are 26 bp, the longest kmer you can try is 25,
                              so 31,42,2 will not work.

                              When you first run velvetg you can set the -exp_cov to auto, and the program will calculate a value for exp_cov.

                              ins_length for velvet is the size of the 2 reads in the pair, plus the distance betwen them. you should have some estimate of the insert size for the library from your sequence provider. otherwise velvet has a script that can calcualte an insert length from the assembled read pairs.

                              Hope this helps,


                              Latest Articles


                              • seqadmin
                                Advanced Tools Transforming the Field of Cytogenomics
                                by seqadmin

                                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                                09-26-2023, 06:26 AM
                              • seqadmin
                                How RNA-Seq is Transforming Cancer Studies
                                by seqadmin

                                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                                09-07-2023, 11:15 PM
                              • seqadmin
                                Methods for Investigating the Transcriptome
                                by seqadmin

                                Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                                Whole Transcriptome RNA-seq
                                Whole transcriptome sequencing...
                                08-31-2023, 11:07 AM





                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:57 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 09-26-2023, 07:53 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 09-25-2023, 07:42 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 09-22-2023, 09:05 AM
                              0 responses
                              Last Post seqadmin