Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jvanleuven
    Member
    • Nov 2011
    • 23

    First time with Velvet

    Hi,

    Im new to seqanswers and bioinformatics and need a little help with Velvet. I keep getting large blocks of Ns in my PE assembly. The Ns go away when I do an assembly with just the single reads.

    workfow;

    ~30 million PE 100bp reads
    lowest average sanger quality score is 34

    filter using fastx-tool kit - remove reads < 20bp in length, <30 quality over 90% of the read

    also had to remove reads that failed the illumina chastity filter

    remove non-PE reads after filtering and join into one file
    now have ~20 million reads

    velveth 31 -shortPaired
    found average ins_length using velvetg: 240, sdev = 50

    experimenting with parameters, I found the two genomes of interest to have coverages of 500 and 75.

    I have been able to find 4-5 contigs per genome that cover nearly the entire length. The problem is that the contigs contain large blocks of Ns. For example

    one assembly has 3120 contigs > 500bp in length, but there are 11634 blocks of Ns with average length of 41

    in the 4 contigs that cover one of the genomes, there are 86 blocks of Ns that average 56bp in length.

    How to I get rid of these Ns?

    Thanks,

    JT
  • nickloman
    Senior Member
    • Jul 2009
    • 355

    #2
    This is a result of Velvet scaffolding contigs using paired-end information. Set "-scaffolding" to "no" if you don't want it to do that.

    Comment

    • jvanleuven
      Member
      • Nov 2011
      • 23

      #3
      okay, thanks.

      when scaffolding is off, I dont get large contigs. I guess I'll have to try changing parameters again.

      Comment

      • themerlin
        Member
        • Feb 2010
        • 51

        #4
        I would also look at changing your kmer value. For 100bp reads, I use a kmer value of 57..31 seems more suited to 50bp reads.

        Comment

        • aloliveira
          Member
          • Aug 2010
          • 47

          #5
          You can tune the exp_cov e cov_cutoff parameter. You can plot the stats file in velvet output on R and check the best values for these parameters.

          Comment

          • jvanleuven
            Member
            • Nov 2011
            • 23

            #6
            thanks for the helpful suggestions, i really appreciate it.

            I tried kmer size 57

            then tried keeping everything set to auto

            $velvetg output_57/ -scaffolding off -ins_length 250 -exp_cov auto -cov_cutoff auto

            Final graph has 469034 nodes and n50 of 124, max 2556, total 20289293, using 7937596/38767538 reads

            when I plot the stats in R, I just get a decreasing curve that doesn't appear to have any peaks. before I found the expected coverage by blasting the contigs to my reference genomes and calculating an average coverage of the contigs that hit.

            jt

            Comment

            • themerlin
              Member
              • Feb 2010
              • 51

              #7
              You could also try Velvetoptimiser to help tune settings to your dataset:

              Comment

              • jvanleuven
                Member
                • Nov 2011
                • 23

                #8
                i tried optimizer, but it seems to like low coverage. I have not tried it with specifying my expected coverage.

                Comment

                • jvanleuven
                  Member
                  • Nov 2011
                  • 23

                  #9
                  one of the genomes is at 400X the other is near 150X I think. When I assemble with exp_cov 400 and cut_off 50, it increases the contig size to ~8000bp.

                  Why wont velvet put these contigs together better? I tried cutting the amount of data by 0.25, but it decreased the max contig size and N50.

                  thanks again.

                  I know what i'll be doing this weekend.

                  jt

                  Comment

                  • aloliveira
                    Member
                    • Aug 2010
                    • 47

                    #10
                    Why wont velvet put these contigs together better?
                    I did not understand this statment.

                    Comment

                    • jvanleuven
                      Member
                      • Nov 2011
                      • 23

                      #11
                      sorry, I meant to ask why the contigs are so small. 8000bp is not even close to the size of the genome. 400X coverage implies that there is sufficient data.

                      jt

                      Comment

                      • aloliveira
                        Member
                        • Aug 2010
                        • 47

                        #12
                        Hi,

                        So, high coverage is good but dont implies in a unique and giant contig. Several biological factors have important impact in genome reconstruction. One and very important is the repetitive elements and the size of these elements in the genome sequenced.

                        With mate-pair or paired-end libraries you can solve repetitions and assembly more contigous sequences (scaffolds) but if your repetition is bigger than the insert size of your library the reconstruction will be fragmented.

                        Another factor is: high levels of coverage introduces more errors in your reads and the assembly process can decrease the N50 levels. And all the assemblers (ABySS, SOAPdenovo, Velvet) has a different peak of n50 and coverage levels. When this peak is reached the contig n50 does not get bigger it stagnates in a plateau. So coverage is a important factor until reached a specific number (30 - 50x more or less) more than that its useless.

                        Here: two paper who talks about this:







                        []s,

                        André

                        Comment

                        • jvanleuven
                          Member
                          • Nov 2011
                          • 23

                          #13
                          okay, thanks. I'll read the papers carefully.

                          I understand that optimal assembly occurs around 50X.

                          What I'm not entirely clear on yet is the effects of having scaffolding on or off. With scaffolding off, is Velvet basically assembling single reads instead of taking advantage of the PEs? If so, I need it on to get large contigs. But when I get large contigs with scaffold on, then I have lots of Ns in the sequence.

                          I think that I may attempt to assemble all the reads as singles, then map the resultant contigs to the large contigs with N gaps. Maybe the small contigs will cover the N gaps.

                          jt

                          Comment

                          • cliffbeall
                            Senior Member
                            • Jan 2010
                            • 144

                            #14
                            I would try being very aggressive in quality filtering/trimming. Most accounts that I have seen don't see improvement beyond 50X coverage, so you should be able to be selective with your input data.

                            Comment

                            • gringer
                              David Eccles (gringer)
                              • May 2011
                              • 845

                              #15
                              So with 400X coverage, 20M 100bp reads, that's a genome size of about 50Mb. Is that correct?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Pathogen Surveillance with Advanced Genomic Tools
                                by seqadmin




                                The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                                03-24-2025, 11:48 AM
                              • seqadmin
                                New Genomics Tools and Methods Shared at AGBT 2025
                                by seqadmin


                                This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                The Headliner
                                The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                03-03-2025, 01:39 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-20-2025, 05:03 AM
                              0 responses
                              41 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-19-2025, 07:27 AM
                              0 responses
                              51 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-18-2025, 12:50 PM
                              0 responses
                              38 views
                              0 reactions
                              Last Post seqadmin  
                              Started by seqadmin, 03-03-2025, 01:15 PM
                              0 responses
                              193 views
                              0 reactions
                              Last Post seqadmin  
                              Working...