Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Velvet & paired-ends

    Greetings to you all,

    First of all thank you for creating this forum, it seems like a great way to share knowledge.

    I am a undergraduate student starting on a job at my university involving assembly of sequenced data. I am an expirienced linux user so I will try to use linux programs.

    Currently I have familiarized myself with velvet a bit and I wish to try to assemble some data that has been previously well assembled and I wanted to test velvets capabilities on it as well. After making sure I know how to use velvet I will also try Ray and SOAP.

    So I have 2 files:
    100611_s_4_1_seq_GDR-7.txt (1.6 GB)
    100611_s_4_2_seq_GDR-7.txt (1.6 GB)

    I have used this as a refrence for my work:

    From what I understand I need to merge the two files into 1 file with Is this correct?
    Code: 100611_s_4_1_seq_GDR-7.txt100611_s_4_2_seq_GDR-7.txt 100611_s_4_both_seq_GDR-7.txt
    I have trouble understanding what does subsetting mean from that page. If we look at:
    17. Do the subsetting. Soon we will compare the single ended assembly to the paired-end assembly. In order for the comparison to be fair, we must use the same total number of reads. Therefore each paired end file will contain 1/4 of the reads:
    It seems like they are trying to compare single ended with paired-end. I am only doing paired-end, do I need to do subsetting?
    Final graph has 302039 nodes and n50 of 175, max 1779, total 5984104, using 13024530/17609332 reads
    Now one more thing, after running velveth_de and velvetg_de, at the end of velvetg_de it tells me how many nodes have been created. Is that the number of contigs? How do I interpret that last line?

    I am using the -shortPaired option for velveth.

    My last question is, has anyone here used consed? I just wanted to ask since I have some problems setting up that program.

    Thank you.
    Last edited by AdrianP; 04-23-2011, 10:42 AM.

  • #2
    if you don't want to compare paired end to single end you don't need to do "subsetting".

    the last line just tells you your N50, how many reads were used (you can also request velvet to output a UnusedReads.fa file) etc, i am not sure if the number of nodes is the number of your contigs. You can check that by grep ">" -c contigs.fa.


    • #3
      Thank you for your previous reply.
      Is there any way to see which contig is largest? Somehow sort contigs by their size?


      • #4
        sure there are a lot of ways. ;-) Use a script, you even have the (length + kmer -1) in the id of the contig so it is really easy.

        here are some perl scripts that might help:


        • #5
          If Velvet assembles poor (small) contigs when other programs with same settings (coverage and insertsize) do much much better, what can be my conclusions?

          By the way most of those scripts are for fasta, and i got fastq, is there a script that converts?



          • #6

            first hit in google. ;-) Also normally trimming "programs" takes fastq as input and output a fasta.

            velvet needs a good coverage to do well because it's de brujin based. Since I don't know on what data you run velvet I and what you expect there is no help. Try smaller kmers, try different parameters, try multiple kmer....


            • #7
              One thing that is puzzeling me is that Genegenious takes a few days to assemble the data that velvet assembles in 10-15 mins. Is it normal that velvet runs so fast?


              • #8
                i have no clue what genegenious is and on what algorithm it is based. So if it is a ovelap-based method, yes it is possible and depends on kmer, amount of reads you have, expected coverage, read length .....


                • #9
                  I was wondering a bit more about Velvet's last line output.

                  What is n50? (I understand that it is a measurement of quality, the higher the better???)
                  What is max?
                  What is total?
                  What are nodes? (this isn't as important since i believe it is related to the graph that velvet builds, and I do not use the graph, just the final contigs.fa file. Should I use the graph?)

                  Also, what is the diffrence between shortpaired and shortpaired2 ? Something to do with inert libraries...

                  Thanks a lot!


                  • #10
                    come on, do a bit more research on your own. :P

                    read the velvet paper:
                    An international, peer-reviewed genome sciences journal featuring outstanding original research that offers novel insights into the biology of all organisms

                    You don't want to use the graph and if you know what a graph is you should also know what nodes are. Nodes are the vertices of a graph.
                    shortpaired2 ist the same as shortpaired but for a separate insert size library (also stated in the manual).


                    • #11
                      The Manual I have read it a few times but I guess what I was asking is what does that mean "separate insert size library" ? Separate from what?

                      As for the research paper, i had a look at it before, I can't find a defenition for n50, it jumps straightly to using that term, eve in the abstract. I would not have asked if I did not do the research myself.

                      I found the answer here after googling n50


                      • #12
                        yes, it is the first hit when you google "definition: N50"

                        separate to "shortpaired" I assume. So you can use 2 different PE libraries with different insert sizes, but maybe I am wrong.

                        Total is their calculated base pairs total, but keep in mind that the bp length of eacht transcript is the "real" bp length minus kmer plus 1, as mentioned somewhere.^^

                        max might be the longest contig, but I don't remember it exactly, since you need to do more statistics anyway. ;-)


                        • #13
                          Originally posted by AdrianP View Post
                          If Velvet assembles poor (small) contigs when other programs with same settings (coverage and insertsize) do much much better, what can be my conclusions?
                          Could be many things: appropriate vs inappropriate settings, more/less sensitive to data error vs coverage, or simply wrong/right tool for this particular job.

                          You also could have one tool making a decent size but completely incorrect assembly, with another making a cautious but correct assembly. N50/size isn't everything. Can you validate your results somehow, e.g. against another closely-related known genome?

                          Either way, getting the best out of a dataset may require months of trial and error, tweaking etc. Even getting a particular tool do run properly and produce decent output can take weeks and can make a massive difference - the DBG assemblers all seem to have glass jaws. Pre-filtering the data seems to make a massive difference to most though.

                          Incidentally, I don't have a lot of experience with velvet in particular, since it's simply too heavy for my project (>1GBase genome)


                          • #14
                            Yeah actually my next genome to work with is a mitGenome and is about 70k, pretty cool.

                            I will start working with consed, not an easy program to work with but as I understand incredibly useful.


                            • #15
                              Originally posted by AdrianP View Post
                              Greetings to you all,

                              After making sure I know how to use velvet I will also try Ray and SOAP.

                              Hi !

                              I am the author of Ray so if you have any question, ask away.

                              Basically, with Ray, you will need to convert your two files to fasta or fastq format.

                              There is a script in maq for that.

                              maq-0.7.1/scripts/ export2std 100611_s_4_1_seq_GDR-7.txt > 100611_s_4_1_seq_GDR-7.txt.fastq
                              maq-0.7.1/scripts/ export2std 100611_s_4_2_seq_GDR-7.txt > 100611_s_4_2_seq_GDR-7.txt.fastq

                              Ray is available at

                              Then, using Ray, you assemble these reads:

                              mpirun -np 8 Ray -k 31 -p 100611_s_4_1_seq_GDR-7.txt.fastq 100611_s_4_2_seq_GR-7.txt.fastq -o Ray-test-1.4.0

                              I encourage you to explore the files written by Ray:

                              ls Ray-test-1.4.0.*

                              see http://denovoassembler.sourceforge.n...00000000000000


                              Latest Articles


                              • seqadmin
                                Best Practices for Single-Cell Sequencing Analysis
                                by seqadmin

                                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                06-06-2024, 07:15 AM
                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin

                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM





                              Topics Statistics Last Post
                              Started by seqadmin, Today, 07:24 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:58 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 06-12-2024, 02:20 PM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 06-07-2024, 06:58 AM
                              0 responses
                              Last Post seqadmin