Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Assembling bacterial genome de novo

    Hi
    I am assembling a bacterial genome de novo. I have about 40million Illumina reads, 75bp. These were paired end but for reasons I wont go into the insert sizes were ~150bp so they overlap.
    I have tried Velvet and MIRA to some extent. My best velvet run (I used velvetoptimiser) gave me 10500 contigs with an N50 of 1455, longest contig 14500, total bases in contigs 4.7M, 1500 contigs greater than 1kb which span ~3M of the genome. The expected genome size is ~4.7M.
    Any recommendations on the next step to get to a build with fewer contigs of larger sizes?

    Thanks
    Noa

  • #2
    Those statistics are pretty poor for a typical bacterial genome so you should be looking to improve on that N50 dramatically.

    Use only a subset of those reads and aim for 30x - 100x coverage (you could filter for the highest quality reads).

    Too many reads will slow down your experiments, and won't improve the assembly. In fact too many reads can cause the assembly to fragment due to a build-up of erroneous k-mers.

    As your pairs are overlapping, try joining them together with a tool like FLASH
    http://bioinformatics.oxfordjournals...tr507.abstract and then assemble these longer reads (try both an OLC assembler like MIRA and a de Bruijn assembler like Velvet - you can increase the k-mer size).

    Comment


    • #3
      Thanks for your help- I know the statistics are poor -happy for any advice as I am new to all this!

      I have been working on a subset of the reads. I am using velvetOptimiser so I scanned all possible kmers from 19 to 31 for each of either 250,000 reads, 1mil, or 2mil reads, and I tried both single and paired end options. The best so far was 2million paired end (those are the stat's I quoted). Should I give even more reads? I have been trimming the reads from base #5 to 60, which gives good looking boxplots with everything with at least a quality score of >20. Happy for suggestions on other ways to filter.

      I had also originally tried to work with data that was paried already using SHERA but didnt seem to get significantly improved results. I may readdress that now that I am slightly more proficient with velvet.

      I also tried MIRA yesterday for the first time and got with 2 million trimmed reads (55bp): 22000 contigs covering consensus of 4.8M largest conti is 3200bp, N50 is 287 - which is also not so great for a bacterial genome. Happy for advice there too!

      Comment


      • #4
        I have to say your results are curiously poor. 2m reads should give you 30x coverage which I would expect to give a much better result than you are getting.

        What happens if you just take the first 2m reads from the (hopefully correctly shuffled, should be pair /1 followed by /2) file, don't filter at all and pass to Velveth with -shortPaired and k-mer of 31 and Velvetg with -exp_cov auto -cov_cutoff auto -ins_length 150?

        Also are you absolutely sure you have got the correct insert length?

        Comment


        • #5
          I agree they are much worse results than expected.
          I just rechecked the shuffling and it is /1 and then /2 as expected (I used the shuffleSequences perl script that comes with vlevet for that)

          There are 4mil reads in total (2 mil of each end) just to make sure that is clear.

          I just tried with the unfiltered sequences (which are longer- but the quality gets really bad after about 90bp.) running as you suggested but the process got killed - i will try to find a computer with more memory... Is it worth trying with half the sequences?

          Insert size is definitely ~150- i just double checked with the guy who did the actual experiment, and also that is what velvetoptimiser spits out at the end of the run when I dont give it that info.

          Comment


          • #6
            I thought your reads were 75bp long? Not sure what you mean by quality dropping off after 90bp.

            Quality may well be the problem. Can you post a FastQC chart?

            Comment


            • #7
              And yes, try with 1/2 the sequences.

              Comment


              • #8
                Try k-mers more than 31. We got a good results from 37 to 47. -shortPaired. -exp_cov auto Try maximum number of reads your computer can treat. We got best results on 15+15 mln and 20+20 mln, less was worse.
                Last edited by vtosha; 02-29-2012, 03:56 AM.

                Comment


                • #9
                  Sorry - here is a boxplot. The original reads were 150bp (144bp after barcode removal), I chopped them up to 56bp (start at bp#5, end at bp#60)
                  Attached Files

                  Comment


                  • #10
                    Ive tried larger kmers and I wasnt getting much better results. The velvetOptimiser is such a useful tool and right now it is only working up to 31mers. (Not sure if i can reinstall it to do higher kmers? I have to ask the guy who installed it...)

                    Comment


                    • #11
                      We try with Perl-script of our bioinformatician. As I understand, for IT it is not too difficult to write such script for autoscreening Velvet's parametres.

                      Comment


                      • #12
                        Thanks- we dont have IT here to help unfortunately.I am a wet biologist learning the computation stuff on my own! (actually enjoying it but it is slow going!)
                        Just to make sure- you were doing 20million x2 reads on what size genome?

                        Comment


                        • #13
                          Nick-
                          I still cant run with hald the seq's. Will try to get hold of a stronger computer but that has to wait til our sys admin is back tomorrow. Any insights or things I can try in the meanwhile? Thanks so much for your help!

                          Comment


                          • #14
                            I just tried running as a 41mer with the PE 2millionx2 reads and I ma getting N50 of 243, max 5556, total of 3.6M genome. So that doesnt look too promising. (Maybe not surprising since the entire read I am using is 56 bp (the trimmed read) so maybe that is too large of a kmer.

                            Comment


                            • #15
                              Some bacterial, near 5,5 M. 72nt, paired-end. N50 from 100 000 to 170 000. Not the best at all, but best our. But for some chloroplast genome we couldn't get any long contig with an excellent sequencing results, so problem may be in library preparation or DNA.
                              Yes, there may be optimal k-mer for your own results, and k-mers with more length not better.
                              Last edited by vtosha; 02-29-2012, 04:19 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Best Practices for Single-Cell Sequencing Analysis
                                by seqadmin



                                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                06-06-2024, 07:15 AM
                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin



                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 07:24 AM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-13-2024, 08:58 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-12-2024, 02:20 PM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 06-07-2024, 06:58 AM
                              0 responses
                              184 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X