Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SOAPdenovo2: rd_len_cutoff and assembly quality

    Dear all,

    working with SOAPdenovo2, I noticed a (to my mind) strange behaviour.

    I am trying to assemble simulated MiSeq data of a fungal genome. Inputting about 12 million 2x250 bp sequences, I expect the assembly length to reach around 38Mbp.

    Yet, with the rd_len_cutoff set to 250, I observe distressing results, as in
    • total scaffold length: 100,681
    • average scaffold length: 466
    • N50: 146 (!!)


    With rd_len_cutoff = 100, the results are significantly better:
    • total scaffold length: 34,867,163
    • average scaffold length: 13,657
    • N50: 29,759


    (In all cases, kmer length is set to 63, which I empirically found to yield the best results for any read length.)

    As far as I can tell after inspecting the read length range between 100 and 250, there seems to be a negative linear correlation between read length and assembly quality.
    (Number of scaffolds decreases with rd_len_cutoff, while N50, assembly size and average scaffold length increase.)

    Is there any explanation to this behavior?

    Thank you in advance and best regards!

  • #2
    Are your reads trimmed? They should be, for adapter removal, at least. Thus if you set the length cutoff at 250 you are probably using very few reads and thus have too little data to assemble.

    I would generally expect the negative correlation you observe, down to a cutoff equal to your kmer size, particularly when you don't have very much data - you only had around 150x coverage before trimming, and if you throw away a lot of it, you will drop too low. Don't set a length cutoff unless you have plenty of data. In your case even 63bp reads might be useful.

    And don't worry about the L50 being lower than the read length... contigs come from the kmer graph which is not directly related to individual reads, so that's not a bug, it's a feature

    Comment


    • #3
      Thank you for your quick reply!

      I'm not sure if I understand you correctly - from your answer it sounds as if the read length cutoff specifies the minimum length reads should have to be included in the assembly.
      From the manual and the example.config file given on the SOAP website I had been assuming that the rd_len_cutoff indicates, for each read, the position of the last base used for assembly (i.e. the position after which the reads will be cut).

      Comment


      • #4
        I think you're right, my understanding of the Soapdenovo web page is that the rd_len_cutoff is the last base left in the read, Soapdenovo trims off all the bases after that point.

        In that case, if you set the rd_len_cutoff to 250, nothing will be trimmed from the 3' ends of the reads, which often have low quality bases, so that could explain why the assembly is worse.

        In your original post you mentioned that you are working with simulated MiSeq data.

        How are you simulating the MiSeq data? What software are you using? Do you end up with reads that are all the same length (250) or not?

        Comment


        • #5
          Again a prompt reply, thank you!

          Yes, all sequences are of same length and they are generated using ART.

          Initially, I had the same thought as you. That's why I conducted a series of runs with varying rd_len_cutoff. Yet, to me, these didn't explain why results from rd_len_cutoff = 175 with
          • total scaffold length: 35,545,564
          • average scaffold length: 9,007
          • N50: 14,637


          would still be far worse than results from rd_len_cutoff = 150 with
          • total scaffold length: 36,161,245
          • average scaffold length: 12,362
          • N50: 25,376


          You're right of course - in most cases, a decrease in base quality can be observed towards the 3' end. But, regarding the FastQC Report (see attachment) for my data, I wouldn't have expected the assembly to be affected so distinctly. Even though the 25 bases ommited in the example above are of a lower quality I still would have thought that the quality overall is rather good and that the loss of information associated with trimming should at least lead to similar results in both cases.

          Moreover, Velvet, which is also works on De Bruijn graphs, performs perfectly well on the set of 250 bp sequences. Do you happen to know which computational difference between these two programs might lead to such divergent results?
          Attached Files

          Comment


          • #6
            I agree your per base quality seems pretty good, which is not what you would see in real data.

            I have used velvet quite a bit, but have never used Soapdenovo or ART.

            With real data you would also get adapters at the 3' ends of some reads, and with long MiSeq PE reads, sometimes R1 and R2 overlap. Does ART add adapters to some reads or not, and what is the average insert length for the simulated reads?

            Of course, what also happens when you make the reads shorter, is that the coverage decreases. I don't know about Soapdenovo, but velvet doesn't do so well with very high coverage.

            Comment


            • #7
              Have you looked at what the kmer frequenting plots are doing as you change the read length? Those can be helpful in diagnosing a problem. And have you tried using SOAP’s error correction method?

              Comment


              • #8
                Originally posted by ju.lie View Post
                Thank you for your quick reply!

                I'm not sure if I understand you correctly - from your answer it sounds as if the read length cutoff specifies the minimum length reads should have to be included in the assembly.
                From the manual and the example.config file given on the SOAP website I had been assuming that the rd_len_cutoff indicates, for each read, the position of the last base used for assembly (i.e. the position after which the reads will be cut).
                Sorry, my answer was completely wrong; I misinterpreted the meaning of that flag. Since you are using synthetic data of high quality, I can't explain the results unless the reads have adapters inserted, or an intentional base-composition bias toward the end of the reads, or there is some problem with the read generator. I suggest you post a base-composition by position plot.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 05-14-2024, 07:03 AM
                0 responses
                23 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-10-2024, 06:35 AM
                0 responses
                44 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                58 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                44 views
                0 likes
                Last Post seqadmin  
                Working...
                X