Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • wired assembly result--abysmally short contigs

    Hi, all:

    We sequenced two Bacillus strains for about 300X using Illumina Miseq paired-end sequencing. Our library is built by Nextera XT kit, which produces a wide range of fragment size, average fragment size is 1,000kb, range from 300 to 3,000kb.

    I run Velvet assembler using these command lines: 1) ./bin/velvet_1.2.10/velveth out 33 -fastq -short ./P06_S1_L001_R1_001.fastq ./P06_S1_L001_R2_001.fastq
    2) velvetg out -exp_cov auto -ins_length 1044 -scaffolding yes

    The contig distribution is as below:
    100:199 199670
    200:299 9825
    300:399 287
    400:499 73
    500:599 24
    600:699 18
    700:799 9
    800:899 2
    900:999 4
    1000:1099 1
    1100:1199 2
    1200:1299 2
    Extremely short contigs, expecially considering reads are 250bp long!

    1) First I suspect the sequenced samples may be contaminated with other organism DNA. Therefore I mapped the reads to a close reference. 90% of reads can map by Bowtie with default parameters. We were afraid that sequencing may be biased, so we draw coverage depth distribution (shown in attached Rplot.jpeg). Though some loci are strongly biased and were sequenced 7000fold, they were just a small propotion.

    2) We also draw k-mer distribution to screen possible abysmal repeat or sequencing errors (shown in attached histogram-k29.histo.pdf). But the figure looks normal.

    What would you do if you meet these problems?
    Attached Files

  • #2
    My first guess (based on seeing your histograms-k29 chart):

    Your coverage is very high and that is making the assembler believe that some noisy k-mers are real. One way to check whether that is really the case, try assembling only 1/6th of the reads and see whether the quality goes up. Alternatively, you can assemble the entire set, but play with Velvet parameter that filters out k-mers below certain frequency.

    Let's see the results, and then we can brainstorm other possibilities/solutions.

    P.S. Weird, not wired. We call it wired, when the assembly works
    http://homolog.us

    Comment


    • #3
      A couple of things could be going on here.

      First, make sure that your reads have had the Nextera adapter sequence trimmed off. When setting up the sample sheet for sequencing XT libraries, this should normally be turned on, but as some users have found out some core facilities don't do that. For all of your reads that were longer than the sequenced fragment, those adapter sequences will cause the assembler to choke and die or else give lots of small contigs because it can't properly resolve the graph.

      If you're sure that the adapter sequences have been trimmed, try merging reads that can be overlapped into single reads. That will remove some of the range for the pair distance, and if you feed a mix of merged single reads and paired reads, you should get a better assembly than just having paired reads with a large pair distance. There are a bunch of programs that can do the merging for you, we use SeqPrep but FLASH and numerous others are widely used as well.

      It doesn't look like you did any quality trimming, based on the commands you listed (P06_S1_L001_R1_001.fastq would be a direct output file from MiSeq Reporter based on its name), so you should probably give that a try. With 300x coverage, bad reads and parts of reads will prevent velvet from collapsing a lot of bubbles in the graph which will result in lots of small contigs. Additionally, increasing the kmer size should help the assembly as well after you quality trim. You can also try using khmer to down-sample/normalize your data so erroneous reads/kmers don't cause issues.

      Comment


      • #4
        Originally posted by chjp0632 View Post

        I run Velvet assembler using these command lines: 1) ./bin/velvet_1.2.10/velveth out 33 -fastq -short ./P06_S1_L001_R1_001.fastq ./P06_S1_L001_R2_001.fastq
        2) velvetg out -exp_cov auto -ins_length 1044 -scaffolding yes

        If you want velvet to treat the reads as paired reads you need to use the flags -shortPaired and -separate.

        Code:
        ./bin/velvet_1.2.10/velveth out 33 -fastq -shortPaired -separate ./P06_S1_L001_R1_001.fastq ./P06_S1_L001_R2_001.fastq

        Comment


        • #5
          Originally posted by samanta View Post
          My first guess (based on seeing your histograms-k29 chart):

          Your coverage is very high and that is making the assembler believe that some noisy k-mers are real. One way to check whether that is really the case, try assembling only 1/6th of the reads and see whether the quality goes up. Alternatively, you can assemble the entire set, but play with Velvet parameter that filters out k-mers below certain frequency.

          Let's see the results, and then we can brainstorm other possibilities/solutions.

          P.S. Weird, not wired. We call it wired, when the assembly works
          I subsample the sequencing data to 300k pairs of reads, 500k, 1million, 2m, 3m and 4m. The SOAPdenovo result is shown below:

          #ReadPair 300k 500k 1m 2m 3m 4m
          N50 13740 24563 16256 2721 941 522

          The result indicates that 500k pair of 250 bp reads produce most continuous contigs by SOAPdenovo. Velvet's performance on these data sets show similar trends. Therefor 5 million sequencing reads may be more than enough.

          Comment


          • #6
            Originally posted by mcnelson.phd View Post
            A couple of things could be going on here.

            If you're sure that the adapter sequences have been trimmed, try merging reads that can be overlapped into single reads. That will remove some of the range for the pair distance, and if you feed a mix of merged single reads and paired reads, you should get a better assembly than just having paired reads with a large pair distance. There are a bunch of programs that can do the merging for you, we use SeqPrep but FLASH and numerous others are widely used as well.

            It doesn't look like you did any quality trimming, based on the commands you listed (P06_S1_L001_R1_001.fastq would be a direct output file from MiSeq Reporter based on its name), so you should probably give that a try. With 300x coverage, bad reads and parts of reads will prevent velvet from collapsing a lot of bubbles in the graph which will result in lots of small contigs. Additionally, increasing the kmer size should help the assembly as well after you quality trim. You can also try using khmer to down-sample/normalize your data so erroneous reads/kmers don't cause issues.
            1) Merging the reads to longer single reads are good suggestions, but I am also worried about false merge. Can SeqPrep and FLASH distinguish true read extension from false one?

            2) Reads treaming and down-sampling reads can be saved if we use SPAdes assembler. I used it and save much time. It automatically assembled different subsample data with different kmers and merge the results.

            3) Compared with SOAPdenovo and Velvet, our result shows SPAdes performed best on contig contiguity for this data set.

            Comment


            • #7
              WRT to Flash, SeqPrep, etc -- you will get a lot of valid merges. You will also get some false merges if the reads overlap repeats in an unfortunate way.

              I would recommend also trying a kmer-based error corrector such as MUSKET.

              A lot of people are liking SPADES assembler, which has a cleanup step built in. I have very good luck with Ray. Velvet is the old workhorse, but many newer programs deliver more contiguous assemblies.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              30 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              32 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Working...
              X