Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina paired end assembly; in silico normalization

    Dear all,
    I want to assemble a fungal genome of approx. 35Mbp. My data are paired end reads of 101 bp in length with two different insertion length (300bp and 600 bp).
    I decided to use velvet for the assembly and searched for the best k-mer using VelvetOptimizer. I searched for the "best" k-mer with a small subset of my NGS data regarding N50 and longest contig as criterion for best assembly.
    I of course already read about the stagnation of assembly statistics, when a specific level of coverage is reached, one could not assemble more then the "real" genome in the end, BUT: what I observed is that the statistics drop down increasing the coverage (using the full data set).
    Inspecting the results of the full set assembly and sub set assembly shows that long contigs from the sub set assembly (med cov 25x) are divided in smaller contigs in the other assembly (med cov 250x).
    Initially the genome was "over sequenced" so the calculated expected coverage is about 1000x.
    Do you have any suggestions why that happens? And do you think in silico normalization might get rid of that? Using DigiNorm for example I would loose the PE information and even the quality information if I'm right because the output is allays fasta?
    I would be very happy if we could start a discussion about that. Did anyone of you observed that phenomenon before?

  • #2

    You can either subsample or normalize; sometimes one gives a better assembly than the other. BBNorm will not lose pairing or quality information.

    That package also contains Reformat, a tool that can do subsampling: in=reads.fq.gz out=sampled.fq.gz samplerate=0.04

    ...will subsample to 4%.

    With BBNorm, to normalize to 35x coverage: in=reads.fq.gz out=normalized.fq.gz target=35 -Xmx29g

    You may or may not need to set the -Xmx flag, depending on your environment. If you do, then set it to about 85% of the machine's physical memory.

    In my experience, normalization normally yields a better assembly than subsampling, particularly if you have variable coverage. The disadvantages of normalization compared to subsampling are that it is slower and uses more memory, and destroys information about repeat content, but I consider these disadvantages to be unimportant if the resultant assembly is better.

    Also, I find that Velvet, with default parameters, yields the best assembly at around 35-40x coverage.


    • #3
      Thank you for the reply! I try again with BBnorm.


      • #4
        BBnorm target depth setting

        Hi Thread/Brian,

        I ran BBnorm on a pre-BBduk'd pair of reads. I plan to use the normalized reads for assemblers (Velvet or A5, well, A4 in this case).

        ./ in1=R1_bbduk20.fq in2=R2_bbduk20.fq out1=R1_bbduk20_norm35.fq out2=R2_bbduck20_norm35.fq target=35

        the opening dialog states that the target depth is 140 on pass 1, then 35 on pass 2.

        Why the two passes?

        What are the "error reads/pairs/types"?

        Do these normalizations typically help a BBmap (or other) run as well?



        • #5

          There are two passes because normalization can selectively enrich the dataset for reads with errors, if no precautions are taken, since reads with errors appear to have rare kmers. To combat this, I first normalize down to a higher-than-requested level; or specifically, reads that appear error-free are normalized to the high level (140 in this case), and reads that appear to contain errors are normalized to the low level (35 in this case) on the first pass.

          So, after the first pass all reads will still have minimum depth 35 (so nothing was lost), but the new dataset will be selectively enriched for error-free reads. The second pass normalizes the remaining reads to the final target regardless of whether errors are suspected.

          It's not really possible to do this in a single pass because if (for example) half your reads contain errors, and error-free reads are randomly discarded at the target rate but error-containing reads are discarded at a higher rate, you will ultimately achieve only half of the desired final coverage.

          You can set "passes=1" when running normalization, and look at the kmer frequency histogram afterward with "histout=x.txt". The histogram from a run with 2-pass normalization will have far fewer error kmers, which is obvious from a quick visual examination of the graph.

          Normalization is not useful for mapping in most circumstances. The only advantage it would convey is a decrease in runtime by reducing the size of the input, which is useful if you are using a huge reference (like nt) or slow algorithm (like BLAST), but I don't use it for that. Error-correction (done by the same program as normalization, with instead of may be useful before mapping, though - it will increase mapping rates, though always with a possibility that the ultimate output may be slightly altered. So, I would not error-correct data before mapping, either, except when using a mapping algorithm that is very intolerant of errors.

          I designed BBNorm as a preprocessing step for assembly. Normalizing and/or error-correcting data with highly uneven coverage - single cell amplified data and metagenomes (and possibly transcriptomes, though I have not tried it) - can yield a much better assembly, particularly with assemblers that are designed for isolates with a fairly flat distribution, but even on assemblers designed for single-cell or metagenomic assembly. Also, normalization can allow assembly of datasets that would otherwise run out of memory or take too long. Depending on the circumstance, it often yields better assemblies with isolates too, but not always.

          I'm sure there are other places where BBNorm can be useful, but I don't recommend it as a general-purpose preprocessing step prior to any analysis - just for cases where you can achieve better results be reducing data volume, or flattening the coverage, or reducing the error rate.

          Oh... and as for the error reads/pairs/types, that shows a summary of the input data quality, and fractions of reads or pairs that appear to contain errors. The "types" indicate which heuristic classified the read as appearing to contain an error; that's really there for testing and I may remove it.


          • #6
            BBduk goose.

            Thanks Brian,

            That's much clearer and makes sense.

            I noticed that fastqc complained more about kmers after the normalization, and I can see why that could happen.


            • #7
              Hi Brian,
              I used the following command for normalization of my oversequenced genome:
               ./ in=species_L004_paired.fastq out=normalized_L004_80x.fastq target=80
              Maybe that is the same question as mentioned before, but I expect that errors represented by low frequent kmers have more weight after the normalization. Is it like that?
              By the way, even after the first run of you tool my assembly statistics increase. You did a great job! Thank you!
              By the way, did I missed to give the pairing argument?


              • #8

                1) BBNorm was designed to prioritize discarding of reads that appear to contain errors, so typically, there are fewer low-frequency kmers after normalization. When you normalize, you can set "hist=khist_input.txt" and "histout=khist_output.txt". The first file will get the frequency histogram of kmers before normalization, and the second one after normalization, so you can see how the process changed the distribution.

                2) I'm glad it was helpful!

                3) If the input is interleaved, the program will autodetect that, as long as the reads follow the standard Illumina naming patterns, so it should be fine. I will update it to print a message indicating whether it is processing the data as paired or not. You can force a file to be interpreted as interleaved with the "interleaved=t" flag.


                • #9
                  Hi Brian,
                  I just do not know if it was successful with respect to the pairing... I got the following exception using DigiNorm:
                  ** ERROR: Error: Improperly interleaved pairs
                  Thats why I was wondering if maybe there are some problems in the interleaved mode. Would your software give a hint if it is not possible to run the interleaved mode?
                  Can you again tell me what is best practice runing two or more iterative steps using BBnorm?


                  • #10

                    You can run Reformat to verify that interleaving is valid, like this:

           in=reads.fq out=null vint

                    It will print a message indicating whether or not the interleaving is correct, based on the read names. So you can run it on the file from before normalization, and the file after, to check. Another program can fix files with broken interleaving:

           in=reads.fq out=fixed.fq fint

                    This is very fast and requires very little memory, but will only work on files that were interleaved, then some of the reads were thrown away. If the reads are arbitrarily disordered, you can run this:

           in=reads.fq out=fixed.fq repair

                    This requires a lot of memory, though, as it may potentially store up to half of the reads in memory.

                    If BBNorm broke pairing, it is much better to redo normalization with the "int=t" flag to force reads to be treated as pairs than to try to fix the corrupted output. If you run it with "int=t" it will completely ignore the read names, and not give any sort of error if they don't match, since the names could have been changed or they could be in some format that is not recognized. If you run with "int=f" then similarly the input will be forced to be used single-ended. And if you run without the flag, it will autodetect but unfortunately it doesn't currently tell you whether the data was processed single-ended or not; you can run this program:


                    ...which will print whether or not a file is detected as interleaved (and various other things). If says a file contains single-ended ASCII-33 fastq reads, then all of my programs will treat that file as such (with the exception of and with the 'fint' flag, because those are designed for corrupted files).

                    I'm not really sure what you mean by running iterative steps with BBNorm - it runs two passes by default, transparently, as that yields the best assemblies in my testing. You can run multiple iterations yourself if you want (using the output as input), but there's no reason to do that unless the coverage comes out above your target. I have not found that multiple iterations of error-correction are particularly useful, either; one seems to be fine. So if you use this command line:

           in=reads.fq out=normalized.fq int=t ecc=t hist=khist.txt histout=khist2.txt ow

                    ...then it will do three passes:

                    1a) Kmer counting of input kmers for frequency histogram
                    1b) Error correction
                    1c) Normalization biased against reads with errors

                    2) Unbiased normalization

                    3) Kmer counting for output frequency histogram

                    ...which should give you optimal results. The "ow" flag is optional and tells it to overwrite output files if they already exist; the "ecc" flag is optional and usually makes things better, but not always; and of course the "int" flag is optional but I always include it if I know whether or not the reads are paired, since autodetection is reliant on reads having Illumina standard names.

                    I hope that helps!


                    Latest Articles


                    • seqadmin
                      The Impact of AI in Genomic Medicine
                      by seqadmin

                      Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                      02-26-2024, 02:07 PM
                    • seqadmin
                      Multiomics Techniques Advancing Disease Research
                      by seqadmin

                      New and advanced multiomics tools and technologies have opened new avenues of research and markedly enhanced various disciplines such as disease research and precision medicine1. The practice of merging diverse data from various ‘omes increasingly provides a more holistic understanding of biological systems. As Maddison Masaeli, Co-Founder and CEO at Deepcell, aptly noted, “You can't explain biology in its complex form with one modality.”

                      A major leap in the field has
                      02-08-2024, 06:33 AM





                    Topics Statistics Last Post
                    Started by seqadmin, 02-28-2024, 06:12 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 02-23-2024, 04:11 PM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 02-21-2024, 08:52 AM
                    0 responses
                    Last Post seqadmin  
                    Started by seqadmin, 02-20-2024, 08:57 AM
                    0 responses
                    Last Post seqadmin