Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Velvet assembly from MiSeq data - am I doing it right?

    Hi everyone,

    I am a newbe to sequencing and assembly and I posted before for help for my library and you guys were great - thanks for that again. Now I am doing Velvet denovo assemblies and have no idea if I am doing it right and as good as possible. I have different questions about FastX processing and Velvet input as well as interpreting the quality of my output contigs.
    I haven't found a FastX -> velvet tutorial, so if there is one I am sorry for wasting your time.

    My data:
    2x250bp paired end Illumina Miseq data, 1.9million reads per genome. Organism is an E. Coli strain so I assume my assembly has a size of roughly 5mbp.

    What I did so far:
    - Assembly without FastX processing:
    Used both reads, shuffled them using the velvet perl script. Input my shuffled sequences into velveth, 'MAXKMERLENGTH=151' -shortPaired.

    The output was:
    - Expected coverage: 17.949313
    - Estimated cutoff: 8.974657
    - nodes: 148
    - n50 of 234827
    - max 537019
    - total 5039428

    This was basically my first attempt to get to know velvet. Next I tried to improve my assembly by quality processing my data. I trimmed the ends of my reads by 7bp for them to be accepted as high quality in FastQC.

    Then I retried my assembly with those reads and kmer lengths between 131-149 (region of lowest number of nodes). Using the quality trimmed reads I end up at best with 158 nodes and n50 values of 212407. Only the max contig length rose to 650709.

    Last I applied all the FastX tools to filter, trim, clip my reads. However, I think due to filtering the reads the shuffle script shuffled the reads wrong and I got very bad assemblies.

    So here are my questions I got from my attempts:

    - Is it acceptable to denovo assemble unprocessed reads?
    - If no, which quality enhancement methods are absolutely necessary?
    - How can I apply the FastX filter tools and still run velvet paired end assemblies?
    - How can I tell a good assembly from a bad assembly? By contig n50 or number of nodes? Kmer coverage?

    I have found threads that cover aspects of what I am asking but I am very insecure if I am doing it right, and I want to do it right.

    Thanks for your patience,
    Illnoobina

  • #2
    Originally posted by illnoobina View Post
    1) Is it acceptable to denovo assemble unprocessed reads?
    2) If no, which quality enhancement methods are absolutely necessary?
    3) How can I apply the FastX filter tools and still run velvet paired end assemblies?
    4) How can I tell a good assembly from a bad assembly? By contig n50 or number of nodes? Kmer coverage?
    1) Yes, particularly if the data is very high quality, but some processing will usually give you a better assembly.
    2) I recommend adapter trimming and filtering out artifacts (primers and other synthetic molecules, phiX, human reads). Depending on the quality and assembler, quality-trimming or error-correction may be useful (Velvet is pretty robust, though). Depending on the library type, sometimes normalization is useful (e.g. for highly amplified single-cell data).
    3) I suggest you not use FastX. Instead, use a tool like BBDuk which retains pairing when doing trimming/filtering operations.
    4) This is a difficult question. You might try using a tool like Quast, which is designed for evaluating assemblies; it works best if you provide it with a reference, so you can use a known strain of E.coli for that. Also, mapping reads to the assembly is useful; the higher the mapping rate, and the lower the error count, the better the assembly reflects the reads. Looking at the coverage, you may also be able to spot things like collapsed repeats. You can also plot the cumulative length of the assembly as you include more contigs, starting with the longest and ending with the shortest; that line will tell you more than a single number like N50.

    Also, once you have an assembly, you can BLAST it against nt or something to see if all your contigs are e.coli. If some are not, you can remove the contaminant reads and reassemble.
    Last edited by Brian Bushnell; 08-16-2014, 12:27 PM.

    Comment


    • #3
      Hi Brian,

      Thank you for your fast and very very helpful reply. I am using BBDuk you suggested (I guess you coded it) and it's awesome! I am down 2 nodes compared to my initial assembly without optimizing kmer lengths. I guess the 2 nodes were the phiX spike and TruSeq adapter contaminations which i filtered for.

      Estimated Coverage = 17.9
      Estimated Coverage cutoff = 8.9
      146 nodes
      n50 of 234k
      max 537k
      total 5034k

      I used the adapter trim as you did in your tutorial (I hope that's fine with TruSeq paired end):

      ./bbduk.sh -Xmx1g in1=R1in.fastq in2=R2in.fastq out1=R1out.fastq out2=R2out.fastq ref=truseq.fa ktrim=r k=28 mink=12 hdist=1

      The only thing I am worried about is that I still see kmer irregularties in the first 10bp in FastQC - i thought i am going to get rid of that after adapter trimming? is that true?

      Comment


      • #4
        Kmer frequency irregularities in the first 10-20bp are not unusual, depending on your fragmentation methodology. I don't think it happens with sonication, but with other approaches like Nextera (transposon) and "random" hexamer priming, it does. In my testing highly nonrandom base frequencies in the first 20bp do not exhibit inflated error rates, and thus are not due to artifacts or base-calling problems, but you can test this by running BBMap with the "mhist=mhist.txt" flag, against your assembly. This will show the error rate by read position; if it is not elevated for the first 20bp, then the nonuniformity is just an artifact of nonrandom cleavage and does not need trimming.

        Adapters (for fragment libraries) are present on the right end, not the left end, so they don't affect the first 10-20bp unless you have a high population of adapter-dimers and so forth. Adapter-dimers should be removed during adapter trimming. There are other artifacts, though, like primer-dimers and various other artificial constructs; you may want to BLAST your assembled contigs against nt (or some other database) to see if any are contaminants. If so, you should filter out the contaminants from the raw reads and reassemble; you get the best assembly when contaminants are removed before assembling. Alternately, you could just use BBDuk to remove all known Illumina artifacts prior to assembly (which is what we do at JGI). I'm not supposed to distribute the files containing all Illumina contaminant sequences since some are patented, but they're not difficult to find online.
        Last edited by Brian Bushnell; 08-17-2014, 10:11 AM.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          Kmer frequency irregularities in the first 10-20bp are not unusual, depending on your fragmentation methodology. I don't think it happens with sonication, but with other approaches like Nextera (transposon) and "random" hexamer priming, it does. In my testing highly nonrandom base frequencies in the first 20bp do not exhibit inflated error rates, and thus are not due to artifacts or base-calling problems, but you can test this by running BBMap with the "mhist=mhist.txt" flag, against your assembly. This will show the error rate by read position; if it is not elevated for the first 20bp, then the nonuniformity is just an artifact of nonrandom cleavage and does not need trimming.

          Adapters (for fragment libraries) are present on the right end, not the left end, so they don't affect the first 10-20bp unless you have a high population of adapter-dimers and so forth. Adapter-dimers should be removed during adapter trimming. There are other artifacts, though, like primer-dimers and various other artificial constructs; you may want to BLAST your assembled contigs against nt (or some other database) to see if any are contaminants. If so, you should filter out the contaminants from the raw reads and reassemble; you get the best assembly when contaminants are removed before assembling. Alternately, you could just use BBDuk to remove all known Illumina artifacts prior to assembly (which is what we do at JGI). I'm not supposed to distribute the files containing all Illumina contaminant sequences since some are patented, but they're not difficult to find online.
          Should one remove adapters FIRST, before trimming using PHRED scores?

          Comment


          • #6
            There are arguments for each approach, but I prefer to remove adapters first; quality trimming never adds information. If you have severe contamination you can always trim adapters both before and after quality trimming, but that's generally a waste of time.

            Basically -

            If you quality-trim first, then you might remove so much adapter that the remaining little piece is no longer recognized, and therefore not trimmed.
            If you quality-trim second, you may remove some error bases that prevented the detection of a shorter-than-K adapter prefix on the very end of the read.

            So neither is perfect but quality-trimming second seems to be better.

            Comment


            • #7
              Originally posted by Brian Bushnell View Post
              There are arguments for each approach, but I prefer to remove adapters first; quality trimming never adds information. If you have severe contamination you can always trim adapters both before and after quality trimming, but that's generally a waste of time.

              Basically -

              If you quality-trim first, then you might remove so much adapter that the remaining little piece is no longer recognized, and therefore not trimmed.
              If you quality-trim second, you may remove some error bases that prevented the detection of a shorter-than-K adapter prefix on the very end of the read.

              So neither is perfect but quality-trimming second seems to be better.
              Ok, thanks!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Essential Discoveries and Tools in Epitranscriptomics
                by seqadmin


                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                Yesterday, 07:01 AM
              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              39 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              41 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              35 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              55 views
              0 likes
              Last Post seqadmin  
              Working...
              X