Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • de novo assembly of 1.7Gb reptile - Best Practices?

    So, first and foremost, greetings to all of you! Been a browser here for a time, but this is my first post and I thought I'd make it a good one

    Up front, I am not a Biologist or Bioinformatician...merely a curious and adventurous manager of HPC for an R1 university. That being said, I am working on a de novo assembly of a 1.7Gb reptillian genome from 209GB of Illumina data. While I have successfully assembled a few Mb sized organisms in the past, I have never tried anything on this scale! The data I currently have is as follows:

    Illumina HiSeq 2000 Mate Pairs (Reverse Comped - 2012)
    -rw-r--r-- 1 jpummil jpummil 21G Mar 25 21:49 RattleSnakeReverseCompedRead1.fastq
    -rw-r--r-- 1 jpummil jpummil 21G Mar 25 21:50 RattleSnakeReverseCompedRead2.fastq

    Illumina HiSeq 2000 Pair Ended (2012)
    -rw-r--r-- 1 jpummil jpummil 27G Mar 25 21:53 s_1_1_sequence.fastq
    -rw-r--r-- 1 jpummil jpummil 27G Mar 25 21:53 s_1_2_sequence.fastq

    Illumina HiSeq 2000 Pair Ended (2012)
    -rw-r--r-- 1 jpummil jpummil 24G Mar 25 21:54 s_2_1_sequence.fastq
    -rw-r--r-- 1 jpummil jpummil 24G Mar 25 21:54 s_2_2_sequence.fastq

    Illumina HiSeq 2000 Pair Ended (2014)
    -rw-r--r-- 1 jpummil jpummil 19G Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
    -rw-r--r-- 1 jpummil jpummil 19G Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq

    Illumina MiSeq 2000 Pair Ended (2014)
    -rw-r--r-- 1 jpummil jpummil 16G Mar 25 21:57 MIKE_S1_L001_R1_001.fastq
    -rw-r--r-- 1 jpummil jpummil 16G Mar 25 21:57 MIKE_S1_L001_R2_001.fastq

    So, as I look at this pile of data, I am thinking that my steps should be as follows:

    Trim adapters and poor data with FastX or Trimmomatic
    Maybe make a second copy of the trimmed data and use FLASH to merge reads?
    Have considered doing something with digital normalization to select just the "best" data?
    Need to determine the best way of selecting parameters such as kmer size to optimize.

    I have been using Ray and Velvet in my early experiments on the 2012 data with not a lot of success as I think our overall coverage was poor. BUT...with the new data added, I want to start off on a fresh note and learn something as I go that I can document and pass on to the REAL biologists if/when they are confronted with such a project

    If there are responses to this thread, it is highly likely that I will ask some very non-biological questions as follow-ups, but it is all in the process of learning and I will be very appreciative of explanations and pointers.

    Cheers,
    --Jeff

  • #2
    It looks like you have around 60x coverage, which is too low to benefit from normalization, so I suggest you skip that step. I have written a normalizer, BBNorm, that also does error-correction - you can do either or both. When you have too much data, I find assemblies greatly benefit from normalization plus error-correction (with Soap and Velvet, at least; AllPathsLG does not seem to like normalization). Still, it's often worthwhile to compare assemblies with raw, normalized, error-corrected, and ((error-corrected)+(normalized)) data, to see which comes out the best. Sometimes error-correction and normalization will both reduce assembly quality. You can roughly determine assembly quality using metrics like N50, percent of raw reads that map to the assembly, error rates when mapping to the assembly, pairing rates, ambiguous mapping rates, assembly size (lower is better, all else equal), and so forth. If you have EST data, you can also evaluate assemblies using that (using, for example, bbest.sh).

    Adapter-trimming is always a good idea when you know the adapter sequences, as is human removal, but since you're using a reptile, my suggestions in that post are inadequate. Still, you can at least remove human symbiotes/parasites like e.coli and salmonella, and known Illumina contaminants like phiX. Adapters can be trimmed with Trimmomatic, but BBDuk is both faster and more sensitive (allowing adjustable edit distance or hamming distance). I do not recommend Fastx, ever, since it ruins read pairing.

    Quality-trimming can be useful but is best done conservatively (to Q10 or less). There are many quality-trimming algorithms; the best-performing is the Phred algorithm (that website doesn't show up for me, but maybe it does for you? If not, I can send you a description of my reimplementation of the algorithm), implemented by BBTools and Seqtk. There are other quality-trimming programs, and all give inferior results, because the Phred algorithm is provably optimal. Most are written with no understanding of logarithms.

    FLASH does a poor job of merging, with a high false-positive rate. BBMerge (part of the BBTools package) does much better, with a similar merging rate to FLASH but around 1/100 the false positive rate (under 0.03% with the default settings, using synthetic data which can be verified). However, merging is not always beneficial. Ray seems to perform much better with merged reads, while Soap gives terrible assemblies using merged reads, compared to unmerged. I have not tested AllPathsLG or Velvet on merged data. So if you want to merge reads, don't use FLASH; but whether merging is a good idea depends on your genome and assembler. The accuracy of a merger is quite easy to verify - if you have an assembly, no matter how primitive, just track the fraction of mapped reads raw and after merging, allowing no indels.

    For optimizing kmer size in an assembly - the normal solution is to assemble using all odd kmer sizes (from, say, k=21 to 80% of read length) and pick the one with the best N50. It's not ideal, but much better than choosing K at random. There are algorithms available that claim to determine the optimal kmer length, but I have yet to test them.

    Lastly - it's normal to assemble all fragment libraries into contigs, then use the long-mate-pair library only for scaffolding. They're different, so treat them as such.
    Last edited by Brian Bushnell; 04-17-2014, 08:28 PM.

    Comment


    • #3
      Good morning Brian!
      Thank you for the very helpful synopsis you have provided. I have gone so far as to run three assemblies with just the base, un-preprocessed data with Ray at 3 different kmer settings (31, 43, and 57). Command line I am using is:
      /share/apps/Ray/Ray-2.3.1/ray-build/Ray -k 43 -p /scratch/jpummil/Snake_v2/3
      _Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq /scratch/jpummil/Snake_v2/3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq -p
      /scratch/jpummil/Snake_v2/RattleSnakeReverseCompedRead1.fastq /scratch/jpummil/Snake_v2/RattleSnakeReverseCompedRead
      2.fastq -p /scratch/jpummil/Snake_v2/s_1_1_sequence.fastq /scratch/jpummil/Snake_v2/s_1_2_sequence.fastq -p /scratch/
      jpummil/Snake_v2/s_2_1_sequence.fastq /scratch/jpummil/Snake_v2/s_2_2_sequence.fastq -p /scratch/jpummil/Snake_v2/MIK
      E_S1_L001_R1_001.fastq /scratch/jpummil/Snake_v2/MIKE_S1_L001_R2_001.fastq -o /scratch/jpummil/Snake_v2/RayOut43-AMD

      Using the quast quality metrics tool, I get the following:
      kmer=31
      Assembly Contigs
      # contigs (>= 0 bp) 742358
      # contigs (>= 1000 bp) 355954
      Total length (>= 0 bp) 1150125950
      Total length (>= 1000 bp) 969349841
      # contigs 511734
      Largest contig 39372
      Total length 1085189817
      GC (%) 37.98
      N50 2917
      N75 1583
      L50 106043
      L75 233027
      # N's per 100 kbp 0.00

      kmer=43
      Assembly Contigs
      # contigs (>= 0 bp) 677575
      # contigs (>= 1000 bp) 355684
      Total length (>= 0 bp) 1194030037
      Total length (>= 1000 bp) 1030937265
      # contigs 501163
      Largest contig 40236
      Total length 1138917269
      GC (%) 38.17
      N50 3235
      N75 1717
      L50 100464
      L75 221906
      # N's per 100 kbp 0.00

      kmer=57
      Assembly Contigs
      # contigs (>= 0 bp) 594352
      # contigs (>= 1000 bp) 349257
      Total length (>= 0 bp) 1155057304
      Total length (>= 1000 bp) 1014358551
      # contigs 487964
      Largest contig 40714
      Total length 1117432969
      GC (%) 38.10
      N50 3249
      N75 1732
      L50 98136
      L75 216568
      # N's per 100 kbp 0.00

      Looks like next step is to work on some pre-processing of the data to eliminate poor quality areas and adapters and maybe try again at kmer=57 to see if there is a substantial improvement.

      Comment


      • #4
        Brian's suggestions, as always, are very good. You may also wish to try the 'ABySS' program. In my experience there is no one perfect assembler so if you have the time then trying out two or three is a good idea.

        Comment


        • #5
          Since you're using Ray, I do recommend trying an assembly of merged data to see if you get a better result (assuming your libraries have short enough inserts for merging). The command is like this:

          bbmerge.sh in1=reads1.fq in2=reads2.fq out=merged.fq outu1=unmerged1.fq outu2=unmerged2.fq.

          If you do merging with bbmerge, it's best to do it after adapter removal, but before quality-trimming.

          Comment


          • #6
            So, to use bbtrim.sh and NOT quality trim, just don't flag the quality options? So for each PE set, use as follows?

            bbtrim.sh in=<input file> in2=<input file> out=<output file> out2=<output file> ref=<adapter files>

            So, for the adapter references, I just looked at one of the newer PE reads to count the various adapters:

            [jpummil@razor-l2 Snake_v2]$ !762
            grep -c "^GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
            122
            [jpummil@razor-l2 Snake_v2]$ !763
            grep -c "GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG$" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
            104
            [jpummil@razor-l2 Snake_v2]$ !764
            grep -c "GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
            7217
            [jpummil@razor-l2 Snake_v2]$ !765
            grep -c "AGATCGGAAGAGC" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
            267557

            Should I include all 4 of these in csv form for the adapter ref?

            What about PCR primers? I haven't found a reference that suits me for those yet...

            Comment


            • #7
              Originally posted by jpummil View Post
              So, to use bbtrim.sh and NOT quality trim, just don't flag the quality options? So for each PE set, use as follows?

              bbtrim.sh in=<input file> in2=<input file> out=<output file> out2=<output file> ref=<adapter files>
              bbtrim.sh is intended for quality-trimming. To do adapter trimming without quality-trimming, you would use bbduk:

              bbduk.sh in=<input file> in2=<input file> out=<output file> out2=<output file> ref=<adapter files> ktrim=r mink=10 -Xmx1g

              You can add "hdist=1" to allow up to one substitution. The "ktrim=r" flag means trim to the right when finding a kmer (without that flag, matches would be discarded rather than trimmed). The "mink=10" allows it to look for matching sequences as short as 10 bp at the end of the read. In the middle of the read, the full kmer length (default 28) will be used.

              I have included both the truseq adapters and phix reference in /bbmap/resources/

              So, for the adapter references, I just looked at one of the newer PE reads to count the various adapters:

              ...

              Should I include all 4 of these in csv form for the adapter ref?
              You can put them in a properly-formatted fasta file, like this:

              >1
              GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
              >2
              GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
              >3
              GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
              >4
              AGATCGGAAGAGC

              ...or, you can use the flag "literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,AGATCGGAAGAGC"

              Note that BBDuk uses kmers, and the default kmer length is 28; it will not find adapters shorter than kmer length. You can change it to, say, 13 with the "k=13" flag, but the shorter it is the more false positives will be found, particularly if you allow mismatches.

              What about PCR primers? I haven't found a reference that suits me for those yet...
              I have a file of all common Illumina contaminants that we use for filtering on everything at JGI. It's small, ~120kb, and includes various primers. I'll check for permission, and post it here if it's allowed.

              Comment


              • #8
                Ahhh...understood. bduk.sh as well as the noted options.

                So, while I could "nuke" about 7500 of the longer adapters with k=28:
                grep -c "GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
                7217
                I would be left with nearly a quarter million of the shorter adapters unless I reduce k to 13...
                grep -c "AGATCGGAAGAGC" 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
                267557

                But...to lower k to less than half of default in this case would substantially increase the likelihood of eliminating things OTHER than adapters. Of course, the grep shown is likely indicating this as well.

                Looks like 75M reads in that particular file, so...
                grep -c @HWI 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
                75552988

                Yes, if allowable, I'd love to have the file with all the adapters, primers, etc that you mentioned.

                Want to thank you for your patience and all the help thus far...I truly do appreciate it!

                Comment


                • #9
                  Sorry, I couldn't get permission to release it. I think some of it may be considered proprietary by Illumina.

                  However, you might try these lists:



                  ...both of which contain some Illumina primers. They'd have to be formatted as Fasta before use.

                  Comment


                  • #10
                    No worries, Brian

                    As it stands now, i have plenty to tinker with. I'll do some adapter trimming with what I have and see where that leads and report back on progress.

                    Oh yeah, while digging around, I ran upon some 454 files from 2010 for the same project. Not sure if they'd be more trouble than they're worth, or useful for a specific purpose:

                    -rwxr-xr-x 1 jpummil jpummil 1.5G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GE68THX01_RSn3-5.sff
                    -rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GEHF0TB01_RSn3-3.sff
                    -rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GEHF0TB02_RSn3-3.sff
                    -rwxr-xr-x 1 jpummil jpummil 1.1G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GELISTB01_RSn3-2pL1.sff
                    -rwxr-xr-x 1 jpummil jpummil 599M Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GEUJLEB01_RSn3-4pL2.sff
                    -rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GF99A5Z01_RSn3-JM1.sff
                    -rwxr-xr-x 1 jpummil jpummil 1.9G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GF99A5Z02_RSn3-JM1.sff
                    -rwxr-xr-x 1 jpummil jpummil 2.2G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGDZ8VL01_RSn3-JM2.sff
                    -rwxr-xr-x 1 jpummil jpummil 2.1G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGDZ8VL02_RSn3-JM2.sff
                    -rwxr-xr-x 1 jpummil jpummil 1.7G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGLRLMF01_RSn3-JM1r2.sff
                    -rwxr-xr-x 1 jpummil jpummil 2.2G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGLRLMF02_RSn3-JM2r2.sff
                    -rwxr-xr-x 1 jpummil jpummil 2.0G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGPCL1V01_RSn3-JM3.sff
                    -rwxr-xr-x 1 jpummil jpummil 2.0G Jun 11 2012 /storage/jpummil/snake-orig/GSFLX2010/GGPCL1V02_RSn3-JM3.sff

                    Comment


                    • #11
                      So, first try seemed to run fine, but minimal change in file sizes:
                      Start:
                      -rw-r--r-- 1 jpummil jpummil 19907863971 Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
                      -rw-r--r-- 1 jpummil jpummil 19907863971 Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq

                      First with cautious: literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
                      -rw-rw-r-- 1 jpummil jpummil 19905819559 Apr 18 20:14 TRIM-AD/New_PE_R1.fastq
                      -rw-rw-r-- 1 jpummil jpummil 19906242537 Apr 18 20:14 TRIM-AD/New_PE_R2.fastq

                      And with more aggressive: k=13 literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,AGATCGGAAGAGC
                      -rw-rw-r-- 1 jpummil jpummil 19885080597 Apr 18 20:20 TRIM-AD/New_PE_R1a.fastq
                      -rw-rw-r-- 1 jpummil jpummil 19884681173 Apr 18 20:20 TRIM-AD/New_PE_R2a.fastq

                      Also, the size of the paired files is no longer the same...I guess I expected they would be.

                      Comment


                      • #12
                        The program will print statistics about how how many reads and bases were trimmed to stander error. No reason to expect the files to remain equal size, since adapters may be detected in one read but not the other. They should have the same number of lines, though.

                        Remember that even if you lose only 0.2% of your data by trimming down to 13-mers, you will also break your assembly everywhere that 13-mer occurs. Hopefully it will be rare (in a random genome, 1 in 67 million), but still, I don't recommend dropping below K=21.

                        Comment


                        • #13
                          Output of both reductions.
                          Conservative:
                          Initial:
                          Memory: free=1507m, used=11m

                          Added 41 kmers; time: 1.031 seconds.
                          Memory: free=1467m, used=51m

                          Input is being processed as paired
                          Started output stream: 1.221 seconds.

                          Input: 151105976 reads 15261703576 bases.
                          KTrimmed: 106096 reads (0.07%) 1750831 bases (0.01%)
                          Result: 151103306 reads (100.00%) 15259952745 bases (99.99%)

                          Time: 98.518 seconds.
                          Reads Processed: 151m 1533.80k reads/sec
                          Bases Processed: 15261m 154.91m bases/sec


                          Aggressive:
                          Initial:
                          Memory: free=1507m, used=11m

                          Added 32 kmers; time: 0.069 seconds.
                          Memory: free=1467m, used=51m

                          Input is being processed as paired
                          Started output stream: 0.102 seconds.

                          Input: 151105976 reads 15261703576 bases.
                          KTrimmed: 640339 reads (0.42%) 21374886 bases (0.14%)
                          Result: 151053684 reads (99.97%) 15240328690 bases (99.86%)

                          Time: 89.117 seconds.
                          Reads Processed: 151m 1695.59k reads/sec
                          Bases Processed: 15261m 171.25m bases/sec

                          And, as you predicted, still the same number of lines in each file:
                          [jpummil@razor-l2 TRIM-AD]$ wc -l New_PE_R1.fastq
                          302206612 New_PE_R1.fastq
                          [jpummil@razor-l2 TRIM-AD]$ wc -l New_PE_R2.fastq
                          302206612 New_PE_R2.fastq

                          So, I think I'll run an assembly with the best kmer setting of the original runs and see what it yields in the way of improvement for both trimmed sets. Then, do a merge and re-run again. Might as well collect some statistics while I'm learning

                          And, I still plan on trying a couple of other assemblers as well. Certainly Velvet, and maybe ABySS. Thought about SOAP also.

                          Comment


                          • #14
                            Sorry to ask this here, but where can I find more about BBNorm, and what exactly does k-mer normalization do?

                            Comment


                            • #15
                              Originally posted by AdrianP View Post
                              Sorry to ask this here, but where can I find more about BBNorm, and what exactly does k-mer normalization do?
                              BBNorm is packaged together with BBMap. The usage information is available by running the shellscript, "bbnorm.sh", with no arguments. It's not published yet (though I do have a mostly-written draft), but it is being used in an upcoming paper on a new assembler, "Omega".

                              Kmer normalization is useful for preprocessing prior to assembly in situations where you have too much data or highly variable coverage (for example in a metagenome, amplified single cell, RNA-seq, or genomes with high-copy plasmids or features like 16S). It's an alternative to subsampling. Some (not all) assemblers perform much better with a flat coverage distribution, and most assemblers use an amount of time and memory that increases with the number of input reads, so reducing the input and flattening the coverage can give better results, and often allow assembly of data that would otherwise take too long or run out of memory.

                              Subsampling: Randomly discard X% of all reads.
                              Normalization: Randomly discard reads with a probability based on kmer frequency, to achieve uniform coverage depth.

                              Subsampling is very fast and easy. You can subsample to 1% depth with BBTools like this:
                              reformat.sh in1=r1.fq in2=r2.fq out1=s1.fq out2=s2.fq samplerate=0.01

                              That will reduce the coverage, but not flatten it. Instead, to normalize it to 50x:

                              bbnorm.sh in1=r1.fq in2=r2.fq out1=n1.fq out2=n2.fq target=50

                              This takes longer and requires much more memory, but many assemblers (like Velvet and Soap) will produce a substantially better assembly with the normalized data, in my testing.

                              BBNorm can also do error-correction at the same time as normalization, if you include the flag "ecc=t". Even if you don't error-correct, BBNorm will discard reads containing errors with a higher probability than error-free reads, so the resulting data after normalization generally has a lower error rate. This is in contrast to naive normalization, which will typically increase the error rate (because reads with errors will have more rare kmers and thus be discarded at a lower rate).
                              Last edited by Brian Bushnell; 04-20-2014, 09:28 AM.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-25-2024, 11:49 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              61 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X