Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brian Bushnell
    replied
    BBMerge can't force-trim just one read of the pair, sorry. You'd have to have the paired reads in two files, and run BBDuk on the r2 file only, with the flags "ftr2=50 ordered".

    "ecct" and extension use kmer data from all reads to do error-correction, so it's best to do that initially, using all reads:

    Code:
    1) [adapter-trim as you have described]
    2) bbmerge.sh in1=r1.fq in2=r2.fq out=merged.fq outu=unmerged#.fq ecct extend2=50 k=62 rem qtrim2=r trimq=10,15,20,25
    3) bbduk.sh in=unmerged2.fq out=ftrimmed2.fq ftr=50 minlen=0
    (make sure the out count is the same as the in count)
    4) bbmerge.sh in1=unmerged1.fq in2=ftrimmed2.fq out=mergedB.fq outu=unmergedB#.fq ecct extend2=50 k=62 rem qtrim2=r trimq=10,15,20,25 extra=merged.fq
    When BBMerge fails to merge reads, it undoes all the operations (error-correction, extension, quality-trimming). Or at least, it's supposed to; I have never actually combined extension, error-correction, and iterative quality-trimming at the same time, so let me know if anything seems odd.

    But! It's important to ask yourself whether you want to go to great effort to salvage the lowest-quality fraction of your data. I don't generally use "xloose" because it has a higher false-positive merge rate; in practice, I never go beyond "vloose". In this case since it appears that sequencing read 2 largely failed toward the end, it's worth going to some extent to salvage or else you'll lose the longest-insert reads, but it's also worth considering having the data rerun, since it's clearly a sequencing failure. I'd be interested in seeing the error rates of the reads from mapping (after adapter-trimming):

    Code:
    bbmap.sh in1=r1.fq in2=r2.fq ref=ref.fa mhist=mhist.txt qhist=qhist.txt bhist=bhist.txt

    Leave a comment:


  • mcmc
    replied
    Originally posted by Brian Bushnell View Post
    This appears to be getting killed by the scheduler because it's using more memory than it's allowed to, although usually that happens much faster so I can't be sure. I suggest you reduce the -Xmx argument. However, I don't know what you should reduce it to, since I don't know the memory limit for that node... but you could just try, say, -Xmx50g and see if that works. Otherwise, I suggest you talk to your sysadmin and ask how much memory you are allowed to use on that node, or why the job got killed.
    Indeed, reducing it to -Xmx50g solved the problem (on a 64g exclusive node).

    I've tried to improve my merging rate above ~75% and as you suspected I think poor quality is limiting the merging. Here are the gory details:

    First I adapter trimmed with bbduk ref=truseq.fa.gz ktrim=r mink=11 hdist=2 k=21 tpe tbo

    Then bbmerge:
    1. default: 76.661% join, median=292
    2. adapter=default, qtrim2=r —> 76.672% join, median=292, adapter expected=11900, adapter found=3188
    3. -Xmx50g prealloc=t prefilter=1 extend2=50 k=62 rem adapter=default —> 76.564%, median=293, adapter expected=11511, adapter found=2805
    4. Focusing on the unmerged reads from (1):
    • 4a. qtrim2=r trimq=10,15,20 adapter=default —> 34.5% (of unmerged) now join, median=322
    • 4b. qtrim2=r trimq=10,15,20,25 adapter=default forcetrimright2=25 —> 41.15% (of unmerged) now join, median=321
    • 4c. ecct qtrim2=r adapter=default —> 10.40% (of unmerged) now join, median=339
    • 4d. xloose —> 33.6% (of unmerged) to join

    5. Try sequential merging:
    default bbmerge; feed unmerged to qtrim2=r, adapter=default, trimq=10,15,20; feed unmerged to forcetrimright2=50.
    Result: total merge rate of 86.8%, leaving 3.57M/27M unmerged
    6. Combining in one command did worse (order of operations different?): qtrim2=r, trimq=10,15,20, adapter=default, forcetrimright2=50, ecct —> 73.8% total join

    So it seems like quality trimming and even force trimming have the greatest effect?

    Questions:
    - what is the order of operations (if specified in a single command): qtrim2, forcetrimright2, ecct, adapter. Is it better to run iteratively on the unmerged reads?
    - are trim operations performed even if the merge fails (i.e. are the unmerged output reads trimmed)?
    - can you force trim from just the R2 read?

    Any other suggestions for dealing with poor quality, primarily the last 50 bases of the R2 read (in a 2x250 HiSeq run)?

    Thanks very much,
    MC
    Attached Files

    Leave a comment:


  • Brian Bushnell
    replied
    Tadpole is a purely kmer-based assembler, so by the time assembly is being done, there is no knowledge of which reads were used. Therefore, Tadpole can't provide that output. So, there are two ways to get it.

    1) Map the reads to the assembly with BBMap and use the "outm" and "outu" flags to capture mapped and unmapped reads.
    2) Use BBDuk to capture reads that share or do not share kmers with the assembly:

    Code:
    bbduk.sh in=reads.fq ref=contigs.fa k=31 mm=f out=unmatched.fq outm=matched.fq

    Leave a comment:


  • JVGen
    replied
    Hi Brian,

    I was wondering if there was a way to save the reads that are used during contig assembly, so that they can be mapped back to the generated consensus in a downstream step. I noticed the 'outd=<file>' flag will list discarded reads - anything similar for used reads?

    I'm quite happy with Tadpole's ability to assemble PCR-amplified viral genomes post-illumina sequencing. I've compared it to Spades and CLC. The one thing I'd like to see what reads are getting used, discarded, etc.

    Thanks!

    Leave a comment:


  • Brian Bushnell
    replied
    This appears to be getting killed by the scheduler because it's using more memory than it's allowed to, although usually that happens much faster so I can't be sure. I suggest you reduce the -Xmx argument. However, I don't know what you should reduce it to, since I don't know the memory limit for that node... but you could just try, say, -Xmx50g and see if that works. Otherwise, I suggest you talk to your sysadmin and ask how much memory you are allowed to use on that node, or why the job got killed.

    Leave a comment:


  • mcmc
    replied
    I tried the command you suggested, and after first running out of memory (with -Xmx54g) and retrying with prealloc=t, now I get this error:

    Code:
    Initial size set to 19219231
    Initial:
    Ways=71, initialSize=19219231, prefilter=t, prealloc=1.0
    Memory: max=55566m, free=54116m, used=1450m
    
    Made prefilter:         hashes = 2       mem = 8.47 GB          cells = 36.38B          used = 11.568%
    Estimated valid kmers:          1095698442
    Prefilter time: 213.467 seconds.
    After prefilter:
    Memory: max=55566m, free=37242m, used=18324m
    
    Estimated kmer capacity:        1228108886
    After table allocation:
    Memory: max=55613m, free=17712m, used=37901m
    
    bbmerge-auto.sh: line 60: 35987 Killed
    Is this still a memory issue?

    Thanks for your help,
    MC

    Leave a comment:


  • Brian Bushnell
    replied
    So, I would recommend a command would like this:

    Code:
    bbmerge-auto.sh in1=r1.fq in2=r2.fq out=merged.fq outu=unmerged.fq prefilter=1 extend2=50 k=62 rem adapter=default
    This operates in 3 phases.

    1) Reads are processed and kmers are counted approximately, for the prefilter flag.
    2) Reads are processed again and kmers are counted exactly, ignoring kmers that only occur once (to save memory).
    3) Reads are processed again, and merging occurs:
    3a) For each pair, merging is attempted.
    3b) If a good overlap is not discovered, each read is extended by up to 50bp on the right end only, and merging is attempted again. If they still don't overlap, the extension is undone. Otherwise, they will be merged.

    This means that with the flag "extend2=50" you could get up to ~100 bp in the middle that is created from the kmers in other reads. This is not really different from normal assembly; assuming you will assemble this data at some point, this process is going to occur eventually. It's true that this process could result in the formation of chimeric sequence, but even with a complex metagenome, Tadpole has a very low rate of chimeric sequence generation. It will basically only happen if you have two strains of a microbe, one that is over 20x as abundant as the other (you can adjust that '20' with the 'branchmult' flag); in that case, the middle of a pair of reads from the less-abundant strain might get filled with sequence from the more-abundant strain. But in practice this is very rare.

    With a median insert of 292, it seems unlikely to me that 25% of the reads would be too far apart to overlap (>~490bp insert). Can you post the insert size distribution (you can get it with the flag ihist=ihist.txt)? It's possible there's another reason for the low merging rate, such as a high error rate, in which case error-correction would be more effective than extension. Of course, error-correction also poses the risk of creation of chimeric sequence, but again, at a very low rate.
    Last edited by Brian Bushnell; 02-01-2017, 03:54 PM.

    Leave a comment:


  • mcmc
    replied
    Hi Brian,
    I have a very basic question about how merging works with the tadpole extend option. I have 2x250 HiSeq metagenome data, and about 75-80% of the reads merge by overlap after adapter trimming (median insert = 292, so I wasted a lot of sequencing, but that's another story). I've tried loose and vloose, which gained a couple % over default.

    I'm wondering if the tadpole extend option might help merge a bit more, if we're just missing overlaps by a little bit. But I'm confused about how it works. Is the idea that it uses the rest of the dataset to find joins between my FW and RV read? Are those filled in with N's in a merged fragment (to give an insert size estimate) or do they actually get filled in with sequence from other reads (which I would not want in this case, since I have a complex metagenome)?

    Thanks very much,
    MC

    Leave a comment:


  • Brian Bushnell
    replied
    Yes, the input can be comma-delimited. For error-correction:

    Code:
    tadpole.sh in=a.fq.gz,b.fq.gz out=a_ecc.fq.gz,b_ecc.fq.gz ecc
    For assembly:

    Code:
    tadpole.sh in=a.fq.gz,b.fq.gz out=contigs.fa
    Note, incidentally, that gzipped files can be concatenated without decompressing them first, and the result is still valid.

    Leave a comment:


  • seq4franken
    replied
    multiple fastq.gz for Tadpole

    Hello,
    is it possible to specify multiple fastq.gz files (paired-end) as input for Tadpole?
    I try to find a way to circumvent decompression and concatenation.

    Thanks for any suggestion!
    Alfons

    Leave a comment:


  • JVGen
    replied
    Originally posted by Brian Bushnell View Post
    I don't really know where the high mutation rate comes from in supposedly isolate libraries. Unfortunately, there's no mechanism in Tadpole to fix them - it always halts at a branch and breaks into two contigs, even if it is a single SNP. You can use the "branchmult1" and "branchmult2" flags to adjust this, though. Dropping them to "bm1=6 bm2=2" can often substantially increase continuity. "bm1=6" means it will continue rather than stopping at a branch where the allele ratio of the next base is at least 6:1 rather than at least 20:1 which is the default. With an allele ratio of 1:1 like you have it will never continue, though. Instead, you might try a scaffolding program and gap-filling program that will use paired-end reads to glue the two contigs back together. I have not written one or tested any, but I know several are available.

    "rinse" and "shave" can, in some cases, improve continuity by remove very low-depth errors, but that's not the case here. "shave=f" will not do anything, though. "t" and "f" stand for "true" and "false". The default for "shave" is already "false". So, you can try enabling these with "rinse shave" or "rinse=t shave=t" or "rinse=true shave=true", which are all equivalent.

    Anyway - believe me, I would also prefer in your case for Tadpole to assemble this as a single contig, but it generally won't do that in the case of such evenly-split alleles. Though maybe with "bm1=1.1 bm2=1.05" it would; I'm not really sure what would happen in that case.
    Thank you for the input, Brian. I tried bm1=8 bm2=2 earlier, as per info available on JGI's website. It unfortunately still splits, though I haven't tried bm1=6 yet. Of course, this is an instance where we want the contigs to be merged, but, in other cases such lenient branching parameters wouldn't be ideal. For instance, if in the above example there were more 'SNPs' present on either side of this single mixed call. In that case, we may have a mixed reaction.

    It's almost as if there need to be an "if" rule added that takes into account other nearby basecalls. One mixed basecall per, say 1000 bp, isn't so worrisome, but 1 per 10 certainly is. Perhaps this is what scaffolding does?

    For now, I'm using CLC to de novo assemble. I'm on a trial license right now...my PI will have to eat the bill if this ends up being the best option. I have a feeling it's gonna cost a pretty penny x_x.

    Jake

    Leave a comment:


  • Brian Bushnell
    replied
    I don't really know where the high mutation rate comes from in supposedly isolate libraries. Unfortunately, there's no mechanism in Tadpole to fix them - it always halts at a branch and breaks into two contigs, even if it is a single SNP. You can use the "branchmult1" and "branchmult2" flags to adjust this, though. Dropping them to "bm1=6 bm2=2" can often substantially increase continuity. "bm1=6" means it will continue rather than stopping at a branch where the allele ratio of the next base is at least 6:1 rather than at least 20:1 which is the default. With an allele ratio of 1:1 like you have it will never continue, though. Instead, you might try a scaffolding program and gap-filling program that will use paired-end reads to glue the two contigs back together. I have not written one or tested any, but I know several are available.

    "rinse" and "shave" can, in some cases, improve continuity by remove very low-depth errors, but that's not the case here. "shave=f" will not do anything, though. "t" and "f" stand for "true" and "false". The default for "shave" is already "false". So, you can try enabling these with "rinse shave" or "rinse=t shave=t" or "rinse=true shave=true", which are all equivalent.

    Anyway - believe me, I would also prefer in your case for Tadpole to assemble this as a single contig, but it generally won't do that in the case of such evenly-split alleles. Though maybe with "bm1=1.1 bm2=1.05" it would; I'm not really sure what would happen in that case.

    Leave a comment:


  • JVGen
    replied
    Originally posted by Brian Bushnell View Post
    Do you know what your read length and approximate depth are? Tadpole's default kmer length is 31, but with sufficient depth and read length, you will get a better assembly with a longer kmer.
    Hi Brian,

    Reads are 150x2, and I generally have >100x coverage. I tried increasing to k=60, but that splits the input into many more contigs.

    My goal here is to extract a consensus sequence for the purpose of annotating ORFs. Previous to this, I mapped the reads to a reference using BBMap, and I'm running into problems with multiple contigs being generated by Tadpole. For instance, in BBMap one of my samples has reads mapped across the entirety of the reference. If I extract the reads that mapped to the reference, and use them to de novo assemble in Tadpole, the output is two contigs (the sum of which equal the length of the reference).

    Upon inspection of the BBMap file, I find one ambiguity within the reads at the region that Tadpole has split the assembly into two contigs. About 50% of the reads have an "A" in one nucleotide position, while the other half have a "G". My guess is that this 'SNP' was introduced during my PCR amplification (prior to sequencing) or library prep PCR. It doesn't suggest the presence of two viral genomes, because everything else is too homogenous. In my mind, since there is great overlap on both sides of this nucleotide call, I'd rather assemble a single contig, and call an ambiguous base for this position: R.

    Any idea on how to accomplish this, and do you agree with my thought? I tried adding "shave=f" as a flag, but still no luck. By the way, what does "f" stand for?


    Leave a comment:


  • Brian Bushnell
    replied
    Hi Jake,

    The problem is the presence of spaces in your filepath. Try adding quotes:

    Code:
    java -ea -Xmx15000m -Xms15000m -cp /Users/DJV/programs/bbmap/current/ assemble.Tadpole in="/Users/DJV/Desktop/IN/150-P6-C9 assembled to HXB2 Nested Amplified Region.fastq" out=/Users/DJV/Desktop/Out/contigs.fa
    Also, you need to add a filename to the output (e.g. /Users/DJV/Desktop/Out/contigs.fa). If using quotes does not work, try renaming the file with underscores rather than spaces. Do you know what your read length and approximate depth are? Tadpole's default kmer length is 31, but with sufficient depth and read length, you will get a better assembly with a longer kmer.

    Leave a comment:


  • JVGen
    replied
    Hi Brian,

    I'm getting an error while trying to use tadpole. Tadpole.sh brings up the expected list of options, so I know that I've directed the terminal correctly.

    However, adding the in=<readsfile.fastq> out=<output directory> doesn't work - I get the following error:


    java -ea -Xmx15000m -Xms15000m -cp /Users/DJV/programs/bbmap/current/ assemble.Tadpole in=/Users/DJV/Desktop/IN/150-P6-C9 assembled to HXB2 Nested Amplified Region.fastq out=/Users/DJV/Desktop/Out
    Executing assemble.Tadpole1 [in=/Users/DJV/Desktop/IN/150-P6-C9, assembled, to, HXB2, Nested, Amplified, Region.fastq, out=/Users/DJV/Desktop/Out]

    Exception in thread "main" java.lang.RuntimeException: Unknown parameter assembled
    at assemble.Tadpole.<init>(Tadpole.java:447)
    at assemble.Tadpole1.<init>(Tadpole1.java:66)
    at assemble.Tadpole.makeTadpole(Tadpole.java:77)
    at assemble.Tadpole.main(Tadpole.java:64)

    Here's my Java version information:
    java version "1.8.0_111"
    Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
    Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)


    Any idea why I'm getting these errors?

    Thanks
    Jake

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Genetic Variation in Immunogenetics and Antibody Diversity
    by seqadmin



    The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
    11-06-2024, 07:24 PM
  • seqadmin
    Choosing Between NGS and qPCR
    by seqadmin



    Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
    10-18-2024, 07:11 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 11-08-2024, 11:09 AM
0 responses
57 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-08-2024, 06:13 AM
0 responses
37 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-01-2024, 06:09 AM
0 responses
34 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-30-2024, 05:31 AM
0 responses
23 views
0 likes
Last Post seqadmin  
Working...
X