Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brian Bushnell
    replied
    Thanks for noting that; I've corrected it. Originally, for error-correction, you had to say "oute" for the output but I changed that a while ago so the current syntax is "out1" and "out2".

    Leave a comment:


  • Brian Bushnell
    replied
    Since you are pulling the kmer information from randomly-sheared metagenomic reads, and using the variable region as a seed, it should be safe. However, there is a possibility of running into a tandem repeat region during extension when you give it a very high extension limit, so I recommend also adding the flag "ibb=f". It probably won't make any difference though.

    Leave a comment:


  • bavci
    replied
    I just saw this comment from Brian Bushnell:

    P.S. DO NOT use read-extension or error-correction for metagenomic 16S or other amplicon studies! It is intended only for randomly-sheared fragment libraries. Error-correction or read-extension using any algorithm are a bad idea for any amplicon library with a long primer. For normal metagenomic fragment libraries, these operations should be useful and safe if you specify a sufficiently long K.

    I recently used Tadpole for contig extension. After reading this comment, I am not sure Tadpole is a reliable tool for this. I have full-length 16S rRNA gene sequences from clone libraries. I took the variable region of 16S rRNA gene sequences (first 300 nt nucleotides) and mapped on metagenome contigs which are taxonomically classified as taxa of interest. These metagenome contigs were assembled from randomly-sheared metagenomic fragment libraries (sequencing platfrom: HiSeq). 300 nt nucleotide of 16S rRNA gene sequence perfectly mapped on a contig. I then run Tadpole to extend this contig using all pair-end reads in this metagenome. I could extend the contig up to 6500 nucleotide.

    Is Tadpole reliable for such usages?

    Leave a comment:


  • ycl6
    replied
    Hi gringer, thanks, I didn't realize the multiple fastq files wasn't concatenated properly before inputting them to tadpole. I'll fix it and try again.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by gringer View Post
    From the error messages, your input read files are corrupted:



    Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

    Code:
    wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
    [numbers should be identical, and this should produce no errors]
    Gringer is correct; the files appear to be corrupt. But just to add a small correction, it does not look like wc can handle gzipped files, so I'd suggest either:

    Code:
    zcat read1.corrected.fq.gz | wc -l; zcat read2.corrected.fq.gz | wc -l
    ...which I have no idea how it would behave on a corrupt file, but should produce a correct answer for a correct file. Or this:
    Code:
    gzip -t read1.corrected.fq.gz; gzip -t read2.corrected.fq.gz
    ...which will test the integrity of the files. It seems to print nothing if the file is OK; if the file is truncated, it will print:
    Code:
    gzip: x.gz: unexpected end of file
    Probably, the process generating the corrected files (probably Tadpole) was killed or timed out before it was finished.

    Leave a comment:


  • gringer
    replied
    From the error messages, your input read files are corrupted:

    ...Unexpected end of ZLIB input stream
    ...Unexpected end of ZLIB input stream
    ...The pairing may have been corrupted by an upstream process.
    Do you have any other program that can read the files to check for errors? At the very least, a simple count of lines would be useful:

    Code:
    wc -l read1.corrected.fq.gz; wc -l read2.corrected.fq.gz
    [numbers should be identical, and this should produce no errors]

    Leave a comment:


  • ycl6
    replied
    Hi, I am using tadpole from BBMap v36.11 and received some java error messages. The command I used is:

    tadpole.sh -Xmx1000g in1=read1.corrected.fq.gz in2=read2.corrected.fq.gz out=contigs.fa.gz

    Code:
    java.io.EOFException: Unexpected end of ZLIB input stream
            at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
            at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
            at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
            at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
            at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
            at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
    open=true
    
    java.io.EOFException: Unexpected end of ZLIB input stream
            at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240)
            at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158)
            at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117)
            at fileIO.ByteFile1.fillBuffer(ByteFile1.java:217)
            at fileIO.ByteFile1.nextLine(ByteFile1.java:136)
            at fileIO.ByteFile2$BF1Thread.run(ByteFile2.java:264)
    open=true
    
    Exception in thread "Thread-4" java.lang.AssertionError:
    There appear to be different numbers of reads in the paired input files.
    The pairing may have been corrupted by an upstream process.  It may be fixable by running repair.sh.
            at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:480)
            at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:345)
            at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:189)
            at java.lang.Thread.run(Thread.java:745)

    Leave a comment:


  • GenoMax
    replied
    Originally posted by JeanLove View Post
    Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
    As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
    I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!
    Provide a number that you want to use e.g. reads=55000000. I don't know if that randomly subsamples though. It may be first 5.5M reads. @Brian will have to clarify.

    If you want to independently subsample the reads you could also use reformat.sh or seqtk (sample) from Heng Li.

    Leave a comment:


  • JeanLove
    replied
    Hi! Although this is a very helpful post, there is still so much unclarities about Tadpole.
    As input, I have fasta file with ~11 mil reads, and I would like for Tadpole to take 50% random reads from this fasta file.
    I see that there is reads=-1 input parameter, where -1 (default) stands for all reads, so I am curious what could stand for 50% etc? Cannot find any info on this parameter, so I would appreciate any help!

    Leave a comment:


  • nepossiver
    replied
    Originally posted by Brian Bushnell View Post
    This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.
    ok, noted. I will test on Illumina reads and report back later. Many thanks.

    Leave a comment:


  • Brian Bushnell
    replied
    Sure:

    Code:
    #First link reference as ref.fa and reads as reads.fa
    
    /global/projectb/sandbox/gaag/bbtools/jgi-bbtools/kmercountexact.sh in=reads.fq.gz khist=khist_raw.txt peaks=peaks_raw.txt
    
    primary=`grep "haploid_fold_coverage" peaks_raw.txt | sed "s/^.*\t//g"`
    cutoff=$(( $primary * 3 ))
    
    bbnorm.sh in=reads.fq.gz out=highpass.fq.gz pigz passes=1 bits=16 min=$cutoff target=9999999
    reformat.sh in=highpass.fq.gz out=highpass_gc.fq.gz maxgc=0.45
    
    kmercountexact.sh in=highpass_gc.fq.gz khist=khist_100.txt k=100 peaks=peaks_100.txt smooth ow smoothradius=1 maxradius=1000 progressivemult=1.06 maxpeaks=16 prefilter=2
    
    mitopeak=`grep "main_peak" peaks_100.txt | sed "s/^.*\t//g"`
    
    upper=$((mitopeak * 6 / 3))
    lower=$((mitopeak * 3 / 7))
    mcs=$((mitopeak * 3 / 4))
    mincov=$((mitopeak * 2 / 3))
    
    tadpole.sh in=highpass_gc.fq.gz out=contigs78.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=78
    tadpole.sh in=highpass_gc.fq.gz out=contigs100.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=100
    tadpole.sh in=highpass_gc.fq.gz out=contigs120.fa prefilter=2 mincr=$lower maxcr=$upper mcs=$mcs mincov=$mincov k=120
    
    bbduk.sh in=highpass.fq.gz ref=contigs100.fa outm=bbd005.fq.gz k=31 mm=f mkf=0.05
    
    tadpole.sh in=bbd005.fq.gz out=contigs_bbd.fa prefilter=2 mincr=$((mitopeak * 3 / 8)) maxcr=$((upper * 2)) mcs=$mcs mincov=$mincov k=100 bm1=6
    
    ln -s contigs_bbd.fa contigs.fa
    This works well with error-corrected PacBio reads, which have a very unbiased coverage distribution. It was able to isolate and assemble the mitochondria in 4/4 fungi tested. I'm not sure how well it works on Illumina reads, though. Also, note that the default k=100 is designed for reads of at least 150bp.

    Leave a comment:


  • nepossiver
    replied
    Originally posted by Brian Bushnell View Post
    Don't feel like you have to use all aspects of Tadpole in order to use it effectively! I am currently using it for mitochondrial assembly also, because it's easy to set a specific depth band to assemble, and thus pull out the mito without the main genome after identifying it on a kmer frequency histogram (in fact, I wrote a script to do this automatically). But in that case I don't actually use the error-correction or extension capabilities, as they are not usually necessary as the coverage is already incredibly high and low-depth kmers are being ignored. I use those more for single-cell work, which has lots of very-low-depth regions.
    How do you identify the mitochondrial band? Could you share the script?

    Leave a comment:


  • RobMaher
    replied
    Hi, Brian,

    Thanks very much for your quick feedback. That does help, thank you. I was able to specify reads=1000 and that approximated the 10% for some test cases. I took a quick shot at normalizing, and I'm having some issues with it, but they may have more to do with Geneious than anything.

    Thanks again!

    Rob

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Rob,

    Tadpole does not do subsampling; you'd have to first sample 10% of the reads with another tool (such as Reformat, with the "samplerate=0.1" flag). However, you CAN restrict the input to a certain number of reads. "reads=400k", for example, will only use the first 400,000 reads (or if the reads are paired, the first 400,000 pairs).

    Tadpole also has "mincr" and "maxcr" flags which allow you to ignore kmers with counts outside of some band, and is sometimes useful for ignoring low-depth junk than can occur when you have too much data:

    mincountretain=0 (mincr) Discard kmers with count below this.
    maxcountretain=INF (maxcr) Discard kmers with count above this.
    Also, I recommend you try normalization (targeting maybe 100x coverage) - there are many situations where normalization gives a better assembly than subsampling, particularly when the coverage of features is highly variable.

    Good luck!

    Leave a comment:


  • RobMaher
    replied
    Hi, Brian,

    Thanks for building this amazing assembler. I have been using it from within the Geneious software package for a few weeks now with great success. Quick question: is it possible to enter a command line under 'custom Tadpole options' that will cause the assembler to only use a specific percentage of the data? Also, is it possible to specify usage of maximum number of reads when performing the assembly (as opposed to a percent)?

    I ask because I am trying to assembly a large number of plasmids from MiSeq FASTQ data, and I've found that the results are much better when only using 10% of the data. However, there is a bug wherein this option is missing from the workflow version of Tadpole; and another where it defaults to 100% when attempting to assemble each sequence list separately. They know about the first bug and are fixing it, but I thought there might be a quick workaround via the command line.

    Thanks!

    Rob

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Non-Coding RNA Research and Technologies
    by seqadmin


    Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

    [Article Coming Soon!]...
    Yesterday, 08:07 AM
  • seqadmin
    Recent Developments in Metagenomics
    by seqadmin





    Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
    09-23-2024, 06:35 AM
  • seqadmin
    Understanding Genetic Influence on Infectious Disease
    by seqadmin




    During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

    Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
    09-09-2024, 10:59 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 10-02-2024, 04:51 AM
0 responses
14 views
0 likes
Last Post seqadmin  
Started by seqadmin, 10-01-2024, 07:10 AM
0 responses
25 views
0 likes
Last Post seqadmin  
Started by seqadmin, 09-30-2024, 08:33 AM
1 response
31 views
0 likes
Last Post EmiTom
by EmiTom
 
Started by seqadmin, 09-26-2024, 12:57 PM
0 responses
20 views
0 likes
Last Post seqadmin  
Working...
X