Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 348056755@qq.com
    replied
    Hi Brian,

    I am using bbnorm.sh to filter low-depth reads and correct them together. I have a question about it as below:
    The raw pair-end reads file size(.gz): 223+229Mb;
    After running bbnorm.sh:111+113+64Mb (out, out2 and outt);
    So, what is different between 64Mb reads with filtered reads(164Mb=223+229-111-113-64).
    This is my running shell:
    bbnorm.sh in=input_fq1.gz in2=input_fq2.gz zerobin=t prefilter=t maxdepth=1000 lowbindepth=10 highbindepth=500 ecc=t out1=bbnorm.fq1.gz out2=bbnorm.fq2.gz outt=exclued.fq.gz

    Thanks,
    Xiao

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by jazz710 View Post
    Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.
    Yes and yes!

    Leave a comment:


  • jazz710
    replied
    Oh cool! Can I specify the kmer size? And can it accept paired end FASTQ files like the other tools? Ideally, I'd like to take a pair of PE FASTQs and extract the unique 31-mers (or some other value of k depending on needs). Currently I use Jellyfish for this.
    Last edited by jazz710; 12-05-2015, 09:17 AM. Reason: More complete

    Leave a comment:


  • Brian Bushnell
    replied
    Hi Bob,

    That's not possible with BBNorm, as it uses a lossy data structure called a count-min sketch to store counts. However, you can do that with KmerCountExact, which is faster than BBNorm, but less memory-efficient. Usage:

    kmercountexact.sh in=reads.fq out=kmers.fasta


    That will print them in fasta format; for 2-column tsv, add the flag "fastadump=f". There are also flags to suppress storing or printing of kmers with low counts.

    Leave a comment:


  • jazz710
    replied
    Unique kmers from BBNorm?

    Brian,

    I see that in the output of BBNorm there are counts of unique kmers. Is there an option within BBNorm to export the unique kmers as a list or FASTA?

    Best,

    Bob

    Leave a comment:


  • Brian Bushnell
    replied
    I have the manuscript mostly written, but it's not really ready to submit anywhere yet. However, there is a postdoc who is eager to get started on preparing it for submission, so... hopefully soon?

    Leave a comment:


  • C. Olsen
    replied
    Comparison paper

    Hello Brian,

    You mentioned the intent to submit a paper in March comparing BBNorm w/other methods. Were you able to scrape up any time to submit something? I'm keen to read it.

    Leave a comment:


  • Brian Bushnell
    replied
    I'll read the article in a few days, and comment on it then. As Titus stated, you cannot do binning by depth after normalization - it destroys that information. Furthermore, MDA'd single cells cannot be individually binned for contaminants based on depth, as the depth is exponentially random across the genome.

    I use BBNorm (with the settings target=100 min=2) to preprocess amplified single cells prior to assembly with Spades, as it vastly reduces the total runtime and memory use, meaning that the jobs are much less likely to crash or be killed. If you want to reduce contamination, though, I have a different tool called CrossBlock, which is designed to eliminate cross-contamination between multiplexed single-cell libraries. You need to first assemble all the libraries, then run CrossBlock with all of the libraries and their reads (raw, not normalized!) as input; it essentially removes contigs from assemblies that have greater coverage from another library than their own library. Incidentally, CrossBlock does in fact use BBNorm.

    The latest version of Spades does not really have too much trouble with high-abundance kmers, unless they get extremely high or you have a limited amount of memory. So, you don't HAVE to normalize before running Spades, but it tends to give a comparable assemble with a small fraction of the resources - typically with slightly better continuity and slightly lower misassembly rates, with slightly lower genome recovery, but a slightly higher rate of long genes being called (according to Quast).

    On the other hand, if you want to assemble MDA-amplified single-cell data with an assembler designed for isolate data, normalization is pretty much essential for a decent assembly.
    Last edited by Brian Bushnell; 09-05-2015, 09:32 PM.

    Leave a comment:


  • titusbrown
    replied
    Hi suzumar, if you apply any kind of coverage normalization (whether with khmer or bbnorm) prior to assembly, you won't be able to use those reads to calculate the coverage of the contigs. However, you can go back and use the unnormalized reads to calculate the contig coverage.

    Re the science paper, I don't know what they actually did; if you can find the software and/or detailed instructions in their methods, I'd be happy to comment.

    Leave a comment:


  • suzumar
    replied
    Coverage Binning

    Hi Brian
    I am trying to assemble a genome from single cell amplification using spades and subsequently using tetramer + coverage analysis (i.e. CONCOCT) to remove "contaminating reads" which seem to be created at the sequencing. If I understood correctly bbnorm will keep enough information to allow coverage information to be used subsequently to bin contigs. Is that the case?

    I also would like to know your opinion on how bbnorm compares to the strategy described in https://www.sciencemag.org/cgi/conte.../6047/1296/DC1

    I quote:

    "High-depth k-mers,presumably derived from MDA amplification bias, cause problems in the assembly, especially if the k-mer depth varies in orders of magnitude for different regions of the genome. We removed reads representing high-abundance k-mers (>64x k-mer depth) and
    trimmed reads that contain unique k-mers. The results of the k-mer based coverage normalization are shown in table S8."

    Leave a comment:


  • Brian Bushnell
    replied
    BBNorm cannot be used for raw PacBio data, as the error rate is too high; it is throwing everything away as the apparent depth is always 1, since all the kmers are unique. Normalization uses 31-mers (by default) which requires that, on average, the error rate is below around 1/40 (so that a large number of kmers will be error-free). However, raw PacBio data has on average a 1/7 error rate, so most programs that use long kmers will not work on it at all. BBMap, on the other hand, uses short kmers (typically 9-13) and it can process PacBio data, but does not do normalization - a longer kmer is needed.

    PacBio CCS or "Reads of Insert" that are self-corrected, with multiple passes to drop the error rate below 3% or so, could be normalized by BBNorm. So, if you intentionally fragment your reads to around 3kbp or less, but run long movies, then self-correct, normalization should work.

    PacBio data has a very flat coverage distribution, which is great, and means that typically it does not need normalization. But MDA'd single cells have highly variable coverage regardless of the platform, and approaches like HGAP to correct by consensus of multiple reads covering the same locus will not work anywhere that has very low coverage. I think your best bet is really to shear to a smaller fragment size, self-correct to generate "Reads of Insert", and use those to assemble. I doubt normalization will give you a better assembly with error-corrected single-cell PacBio data, but if it did, you would have to use custom parameters to not throw away low-coverage data (namely, "mindepth=0 lowthresh=0"), since a lot of the single-cell contigs have very low coverage. BBNorm (and, I imagine, all other normalizers) have defaults set for Illumina reads.

    Leave a comment:


  • inkang
    replied
    Dear Brian,

    Is it possible to use BBNorm to normalize filtered subreads from PacBio sequencing?

    BBNorm seems to work well for my Illumina MiSeq data.
    But, when I tried it for PacBio filtered subreads (fastq format), the resulting output file was empty.

    Actually, I'm working on PacBio sequencing data from MDA-amplified single cell genomes. When I assembled the data using HGAP (SMRT Analysis), the results were not good (too many rather short contigs). Some of my colleagues told me that uneven coverage might be the cause of bad assembly. So, I've been trying to normalize raw data.

    Thanks.

    Leave a comment:


  • Brian Bushnell
    replied
    At default settings, BBNorm will correct using kmers that occur a minimum of 22 times (echighthresh flag). Whether 50x is sufficient to do a good job depends on things like the ploidy of the organism, how heterozygous it is, and the uniformity of your coverage. 50x of fairly uniform coverage is plenty for a haploid organism, or a diploid organism like human with a low het rate, but not for a wild tetraploid plant. You can always reduce the setting, of course, but I would not put it below ~16x typically. You can't get good error-correction with 3x depth nor matter what you do. Bear in mind that the longer the kmer is compared to read length, the lower the kmer depth will be compared to read depth.

    To deal with multiple different data sources, you can run BBNorm with the "extra" flag to use additional data to build kmer frequency tables but not as output, like this:

    ecc.sh in=miseq.fq out=miseq_ecc.fq extra=gaII_1.fq,gaII_2.fq,gaII_3.fq

    That would take extra processing time, since all the data would have to be reprocessed for every output file you generate. Alternately, you can do this:

    bbrename.sh miseq.fq addprefix prefix=miseq
    bbrename.sh gaII_1.fq addprefix prefix=gaII_1

    ...etc

    Then cat all the files together, and error-correct them:
    ecc.sh in=combined.fq out=ecc.fq ordered int=f

    Then demultiplex:
    demuxbyname.sh in=ecc.fq out=demuxed_%.fq names=miseq,gaII_1 int=f
    (note the % symbol; it will be replaced by a name)

    That will keep all the read order the same. So, if all the reads are initially either single-ended or interleaved (i.e. one file per library) pairs will be kept together, and you can de-interleave them afterward if you want. You can convert between 2-file and interleaved with reformat.sh.

    Leave a comment:


  • dcard
    replied
    Optimal depth for read error correcting

    Hi Brian and others,

    I am wondering what depth you need and what depth is optimal (if the two differ) for proper read error correcting using BBMap or any other error correcting program. The Quake website mentioned >15x coverage but a quick round of Googling hasn't given me much more than that.

    The reason I ask is because I have a couple lanes of MiSeq data (600 cycle PE sequencing), which individually total to about 3x coverage of my genome each. Therefore, a kmer based error correction wouldn't work well, even if I were to concatenate the two together. We do have an additional HiSeq lane (100bp PE) and a few GAII lanes (so 50-60x coverage total), so we have the option of concatenating all of the datasets together (though one GAII lane isn't paired-end). However, then we would have the separate the individual lanes back out, since we next plan to merge the MiSeq reads to create longer, 454-like reads.

    Therefore, my second question is about what workflow would be best to accomplish this task? Are there some settings in ecc.sh or the like that would allow decent error correction with low coverage? Or alternatively, is there an easy way of separating data from different lanes if we were to concatenate a bunch together to give the coverage necessary to confidently correct? Thanks in advance for the help.

    Leave a comment:


  • Brian Bushnell
    replied
    I am currently collaborating with another person on writing BBNorm's paper and we plan to submit it in March. I will post here once it gets accepted.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Today, 11:49 AM
0 responses
13 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 08:47 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
60 views
0 likes
Last Post seqadmin  
Working...
X