Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brian Bushnell
    replied
    BBMap does not have any explicit chimera identification capabilities; that would need to be done in post-processing. Also, it does not produce split alignments, but it does do a very good job of identifying long deletions, which it will capture in a single alignment.

    You can run BBMap like this:

    Code:
    bbmap.sh in=reads.fq outm=normal.bam outu=unmapped.fq pairedonly ambig=all maxindel=400k
    That will produce a bam file of the normally-mapped pairs. Then, mapping with a second pass:

    Code:
    bbmap.sh in=unmapped.fq out=odd.bam ambig=all local maxindel=400k
    That will produce a file containing all of the unmapped and unpaired reads. I have not used Lumpy so I'm not sure if this is the best way to do it, but that sounds approximately like what it wants.

    Leave a comment:


  • nicklisbon
    replied
    Good afternoon Brian,
    In my lab we are dealing with a couple of thousands of whole genome seq. data. We are running some tests with BBtools and its performance looks really promising.
    We would like to use Lumpy for structural variants call and this tool requires as inputs beside a "normal" BAM file also a bam file containing unpaired and split reads.
    I have seen that it is possible to run BBmap in local alignment mode but is there any way to identify chimeras with BBmap?
    Thanks in advance.

    Leave a comment:


  • Brian Bushnell
    replied
    Seem my response in this thread: http://seqanswers.com/forums/showthr...313#post201313

    But basically, yes. And no, you don't need to be in the /ref/ directory, you need to be in the directory containing the ref directory.

    By default, there is only one alignment per read.

    Leave a comment:


  • Brian Bushnell
    replied
    I will update the description; it's a bit misleading. What it means is, reads with average quality below the setting will not be mapped. All of the reads will still be printed to "out", even if they failed the quality threshold. However, if you use the "outm" and "outu" streams, only mapped reads will go to outm, and only unmapped reads will go to outu. So, maq would suppress low-quality reads from going to outm.

    Thanks for noting that the long version of the flag does not work. It's actually "minavgquality"; I'll fix that.

    Leave a comment:


  • ShellfishGene
    replied
    qtrim and maq in BBMap

    Hi Brian,

    I'm using BBMap to get reads that do not align to a genome. I want the output quality trimmed, and use
    Code:
    bbmap.sh in=raw_reads.fastq.gz outu=stdout.sam qin=auto qout=33 qtrim=lr maq=30 mintrimlength=30
    The BBDuk docs state that the minimum average quality is checked after clipping. I assume this is the same in BBMap. However with this command I have output such as this:
    Code:
    1600:7:1:1413:2208      653     *       0       0       *       *       0       0       AAATGAAGAGAAAGATTACAGGGAAGTAAC  ##############################
    The read should clearly be removed, I would think? Or does that work only for the mapped reads?
    Also, is there an option to drop all reads that are clipped to below a certain length?

    Btw, the option "minaveragequality" is not recognized, only the short version "maq". Not sure if that's intentional.

    Cheers,

    Till

    Leave a comment:


  • darthsequencer
    replied
    Great - Thanks a bunch!

    Leave a comment:


  • Brian Bushnell
    replied
    I just uploaded BBMap v36.59 which adds "coverage" and "metagenome" flags.

    coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed).

    metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.

    The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.

    Leave a comment:


  • darthsequencer
    replied
    That would be awesome Brian! Most of the tools out there are bit wonky, slow, or have fallen to the wayside.

    More of a nuts and bolts question but how do you handle fragmented reference genomes in a simulated metagenome? In the sense that one reference but is in multiple fasta files but will get assigned different coverages because most programs think each file is a different genome.

    Leave a comment:


  • Brian Bushnell
    replied
    You can simulate a mixed dataset just fine that way, but unfortunately metagenomes tend to have an exponential distribution of coverage. So to really simulate a metagenome, you need to mimic that, which would mean (for now) running randomreads on each genome individually with an exponential-random coverage target, then combining all the outputs together.

    I did write a program "mutate.sh" that I use in simulating metagenomes, which takes a genome fasta and produces a mutant version of it at a specified error rate; I use it to do things like verify that error-correction is not erasing strain differences in metagenomes.

    Anyway, I think it will be pretty easy to modify randomreads for a metagenome mode, so I'll try to do that today. And a coverage (rather than #reads) threshold is pretty easy as well.

    Leave a comment:


  • GenoMax
    replied
    Originally posted by Brian Bushnell View Post
    Unfortunately, no to both. And you are correct, simulating metagenomes is a need we have at the JGI so I probably should have developed one by now!

    I'll look into making a wrapper for the purpose; it shouldn't take very long.
    Starting with a bunch of bacterial genomes in a file and using randomreads.sh should produce something acceptable, correct? I have used it in the past with a defined set of genomes to simulate a mixed dataset.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by Carol_B View Post
    Hi Brian,

    I'm also struggling with the selection of unique alignments. I had previously used the option ambiguous=toss, but for some pair-end reads I only get the forward to be considered aligned. I saw your suggestions of removing the "ambiguos=toss" flag and later filter the alignment based on their mapq value. Nevertheless, I am curious on how to select a correct threshold for my filtering. I am using local=t alignment, is the maximum mapq score different when you use local as compared to global alignment? why do you suggest 4 as a good threshold?

    Thanks in advance !!!

    Carol
    Hi Carol,

    The definition of the mapq is the Phred-scaled (decibel) probability that the mapping is incorrect; q=10 means a 10% chance of being incorrect, q=20 means 1%, q=30 means 0.1%, etc. q=3 indicates 50% - so q=3 or below indicate that the read is at least 50% likely to have come from somewhere else, while q=4 or above means the mapping is probably correct. Therefore, reads that map equally well to multiple locations should be assigned a mapping score of 3 or less, and a mapping score of 4 or more indicates the read cannot map anywhere else as well as the indicated position.

    The local alignments will give different scores than global alignments because mismatches toward the ends of the reads are ignored. In practice, though, there will not be much difference in most cases; mainly just near the ends of contigs, or when dealing with reads that are chimeric/low-quality/not-adapter-trimmed, etc.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by darthsequencer View Post
    Hi Brian,
    Is there a way to set a desired coverage for randomreads.sh rather than the total number of reads?

    On that note - I'm trying to simulate a metagenome with bbmap tools. Knowing BBMap there's probably a way to do that (especially since JGI has probably needed it too). So can BBmap simulate a metagenome (in terms of abundance profile) from a bunch of reference genomes?
    Unfortunately, no to both. And you are correct, simulating metagenomes is a need we have at the JGI so I probably should have developed one by now!

    I'll look into making a wrapper for the purpose; it shouldn't take very long.

    Leave a comment:


  • darthsequencer
    replied
    Hi Brian,
    Is there a way to set a desired coverage for randomreads.sh rather than the total number of reads?

    On that note - I'm trying to simulate a metagenome with bbmap tools. Knowing BBMap there's probably a way to do that (especially since JGI has probably needed it too). So can BBmap simulate a metagenome (in terms of abundance profile) from a bunch of reference genomes?

    Leave a comment:


  • Carol_B
    replied
    Originally posted by Brian Bushnell View Post
    You can filter proper pairs only with the "po" flag; unpaired reads will then be classified as unmapped. You can use the 'outm' stream to just get mapped reads. You can also filter with samtools after the fact.

    From these, you could filter only the pairs where at least one read is uniquely aligned (mapq>3 or some higher threshold for greater confidence), but this can't be done in BBMap currently; you could, however, do this with a custom script or with samtools.
    Hi Brian,

    I'm also struggling with the selection of unique alignments. I had previously used the option ambiguous=toss, but for some pair-end reads I only get the forward to be considered aligned. I saw your suggestions of removing the "ambiguos=toss" flag and later filter the alignment based on their mapq value. Nevertheless, I am curious on how to select a correct threshold for my filtering. I am using local=t alignment, is the maximum mapq score different when you use local as compared to global alignment? why do you suggest 4 as a good threshold?

    Thanks in advance !!!

    Carol

    Leave a comment:


  • GenoMax
    replied
    Originally posted by sebl View Post
    Hi,

    I would like to calculate the average coverage of some genomes. I used:

    Code:
    bbmap.sh in1=reads_R1.fq in2=reads_R2.fq ref=scaffolds.fasta nodisk covstats=stats.txt covhist=histogram.txt
    I get nice output files with coverage etc. for each scaffold, but the average for the whole genome is shown only in the terminal. How could I output it to a file, and preferably in a way that I can get tens of genomes coverages summarized in the same file?

    Thanks!
    Capture the standard out/error streams to a file (would depend on your shell but examples for bash are here) and you will have that information in a file.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
32 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Working...
X