Announcement

Collapse
No announcement yet.

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Normally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:

    bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
    bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
    bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k

    Comment


    • Originally posted by Brian Bushnell View Post
      Normally, I would just map the reads with the flag "maxindel=100k" (or whatever is appropriate, given the length of the genome) and then call variants with "callvariants.sh" using the flag "rarity=0.01" to find low-frequency events. But alternatively, you can separate the reads with multiple BBMap passes, like this:

      bbmap.sh in=reads.fq outm=mapped.fq maxindel=100k
      bbmap.sh in=mapped.fq outm=normal.fq outu=longdeletion.fq maxindel=100k dellenfilter=20
      bbmap.sh in=longdeletion.fq out=mapped_longdeletion.sam maxindel=100k
      Brian, that's awesome, I didn't even know there was a variant caller in your suite. That worked well and I was able to find a bunch of deletions by using the flag "clearfilters" in callvariants.sh. However, I'm sure a lot of it is probably not real. Is there a particular parameter you suggest looking at as a measure of probability of being real? I'm not sure what some of the readouts in the output from callvariants.sh actually mean.

      Thanks again.

      Comment


      • Normally, the variant quality is the best indicator of whether a variant is real. I think, in this case, it's likely that most of them are real - viruses mutate rapidly, so viral "isolates" often contain multiple different viral genomes (which is why they are hard to assemble). The possibilities are, essentially...

        1) Sequencing error randomly matched somewhere distant in the genome, causing the false appearance of a deletion. This is extremely unlikely with small, non-redundant genomes, and would usually only be represented by a single read.
        2) Chimeric reads formed from different genomic fragments fusing together during sequencing may look like long deletion events. These should universally be represented by a single read (or pair) with a PCR-free or deduplicated library.
        3) Real deletion events. Hopefully these will be represented by multiple reads.

        So, you may want to deduplicate your library (if it was PCR'd, otherwise don't) and then call variants using a threshold of, say, 3 reads (minreads=3 flag, placed after "clearfilters").

        Comment


        • Originally posted by Brian Bushnell View Post
          @minja - I replied via email and still have not had a chance to look into that; sorry - I will when I have a chance.

          @catagui - BBMap cannot handle query sequences over 600bp. When you feed it a fasta input file, by default, sequences longer than 500bp are broken into 500bp pieces and mapped individually (and not recombined afterward). Some of these pieces could be quite short (say, 15bp) and will usually map if you have a big enough reference. You can discard them with "minlen=40" or so. Or you could simply use MapPacBio for mapping as it can accommodate much longer reads. In that case, though, you should increase "maxindel" which has different defaults between BBMap and MapPacBio, and furthermore, a longer transcript could contain multiple long introns.

          Try "mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40 (other arguments)" and see if that improves its mapping rate. Ultimately, though, longer query sequences will always have lower alignment rates than shorter ones when using a global or glocal aligner, so shredding the sequences (which BBMap does) artificially increases the alignment rate.
          Hi Brian thanks for your reply.

          I haven't been able to improve the mapping percentage. I tried your suggestion (bbmap/mapPacBio.sh k=13 maxindel=200000 maxlen=60000 minlen=40) and still get 13%.

          Is it because the mapping is too stringent? As I am dealing with different species within the same genus. Is there another flag aside from "minid" that I could change to improve the mapping.

          Thanks again

          Comment


          • You can reduce kmer length and increase sensitivity - change "k=13" to "k=11" and add "slow". But ultimately, since BBMap is a global aligner, it does not like large structural rearrangements when aligning long queries, since they do not fit into the model of "match/substitution/insertion/deletion" reported in single alignments.

            Comment


            • High error rate with BBmap?

              Hi Brian,

              while trying to call variants on a human MT DNA I noticed the unusually high error rate in mapping with bbmap (37.10 and 36.92 tested) compared to bwa (0.7.12-r1039). Both were run with the default parameters, and below is the output of samtools stats (after sorting and indexing bams). IGV shows many more inverted pairs with bbmap mapping, and this came as a surprise.

              I can provide details, however I prefer to share them offline due to the sensitivity of some of the data. Any hints on how to track down the issue?

              Many thanks!

              BBmap
              Code:
              BBM: raw total sequences:       537764
              BBM: filtered sequences:        0
              BBM: sequences: 537764
              BBM: is sorted: 1
              BBM: 1st fragments:     268882
              BBM: last fragments:    268882
              BBM: reads mapped:      256366
              BBM: reads mapped and paired:   230042  # paired-end technology bit set + both mates mapped
              BBM: reads unmapped:    281398
              BBM: reads properly paired:     228748  # proper-pair bit set
              BBM: reads paired:      537764  # paired-end technology bit set
              BBM: reads duplicated:  0       # PCR or optical duplicate bit set
              BBM: reads MQ0: 0       # mapped and MQ=0
              BBM: reads QC failed:   4282
              BBM: non-primary alignments:    0
              BBM: total length:      67353071        # ignores clipping
              BBM: bases mapped:      32229667        # ignores clipping
              BBM: bases mapped (cigar):      29307   # more accurate
              BBM: bases trimmed:     0
              BBM: bases duplicated:  0
              BBM: mismatches:        1672477 # from NM fields
              BBM: error rate:        5.706749e+01    # mismatches / bases mapped (cigar)
              BBM: average length:    125
              BBM: maximum length:    151
              BBM: average quality:   30.4
              BBM: insert size average:       255.7
              BBM: insert size standard deviation:    117.0
              BBM: inward oriented pairs:     60121
              BBM: outward oriented pairs:    2977
              BBM: pairs with other orientation:      48
              BBM: pairs on different chromosomes:    0
              BWA
              Code:
              BWA: raw total sequences:       537764
              BWA: filtered sequences:        0
              BWA: sequences: 537764
              BWA: is sorted: 1
              BWA: 1st fragments:     268882
              BWA: last fragments:    268882
              BWA: reads mapped:      266324
              BWA: reads mapped and paired:   248064  # paired-end technology bit set + both mates mapped
              BWA: reads unmapped:    271440
              BWA: reads properly paired:     245908  # proper-pair bit set
              BWA: reads paired:      537764  # paired-end technology bit set
              BWA: reads duplicated:  0       # PCR or optical duplicate bit set
              BWA: reads MQ0: 178     # mapped and MQ=0
              BWA: reads QC failed:   0
              BWA: non-primary alignments:    0
              BWA: total length:      67353071        # ignores clipping
              BWA: bases mapped:      33736838        # ignores clipping
              BWA: bases mapped (cigar):      32160973        # more accurate
              BWA: bases trimmed:     0
              BWA: bases duplicated:  0
              BWA: mismatches:        603462  # from NM fields
              BWA: error rate:        1.876380e-02    # mismatches / bases mapped (cigar)
              BWA: average length:    125
              BWA: maximum length:    151
              BWA: average quality:   30.4
              BWA: insert size average:       384.2
              BWA: insert size standard deviation:    1040.7
              BWA: inward oriented pairs:     67383
              BWA: outward oriented pairs:    3645
              BWA: pairs with other orientation:      112
              BWA: pairs on different chromosomes:    0

              Comment


              • I noticed this line:
                BBM: bases mapped (cigar): 29307 # more accurate
                I'm guessing you are not counting "=" and "X" symbols as mapped, just "M"... generally, I would classify "M", "=", "X", and "I" as mapped bases. What version of samtools are you using?

                Also, I find it odd that the insert size average and stdev differ so much. Often the mode or median is more stable as it is less influenced by outliers.

                By inverted pairs, do you mean both reads map to the same side of the reference? If not, I wonder if the fact that the reference is very small and circular could play some role in the differences.

                Anyway, please feel free to email me both bam files and the reference, and I'll look at them. It's possible that the circularity, or the fact that bwa looks for local rather than global alignments, is the primary factor behind the difference.

                Comment


                • Originally posted by Brian Bushnell View Post
                  I noticed this line:
                  Anyway, please feel free to email me both bam files and the reference, and I'll look at them.
                  Thanks Brian, I sent you a PM with the details. I'm using samtools 1.3.1 (and the above is the result of samtools stats

                  Comment


                  • Hi Kristian,

                    Is this a Nextera long-mate pair library? Those need special processing before they can be mapped. Or... can you give me any more information about the library construction, and the trimming methodology? The library has an extremely high error rate (particularly with read 2), less than half of the reads map to the mito, and it appears that both adapters and transposase are still present.... also, I'm measuring the median insert size as 159 (BBMap) or 133 (BBMerge), so there are a lot of pairs with insert size shorter than the sequenced read length; those might be displayed differently in IGV depending on whether the adapter portion was soft-clipped (which bwa would do by default) or not (bbmap does not soft-clip by default).

                    I adapter-trimmed the reads and error-corrected them, but still under 50% map. I'm not really sure what's wrong with the library. But, I don't see anything unusual about the pairing orientations. I get 45.5670% properly paired with "rcs=f" (require correct strand = false) and 45.5481% with "rcs=t", so only 0.02% map in the wrong orientation.
                    Last edited by Brian Bushnell; 04-17-2017, 02:38 PM.

                    Comment


                    • Hi Brian,

                      Originally posted by Brian Bushnell View Post
                      Is this a Nextera long-mate pair library?
                      This is Nextera XT prep, and the adapters/linker were trimmed along with demultplexing directly through Illumina RTA pipeline. On my end the bbduk2 left and right trimming had almost no effect, and the splitnextera also had only a few hits (0.1%) to the adapter, so I'd guess the dataset is (for the most part) trimmed.
                      Last edited by Kristian; 04-18-2017, 05:21 AM.

                      Comment


                      • Hi Brian,

                        I'm wondering if there is any way you can get bbmap (or another of your tools) to give a consensus sequence of an alignment? I went to map to a ref sequence and then use that to create a consensus sequence and then use that to map again and then call variants.

                        Thanks!

                        Comment


                        • Hi Jweger,

                          Sorry, I don't have anything that will do that. Clumpify allows something sort of similar; you can create consensus sequence from raw reads, and map those. But that loses per-base depth information which is important for variant-calling, so I don't think it's what you want.

                          Comment


                          • I was testing something with BBMap today and I realized I had forgotten something about how to use it and couldn't figure it out. I was mapping RNA-Seq and I thought that it would report spliced alignment cigars (using sam 1.3) as /[0-9]+M[0-9]+N[0-9]+M/ but it was reporting the introns as deletions with the [D] cigar value. Is that right or is there a way to get it to not do that?

                            Thanks-
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • You can use the flag "intronlen" to control this. For example, "intronlen=100" will change the reporting so that all deletions of at least 100bp will reported using 'N' instead of 'D' in the cigar string.

                              Comment


                              • Oh yes! Thanks Brian.
                                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                                Salk Institute for Biological Studies, La Jolla, CA, USA */

                                Comment

                                Working...
                                X