Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Oh, that's kind of irritating. BBMap as currently structured has a maximum reference sequence length of 500Mbp. I designed it that way because I was unaware of any chromosomes longer than that, and I believed the reason to be that 500Mbp was above the maximum stable length of an individual chromosome... looks like I may have been wrong!

    I'll have to think about how to resolve this; there's no simple setting for it. Thanks for bringing it to my attention.

    Comment


    • Thank you Brian for the quick response.
      We would really appreciate your thoughts/inputs on how to work around our issue.

      Comment


      • Purely speculating. Don't know where the centromere is in this chromosome but you could split it in a region where there are long stretches of N's (and the pieces remain smaller than 500 mb) that way chances of reads needing to map across this break would be small.

        Comment


        • Just because I sometimes stumble over that issue in tutorials (which don't seem to bother) and also saw it again in the recent question....

          I once was thaugt (and got a deduction of points in a test for not knowing it) that using even k-mer sizes is frowned upon. The comprehensible rationale behind is, that only odd k-mer sizes ensure a kmer can never be its own reverse complement in the de Bruijn Graph. Such ambiguity created by palindromic k-mers in the de Bruijn graph supposedly make its resolution difficult.

          So to settle that question once and for good: Does it really have an impact on mapping efficiency, if I chose an even or its neighboring odd k-mer?

          Comment


          • No. The longer the kmer, the greater the speed (and memory consumption); even versus odd is not important.

            Additionally, I don't see that even-length kmers cause problems in assembly, either. Genomic palindromes of kmer length or longer cause problems whether you are using an even or odd kmer length. These palindromes always have an even length, but - say you have a genomic palindrome of length 22. Using K=22, you will not (trivially) be able to resolve it. Nor will you with K=21. You will with K=23, and you will with K=24. It's not clear to me in this situation why K=23 would be preferable of K=24 with regards to palindromes, but K=24 can resolve longer repeats than K=23.

            Comment


            • Actually, an odd k-mer ensures that the strand orientation can be determined, since the central nucleotide cannot be identical due to complementarity (an even k-mer can be a perfect palindrome in both orientations).

              But the point about longer k-mers is spot-on.

              Comment


              • Thanks a lot for your answers! Your exemplified replies were really helpful for some more insight.

                Comment


                • Hi I have a couple questions on the terminology used for retaining ambiguous sites using bbmap.

                  If "ambiguous=best" this means that if there are a bunch of reads all the with the sam score only the first match will be retained? Or does it mean that of all the reads mapping above a score cutoff the first one will be picked?

                  Along the same lines - for "ambiguous=all" does this mean that if say 5 locations all share the same highest score that they will be reported or does it mean that all locations above the score cutoff will be retained?

                  Comment


                  • "ambiguous=best" is a bit misleading, but it means the genomically first location with a maxmimum score will be used. "ambiguous=all" will report all locations within the ambiguity threshold of the first. This does not mean they need exactly the same score; it means that they are very close, so much so that none can be confidently determined to be the correct mapping location. Normally they're identical, but if for example one mapping had a single 1bp deletion and another mapping had two 1bp substitutions, the scores would be different, but would be close enough to be both reported. But if there was a third potential mapping with, say, 5 substitutions, that would be excluded. This can be controlled with the "secondarysitescoreratio" flag; if you set it to 1.0, only mappings with identical scores to the best score will be reported.

                    Comment


                    • Hi, Brian

                      We recently increased our PacBio amplicon size from ~1100 to 3kb. With the smaller amplicon size we were able to map reads to our allele reference sequence library of non-full length allele sequences using "semiperfectmode" to allow for soft-clipping. Im now looking to map ~3kb read sequences obtained from gDNA sequencing to exon reference sequences of ~270 bases a piece and not able to tune the settings to get any mapping results. Is there a way to tune mapPacBio.sh to get hits for regions within long reads to short exon sequences that perfectly match?

                      Comment


                      • Hi,

                        couldn't you just do it the other way around, Have the pacbio as ref and may your short refs to it?

                        Although I don't understand why you refs are so short.


                        S.

                        Comment


                        • I agree with Susan. BBMap is a global aligner, and not really designed to map reads to substantially shorter reference sequences. But you could try with the flags "minid=0 local", which might work. Note that "semiperfectmode" will not allow a single mismatch or indel, so it's really only useful in special situations; "local" is more appropriate in this situation.

                          Comment


                          • @lankage: You don't have to align to the short amplicon regions. You could align to the genome (and find out if you have any non-specific amplification along the way).

                            Comment


                            • @moistplus: If you were to use bbmap.sh to do the alignments then you would get that information in the alignment report along with the bam file (as long as you have samtools available in $PATH).

                              Comment


                              • Hi Brian,

                                Since I saw increased activity lately again, I was wondering if you might have thought about the issue we discussed back in January (~post #300). It was about dedupe not writing out exact matched and contained sequence identifiers.

                                As mentioned before, solving this would make this tool very competitive to existing ones, due to the immense speed-up.

                                Thanks for your consideration!

                                Best wishes,
                                Shini

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                23 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 09:45 AM
                                0 responses
                                200 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 08:54 AM
                                0 responses
                                209 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-02-2024, 03:00 PM
                                0 responses
                                192 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X