Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Actually, while it's not documented, the flag "minlength" also works with BBMap. Reads shorter than that will be discarded completely (they won't be output as unmapped).

    Comment


    • Originally posted by Brian Bushnell View Post
      Actually, while it's not documented, the flag "minlength" also works with BBMap. Reads shorter than that will be discarded completely (they won't be output as unmapped).
      Thanks for the tip - save me some space so I don't have to make additional files. I deal with a lot of libraries with different read lengths.

      Is there a way to make BBMap require the smallest read to be some fraction of the longest read length? I know that's a niche use but BBMap always suprises me with it's built in functions.

      Comment


      • Ahh... sorry, but nope That would require reading the file twice, so it would not be easy to implement.

        Comment


        • Originally posted by Brian Bushnell View Post
          Ahh... sorry, but nope That would require reading the file twice, so it would not be easy to implement.
          Oh yeah - duh

          Comment


          • Originally posted by Shini Sunagawa View Post
            Dear Brian,

            I have been looking for a tool that would quickly dereplicate (100% containments) nucleotide sequences and track for each unique sequence the identifiers of the removed duplicates.

            Something like:

            dedupe.sh in=in.fa out=out.fa outd=outd.fa mid=100 mop=100

            where:

            in.fa:
            seq1
            seq2 (contained in seq1)
            seq3 (contained in seq1)
            seq4

            out.fa:
            seq1
            seq4

            outd.fa:
            seq2
            seq3

            I am interested in:
            seq1<tab>seq2,seq3
            seq4

            dedupe.sh does a fantastic job in returning out and outd, but I cannot find any option that would return the information I am interested in. Is this something that I am missing? Otherwise, I believe this could be a great feature, since compared to other tools that return this information, dedupe is so much faster.

            Best,
            Shini
            Did anything like this get added to BBMap? Would be really helpful for me too!

            Comment


            • Hmmm... no, not yet, though I did add into Clumpify's dereplication step the ability to count the number of duplicate reads and add "count=3", for example, to the name of a read representing 3 total reads (itself and 2 duplicates). It would not be difficult to modify that to report read identifiers. I'll add it to my todo list.

              Comment


              • Originally posted by Brian Bushnell View Post
                Hmmm... no, not yet, though I did add into Clumpify's dereplication step the ability to count the number of duplicate reads and add "count=3", for example, to the name of a read representing 3 total reads (itself and 2 duplicates). It would not be difficult to modify that to report read identifiers. I'll add it to my todo list.
                Well my real use it to group sequences that >99% similar. Whereas clumpify find exact matches? Although I suppose that clumpify could be run in error correct mode with "midid" and it should be the same as dedupe with minidentity?

                Comment


                • Clumpify can consider sequences as duplicates if they have at most X substitutions, but it's not as flexible as Dedupe. For example, Clumpify requires duplicates to overlap 100% with neither overhanging, while Dedupe allows containments (this only matters when using variable-length sequences) and also allows indels. What I actually added to my todo list was to update both of them with that capability, since it seems useful.

                  Comment


                  • Originally posted by Brian Bushnell View Post
                    Clumpify can consider sequences as duplicates if they have at most X substitutions, but it's not as flexible as Dedupe. For example, Clumpify requires duplicates to overlap 100% with neither overhanging, while Dedupe allows containments (this only matters when using variable-length sequences) and also allows indels. What I actually added to my todo list was to update both of them with that capability, since it seems useful.
                    Great! Yeah I'd like to know the duplicate membership for containments too.

                    Comment


                    • Hello Brian,

                      I would like to ask you a suggestion for bbmap.
                      I am trying to reassemble a bin from a metagenomic data, hoping that I will get better assembly if I use just the mapped reads.
                      I tried bbmap.sh on normal parameters and used the outm to take all the aligned reads, then I normalized the reads with bbnorm.sh, and reassembled with SPAdes. I want to note that the initial metagenomic assembly was done on normalized reads, and SPAdes does error correction, but I did not use these libraries, I used the adapter and quality trimmed libraries but not normalized and error corrected (this is why I do normalization after mapping).

                      I got better assembly, (some longer scaffolds, and slightly larger N50) but checking briefly the SSU I noticed that some "contaminants" were present. Also, the amount of SSU sequences was much higher than expected. (I expect 4, 3 complete ones and one near complete).
                      In the metagemic data (assembled using all the data) I have 10 SSUs, but here I got a lot of them (15+) and most of them are really partial.

                      What I am thinking is that bbmap, includes in the output some(not all) reads from other bacterial SSU which map to a certain degree to my reference (since it can have very conserved regions) then SPAdes is somehow confused by these, and fragments my SSU sequences in multiple places due to these reads. Sorry if this sounds strange, but I am just speculating, I am not sure if this is the case, and unfortunately I am not an expert bioinformatician.

                      I was thinking to add to the command line the parameters minidentity=0.98 idfilter=0.98 hoping that I will somehow avoid the mapping of "non-specific" reads.

                      I avoid using the perfectmode=t due to the fact that SPAdes does error correction, and this would somehow cause some small mismatches, and thus I would lose some reads.

                      Would you have any suggestion for better parameters of bbmap?

                      Thank you!

                      PS: I am using PE 2X250bp and PE2x300bp libraries for the mapping.

                      Comment


                      • Non-Deterministic BBmap results... how ensure deterministic?

                        Hi Brian,

                        It is important that my pipeline's results can be perfectly reproduced. I notice that the non-deterministic behavior is coming from the human read removal from my datasets... here are the parameters I have specified in this call:

                        bbmap.sh\
                        -Xmx23g\
                        minid=0.9\
                        idfilter=0.9\
                        maxindel=3\
                        bwr=0.16\
                        bw=12\
                        minhits=2\
                        printunmappedcount=t\

                        The numbers are close (ie. 672510 vs 672492)

                        I run the PE reads through BBduk to remove low-quality pairs first. It looks like that output is sorted the same and deterministic upon re-runs.

                        Thanks for your thoughts, Kate
                        Last edited by sk8bro; 05-03-2017, 08:32 AM.

                        Comment


                        • It looks like the input is identically sorted by the way

                          Comment


                          • Originally posted by Brian Bushnell View Post
                            Actually, while it's not documented, the flag "minlength" also works with BBMap. Reads shorter than that will be discarded completely (they won't be output as unmapped).
                            Does this work with paired end reads?

                            I get this error: Read of length 36 outside of range 50--1. Paired input is incompatible with 'breaklength'

                            Comment


                            • Originally posted by darthsequencer View Post
                              Does this work with paired end reads?

                              I get this error: Read of length 36 outside of range 50--1. Paired input is incompatible with 'breaklength'
                              Yep, looks like that's only for single reads, since I intended it for PacBio/Nanopore (breaking up long reads to a fixed length and discarding short dangling pieces)... I'll add the ability to handle paired reads too.

                              Comment


                              • Originally posted by Brian Bushnell View Post
                                I'll add the ability to handle paired reads too.
                                Pre-filtering with reformat.sh in the mean time then?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                23 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 09:45 AM
                                0 responses
                                201 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 08:54 AM
                                0 responses
                                209 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-02-2024, 03:00 PM
                                0 responses
                                192 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X