Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by sk8bro View Post
    Hi Brian,

    It is important that my pipeline's results can be perfectly reproduced. I notice that the non-deterministic behavior is coming from the human read removal from my datasets... here are the parameters I have specified in this call:

    bbmap.sh\
    -Xmx23g\
    minid=0.9\
    idfilter=0.9\
    maxindel=3\
    bwr=0.16\
    bw=12\
    minhits=2\
    printunmappedcount=t\

    The numbers are close (ie. 672510 vs 672492)

    I run the PE reads through BBduk to remove low-quality pairs first. It looks like that output is sorted the same and deterministic upon re-runs.

    Thanks for your thoughts, Kate
    Hi Kate,

    BBMap is always nondeterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread (threads=1), or map the reads as single-ended and then fix pairing afterward:

    Code:
    bbmap.sh in=reads.fq outu=unmapped.fq int=f
    repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq
    In this case you'd want to only keep the paired output.

    Comment


    • Originally posted by GenoMax View Post
      Pre-filtering with reformat.sh in the mean time then?
      Yep, looks like it

      Comment


      • Originally posted by sebastian_1 View Post
        Hello Brian,

        I would like to ask you a suggestion for bbmap.
        I am trying to reassemble a bin from a metagenomic data, hoping that I will get better assembly if I use just the mapped reads.
        I tried bbmap.sh on normal parameters and used the outm to take all the aligned reads, then I normalized the reads with bbnorm.sh, and reassembled with SPAdes. I want to note that the initial metagenomic assembly was done on normalized reads, and SPAdes does error correction, but I did not use these libraries, I used the adapter and quality trimmed libraries but not normalized and error corrected (this is why I do normalization after mapping).

        I got better assembly, (some longer scaffolds, and slightly larger N50) but checking briefly the SSU I noticed that some "contaminants" were present. Also, the amount of SSU sequences was much higher than expected. (I expect 4, 3 complete ones and one near complete).
        In the metagemic data (assembled using all the data) I have 10 SSUs, but here I got a lot of them (15+) and most of them are really partial.

        What I am thinking is that bbmap, includes in the output some(not all) reads from other bacterial SSU which map to a certain degree to my reference (since it can have very conserved regions) then SPAdes is somehow confused by these, and fragments my SSU sequences in multiple places due to these reads. Sorry if this sounds strange, but I am just speculating, I am not sure if this is the case, and unfortunately I am not an expert bioinformatician.

        I was thinking to add to the command line the parameters minidentity=0.98 idfilter=0.98 hoping that I will somehow avoid the mapping of "non-specific" reads.

        I avoid using the perfectmode=t due to the fact that SPAdes does error correction, and this would somehow cause some small mismatches, and thus I would lose some reads.

        Would you have any suggestion for better parameters of bbmap?

        Thank you!

        PS: I am using PE 2X250bp and PE2x300bp libraries for the mapping.
        It sounds like related strains are being binned together. Ribosomes are always hard to assemble anyway. But yes, you can try minidentity=0.98 idfilter=0.98. That might knock out a bunch of your 2x300bp reads since those are usually low quality, but you can add the flags "qtrim=r trimq=15 untrim" to trim the reads before mapping, then undo the trimming after mapping, so that the low-quality right end won't contribute to the identity calculation. Alternatively, you can try error-correction via Tadpole and/or BBMerge prior to mapping:

        Code:
        bbmerge.sh in=reads.fq out=ecco.fq ecco mix rsem strict k=62
        tadpole.sh in=ecco.fq out=ecct.fq ecc k=62

        Comment


        • Thank you for the explanation and suggested resolution Brian,

          How about BBduk and BBsplit? Can they similarly behave non-deterministically in paired-end mode?

          Comment


          • BBSplit is based on BBMap, so it is also nondeterministic in paired mode with multiple threads. BBDuk and Seal (which can be used similarly to BBSplit) are always deterministic. I could add a deterministic mode to BBMap without too much effort.

            Comment


            • Hi Brian,

              Thanks for letting me know. That would be really awesome... keep me posted! My email is [email protected]. Til then, I will probably need to stick with the single thread.

              Comment


              • Hey Brian,

                Do you have any sort of time estimate on when you might consider adding the deterministic mode to bbmap in a future release? It would help me decide what course of action to take if you had a rough (but realistic) time estimate!

                Comment


                • I just looked at the code, and it's trivial, there was already a variable:

                  Code:
                  protected static final boolean DYNAMIC_INSERT_LENGTH=true;
                  I just need to remove the word "final" and parse a new argument (in fact, I just did that). So that will be in the next release, probably Monday. Then you'll need to add these flags:

                  Code:
                  deterministic averagepairdist=100
                  ...where instead of 100, you use the expected average inner distance between pairs (that's the insert size minus the sum of the read lengths, or the length of the unsequenced portion). You don't need the number to be correct, and you don't need to set it at all (the default is 100), but setting it accurately will yield slightly better pairing.
                  Last edited by Brian Bushnell; 05-04-2017, 02:58 PM.

                  Comment


                  • OK, I just released the latest version with the "deterministic" flag in 37.22.

                    Comment


                    • Thank you so much Brian!!! I will download it

                      Comment


                      • Hi,

                        I try to align paired-end reads of Pisum sativum (pea) to the genome of Arabidopsis thaliana, as there is no reference genome for pea. I first tried HÌSAT2 but had alignment rates of ~4%. Therefore I wanted to give BBMap a try but have some questions:

                        1) I know we sequenced strand-specific but don't know the protocol but I see that the R2 read (second in pair) aligns always on the strand on which the gene is located and the R1 read (first in pair) on the opposite strand. How do I set the "xstag" parameter? xstag=fs or xstag=ss?

                        2) Regarding question 1 I tried both xstag=fs and xstag=ss but don't see a XS flag in the SAM file? Why?

                        3) When I aligned the first 1,250,000 read pairs (alignment rate ~12%) I saw from the CIGAR format (and IGV) that spliced reads (so normally sth. like 120N) are reported as having deletions (e.g. 120D). I thought BBMap was aware of spliced alignments?

                        Here is how I call BBMap:

                        Code:
                        sh /opt/bbmap/bbmap.sh \
                        mappedonly=t \
                        xstag=fs \
                        in=/.../test_R1.fastq \
                        in2=/.../test_R2.fastq \
                        out=/.../test_BBMap_fs.sam \
                        ref=/.../TAIR10_full.fa

                        I am really thankful for answers.

                        Best Regards
                        Mario
                        Last edited by Mchicken; 05-09-2017, 06:15 AM.

                        Comment


                        • You should take a look at the maxindel= and intronlen= settings.

                          Sometimes it is not apparent, but if you did not specify a setting, it means that the program is using the default values for it.

                          This is what help for those two parameters says
                          Code:
                          maxindel=16000          Don't look for indels longer than this. Lower is faster.
                                                  Set to >=100k for RNAseq with long introns like mammals.
                          intronlen=999999999     Set to a lower number like 10 to change 'D' to 'N' in 
                                                  cigar strings for deletions of at least that length.
                          So those are the two values you used for this run.
                          Last edited by GenoMax; 05-09-2017, 11:40 AM.

                          Comment


                          • As Genomax said, you need to set the "intronlen" flag, maybe to about 10. Arabidopsis has very short introns as I recall, so you shouldn't need to increase maxindel.

                            As for the xstag, you should not need to set that unless you plan on using the Tuxedo pipeline for analysis, but in that case... reads get XS:A:+ if R1 is mapped to the plus strand in firststrand mode. But, only spliced reads get an XS tag, so without setting intronlen, none of them will get it.

                            Comment


                            • Thanks a lot for your helpful replies. It is now working perfectly . One more question to the "xstag". I see that if the gene is on the + strand, R2 maps on+ and R1 maps on -. If I use "xstag=ss" I get for spliced R1 and R2 reads XS=+. So I think this is the correct library type. Am I right?

                              Comment


                              • Hi Brian,

                                I'm trying to count some Kmers in a specific location in a read. I was hoping I could use bbduk or bbduk2 to trim around the sequence I want to count. I can't seem to make bbduk trim all the reads at these sites though. I'm trying many iterations of something like

                                bbduk2.sh in=my.fq.gz out=my_trim.fq.gz lliteral=CCACGCACTGATAACATCAC rliteral=CGGGGCACACTGCTTGTGGC k=20 hdist=2 tbo tpe

                                I've also tried this with bbduk specifying ktrim=rl and separating the sequences with a comma.

                                Is there a way to make bbduk(2) actually trim all of the sequences? I'm ending up with a bunch of the reads still having that sequence within a hdist=2.

                                You had previously suggested to reformat to a specific location in the reads but the problem with this is that even though they were all amplified with the same primers some additional nucleotides have managed to stay on the end of a subset of the reads. I can filter these out with size parameters but then I lose a lot of potentially valuable reads.

                                Any help is greatly appreciated.

                                Thanks!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Today, 11:09 AM
                                0 responses
                                15 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-19-2024, 07:20 AM
                                0 responses
                                147 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-16-2024, 05:49 AM
                                0 responses
                                121 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-15-2024, 06:53 AM
                                0 responses
                                111 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X