Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by GenoMax View Post
    @Brian will confirm later but I think BBMap loads pre-made indexes on disk in memory if you provide path= option. Note: "nodisk" option builds indexes in memory from fasta files but am not sure if it can (or needs to) be used with path= option.
    That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.

    Comment


    • Originally posted by Brian Bushnell View Post
      That's correct. If the index is present on disk, it gets loaded at startup; no random-access is ever performed. So, the only difference is a faster startup when the index is already on disk.
      Perfect, thank you!

      Comment


      • Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.

        Comment


        • Hi Brian,

          I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

          Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
          Do you have a suggestion how to constrain the alignment within each ORF?

          Thank you!

          Comment


          • @YeastGuy: Are those sequences in multi-fasta format?

            Comment


            • @GenoMax: Yes, they are.

              Comment


              • Originally posted by Mat29 View Post
                Thank you for developing BBMap. Could you please advise how to preprocess the output VCF to make it compatible with VCFAnnotator, SNPEff. I recieve "No gene feature" and "Chromososme is missing" errors on annotation. Tried Pilon also VCFs and errors persist.
                Please ensure that your chromosome names do not have spaces in them. If they do, use the "trd" flag when mapping (trim read description), or for simplicity, just process the fasta initially like this:

                reformat.sh in=ref.fa out=fixed.fa trd


                ...then use fixed.fa for mapping and everything else.

                Comment


                • Originally posted by YeastGuy View Post
                  Hi Brian,

                  I have a discontinuous reference genome in a single fasta file (6000 coding sequences) and I would like to use bbmap to align my paired reads to the coding sequence only. I want to know how many reads align to each ORF.

                  Now I realised that the order of the reference genome affects the alignment because bbmap seems to ignore the ORF borderes in the reference.
                  Do you have a suggestion how to constrain the alignment within each ORF?

                  Thank you!
                  How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.

                  Comment


                  • Originally posted by Brian Bushnell View Post
                    How are the ORFs indicated? Do you have them in a separate file (gff/gtf/bed)? BBMap doesn't use any of those, though you could probably use a different tool to mask the reference with them. Generally, I don't recommend constraining alignment to a specific subset of the genome, which incurs confirmation bias.
                    I tried to simply use the multi-fasta format with the hope that the sequences would be seen as seperate and not connected:
                    >lcl||1|YAL001C|TFC3|1|145
                    TAGTTACTATGGTCGTTAACGAAATAATATTTCATCCAGGGA
                    >lcl||2|YAL002W|VPS8|1|145
                    CTGGTCTGGACCCATTACTTTTTCTAGCTTGGGAAAATGTACAG

                    But after randomising the order withing the ref file the coverage changed.

                    Comment


                    • Oh, that file should be fine; BBMap sees the individual sequences as unconnected. It's still possible for read 1 to map to one csequence and read 2 to a different sequence, though. The coverage may be different for various reasons, typically due to ambiguity (two different reference sequences that are very similar). You can use "ambig=toss" or "ambig=all" to change the behavior of reads mapping to multiple possible locations. But bear in mind that BBMap is slightly nondeterministic when mapping paired reads.

                      Comment


                      • Originally posted by Brian Bushnell View Post
                        Oh, that file should be fine; BBMap sees the individual sequences as unconnected. It's still possible for read 1 to map to one csequence and read 2 to a different sequence, though. The coverage may be different for various reasons, typically due to ambiguity (two different reference sequences that are very similar). You can use "ambig=toss" or "ambig=all" to change the behavior of reads mapping to multiple possible locations. But bear in mind that BBMap is slightly nondeterministic when mapping paired reads.
                        Okay then, thank you for your quick help! I will try your suggestions.
                        Now that I know what the problem is, it should be possible to find a solution.

                        Comment


                        • @Brian: What setting (if any) would prevent the paired reads from aligning to two references (maxindel=0 perhaps) in first place?

                          Comment


                          • Difficulties using bbmap to generate sorted, indexed bam file

                            Hello people,

                            I am trying to generate sorted, indexed bam file with the bbmap tool, but I have an error message.

                            tanshiming@S620100019205:~/Documents/ACDC-trial/shimingles$ bbmap.sh in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.sam -Xmx30g ref=reformatted-contigs.fasta bamscript=bs.sh; sh bs.sh
                            java -Djava.library.path=/home/tanshiming/tools/bbmap/jni/ -ea -Xmx30g -cp /home/tanshiming/tools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.sam -Xmx30g ref=reformatted-contigs.fasta bamscript=bs.sh
                            Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq, in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq, out=mapped.sam, -Xmx30g, ref=reformatted-contigs.fasta, bamscript=bs.sh]

                            BBMap version 36.02
                            Retaining first best site only for ambiguous mappings.
                            Writing reference.
                            Executing dna.FastaToChromArrays2 [reformatted-contigs.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

                            Set genScaffoldInfo=true
                            Writing chunk 1
                            Set genome to 1

                            Loaded Reference: 0.084 seconds.
                            Loading index for chunk 1-1, build 1
                            No index available; generating from reference genome: /home/tanshiming/Documents/ACDC-trial/shimingles/ref/index/1/chr1_index_k13_c7_b1.block
                            Indexing threads started for block 0-1
                            Indexing threads finished for block 0-1
                            Generated Index: 4.146 seconds.
                            Analyzed Index: 3.074 seconds.
                            Started output stream: 0.015 seconds.
                            Cleared Memory: 0.307 seconds.
                            Processing reads in paired-ended mode.
                            Started read stream.
                            Started 32 mapping threads.
                            Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31

                            ------------------ Results ------------------

                            Genome: 1
                            Key Length: 13
                            Max Indel: 16000
                            Minimum Score Ratio: 0.56
                            Mapping Mode: normal
                            Reads Used: 13403154 (2943817149 bases)

                            Mapping: 634.915 seconds.
                            Reads/sec: 21110.16
                            kBases/sec: 4636.56


                            Pairing data: pct reads num reads pct bases num bases

                            mated pairs: 82.0206% 5496676 85.8858% 2528321686
                            bad pairs: 1.6892% 113202 1.6858% 49625732
                            insert size avg: 339.43


                            Read 1 data: pct reads num reads pct bases num bases

                            mapped: 85.2928% 5715965 85.4106% 1312819237
                            unambiguous: 85.0524% 5699850 85.1949% 1309504251
                            ambiguous: 0.2405% 16115 0.2157% 3314986
                            low-Q discards: 0.0000% 0 0.0000% 0

                            perfect best site: 56.5028% 3786577 57.4090% 882415075
                            semiperfect site: 57.7478% 3870016 58.6870% 902059233
                            rescued: 0.7551% 50604

                            Match Rate: NA NA 95.4737% 1291210481
                            Error Rate: 32.2729% 1844876 3.9447% 53348516
                            Sub Rate: 31.0425% 1774536 0.7285% 9852710
                            Del Rate: 2.5010% 142970 2.9285% 39606112
                            Ins Rate: 4.1940% 239750 0.2876% 3889694
                            N Rate: 2.6675% 152488 0.5816% 7866352


                            Read 2 data: pct reads num reads pct bases num bases

                            mapped: 85.4634% 5727397 85.6329% 1204639477
                            unambiguous: 85.1581% 5706936 85.3792% 1201071323
                            ambiguous: 0.3053% 20461 0.2536% 3568154
                            low-Q discards: 0.0000% 0 0.0000% 0

                            perfect best site: 54.6069% 3659525 55.8135% 785155749
                            semiperfect site: 55.7207% 3734167 56.9902% 801709710
                            rescued: 1.0234% 68587

                            Match Rate: NA NA 95.6569% 1184979976
                            Error Rate: 34.8039% 1993488 3.7863% 46904205
                            Sub Rate: 33.7245% 1931661 0.7674% 9506079
                            Del Rate: 2.2309% 127782 2.7561% 34141479
                            Ins Rate: 3.7482% 214688 0.2629% 3256647
                            N Rate: 2.4683% 141381 0.5567% 6896775

                            Total time: 643.704 seconds.
                            bs.sh: 2: bs.sh: module: not found
                            bs.sh: 3: bs.sh: module: not found
                            Note: This script is designed to run with the amount of memory detected by BBMap.
                            If Samtools crashes, please ensure you are running on the same platform as BBMap,
                            or reduce Samtools' memory setting (the -m flag).
                            Note: Please ignore any warnings about 'EOF marker is absent'; this is a bug in samtools that occurs when using piped input.
                            [bam_sort] Use -T PREFIX / -o FILE to specify temporary and final output files
                            Usage: samtools sort [options...] [in.bam]
                            Options:
                            -l INT Set compression level, from 0 (uncompressed) to 9 (best)
                            -m INT Set maximum memory per thread; suffix K/M/G recognized [768M]
                            -n Sort by read name
                            -o FILE Write final output to FILE rather than standard output
                            -T PREFIX Write temporary files to PREFIX.nnnn.bam
                            -@, --threads INT
                            Set number of sorting and compression threads [1]
                            --input-fmt-option OPT[=VAL]
                            Specify a single input file format option in the form
                            of OPTION or OPTION=VALUE
                            -O, --output-fmt FORMAT[,OPT[=VAL]]...
                            Specify output format (SAM, BAM, CRAM)
                            --output-fmt-option OPT[=VAL]
                            Specify a single output file format option in the form
                            of OPTION or OPTION=VALUE
                            --reference FILE
                            Reference sequence FASTA FILE [null]
                            [E::hts_open_format] fail to open file 'mapped_sorted.bam'
                            samtools index: failed to open "mapped_sorted.bam": No such file or directory

                            Could someone please advice me on this?

                            Thank You.

                            Comment


                            • Do you have samtools installed and available in your $PATH and which version?

                              If yes then you can create bam files directly in bbmap command and then sort/index them. Following example is for samtools (v.1.3.1).

                              Code:
                              bbmap.sh in1=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R1_001.fastq in2=UPWRP0508161KA_TCCGGAGA-ATAGAGGC_L1L2_R2_001.fastq out=mapped.bam -Xmx30g ref=reformatted-contigs.fasta
                              
                              samtools sort -o sorted.bam mapped.bam
                              samtools index sorted.bam

                              Comment


                              • Actually, I think the problem in this case is the samtools version. Samtools changed the default command line flags going from 0.x to 1.x, which causes headaches, and I didn't realize this until recently. If you download the latest version of BBMap (36.92), it will look at the version of samtools installed and change the samtools command appropriately.

                                Alternately, you can just run these commands manually. For samtools 0.x:
                                Code:
                                samtools view -bSh1 mapped.sam.gz | samtools sort -m 1G -@ 4 - mapped_sorted
                                samtools index mapped_sorted.bam
                                For samtools 1.x:
                                Code:
                                samtools view -bSh1 mapped.sam | samtools sort -m 1G -@ 4 - -o mapped_sorted.bam
                                samtools index mapped_sorted.bam
                                The exact amount of memory specified (e.g. 1G) depends on your computer but should be fine in this case.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X