Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • genome_anawk1
    Junior Member
    • May 2011
    • 7

    bfast parallelization

    Hello,

    I am a new bfast user and have successfully run the full bfast job which included all the details. Now to speed up the processing (parallelize at various levels) at various steps I need to know the following from the authors and experts.

    After using 10 unique set of indices suggested in the manual, I created 10 different index files.
    refgenome.fa.cs.<j>.1.bif (j=1,..10).
    I used the index files and the reference sequence and a unique reads.k.fastq file (k=1…N)
    to <br>
    search the indices (bfast index …)
    perform local alignment (bfast localalign)
    filter alignments (bfast postprocess)

    1) My question is : Can I further parallelize the searches for indices (bfast index) step.
    Can I run small jobs using independent index files for the above 3 steps and then merge the final *bam files after "bfast postprocess"?

    2) how can I generate similar sized "reads.k.fastq" files . I noticed that some files generated are larger than others - perhaps twice as large. The compute time takes twice as long too.

    Thanks in advance,
    cheers
    Last edited by genome_anawk1; 05-13-2011, 01:21 PM.
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    #2
    Note: "bfast match" searches the index, while "bfast index" creates an index.

    Here are the basic steps (remember to use multiple threads with the "-n" option).

    1.
    Partition the N reads into bins with R reads contained. Run these through "bfast match" to create ~X=N/R BMF files. Alternatively, if you have only one input file, use the "-s/-e" options to specify which reads to process for each invocation of "bfast match".

    2.
    For each BMF file from step 1, process them with "bfast localalign". You can also use "-s/-e" to further sub-divide each input file. Lets call the number of BAF files Y (since you may sub-divide, otherwise X equals Y).

    3.
    For each BAF file from step 2, process them with "bfast localalign" to get a SAM file.

    4.
    After all SAM files have been created, merge them with Picard or samtools (I recommend the former).

    This is the best way to partition the various steps of bfast, since step 1 ("bfast match") may take much more memory (for large references) than step 2 ("bfast localalign") and so if you had a heterogenous cluster you could submit the step 1 jobs to high memory nodes and step 2 jobs to low memory nodes etc. This is the spirit of the "bfast.submit.pl" script found with the BFAST distribution. The latter script is not well supported (there be dragons).

    Comment

    • genome_anawk1
      Junior Member
      • May 2011
      • 7

      #3
      Hi Nils,

      Thanks very much for your prompt response. Apologies, I had some incorrect commands earlier. I am making the fixes here.

      I did not use the "-s" or "-e" flags as I needed the algorithm to use color space information - the stress being on the accuracy. As the manual suggests (page 60), there was a trade off. Hence the long run time.

      My parallelization question is regarding the bif files (index files in color space). For example I have 10 index files and N (N~100) reads file called reads.j.fastq (j=1..N). My jobs are split as shown below:

      Using your set of masks, I create 10 bif files.
      using mask_1
      bfast index -f ref_genome.fa -m 111...11 -w 14 -i 1 -A 1

      ..
      using mask_k
      bfast index -f ref_genome.fa -m 110...11 -w 14 -i k -A 1

      ...
      (k=1 ..10 such unique masks as suggested in the manual).

      I am using the mouse genome, so is 10 an optimal basis for the hash masks. For now I am using "-w 14" but I guess that remains an open question for the mouse genome.

      Then get the 10 unique bif files. For the next few steps do I need to keep ALL 10 bif files as inputs (alongwith the ref genome in color space and nucleotide space) for bfast match, bfast localalign, and bfast postprocess ..
      Here is the parallelization that I now have ..

      dir_wk_1/
      bfast match -f ref_genome.fa -A 1 -r reads.1.fastq > bfast.matches_file.1.bmf
      bfast localalign -f ref_genome.fa -m bfast.matches_file.1.bmf -A 1 > bfast.aligned.file.1.baf
      bfast postprocess -f ref_genome.fa -i bfast.aligned.file.1.baf -A 1 > bfast.reported.file.1.sam


      dir_wk_2/
      bfast match -f ref_genome.fa -A 1 -r reads.2.fastq > bfast.matches_file.2.bmf
      bfast localalign ..
      bfast postprocess ...

      ...
      dir_wk_N/
      bfast match -f ref_genome.fa -A 1 -r reads.N.fastq > bfast.matches_file.N.bmf
      bfast localalign ..
      bfast postprocess ...

      ----------
      Specifically do the 10 bif files have to be referenced in each and every N subdirectory. Can I make 10xN separate runs - where each run has only ONE bif file (index file) referenced to it and one reads.j.fastq file referenced to it.

      Thanks very much, and apologies for the previous incorrect commands.
      cheers,
      new genome analyzer.
      Last edited by genome_anawk1; 05-14-2011, 07:12 AM.

      Comment

      • nilshomer
        Nils Homer
        • Nov 2008
        • 1283

        #4
        Where's the "bfast match" command? You are definitely missing something. Try reading through the example in the manual's appendix.

        Comment

        • genome_anawk1
          Junior Member
          • May 2011
          • 7

          #5
          Hi Nils,
          Apologies, I fixed my previous query. I had copied the command incorrectly.
          cheers,
          new analyzer

          Comment

          • genome_anawk1
            Junior Member
            • May 2011
            • 7

            #6
            Hi Nils,

            It is not obvious to me if one needs to use all the different index files / *.bif files (example 10 different files) at the same time for the three steps:
            1- bfast match
            2- bfast localalign
            3- bfast postprocess

            As posted previously, can I split jobs such that each of the ten bif files are processed separately. This will lead to 10xN separate jobs that run and consume less input time for reading in the dataset.

            Hope you can please suggest. Thanks in advance

            cheers,

            Comment

            • nilshomer
              Nils Homer
              • Nov 2008
              • 1283

              #7
              Originally posted by genome_anawk1 View Post
              Hi Nils,

              It is not obvious to me if one needs to use all the different index files / *.bif files (example 10 different files) at the same time for the three steps:
              1- bfast match
              2- bfast localalign
              3- bfast postprocess

              As posted previously, can I split jobs such that each of the ten bif files are processed separately. This will lead to 10xN separate jobs that run and consume less input time for reading in the dataset.

              Hope you can please suggest. Thanks in advance

              cheers,
              You should use all index files to get full sensitivity, but you could run one "bfast match" per index file on the input reads, then use "bmfmerge" (see the "butil" folder) to merge the index results (BMF files).

              Comment

              • pengchy
                Senior Member
                • Feb 2009
                • 116

                #8
                To parallelize the running, whether I can split the reference genome into several, say 2, sections, and create index separately and then align separately then merge the bmf files?
                Thanks.

                Comment

                • nilshomer
                  Nils Homer
                  • Nov 2008
                  • 1283

                  #9
                  Originally posted by pengchy View Post
                  To parallelize the running, whether I can split the reference genome into several, say 2, sections, and create index separately and then align separately then merge the bmf files?
                  Thanks.
                  Split the reads, merge the SAM files.

                  Comment

                  • rgarcia
                    Junior Member
                    • May 2012
                    • 2

                    #10
                    are mate pairs split when partitioning reads?

                    Hello! I'm using this approach to parallelize the alignment of a run of SOLiD mate-pair reads. I converted the reads using solid2fastq, then partitioned the fastq files using the unix split command. I then successfully ran the match step on each partition.

                    I then realised that solid2fastq placed F3 reads at the beginning of the fastq while R3 at the end, so after spliting the fastq, F3 reads will be separately aligned from R3 reads. It is here suggested that I separately generate a sam file from each partition.

                    Does this mean that my reads won't be paired? Will samtools or picard pair my reads when I merge my sam files?

                    Thanks!

                    Originally posted by nilshomer View Post
                    Note: "bfast match" searches the index, while "bfast index" creates an index.

                    Here are the basic steps (remember to use multiple threads with the "-n" option).

                    1.
                    Partition the N reads into bins with R reads contained. Run these through "bfast match" to create ~X=N/R BMF files. Alternatively, if you have only one input file, use the "-s/-e" options to specify which reads to process for each invocation of "bfast match".

                    2.
                    For each BMF file from step 1, process them with "bfast localalign". You can also use "-s/-e" to further sub-divide each input file. Lets call the number of BAF files Y (since you may sub-divide, otherwise X equals Y).

                    3.
                    For each BAF file from step 2, process them with "bfast localalign" to get a SAM file.

                    4.
                    After all SAM files have been created, merge them with Picard or samtools (I recommend the former).

                    This is the best way to partition the various steps of bfast, since step 1 ("bfast match") may take much more memory (for large references) than step 2 ("bfast localalign") and so if you had a heterogenous cluster you could submit the step 1 jobs to high memory nodes and step 2 jobs to low memory nodes etc. This is the spirit of the "bfast.submit.pl" script found with the BFAST distribution. The latter script is not well supported (there be dragons).

                    Comment

                    Latest Articles

                    Collapse

                    • SEQadmin2
                      From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                      by SEQadmin2


                      Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                      The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                      ...
                      06-02-2026, 10:05 AM
                    • SEQadmin2
                      Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                      by SEQadmin2


                      With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                      Introduction

                      Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                      05-22-2026, 06:42 AM
                    • SEQadmin2
                      Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                      by SEQadmin2

                      Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                      Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                      05-06-2026, 09:04 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 06-02-2026, 12:03 PM
                    0 responses
                    19 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-02-2026, 11:40 AM
                    0 responses
                    14 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-28-2026, 11:40 AM
                    0 responses
                    29 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 05-26-2026, 10:12 AM
                    0 responses
                    31 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...