Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BFAST indexing memory requirements

    I'm trying to get BFAST working as an aligner for me to use to attempt to detect human contamination in a bacterial metagenomic sample (everything will be 100mer Illumina reads). I am using the ensembl build 36 human genome + some additional novel regions from 2 other human genomes sequenced at the BGI. The total db size is ~3.0Gb, but it consists of 24 chromosomes that are VERY large, and then several thousand small sequences in addition to that. So its kind of a 'lopsided' database.

    I successfully ran 'bfast fasta2brg' on the file, but now for the 'bfast index' step I was using the '-d 1' parameter to reduce the memory footprint. From other threads I'd gotten the idea that using '-d 1' would probably keep my memory footprint down to ~8Gb. But all my blade jobs keep dying when I request only 8Gb of memory. What kind of memory can I expect my job to require?


    On another matter, I'm using the masks listed in the bfast manual for 'illumina reads > 40bp'. Should those be good enough for me to align Illumina 100mers, or would I be better off defining new masks? My goal is to identify human reads out from amongst bacterial sequences. So I believe I can be fairly relaxed in my search criteria without fear of falsely identifying bacterial reads as human.

  • #2
    Originally posted by jmartin View Post
    I'm trying to get BFAST working as an aligner for me to use to attempt to detect human contamination in a bacterial metagenomic sample (everything will be 100mer Illumina reads). I am using the ensembl build 36 human genome + some additional novel regions from 2 other human genomes sequenced at the BGI. The total db size is ~3.0Gb, but it consists of 24 chromosomes that are VERY large, and then several thousand small sequences in addition to that. So its kind of a 'lopsided' database.

    I successfully ran 'bfast fasta2brg' on the file, but now for the 'bfast index' step I was using the '-d 1' parameter to reduce the memory footprint. From other threads I'd gotten the idea that using '-d 1' would probably keep my memory footprint down to ~8Gb. But all my blade jobs keep dying when I request only 8Gb of memory. What kind of memory can I expect my job to require?


    On another matter, I'm using the masks listed in the bfast manual for 'illumina reads > 40bp'. Should those be good enough for me to align Illumina 100mers, or would I be better off defining new masks? My goal is to identify human reads out from amongst bacterial sequences. So I believe I can be fairly relaxed in my search criteria without fear of falsely identifying bacterial reads as human.
    I am would not expect more than 8GB is required when creating split indexes ("-d 1"). Nevertheless, in your case it looks like this is the case. Make sure you use the multi-threaded parameter nonetheless. Can you test with more memory?

    As for the 100bp data, the masks are great for 100bp data.

    Comment


    • #3
      I was able to successfully index using 24Gb memory per blade job. At some point I may throttle down the memory and see what the minimum I can get by with is for my db which may grow somewhat.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Today, 11:49 AM
      0 responses
      13 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 08:47 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      61 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      60 views
      0 likes
      Last Post seqadmin  
      Working...
      X