Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looking for best strategy to realign BAM files

    Hi all, I was given a set of BAM files (100 Gb each) that I would like to realign using bwa. Problem is, I would like to have the best speed for doing this.
    At the moment start from chunked fastq files and send a single bwa alignment for each node of my cluster, I achieve good parallelism and speed.
    When starting from BAM files I have two options:
    1- convert BAM to fastq -> split fastq in chunks -> align in the same way
    2- feed bwa with BAM files
    If I go for (2) I cannot really parallelize the whole process, unless I can split bam files into chunks which must contain both pairs for each fragment. The only way to do this, I guess, is to sort by read name my BAM files and then split. I don't have an idea about the time required and the space for the newly sorted file
    If I go for (1) I can use picard SamToFastq but it takes ~25s every 100k reads to convert... each of my BAM files contains 130M reads, it would take more than a week only to convert.

    Does anybody want to spend two cents on this with an advice?
    thanks

    d

  • #2
    Is your data PE or SE?

    In fact you can split BAM directly, no matter compressed or not, it was using bgzip format. I believe there are tools for this.

    I would prefer feed BWA BAM files and having multiple BWA instances.

    my two pence.

    dong

    Comment


    • #3
      Yikes, any way you do this will require you to reorder things by read for bwa to run efficiently (I was unaware that one could even directly feed bwa a BAM file). Honestly, I suspect you'd be best off using one of the parallel versions of samtools to sort the monster BAM file by read name. I hope that there's only one alignment reported for each read, otherwise you have to take that into account if you then make a fastq file (maybe bwa knows how to deal with that in a BAM file).

      If you're familiar with programming, you could split the BAM file (don't forget to put a header on each of the split files!) and then sort them on different cluster nodes (not being familiar with how your cluster is made, it may be simpler to just use one of the parallel versions of samtools instead) until eventually merging them.

      Comment


      • #4
        Originally posted by xied75 View Post
        Is your data PE or SE?

        In fact you can split BAM directly, no matter compressed or not, it was using bgzip format. I believe there are tools for this.

        I would prefer feed BWA BAM files and having multiple BWA instances.

        my two pence.

        dong
        These are PE data. I've started a bwa realignment directly from BAM, using bwa threads in alignment. sampe step will take ages, though...
        thanks

        d

        Comment


        • #5
          Originally posted by dawe View Post
          These are PE data. I've started a bwa realignment directly from BAM, using bwa threads in alignment. sampe step will take ages, though...
          thanks

          d
          Yes because sampe is single thread no matter how powerful your machine is and one instance will eat >6GB memory so making run multiple instances also difficult unless you have 128GB something.

          Turn on -P will eat more memory, but should run faster.

          My Windows bwa can do multithread sampe, I suggested people could use my way to modify the Linux version, seems nobody interested to take on this job. If I could have some free time I might do it myself.

          Best,

          dong

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Best Practices for Single-Cell Sequencing Analysis
            by seqadmin



            While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
            06-06-2024, 07:15 AM
          • seqadmin
            Latest Developments in Precision Medicine
            by seqadmin



            Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

            Somatic Genomics
            “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
            05-24-2024, 01:16 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 07:24 AM
          0 responses
          9 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 08:58 AM
          0 responses
          11 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-12-2024, 02:20 PM
          0 responses
          15 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 06-07-2024, 06:58 AM
          0 responses
          182 views
          0 likes
          Last Post seqadmin  
          Working...
          X