Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looking for best strategy to realign BAM files

    Hi all, I was given a set of BAM files (100 Gb each) that I would like to realign using bwa. Problem is, I would like to have the best speed for doing this.
    At the moment start from chunked fastq files and send a single bwa alignment for each node of my cluster, I achieve good parallelism and speed.
    When starting from BAM files I have two options:
    1- convert BAM to fastq -> split fastq in chunks -> align in the same way
    2- feed bwa with BAM files
    If I go for (2) I cannot really parallelize the whole process, unless I can split bam files into chunks which must contain both pairs for each fragment. The only way to do this, I guess, is to sort by read name my BAM files and then split. I don't have an idea about the time required and the space for the newly sorted file
    If I go for (1) I can use picard SamToFastq but it takes ~25s every 100k reads to convert... each of my BAM files contains 130M reads, it would take more than a week only to convert.

    Does anybody want to spend two cents on this with an advice?
    thanks

    d

  • #2
    Is your data PE or SE?

    In fact you can split BAM directly, no matter compressed or not, it was using bgzip format. I believe there are tools for this.

    I would prefer feed BWA BAM files and having multiple BWA instances.

    my two pence.

    dong

    Comment


    • #3
      Yikes, any way you do this will require you to reorder things by read for bwa to run efficiently (I was unaware that one could even directly feed bwa a BAM file). Honestly, I suspect you'd be best off using one of the parallel versions of samtools to sort the monster BAM file by read name. I hope that there's only one alignment reported for each read, otherwise you have to take that into account if you then make a fastq file (maybe bwa knows how to deal with that in a BAM file).

      If you're familiar with programming, you could split the BAM file (don't forget to put a header on each of the split files!) and then sort them on different cluster nodes (not being familiar with how your cluster is made, it may be simpler to just use one of the parallel versions of samtools instead) until eventually merging them.

      Comment


      • #4
        Originally posted by xied75 View Post
        Is your data PE or SE?

        In fact you can split BAM directly, no matter compressed or not, it was using bgzip format. I believe there are tools for this.

        I would prefer feed BWA BAM files and having multiple BWA instances.

        my two pence.

        dong
        These are PE data. I've started a bwa realignment directly from BAM, using bwa threads in alignment. sampe step will take ages, though...
        thanks

        d

        Comment


        • #5
          Originally posted by dawe View Post
          These are PE data. I've started a bwa realignment directly from BAM, using bwa threads in alignment. sampe step will take ages, though...
          thanks

          d
          Yes because sampe is single thread no matter how powerful your machine is and one instance will eat >6GB memory so making run multiple instances also difficult unless you have 128GB something.

          Turn on -P will eat more memory, but should run faster.

          My Windows bwa can do multithread sampe, I suggested people could use my way to modify the Linux version, seems nobody interested to take on this job. If I could have some free time I might do it myself.

          Best,

          dong

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Genetic Variation in Immunogenetics and Antibody Diversity
            by seqadmin



            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
            11-06-2024, 07:24 PM
          • seqadmin
            Choosing Between NGS and qPCR
            by seqadmin



            Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
            10-18-2024, 07:11 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:09 AM
          0 responses
          23 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Today, 06:13 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 11-01-2024, 06:09 AM
          0 responses
          30 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 10-30-2024, 05:31 AM
          0 responses
          21 views
          0 likes
          Last Post seqadmin  
          Working...
          X