Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • memento
    Member
    • Nov 2010
    • 49

    fastest solution to get bam2fastq done

    Hi,
    I am currently using Hydra package solution but quite quickly this gets clogged as all the RAM is used (4gb). Is there any other solution that has smaller memory footprint (picard?)? What would be the optimal memory size (considering that any other tool would not offer better memory handling)?

    Thanks!
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    When you say bam2fastq are you trying to recover the original read sequences (possibly paired) as FASTQ files, in their original orientation (reverse complement undone if mapped to the reverse strand)?

    Comment

    • memento
      Member
      • Nov 2010
      • 49

      #3
      that is correct maubp.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        I think EMBOSS seqret can do BAM to FASTQ, but I'm not sure what it does with pairs or reads mapped to the reverse strand.

        I did this once myself using a custom Python script.

        Comment

        • cjp
          Member
          • Jun 2011
          • 58

          #5
          This software worked OK on my 4GB workstation:



          Chris

          Comment

          • Richard Finney
            Senior Member
            • Feb 2009
            • 701

            #6
            bampe2fqs.c using AVL trees ...

            For the peoples ...

            I've run into similar problems: bam to fastq implementations take up too much memory and take all day to run. I whipped together my own solution.
            This thing is very fast and uses very little memory (it outputs a pair when the second pair is input). It will take a little knowledge of how to point to libraries and headers to get working on your system.

            It uses AVL trees. It only works on paired end data.

            You must install samtools, tree library and possibly zlib. The tree library is at http://piumarta.com/software/tree/


            CODE HAS BEEN REDACTED FOR MORE ROBUST VERSION, SEE LATER POSTS (edit 6/21/2013)
            Last edited by Richard Finney; 06-21-2013, 10:46 AM.

            Comment

            • memento
              Member
              • Nov 2010
              • 49

              #7
              @Richard

              wow, this really zips thru the data! is there counterpart solution for non-paired data?

              Comment

              • Richard Finney
                Senior Member
                • Feb 2009
                • 701

                #8
                Single End bam to fastq ?

                There's actually faster ways to do it, a hash table implementation I did was 5x faster, but had a fixed amount of memory. I just wanted something simple and guaranteed to work on about any system (including the 4GB nodes on biowulf at NIH) and would run fast enough that I didn't lose focus. The implementation above uses heap for unmatched pairs and so uses very little memory on a proper ("strict"?) paired end bam. Soon as the pair is matched, the node is deleted and the AVL tree is rebalanced. The pieces are there for a single end implementation, just need to cut some code out. I'll see if can do it; check back tomorrow. Everybody's welcome to do whatever they want with the code.

                Comment

                • Richard Finney
                  Senior Member
                  • Feb 2009
                  • 701

                  #9
                  bamse2fq.c - single ended version

                  Here it is. This is the single ended version. Nothing special but does (re-)reverse compliment the sequence and (re-)reverses the quality for opposite strand hits. Please check your results as this is brand new code.


                  /*

                  bam to fastq SE (single end) version [ use bampe2fq for paried end data] .
                  how to compile:

                  edit line below to point to samtools libraries

                  gcc -Wall -O3 -I/h1/finneyr/samtools-0.1.18/ bamse2fqs.c -o bamse2fqs -lm /h1/finneyr/samtools-0.1.18/libbam.a /h1/finneyr//zlib-1.2.5/libz.a (AVL tree not used in this version).

                  example
                  ./bamse2fqs /tcga_next_gen/NG_buckets/bucket11/bam/TCGA-AA-3858-01A-01W-0900-09_IlluminaGA-DNASeq_exome.bam

                  outputs a fastq file called 1se.fq

                  */


                  #include <stdio.h>
                  #include <stdlib.h>
                  off_t ftello(FILE *stream); // sam.h complains unless this prototype is there for -Wall on older systems
                  #include "sam.h"
                  #include <sys/types.h>
                  #include <unistd.h>

                  static char seq[500];
                  static char qual[500];

                  static char compli(char ch)
                  {
                  if (ch == 'A') return 'T';
                  else if (ch == 'T') return 'A';
                  else if (ch == 'C') return 'G';
                  else if (ch == 'G') return 'C';
                  else if (ch == 'N') return 'N';
                  else if (ch == 'a') return 't';
                  else if (ch == 't') return 'a';
                  else if (ch == 'c') return 'g';
                  else if (ch == 'g') return 'c';
                  return '.';
                  }

                  static void reverse_and_compliment(char s[], char outs[], int len)
                  {
                  register int i,j;

                  j = 0;
                  i = len - 1;
                  while (i >= 0)
                  outs[j++] = compli(s[i--]);
                  return;
                  }

                  static void reverse(char s[], char outs[])
                  {
                  register int i,j;

                  i = strlen(s);
                  i--;

                  j = 0;
                  while (i >= 0)
                  {
                  outs[j] = s[i];
                  j++;
                  i--;
                  }
                  outs[j] = (char)0;
                  strcpy(s,outs);
                  return;
                  }


                  static FILE *fp1; // fastq output file


                  long int in_cnt = 0L;
                  long int flip_cnt = 0L;

                  static int fetch_func(const bam1_t *b, void *data)
                  {
                  uint32_t *cigar = bam1_cigar(b);
                  const bam1_core_t *c = &b->core;
                  int i, l;
                  // NO!!!!!!!!! if (b->core.tid < 0) return 0;
                  for (i = l = 0; i < c->n_cigar; ++i) {
                  int op = cigar[i]&0xf;
                  if (op == BAM_CMATCH || op == BAM_CDEL || op == BAM_CREF_SKIP)
                  l += cigar[i]>>4;
                  }

                  uint8_t *s = bam1_seq(b);
                  uint8_t *t = bam1_qual(b);
                  qual[0] = (char)0;
                  seq[0] = (char)0;
                  for (i = 0; i < c->l_qseq; i++) seq[i] = bam_nt16_rev_table[bam1_seqi(s, i)];
                  seq[i] = (char)0;
                  i = 0;
                  for (i = 0; i < c->l_qseq; ++i) qual[i] = t[i] + 33;
                  qual[i] = (char)0;
                  if (c->flag&BAM_FREVERSE)
                  {
                  char tmps[512];

                  reverse_and_compliment(seq, tmps, strlen(seq));
                  strcpy(seq,tmps);
                  reverse(qual, tmps);
                  strcpy(qual,tmps);
                  flip_cnt++;
                  }
                  in_cnt++;

                  fprintf(fp1,"@%s\n%s\n+\n%s\n",bam1_qname(b), seq,qual); // single end, just output it

                  #if 0 // for paired end - see bampe2fq.c
                  if (c->flag&BAM_FREAD1) add(bam1_qname(b),0,seq,qual);
                  else add(bam1_qname(b),1,seq,qual);
                  #endif
                  return 0;
                  }


                  int main(int argc, char *argv[])
                  {
                  samfile_t *fp;

                  if (argc == 1) {
                  fprintf(stderr, "Usage: bamse2fq <in.bam> \n");
                  return 1;
                  }
                  if ((fp = samopen(argv[1], "rb", 0)) == 0) {
                  fprintf(stderr, "bamse2fq: Fail to open BAM file %s\n", argv[1]);
                  return 1;
                  }

                  if (argc == 2) { /* if a region is not specified */
                  fp1 = fopen("1se.fq","w");
                  bam1_t *b = bam_init1();
                  while (samread(fp, b) >= 0) fetch_func(b, fp);
                  bam_destroy1(b);
                  fclose(fp1);
                  printf("DONE in_cnt = %ld , reversed=%ld \n",in_cnt,flip_cnt);
                  } else {
                  fprintf(stderr,"ERROR: ");
                  fprintf(stderr, "Usage: bamse2fq <in.bam> \n");
                  }
                  samclose(fp);
                  return 0;
                  }
                  Last edited by Richard Finney; 12-22-2011, 06:35 PM.

                  Comment

                  • lh3
                    Senior Member
                    • Feb 2008
                    • 686

                    #10
                    The latest samtools actually has a hidden command: bam2fq, which converts a BAM to *single-end* FASTQ.

                    @Richard: samtools comes with a single-header hash table library khash.h. BTW, I could not open your link. I would be interested in an AVL tree implementation in C. I know a couple single-header libraries for B-tree and red-black trees.

                    Comment

                    • Richard Finney
                      Senior Member
                      • Feb 2009
                      • 701

                      #11
                      AVL library

                      Heng (lh3),
                      Yes, the link is down. It was up the other day. The server at http://piumarta.com must be taking a holiday. It will probably be back. Regardless I'll email you the AVL "library" at your "me" email address (let me know if there's a better one). Tree-1.0 is really just include files, not an actual linkable library. It is MIT license. Re-implementing the data structures and operations of an AVL package would be pretty easy if it does not suit your needs.

                      AVL seems most elegant way and will avoid the pathological worst case key clashes of a hash solution; but hash will in most cases be faster with the cost of larger memory usage.

                      Comment

                      • vinay052003
                        Member
                        • Jan 2010
                        • 59

                        #12
                        Hi guys,
                        I've run into a more complex problem regarding bam to fastq conversion. Somehow my bam files have some unpaired reads mixed with the paired-end reads. I was running Picard but that seems to terminate when it finds unpaired read.
                        Any suggestions on how to proceed now?

                        Thanks in advance.

                        Comment

                        • Richard Finney
                          Senior Member
                          • Feb 2009
                          • 701

                          #13
                          [ EDIT 6/21/2013 ... see later posts for full, improved more robust version ]
                          If you want the "orphan" reads [ i.e. read not properly paired], modify bampe2fqs.c as instructed below ...

                          This will dump "orphan" reads to the file "orphans.fq" in addtion to the pairs in 1.fq and 2.fq

                          1) Uncomment dump_orphans() call.

                          2) Add this code to bampe2fqs.c after line with "static Tree tree;"

                          Remember AVL tree "library" is at http://piumarta.com/software/tree/ as include files and point to your libz and libbam libraries when compiling.

                          =============== insert after Tree declaration ==============
                          int orphcnt = 0; // count of orphans reads

                          void orphan_node_printer(Node *self, void *stream)
                          {
                          fprintf(stream,"@%s\n%s\n+\n%s\n",self->name, self->seq,self->qual);
                          orphcnt++;
                          }
                          void dump_orphans()
                          {
                          FILE *fpo; // orphan fastqs
                          fpo = fopen("orphan.fq","w");
                          TREE_FORWARD_APPLY(&tree, _Node, linkage, orphan_node_printer, fpo);
                          fclose(fpo);
                          fprintf(stderr,"DONE, %d orphans flushed\n",orphcnt);
                          }
                          Last edited by Richard Finney; 06-21-2013, 10:47 AM.

                          Comment

                          • lh3
                            Senior Member
                            • Feb 2008
                            • 686

                            #14
                            @Richard: khash takes smaller memory than most search trees on small key-value pairs. The only implementations that beat khash on memory are stx-tree and my own kbtree.h, both of which are based on B-tree/B+-tree. Nonetheless, if you store large objects (I usually store pointers instead), tree structures can be more memory efficient.

                            Comment

                            • vinay052003
                              Member
                              • Jan 2010
                              • 59

                              #15
                              Hi Richard,
                              Thanks a lot for the info. Works perfectly.
                              Does this program make any assumptions about the input bam file (sorting on bam file). Does it also work if I have multiple alignments for reads (non-primary alignments)?

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...