Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Issue with BGZF and Samtools

    Hi,

    I'm a college student working on a parallel version of Samtools project.
    I have to use BGZF to compress blocks of data in order to create a .bam file.

    But I'm facing a problem. When I try to read the bam file I created with the Samtools view, I have the following errors:
    Code:
    [bam_header_read] EOF marker is absent. The input is probably truncated.
    How did I do wrong?
    How can I fix it?

    I'm here to any further questions.
    Thanks for your help !

  • #2
    If you are streaming/piping the data into samtools, v0.1.18 and v0.1.19 would wrongly give this warning message as a false alarm: https://github.com/samtools/samtools/issues/18

    Have you actually written the 28 byte empty BGZF block at the end of the file? This was only formally added to the specification in Dec 2013 but was in use long before that.

    Comment


    • #3
      Are you piping the file into samtools when you get that warning? That's a common occurence and can be ignored (some versions of samtools mistakenly try to check for an end of file (EOF) marker when using pipes).

      If not, it's likely your file is truncated for some reason (we'd have to know more about how you created the file in the first place to guess how).

      Edit: I should have refreshed the tab, I see Peter beat me to it!

      Comment


      • #4
        Hi guys, thanks for your answers.

        Yes, I was piping the file into Samtools.

        I checked into bgzf.c and I found that the bgzf_close() function add this 28 bytes EOF marker.

        But when I try to use this function, I got a segmentation fault ...And I don't know why

        Comment


        • #5
          Please clarify: Are you getting a segmentation fault from your own code when writing the EOF marker - or are you getting a segmentation fault from samtools view when reading your BAM file?

          Comment


          • #6
            It would also be helpful if you posted some of the code that's causing the problem.

            Comment


            • #7
              Sorry, here is the code

              Code:
              void compressData(MPI_File *in, MPI_File *out, const int rank, const int num_proc, const int overlap,
                             char ***lines, int *nlines) {
              
                  MPI_Offset filesize;
                  MPI_Offset localsize;
                  MPI_Offset start;
                  MPI_Offset end;
                  char *chunk;
                  uint8_t *dunk;
                  BGZF *fp;
                  int *offset_tab;
                  /* figure out who reads what */
              
                  MPI_File_get_size(*in, &filesize);
                  localsize = filesize/num_proc;
                  start = rank * localsize;
                  end   = start + localsize - 1;
              
                  /* add overlap to the end of everyone's chunk... */
                  end += overlap;
              
                  /* except the last processor, of course */
                  if (rank == num_proc-1) end = filesize;
              
                  localsize =  end - start + 1;
                  /* allocate memory */
                  chunk = malloc( (localsize)*sizeof(char));
              
                  /* everyone reads in their part */
                  printf("Rank %d we read data!! \n", rank);
                  MPI_File_read_at_all(*in, start, chunk, localsize, MPI_CHAR, MPI_STATUS_IGNORE);
                  //chunk[localsize] = '\0';
              
                  int dlen;
                  int slen = strlen(chunk);
                  printf("Rank %d size of the data read %d !! \n", rank, slen);
                  printf("Rank %d start compression!! \n", rank);
              
              
              
                  bam_header_t *head = bam_header_init();
                  bam_header_write(fp, head);
              
                  fp = bgzf_write_init(Z_DEFAULT_COMPRESSION);
                  memcpy(fp->uncompressed_block, chunk, localsize);
                  int comp_size = deflate_block(fp, slen);
              	
              	
                  if(!bgzf_close(fp)){
                  	printf("Error for CPU number %d", rank);
                  	exit(2);
                  }
              
              }
              I got a segmentation fault when using bgzf_close(), while I got nothing when I don't call it.

              I used this function to add the EOF marker at the end of the blocks.

              Comment


              • #8
                Originally posted by granzanimo View Post
                Sorry, here is the code

                Code:
                    bam_header_t *head = bam_header_init();
                    bam_header_write(fp, head);
                You're writing an initialized but otherwise empty struct to an uninitialized file pointer...

                Code:
                    fp = bgzf_write_init(Z_DEFAULT_COMPRESSION);
                Now you have an initialized BGZF struct, though it still has no file association.

                Code:
                    if(!bgzf_close(fp)){
                Since "fp->fp" points to uninitialized memory, this will segfault in the internal fclose(fp->fp) step.

                Firstly, there's usually no reason to manually add the EOF to a BAM file, since you're probably lying to yourself that the contents aren't corrupt. Secondly why are you trying to do this with MPI? The bottle-neck here is usually IO, which is often saturated with 4 or so compression threads.

                Comment


                • #9
                  Thanks for your answer

                  The MPI code is here because I'm trying to run this program on a 800 CPU cluster.
                  I'm trying to compress each block of data with each CPU.

                  So I didn't understand, what do I need to fix in my code?

                  Comment


                  • #10
                    Firstly, you need to

                    Code:
                    fp = bgzf_write_init(Z_DEFAULT_COMPRESSION);
                    before you can
                    Code:
                    bam_header_write(fp, head);
                    Secondly, the above line is problematic. While the header will likely fit into the buffer and not cause a call to bgzf_flush, if it doesn't fit you'll get a segfault, since fp->fp isn't initialized.

                    Code:
                    if(!bgzf_close(fp)){
                    You'll have to write your own close function, since bgzf_close() will first try to flush the buffer, which it can't due to the aforementioned fp->fp issue ... causing the segfault that you saw.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM
                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-25-2024, 11:49 AM
                    0 responses
                    19 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-24-2024, 08:47 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    62 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    60 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X