Header Leaderboard Ad

Collapse

"samtools: no @sq lines in the header" - metagenome!

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • "samtools: no @sq lines in the header" - metagenome!

    Hi,

    I'm trying to clean a bam file, by deleting reads that failed QC (for instance), and writing a new bam file. When I run the following test command:
    Code:
    samtools view -h dirty.bam | samtools view -bS - > clean.bam
    I get the following error:
    Code:
    [samopen] no @SQ lines in the header.
    From googling the error message, every post I've read says "use the -t reference.tsv option" or "use the -T reference.fasta option". The problem is, this data is for a metagenome, and consequently there is no reference genome. Fortunately, this is samtools version 0.1.18, and the program doesn't abort (unlike samtools 0.1.16) when it prints this error, but I just wanted to let the world know that -t/-T is not required nor desired in some cases.

    If someone could update the man page for samtools to reflect that, and maybe eliminate or clarify the actual error message, that would be great.

    Thanks!
    Last edited by PenelopeFudd; 08-08-2012, 08:56 AM. Reason: or -> nor

  • #2
    When you do 'samtools view' to convert BAM into SAM, the header isn't shown by default. Try -h here to include it.

    Comment


    • #3
      There's already an '-h' in the command.

      The dirty.bam file has three header lines:

      @HD VN:1.0 SO:coordinate
      @RG ID:124048 PL:illumina PU:C06N6ACXX.7 LB:INX749 SM:MF9.8 CN:BCCAGSC
      @PG ID:0 VN:0.5.7 CL:bwa aln; bwa sampe
      None of these header lines are @SQ lines. Here's a few lines of data:
      HS11_168:7:1101:10000:124266 77 * 0 0 * * 0 0 TGTGTTGGGTGTGTTTGGGGGGTGGGTTTACATATTATCTCGTTTATTCAACGAGGTCCATGGGTTCCTCTGGAG :11=ABDDF?DFFGIIIIIIIB&[email protected]>[email protected]<B?B2>[email protected]@@)(07)[email protected]:>(3<9?14::3>A:< RG:Z:124048
      HS11_168:7:1101:10000:124266 141 * 0 0 * * 0 0 TGTGTTGGGGGGGTTTGGGTTGGGTGGGGAATTTGCCTGCTTGATTGTTTTCGCAAATATATTTCCCCCAATTAT =?1ADDDFG08')6:[email protected]######################################################## RG:Z:124048
      As you can see, columns 3 through 9 are '*' or '0'. No alignment has been done, so no reference sequence can or should be specified with @SQ, if I'm understanding the file format correctly. Is this right?

      Thanks

      Comment


      • #4
        I think that samtools 0.1.18 will still abort with an error if it runs across a sequence name that is not in the header.

        For example, I ran the following on a well-formed bam file:

        samtools view input.bam | samtools view -bS -
        [samopen] no @SQ lines in the header.
        [sam_read1] missing header? Abort!
        This was to simulate trying to convert a SAM to a BAM that didn't have any header information (so I intentionally left off the -h flag). I think that because you don't have aligned reads, you get that warning but SAMtools can still proceed. If the data were aligned, then you would need to have names for the reference sequences within the header section.

        Justin

        Comment


        • #5
          Have you tried the very latest samtools (from github)? Under some circumstances is will generate the missing @SQ lines from the BAM header (which contains the reference names and their lengths).

          Comment


          • #6
            Originally posted by maubp View Post
            Have you tried the very latest samtools (from github)? Under some circumstances is will generate the missing @SQ lines from the BAM header (which contains the reference names and their lengths).
            I haven't tried the very latest samtools, but with this file, there is no reference name nor length in the headers that I could see.

            Perhaps the sequencing company is using this file format in an unexpected way?

            Originally posted by BAMseek View Post
            This was to simulate trying to convert a SAM to a BAM that didn't have any header information (so I intentionally left off the -h flag). I think that because you don't have aligned reads, you get that warning but SAMtools can still proceed. If the data were aligned, then you would need to have names for the reference sequences within the header section.
            Ok, that makes sense, although if it can still proceed, I'd like it to not print the message, and to update the man page to say more about it. Do you know where I could send such a feature request? Is it [email protected]?

            Thanks

            Comment


            • #7
              Originally posted by PenelopeFudd View Post
              I haven't tried the very latest samtools, but with this file, there is no reference name nor length in the headers that I could see.
              BAM files have a mandatory table of reference names and lengths, plus optionally an embedded SAM header which may repeat that information as @SQ lines (and can supplement it with extra things like MD5 checksums).

              The main reason for this is each mapped BAM read doesn't store the reference name, but the index of the reference. This is much more space efficient (one integer rather than a many character long string).

              It does mean it is possible to make a corrupt BAM file where the BAM header and the embedded @SQ lines don't match up.

              Comment


              • #8
                Originally posted by maubp View Post
                BAM files have a mandatory table of reference names and lengths, plus optionally an embedded SAM header which may repeat that information as @SQ lines (and can supplement it with extra things like MD5 checksums).
                ...
                It does mean it is possible to make a corrupt BAM file where the BAM header and the embedded @SQ lines don't match up.
                Aha, I see the problem. The Illumina sequencing machine generated a BAM file with no table of reference names and lengths, and no @SQ lines, as it did not have that information. In fact, nobody has that information, it's unknown. Illumina is using a BAM file in this instance for unaligned, raw data. This conflicts with this 'mandatory' table.

                Comment


                • #9
                  Originally posted by PenelopeFudd View Post
                  Aha, I see the problem. The Illumina sequencing machine generated a BAM file with no table of reference names and lengths, and no @SQ lines, as it did not have that information. In fact, nobody has that information, it's unknown. Illumina is using a BAM file in this instance for unaligned, raw data. This conflicts with this 'mandatory' table.
                  If your BAM file is unaligned reads, then the BAM header table will be present but empty

                  And there should indeed be no @SQ lines in this case.

                  Comment


                  • #10
                    So, the error message about there being no @SQ lines is an error... :-)

                    Comment


                    • #11
                      Originally posted by PenelopeFudd View Post
                      Aha, I see the problem. The Illumina sequencing machine generated a BAM file with no table of reference names and lengths, and no @SQ lines, as it did not have that information. In fact, nobody has that information, it's unknown. Illumina is using a BAM file in this instance for unaligned, raw data. This conflicts with this 'mandatory' table.
                      Penelope,

                      The header information that is present suggest that the BAM file is the result of an alignment. Specifically the program ID header line you showed above

                      Code:
                      @PG	ID:0	VN:0.5.7	CL:bwa aln; bwa sampe
                      Indicating the data was aligned with bwa. I suppose that could be a lie though.

                      Comment


                      • #12
                        Originally posted by kmcarr View Post
                        Penelope,

                        The header information that is present suggest that the BAM file is the result of an alignment. Specifically the program ID header line you showed above

                        Code:
                        @PG	ID:0	VN:0.5.7	CL:bwa aln; bwa sampe
                        Indicating the data was aligned with bwa. I suppose that could be a lie though.
                        It would almost have to be a lie; none of the sequences have info in the alignment columns, and I doubt bwa would generate a file like that.

                        Is the CL: field supposed to store all of the command line arguments? I have a feeling it should, or else most of the benefit is lost, especially when tweaking a parameter and saving the output files that result.

                        Thanks!

                        Comment

                        Working...
                        X