Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PenelopeFudd
    replied
    Originally posted by kmcarr View Post
    Penelope,

    The header information that is present suggest that the BAM file is the result of an alignment. Specifically the program ID header line you showed above

    Code:
    @PG	ID:0	VN:0.5.7	CL:bwa aln; bwa sampe
    Indicating the data was aligned with bwa. I suppose that could be a lie though.
    It would almost have to be a lie; none of the sequences have info in the alignment columns, and I doubt bwa would generate a file like that.

    Is the CL: field supposed to store all of the command line arguments? I have a feeling it should, or else most of the benefit is lost, especially when tweaking a parameter and saving the output files that result.

    Thanks!

    Leave a comment:


  • kmcarr
    replied
    Originally posted by PenelopeFudd View Post
    Aha, I see the problem. The Illumina sequencing machine generated a BAM file with no table of reference names and lengths, and no @SQ lines, as it did not have that information. In fact, nobody has that information, it's unknown. Illumina is using a BAM file in this instance for unaligned, raw data. This conflicts with this 'mandatory' table.
    Penelope,

    The header information that is present suggest that the BAM file is the result of an alignment. Specifically the program ID header line you showed above

    Code:
    @PG	ID:0	VN:0.5.7	CL:bwa aln; bwa sampe
    Indicating the data was aligned with bwa. I suppose that could be a lie though.

    Leave a comment:


  • PenelopeFudd
    replied
    So, the error message about there being no @SQ lines is an error... :-)

    Leave a comment:


  • maubp
    replied
    Originally posted by PenelopeFudd View Post
    Aha, I see the problem. The Illumina sequencing machine generated a BAM file with no table of reference names and lengths, and no @SQ lines, as it did not have that information. In fact, nobody has that information, it's unknown. Illumina is using a BAM file in this instance for unaligned, raw data. This conflicts with this 'mandatory' table.
    If your BAM file is unaligned reads, then the BAM header table will be present but empty

    And there should indeed be no @SQ lines in this case.

    Leave a comment:


  • PenelopeFudd
    replied
    Originally posted by maubp View Post
    BAM files have a mandatory table of reference names and lengths, plus optionally an embedded SAM header which may repeat that information as @SQ lines (and can supplement it with extra things like MD5 checksums).
    ...
    It does mean it is possible to make a corrupt BAM file where the BAM header and the embedded @SQ lines don't match up.
    Aha, I see the problem. The Illumina sequencing machine generated a BAM file with no table of reference names and lengths, and no @SQ lines, as it did not have that information. In fact, nobody has that information, it's unknown. Illumina is using a BAM file in this instance for unaligned, raw data. This conflicts with this 'mandatory' table.

    Leave a comment:


  • maubp
    replied
    Originally posted by PenelopeFudd View Post
    I haven't tried the very latest samtools, but with this file, there is no reference name nor length in the headers that I could see.
    BAM files have a mandatory table of reference names and lengths, plus optionally an embedded SAM header which may repeat that information as @SQ lines (and can supplement it with extra things like MD5 checksums).

    The main reason for this is each mapped BAM read doesn't store the reference name, but the index of the reference. This is much more space efficient (one integer rather than a many character long string).

    It does mean it is possible to make a corrupt BAM file where the BAM header and the embedded @SQ lines don't match up.

    Leave a comment:


  • PenelopeFudd
    replied
    Originally posted by maubp View Post
    Have you tried the very latest samtools (from github)? Under some circumstances is will generate the missing @SQ lines from the BAM header (which contains the reference names and their lengths).
    I haven't tried the very latest samtools, but with this file, there is no reference name nor length in the headers that I could see.

    Perhaps the sequencing company is using this file format in an unexpected way?

    Originally posted by BAMseek View Post
    This was to simulate trying to convert a SAM to a BAM that didn't have any header information (so I intentionally left off the -h flag). I think that because you don't have aligned reads, you get that warning but SAMtools can still proceed. If the data were aligned, then you would need to have names for the reference sequences within the header section.
    Ok, that makes sense, although if it can still proceed, I'd like it to not print the message, and to update the man page to say more about it. Do you know where I could send such a feature request? Is it [email protected]?

    Thanks

    Leave a comment:


  • maubp
    replied
    Have you tried the very latest samtools (from github)? Under some circumstances is will generate the missing @SQ lines from the BAM header (which contains the reference names and their lengths).

    Leave a comment:


  • BAMseek
    replied
    I think that samtools 0.1.18 will still abort with an error if it runs across a sequence name that is not in the header.

    For example, I ran the following on a well-formed bam file:

    samtools view input.bam | samtools view -bS -
    [samopen] no @SQ lines in the header.
    [sam_read1] missing header? Abort!
    This was to simulate trying to convert a SAM to a BAM that didn't have any header information (so I intentionally left off the -h flag). I think that because you don't have aligned reads, you get that warning but SAMtools can still proceed. If the data were aligned, then you would need to have names for the reference sequences within the header section.

    Justin

    Leave a comment:


  • PenelopeFudd
    replied
    There's already an '-h' in the command.

    The dirty.bam file has three header lines:

    @HD VN:1.0 SO:coordinate
    @RG ID:124048 PL:illumina PU:C06N6ACXX.7 LB:INX749 SM:MF9.8 CN:BCCAGSC
    @PG ID:0 VN:0.5.7 CL:bwa aln; bwa sampe
    None of these header lines are @SQ lines. Here's a few lines of data:
    HS11_168:7:1101:10000:124266 77 * 0 0 * * 0 0 TGTGTTGGGTGTGTTTGGGGGGTGGGTTTACATATTATCTCGTTTATTCAACGAGGTCCATGGGTTCCTCTGGAG :11=ABDDF?DFFGIIIIIIIB&07B7BBBBBDBDA@8>A@<B?B2>A@3@@)(07)9@:>(3<9?14::3>A:< RG:Z:124048
    HS11_168:7:1101:10000:124266 141 * 0 0 * * 0 0 TGTGTTGGGGGGGTTTGGGTTGGGTGGGGAATTTGCCTGCTTGATTGTTTTCGCAAATATATTTCCCCCAATTAT =?1ADDDFG08')6:??@B######################################################## RG:Z:124048
    As you can see, columns 3 through 9 are '*' or '0'. No alignment has been done, so no reference sequence can or should be specified with @SQ, if I'm understanding the file format correctly. Is this right?

    Thanks

    Leave a comment:


  • maubp
    replied
    When you do 'samtools view' to convert BAM into SAM, the header isn't shown by default. Try -h here to include it.

    Leave a comment:


  • "samtools: no @sq lines in the header" - metagenome!

    Hi,

    I'm trying to clean a bam file, by deleting reads that failed QC (for instance), and writing a new bam file. When I run the following test command:
    Code:
    samtools view -h dirty.bam | samtools view -bS - > clean.bam
    I get the following error:
    Code:
    [samopen] no @SQ lines in the header.
    From googling the error message, every post I've read says "use the -t reference.tsv option" or "use the -T reference.fasta option". The problem is, this data is for a metagenome, and consequently there is no reference genome. Fortunately, this is samtools version 0.1.18, and the program doesn't abort (unlike samtools 0.1.16) when it prints this error, but I just wanted to let the world know that -t/-T is not required nor desired in some cases.

    If someone could update the man page for samtools to reflect that, and maybe eliminate or clarify the actual error message, that would be great.

    Thanks!
    Last edited by PenelopeFudd; 08-08-2012, 08:56 AM. Reason: or -> nor

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
10 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
49 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
67 views
0 likes
Last Post seqadmin  
Working...
X