Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

featureCounts error

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • featureCounts error

    I am running into a featureCounts error that has been referred to in prior reports (https://www.biostars.org/p/213666/ and http://seqanswers.com/forums/showthread.php?t=64587 etc).
    Code:
    featureCounts: readSummary.c:1578: process_pairer_header: Assertion `l_name < 100' failed.
    It does not appear to have been addressed as far as I can tell. I am using data from a regular Illumina HiSeq 4000 run mapped using BBMap.

    @Wei Shi: I will send you a PM in case you don't happen to see this thread.
    Last edited by GenoMax; 03-19-2017, 05:11 PM.

  • #2
    Sorry about this. Could you please try out the latest version we released a couple of days ago? If the problem persists, we will need to take a look at the data.

    Comment


    • #3
      Hi Wei: New version of subread still has the same error. I just emailed you with a sample of the data.
      Last edited by GenoMax; 03-20-2017, 04:20 AM.

      Comment


      • #4
        Originally posted by GenoMax View Post
        Hi Wei: New version of subread still has the same error. I just emailed you with a sample of the data.
        Hi GenoMax,

        I tried using Samtools (v 1.1) to convert the BAM file you provided to the SAM format. In the Samtools output, the chromosome names looked abnormal:

        Code:
        K00270:29:HFV25BBXX:6:1205:24616:23681 1:N:0:TTAGGC     163     [COLOR="Red"]chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38[/COLOR]  11612   3       50M     =       11692   130     GTCTAGGAATGCCTGTTTCTCCACAAAGTGTTTACTTTTGGATTTTTGCC      AAFFFJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJ   XT:A:R  NM:i:0  AM:i:3
        K00270:29:HFV25BBXX:6:1205:24616:23681 1:N:0:TTAGGC     83      [COLOR="Red"]chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38[/COLOR]  11692   3       50M     =       11612   -130    ATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAATTTCCA      JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA   XT:A:R  NM:i:0  AM:i:3
        The texts marked in red are where the chromosome names should be in the Samtools output from a normal BAM file. However, in the Samtools output from your BAM file, they are not only the chromosome names, but also include all the information tags ("LN", "M5", "AS", etc) in the BAM file header. This suggests that your BAM file had an incorrect format on the chromosome names.

        Can you please look into your pipeline to see if there are format conversion issues when generating the BAM files?

        Thanks!

        Yang

        Comment


        • #5
          Thanks for sending us an example bam file, @GenoMax. We found there were some tags added into the chr field in the bam file, which led to the crash of featureCounts. Below are the first two reads in the file
          K00270:29:HFV25BBXX:6:2220:19278:41018 1:N:0:TTAGGC 99 chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38 13603 3 50M = 13712 159 GTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

          JJJJJJJFFJJJF XT:A:R NM:i:0 AM:i:3
          K00270:29:HFV25BBXX:6:2220:19278:41018 1:N:0:TTAGGC 147 chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38 13712 3 50M = 13603 -159 AGCTCCCCTCTGTGGAATCCAATCTGTCTTCCATCCTGCGTGGCCGAGGG JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
          JJJJJJJJFFFAA XT:A:R NM:i:0 AM:i:3

          Comment


          • #6
            It looks like the issue is that BBMap was printing the original names. If you use the "trd" flag it will truncate the names after the first whitespace (to just "chr1"), which some tools expect. Samtools will happily accept names with whitespaces, and by default BBMap retains the entire headers, to avoid loss of information. Can you check whether BBMap with the "trd" flag resolves the issue, or if it's something else?

            Comment


            • #7
              Originally posted by Brian Bushnell View Post
              Samtools will happily accept names with whitespaces, and by default BBMap retains the entire headers, to avoid loss of information.
              Given the SAM specifications disallow spaces in the RNAME field (Regular expression: "\*|[!-()+-<>-~][!-~]*"), it's unsurprising that downstream tools break.

              Would changing the BBMap default output to comply with the file format specifications cause other issues? If loss of information is the issue then the full fasta headers could still be included in the SAM file in the @SQ headers, or is the use case you're defending against one in which the fasta header are not unique (eg "chr1 hg19" and "chr1 mm10" in the same reference)?

              Edit: Out of curiosity: if the fasta header contains a TAB, does BBMap clean it up, or just write a malformed SAM record?
              Last edited by dcameron; 03-23-2017, 03:44 PM.

              Comment


              • #8
                Originally posted by dcameron View Post
                Given the SAM specifications disallow spaces in the RNAME field (Regular expression: "\*|[!-()+-<>-~][!-~]*"), it's unsurprising that downstream tools break.
                BBMap reports names as they were stated in the input files. I have no idea why the sam format disallows spaces, since they don't cause any problems and are widely used in sequence headers. I did not initially notice that they might be disallowed since it's never explicitly stated, just encoded in a regex. I ignored that since regexes are language-specific and the sam spec has no mention of which language it refers to, so it's impossible to decode unambiguously. Also, it has no mention of the term "whitespace". It explicitly states that spaces are allowed in some fields, but never mentions any fields in which they are disallowed. Coupled with the fact that samtools supports spaces in names, I have never seen any evidence that spaces in names were disallowed in the sam format, and I conclude that they in fact are not disallowed, because "(Regular expression: "\*|[!-()+-<>-~][!-~]*")" has no meaning without a language being specified; and no language has been specified by the sam format. There are many regex languages. There are with numerous other absences in the sam spec, so you generally have to write an interpretation based on observation, and it involves a lot of guesswork. Consider the cigar string and MD tag, for example - I still have no idea what "P" means in the cigar string. It's defined as "P padding (silent deletion from padded reference)". What does that mean? There's no way to know. I've never encountered one. Similarly, MD tags are crucial, but they are never defined anywhere. The current sam spec does not even mention the term MD. You just have to kind of guess how it works from observation. So I'd say sam is kind of like English or other spoken languages - defined based on usage, since it's obviously not completely specified. As such, not randomly discarding information is probably a better choice.

                That said -

                1) If people tend to feel like the sam spec disallows spaces in reference names, I will probably change BBMap's default to truncate at the first whitespace, so lossless mode will require the "trd=f" flag. It's never been a problem in the past, though, and samtools obviously does handle spaces in reference names. Since the sam spec is not very informative, I tend to feel like samtools is the definitive answer on what is or is not sam compliant; by that metric, currently, spaces in names are compliant.
                2) This change would be unfortunate, since fasta files may contain sequences like "chr1" and "chr1 alt3" which are not supported by this interpretation of the sam format, but are supported by both BBMap and samtools.

                Would changing the BBMap default output to comply with the file format specifications cause other issues? If loss of information is the issue then the full fasta headers could still be included in the SAM file in the @SQ headers, or is the use case you're defending against one in which the fasta header are not unique (eg "chr1 hg19" and "chr1 mm10" in the same reference)?
                There's no problem, I just don't like to discard information when I don't need to. If an input sequence is named "chr1 hg19" I don't see any reason to rename it, as long as downstream tools can handle spaces, which they generally can. But I'll probably change the defaults to discard information, since compatibility is generally considered to be more important than accuracy.

                Edit: Out of curiosity: if the fasta header contains a TAB, does BBMap clean it up, or just write a malformed SAM record?
                That's a good question - I think, in that case, BBMap would write a malformed sam record. I'll fix that.
                Last edited by Brian Bushnell; 03-23-2017, 07:04 PM.

                Comment


                • #9
                  Originally posted by Brian Bushnell View Post
                  I have no idea why the sam format disallows spaces, since they don't cause any problems and are widely used in sequence headers. I did not initially notice that they might be disallowed since it's never explicitly stated, just encoded in a regex. I ignored that since regexes are language-specific and the sam spec has no mention of which language it refers to, so it's impossible to decode unambiguously.
                  The specs are indeed a bit light on detail. Not only language, but character encoding is not specified. For VCF, I contributed a clarification that it is US-ASCII (technically headerless UTF-8 for VCF) but it appears that this, and many other clarifications, did not get ported to the SAM specs as well.

                  Originally posted by Brian Bushnell View Post
                  It explicitly states that spaces are allowed in some fields, but never mentions any fields in which they are disallowed. I have never seen any evidence that spaces in names were disallowed in the sam format, and I conclude that they in fact are not disallowed, because "(Regular expression: "\*|[!-()+-<>-~][!-~]*")" has no meaning without a language being specified; and no language has been specified by the sam format. There are many regex languages. There are with numerous other absences in the sam spec, so you generally have to write an interpretation based on observation, and it involves a lot of guesswork.
                  If we treated every ambiguous part of the SAM specifications as if it didn't exist, then nobody's output would be compatible with anyone else's programs. Yes we have to guess, but as everything I've seen in the SAM specs has used C-style syntax (e.g. the bit-wise operators in the BAM header definition), and the initial SAM/BAM implementation was was written in C, it seems reasonable to default to a C-style interpretation for such things - including things such as char = 1 byte US-ASCII.

                  There have been multiple attempts (including by me) to define a formal grammar for SAM/VCF, but none of these attempts have even made it into the specifications document. My attempt for VCF just ended up with me identifying ~10 different parsing ambiguities that required some sort of change to the specifications to resolve and stalled getting people to agree on what to do.

                  Originally posted by Brian Bushnell View Post
                  Similarly, MD tags are crucial, but they are never defined anywhere. The current sam spec does not even mention the term MD.
                  That's now defined in the SAM tags specifications document. Any yes, that document still doesn't even actually define it properly.

                  Originally posted by Brian Bushnell View Post
                  1) If people tend to feel like the sam spec disallows spaces in reference names, I will probably change BBMap's default to truncate at the first whitespace, so lossless mode will require the "trd=f" flag. It's never been a problem in the past, though, and samtools obviously does handle spaces in reference names. Since the sam spec is not very informative, I tend to feel like samtools is the definitive answer on what is or is not sam compliant; by that metric, currently, spaces in names are compliant.
                  Ideally, everyone's programs would be able to accept as much invalid input as possible and still function. Unfortunately, it's usually downstream of the alignment/samtools that everything blows up. Even more unfortunate is the fact that the VCF file format can't even unambiguously handle all the characters that are actually allowed by the SAM regex.

                  Originally posted by Brian Bushnell View Post
                  There's no problem, I just don't like to discard information when I don't need to. If an input sequence is named "chr1 hg19" I don't see any reason to rename it, as long as downstream tools can handle spaces, which they generally can.
                  Clearly you are more optimistic about the robustness of bioinformatics software than I am

                  Comment


                  • #10
                    I wanted to provide follow-up. The problem did turn out to be due to spaces in the fasta headers present in iGenomes GRCh38 genome fasta.

                    As suggested by Brian, after truncating the reference names after first space (trd=t option for bbmap.sh) the problem went away. featureCounts ran and produced counts.

                    Origins of the problem lie in the iGenomes bundle that I had downloaded (and then built BBMap index) for GRCh38. Instead of simple fasta ID’s there are long strings with spaces in the chromosome names which caused featureCounts to produce an error.

                    Based on the discussion above it appears that the “standard” for SAM is vague and there may be no reasonable solution.

                    Prompt assistance provided by @Brian, @Wei and @Yangliao is much appreciated. Bioinformaticians/developers like them who stand behind their software provide an incredible service for us all.
                    Last edited by GenoMax; 03-27-2017, 04:53 AM.

                    Comment


                    • #11
                      Originally posted by GenoMax View Post
                      Based on the discussion above it appears that the “standard” for SAM is vague and there may be no reasonable solution.
                      I've raised an issue with the SAM specs here to explicitly state that the regexes are POSIX basic regular expressions and the encoding is US-ASCII.

                      As to whether fasta/fastq information after a space is part of the sequence name or is considered a comment is outside the scope of the SAM specifications. Unfortunately, there's not even a FASTA specification document to argue over. I'm inclined to break at any whitespace but I suspect that's due to my familiarity with myriad downstream complications when doing SV calling in VCF breakend notation. Did you know "C]<=:-0>]" is a perfectly valid ALT allele?

                      Comment


                      • #12
                        Originally posted by dcameron View Post
                        Unfortunately, there's not even a FASTA specification document to argue over.
                        Slightly off-topic but Bill Pearson offered a historical perspective on FASTA format in a thread he participated in a few months back on Biostars.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          I wanted to provide follow-up. The problem did turn out to be due to spaces in the fasta headers present in iGenomes GRCh38 genome fasta.
                          I would say the issues wasn't the spaces in the fasta headers but the program that interpreted the headers without any room for the usual ">name comment" syntax. There has never been, as far as I'm aware, any assumption that the fasta name after the first whitespace (or semicolon I think in early days) was to be used as part of the name itself.

                          Comment

                          Working...
                          X