Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sunkid
    Member
    • Nov 2014
    • 16

    CIGAR error when running htseq-count after BBMap

    I am getting the following types of errors when running htseq-count on a SAM file generated with bbmap.sh

    Code:
    Error occured when reading beginning of SAM/BAM file.
      ("Illegal CIGAR string '66='", 'line 81726 of file bb_mapped.sam')
      [Exception type: ValueError, raised in _HTSeq.pyx:1116]
    The above SAM file was generated with a command like

    Code:
    bbmap.sh build=2 in=stdin.fq bamscript=bam.sh maxindel=200000 ambiguous=toss \
     usejni=t outu=bb_unm apped.sam outm=bb_mapped.sam rpkm=rpkm.txt -Xmx28g
    The htseq-count command I used was

    Code:
    htseq-count -i gene bb_mapped.sam genome.gff > ! counts.txt
    Any pointers greatly appreciated!
  • GenoMax
    Senior Member
    • Feb 2008
    • 7142

    #2
    See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.

    Comment

    • sunkid
      Member
      • Nov 2014
      • 16

      #3
      Originally posted by GenoMax View Post
      See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.
      Perfect! Thank you. For some reason I thought 1.3 was the default and so I had only experimented with the 1.4 flag.

      Comment

      • GenoMax
        Senior Member
        • Feb 2008
        • 7142

        #4
        I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.

        Comment

        • sunkid
          Member
          • Nov 2014
          • 16

          #5
          Originally posted by GenoMax View Post
          I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.
          Yes, reformat.sh does work.

          Comment

          • chiayi
            Member
            • Dec 2016
            • 23

            #6
            On a related note, I encountered this issue even after reformat.

            The sam file was generated by BBMap,
            Code:
            bbmap.sh -Xmx32g in1=r1.BBDuk.fastq.gz in2=r2.BBDuk.fastq.gz maxindel=2000 outm=mapped.sam outu=unmapped.sam ehist=ehist
            I used reformat to convert the CIGAR to samtools v1.3,
            Code:
            reformat.sh sam=1.3 in=in.sam out=reformat.sam
            However, all the reads were still unassigned when I ran featureCounts,
            Code:
            featureCounts -T 4 -p -C -O -a ref.gtf -R -o output reformat.sam
            Here's one example read from the BBMap-generated SAM,
            HTML Code:
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    83      Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851283 44       101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA   8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     NM:i:0  AM:i:44
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    163     Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851177 44       101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT   @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     NM:i:0  AM:i:44
            To makre sure this read was not mapped to the unannotated region, I used to same featureCounts command with a HiSat2-generated BAM and that same read was assigned.

            HTML Code:
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 83      Chr2    17851283        60      101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA     8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 163     Chr2    17851177        60      101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT     @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
            The CIGARs are identical (101M) in both SAM and BAM, so I wasn't sure what may be the cause of this issue. Has anyone encounted this before? Thank you for any advice and input.

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              #7
              @chiyai: What version of BBMap are you using?

              Comment

              • chiayi
                Member
                • Dec 2016
                • 23

                #8
                @GenoMax
                BBMap version 36.62
                featureCounts v1.5.1
                Thank you.

                Comment

                • Michael.Ante
                  Senior Member
                  • Oct 2011
                  • 127

                  #9
                  Hi chiayi,

                  in your BBmap index, the naming of the chromosome is "Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02", whilst in the Hisat alignment it's just "Chr2". Either you rename each alignment in the bbamp results, or redo the alignment with a reduced fasta header, or rename the chromosmes in your gtf accordingly.

                  Cheers,

                  Michael

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    #10
                    @Michael: Good catch. Not cleaning up fasta headers comes back and bites each time :-)

                    @Chiayi: Don't leave any spaces in fasta headers since they always cause problems with one thing or other.

                    Comment

                    • chiayi
                      Member
                      • Dec 2016
                      • 23

                      #11
                      @Michael: I used sed to clean up the chromosome name in each alignment in the SAM. featureCounts works fine after that. Thank you so much.
                      @GenoMax: Thanks for the suggestion.
                      BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        Originally posted by chiayi View Post
                        BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?
                        reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.

                        Comment

                        • chiayi
                          Member
                          • Dec 2016
                          • 23

                          #13
                          Originally posted by GenoMax View Post
                          reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.
                          I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            Originally posted by chiayi View Post
                            I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?
                            Discrepancy in your case was with the genome fasta headers and not read names. You could have done trd on fasta genome file to trim names after the first space and then made the index.

                            I suspect that option won't work when processing sam/bam files. Even if it did, it is for read names and not genome headers. @Brian will have to confirm.

                            Comment

                            • Brian Bushnell
                              Super Moderator
                              • Jan 2014
                              • 2709

                              #15
                              Yep - "trd" can be used on the fasta file prior to mapping, or the parameter "trd" can be added during mapping. However, once the reads are mapped, Reformat will not change the "rname" field of the sam file, just the "qname". That's a good idea, though - I'll change it so that it can do that.

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 08:59 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              32 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...