Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • CIGAR error when running htseq-count after BBMap

    I am getting the following types of errors when running htseq-count on a SAM file generated with bbmap.sh

    Code:
    Error occured when reading beginning of SAM/BAM file.
      ("Illegal CIGAR string '66='", 'line 81726 of file bb_mapped.sam')
      [Exception type: ValueError, raised in _HTSeq.pyx:1116]
    The above SAM file was generated with a command like

    Code:
    bbmap.sh build=2 in=stdin.fq bamscript=bam.sh maxindel=200000 ambiguous=toss \
     usejni=t outu=bb_unm apped.sam outm=bb_mapped.sam rpkm=rpkm.txt -Xmx28g
    The htseq-count command I used was

    Code:
    htseq-count -i gene bb_mapped.sam genome.gff > ! counts.txt
    Any pointers greatly appreciated!

  • #2
    See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.

    Comment


    • #3
      Originally posted by GenoMax View Post
      See this for a solution: https://www.biostars.org/p/182156/ You won't need to re-do the alignment but you would have to reformat your bam files.
      Perfect! Thank you. For some reason I thought 1.3 was the default and so I had only experimented with the 1.4 flag.

      Comment


      • #4
        I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.

        Comment


        • #5
          Originally posted by GenoMax View Post
          I assume reformat.sh command will work (can you confirm, if you try it?). I have been using sam=1.3 to avoid this issue.
          Yes, reformat.sh does work.

          Comment


          • #6
            On a related note, I encountered this issue even after reformat.

            The sam file was generated by BBMap,
            Code:
            bbmap.sh -Xmx32g in1=r1.BBDuk.fastq.gz in2=r2.BBDuk.fastq.gz maxindel=2000 outm=mapped.sam outu=unmapped.sam ehist=ehist
            I used reformat to convert the CIGAR to samtools v1.3,
            Code:
            reformat.sh sam=1.3 in=in.sam out=reformat.sam
            However, all the reads were still unassigned when I ran featureCounts,
            Code:
            featureCounts -T 4 -p -C -O -a ref.gtf -R -o output reformat.sam
            Here's one example read from the BBMap-generated SAM,
            HTML Code:
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    83      Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851283 44       101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA   8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     NM:i:0  AM:i:44
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 1:N:0:CAGATC    163     Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02      17851177 44       101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT   @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     NM:i:0  AM:i:44
            To makre sure this read was not mapped to the unannotated region, I used to same featureCounts command with a HiSat2-generated BAM and that same read was assigned.

            HTML Code:
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 83      Chr2    17851283        60      101M    =       17851177        -207    TGCTGCAAATGCATTTGTTCCTCGTAAACGTCCCAATACGGCTGGTAGAGTTTCAGTGGAACATCCAAACGGTGAACATCCTTCAAGGACATTATTTGTTA     8:DDDECC@@>A=DDDDB=/@=;6BCFFEE?23CA@@@F@JIIEHFE3JJIGHF>HGGGGDCBIG?GEE?GIIIGIEJGGGIJGGJJJHFGGGFFFFFCBB     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
            HWI-ST975:91:C49TLACXX:7:1101:2847:2191 163     Chr2    17851177        60      101M    =       17851283        207     TCGAAGAGTGTGATGTCTTTTGCACGGGAGGGGGTATGGAGTTGGATGTTGAATCCCAAGATAATCATGCTGTTGATGCGTCAGGGATGCAGATTTCTGAT     @;;DD;DDBDBFFEH>FGH>@HHHHHI<AGIIID7;@FHADGA4C=CEEHACFFFFFEECEC@;@@CCCCCCC@C:>>(88><BC<89??>>3>C>ADC:A     AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:101        YS:i:0  YT:Z:CP NH:i:1
            The CIGARs are identical (101M) in both SAM and BAM, so I wasn't sure what may be the cause of this issue. Has anyone encounted this before? Thank you for any advice and input.

            Comment


            • #7
              @chiyai: What version of BBMap are you using?

              Comment


              • #8
                @GenoMax
                BBMap version 36.62
                featureCounts v1.5.1
                Thank you.

                Comment


                • #9
                  Hi chiayi,

                  in your BBmap index, the naming of the chromosome is "Chr2 CHROMOSOME dumped from ADB: Jun/20/09 14:54; last updated: 2009-02-02", whilst in the Hisat alignment it's just "Chr2". Either you rename each alignment in the bbamp results, or redo the alignment with a reduced fasta header, or rename the chromosmes in your gtf accordingly.

                  Cheers,

                  Michael

                  Comment


                  • #10
                    @Michael: Good catch. Not cleaning up fasta headers comes back and bites each time :-)

                    @Chiayi: Don't leave any spaces in fasta headers since they always cause problems with one thing or other.

                    Comment


                    • #11
                      @Michael: I used sed to clean up the chromosome name in each alignment in the SAM. featureCounts works fine after that. Thank you so much.
                      @GenoMax: Thanks for the suggestion.
                      BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?

                      Comment


                      • #12
                        Originally posted by chiayi View Post
                        BTW, does reformat.sh only clean fasta header, or also clean sam/bam like my case here?
                        reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          reformat in your command iteration is not touching the headers it only converts CIGAR strings to SAM 1.3 format (bbmap uses SAM format 1.4 by default). featureCounts and HTSeq-count don't understand 1.4 format (as of last time I was looking at that). BTW: Samtools v.1.3.x is unrelated to issue of SAM format v.1.3.
                          I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?

                          Comment


                          • #14
                            Originally posted by chiayi View Post
                            I was referring to the trd (trimreaddescription) option in reformat. I thought that would clean the fasta header after the first space, or did I misunderstand?
                            Discrepancy in your case was with the genome fasta headers and not read names. You could have done trd on fasta genome file to trim names after the first space and then made the index.

                            I suspect that option won't work when processing sam/bam files. Even if it did, it is for read names and not genome headers. @Brian will have to confirm.

                            Comment


                            • #15
                              Yep - "trd" can be used on the fasta file prior to mapping, or the parameter "trd" can be added during mapping. However, once the reads are mapped, Reformat will not change the "rname" field of the sam file, just the "qname". That's a good idea, though - I'll change it so that it can do that.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Understanding Genetic Influence on Infectious Disease
                                by seqadmin




                                During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                                Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                                09-09-2024, 10:59 AM
                              • seqadmin
                                Addressing Off-Target Effects in CRISPR Technologies
                                by seqadmin






                                The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                                08-27-2024, 04:44 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 06:25 AM
                              0 responses
                              13 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 01:02 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-18-2024, 06:39 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-11-2024, 02:44 PM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X