Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Samtools Cigar error

    Hi everyone,

    I ran some alignments using the bwa - samtools pipeline and now I am in the process of calling the consensus sequence. Everything has worked fine so far, but I have just hit a brick wall. After running the pileup command, the samtools works for a while and then crashes giving me the following error:

    samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed

    I searched through the forums and could only find one thread about this, where a person used CG to convert their files into BAM format, which generated the above error. In my case I have used the "samtools view" command to convert my files from SAM to BAM so I don't think this could be the issue.

    If anyone has any suggestions I would greatly appreciate it.

    By the way, the command I've been using is:

    samtools pileup -vcf [input.fasta] [input.aln.sorted.bam] > [output.txt]

    Thanks!

    PS. I've been wondering (and trying to figure out) if my starting fastq files are the issue. They are in Illumina 1.5 format, and I never specified that my files are Illumina-format in the pipeline (supposedly bwa mem works for this).

  • #2
    It's bizarre that there would be a cigar error on a bam file generated by samtools. Samtools 'view' would have caught that during the conversion process from sam to bam since the issue would have been in the sam file. This suggests that samtools converted the cigar string incorrectly which isn't likely. Question - is there any reason you're using 'pileup' instead of 'mpileup'? I think the 'pileup' command is depreciated. That could be the issue.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      Hey sdriscoll,

      Thanks for the advise. I was using "pileup" because that's what the samtools web page said. I did try mpileup as well and it gave me the same error though.

      At this point I have also redone the whole pipeline, starting with bwa and moving onto samtools to no luck.

      Comment


      • #4
        I took a quick look at the pileup code and while it's not that readable it seems like what's going on is the code is trying to evaluate cigar operations beyond the total number of operations in one alignment. In case you don't know what this cigar business is about it's a single field in the alignment format (SAM/BAM) that specifies the alignment structure from left to right.

        A couple examples:

        100M - this means 100 n matched continuously
        10S90M - 10 bases "soft clipped" followed by 90 bases matched
        25M1D75M - 25 bases matched, 1 deletion, 75 bases matched.

        As you can see there can be several "operations" in a single CIGAR field. I don't know what would trigger this error you're getting but maybe there's just a simple formatting issue. For example maybe the field ends in a number instead of a character as it should. That would throw off the code. Unfortunately the code completely bails out when it runs into something like a simple formatting issue rather than just throwing a warning and skipping that alignment. I think that would be more ideal...so that you're not completely stuck. Additionally it would be nice to see the line that's causing the issue...or a line number at least. /endrant

        The confusing thing about this is still that samtools 'view' does not see any issue with your alignments. That command should not miss something like a formatting error.

        Shawn
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Thanks for all the help guys. After struggling with it for a while I ended up just giving up and using a program we have in our lab ("CLC"). Thanks a lot for the help. If I ever figure out what the problem was I'll be sure to let all of you know.

          Comment


          • #6
            Sorry to bring back this thread from the dead, but I am currently using a pipeline that requires the use of the samtools mpileup output, and I just cannot get it to work. I keep getting the same error I got before:

            samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed.

            I am wondering if anyone else has had a problem with this? So you have an idea of what I did, here's my exact pipeline so far:

            1. Trim fastq reads using sickle
            2. Use BWA's "mem" to map fastq pair-end reads to reference genomes
            3. Convert bam to sam using samtools
            4. Sorted Bams
            5. Indexed Bams
            6. Kept reads that mapped to only one location on the genome (samtools view -q 1)
            7. Indexed new "unique hit" Bams (the ones with quality score > 0)
            8. Attempted samtools pileup and failed.

            I also followed the instructions provided on another post about the using an awk command to fix the CIGAR error, but had no luck. The only difference I saw was that it took longer for the mpileup to crash:

            #samtools view -h myreads.bam | awk '{if ($6!~/^[[:digit:]]+N[[:digit:]]+D|^[[:digit:]]+P|[[:digit:]]+P\$|^[[:digit:]]+I/) print $0}' | samtools view -bhS - > mycleanreads.bam

            One thing I kep wondering is if this might have anything to do with my initial fastq files. They were in Illumina 1.9 format and I am not sure if this could be a problem...

            If anyone has any idea, I'd appreciate it.

            Thanks

            Comment


            • #7
              Mgaldos,

              The code is incomprehensible. But it looks like we're running off the end of the cigar string. You imply that it takes a while to fail, so it sounds like there is one read in there somewhere with a bad cigar.

              Does the problem happen all the time, with different sets of data?

              It would help if you could find the one bad read -- but that won't be easy with mpileup. It sounds like samtools view works OK (not too surprising, since it doesn't try to parse the cigar string). Try running samtools calmd, which also needs to look at the cigar. If that fails similarly, the bad read should be the one just after the last good read output -- or at least, very close to there.

              If you can find the offending read, post the SAM line here.

              --SP

              Comment


              • #8
                Hi SP,

                You are correct. I should've been more explicit, but the code essentially looks at each line and removes it if it:

                * starts with \d+N\dD
                * starts with \d+P
                * starts with \d+I
                * ends with \d+P

                The full answer that shows that code is here: http://seqanswers.com/forums/showthread.php?t=16691

                and the code should probably be accredited to SEQquestions.

                The problem does happen all the time, with different sets of data. I am unsure as to why, since I ran that awk command (but then again, I found some people that ran into the same issue after running it).

                I'll run the samtools calmd command in a few hours and let you know what the output is.

                Thanks,

                MG

                Comment


                • #9
                  Sorry, mgaldos (and SEQquestions). I didn't mean the AWK code is incomprehensible. I was talking about the code in bam_pileup.c where the exception occurred!

                  However, I worry about the AWK approach for two reasons. As SEQquestions implies, it doesn't fix the problem -- it just works around it. And in this case I'm betting the bad cigar string is more subtly malformed, so that it may well pass the AWK check.

                  This will be a lot clearer if we can get our hands on the bad cigar string (and associated BAM record).

                  --SP

                  Comment


                  • #10
                    Haha, no biggie, SllyPoint.

                    Sorry for the delayed response. I've been banging my head against the wall looking for alternatives to this because I don't want to mess too much with the BAM files. I think that running the AWK command is already bad enough because I am chucking reads. I found out that using the genomecoveragebed command from bedtools works like a charm, and I basically get the same output.

                    Anyway, I ran the command because I think it would still be good to know what is going on with my BAM file. Here are a few lines from the output of samtools calmd:

                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:5:1304:4847:7846': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:5:1210:5291:11776': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:1:2104:20060:41437': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:1:2210:3760:14416': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:1:2201:8462:37781': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:1:1115:18851:20084': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:1:2306:18305:96046': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420: 104:C1BYMACXX:5:1311:4434:26104': 0 -> 1
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:1:2206:12161:20118': 0 -> 2
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:1:1116:11792:93723': 0 -> 2
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:5:1310:12471:11424': 0 -> 2
                    [bam_fillmd1] different NM for read 'HWI-ST420:104:C1BYMACXX:5:2111:3437:4614': 0 -> 2

                    I checked the SAM file as well, looking for the first read in that list, and came across this:


                    HWI-ST420:104:C1BYMACXX:5:1304:4847:7846 83 Consensus_A 17038 3 67M33S = 16929 -176 TGAACTTACGAGGTTTTAAATTAAGGTTGAAGTAATGTGTTGAATGTCAGACTCGAAATGCATTCGAAGTTAAAATTGAAATTTCGCCACAAATATACTC EDDBDDDDDDDDEEEDEEFFFFFECHHHHHJJJJJJJJJJJJJJIJJIIHDHHIJFJJJJJIJJJIJJJJJJJJJJJJJJJJJJJJJHHHHHFFFDBCCB NM:i:0 AS:i:67 XS:i:66
                    HWI-ST420:104:C1BYMACXX:5:1304:4847:7846 163 Consensus_A 16929 3 100M = 17038 176 GGATTGAGTAGGTAAAATTAGTAAGGCTCGGATGCCCGAAAAGACGAAAGACAGACACAATCTAATCATAAACTGTAATGTTCAAGCTTCGCTTACAATC @C@FFFDFFHHHCGHJJJJJJFHJJJJJJJJEGHGGIG>HGGGGHGIIJIHEFEHFFDDFEEEEEEEDCDDDCCCCDDDCDCDDDCDDDDBBDDBDC@CC NM:i:0 AS:i:100 XS:i:100

                    To be perfectly honest, I'm not sure about what any of this means. I hope you or someone can make sense of it.

                    Thanks a lot!

                    Comment


                    • #11
                      I gather samtools calmd ran to completion then? That's too bad -- I was hoping it would blow up when it got to the bad read.

                      As to the example you posted: The SAM lines themselves look consistent. They show two 100-base reads aligned in opposing directions to a 176-base sequence of the reference. The final 33 bases of the first read (aligned to the reverse strand) are "soft-clipped" (i.e., ignored in the alignment) -- probably because their quality scores are too low (???).

                      The calmd output is suspicious: the NM field it is complaining about shows the number of mismatches to the reference. The original SAM lines both show NM:i:0 -- no mismatches. But calmd thinks there should be one mismatch in each.

                      The most likely explanation for that is that you used a different reference for calmd than you used for the original alignment.

                      Unfortunately, I can't see how that would lead to the assertion failure you are seeing. OTOH, as the bam_pileup.c code is pretty impenetrable, it's worth matching up the references and trying again.

                      If anything interesting happens, post another note here.

                      --SP

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      8 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      66 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X