Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bnfoguy
    Member
    • May 2011
    • 17

    CIGAR and Sequence length incosistent

    Hello,

    I am trying to convert a .sam file into .bam file and I get the following error:

    CIGAR and Sequence length are inconsistent. Below is the offending line:

    ERR035488.960541 147 chr1 27205782 99 72M = 27205550 -304 TATGCCTCCTTGAGTGTCAGTGGCGTGATCTTGGCCCGGCTCACACCGGCCGGCAGGAAGTCTAGTAGGCAG 8<=21<?7A<<DA8>>@CEDE.D<DDB?@;BAC6;48EB@AE4>4@&FEB6FEFGFD1GFAC@DFAFBCBB> PQ:i:46 SM:i:96 UQ:i:0 MQ:i:96 XQ:i:15 NM:i:0

    I need to convert this into a .bam file and further into a .flt.vcf format for indel calling. So far, I have cut the first 1 million alignments from the original sam output and have gotten as far as creating an {prefix}_sorted.bam file.
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    I don't see what's wrong with that - the sequence and the quality string appear to both be 72 characters, and the CIGAR string of 72M is consistent with that (72 matches or mismatches). Are you sure you are looking at the right line number?

    Comment

    • Richard Finney
      Senior Member
      • Feb 2009
      • 701

      #3
      What software produced the sam file?
      What version of samtools are you using ?

      Please do this:
      check with this command:

      cat yoursample.sam | awk '{if (length($10)!=length($11)) print p+1" "$0;p++}'

      It will print out the line number and offending line.

      Line number includes sam header, not just sequence read lines.

      Comment

      • katussa10
        Member
        • Jun 2010
        • 11

        #4
        I have the same problem:
        Line 18632882, sequence length 48 vs 74 from CIGAR
        Parse error at line 18632882: CIGAR and sequence length are inconsistent



        I used:
        "cat yoursample.sam | awk '{if (length($10)!=length($11)) print p+1" "$0;p++}'"

        and got the following:

        76327 @PG ID:Bowtie VN:0.12.7 CL:"bowtie -p 8 --sam -e 120 Trinity_name -1 ../PAIR_END_1_Trim_the_reads_by_quality.fastqsanger -2 ../PAIR_END_2_Trim_the_reads_by_quality.fastqsanger Mysample.sam"
        18632882 SOLEXA_0019_FC:1:27:11595:10519#AACTAC 163 c0_seq1 4639 255 74M = 4646 81 ATGAAATCGAACGGTTTTCTACACGATCGTGCAAGTGAAACACATGGG

        However, when I used:
        head -n 18632882 Mysample.sam |tail -n 1



        SOLEXA_0019_FC:1:27:11595:10519#AACTAC 163 1089-1104_All_comp1980_c0_seq1 4639 255 74M = 4646 81 ATGAAATCGAACGGTTTTCTACACGATCGTGCAAGTGAAACACATGGGTTATTGAGAGCATTGGGAGTCATTAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIDFDGGGGEHIIIGIIIIIIIHIIIIEIIIIIHIIFGIGIGIIIIIH XA:i:0 MD:Z:74 NM:i:0


        What should I do next?

        Comment

        • Richard Finney
          Senior Member
          • Feb 2009
          • 701

          #5
          Well, there's "od" : "octal dump".

          Try

          head -n 18632882 Mysample.sam |tail -n 1 | od -c

          (or try "od -x" , whichever you can read easier )

          See if there's some strange control character snuck in by some piece of software.

          I'd just cut out the read and move on, like this ..

          cat yoursample.sam | awk '{if (p!=18632881) print $0;p++}'" > fixed.sam

          # note the zerobased couting using 18632881, not 18632882

          Nobody's going to miss one read.

          Comment

          • mxr1895
            Junior Member
            • Feb 2012
            • 6

            #6
            I'm dealing with the same issue. More specifically, I am trying to convert a sam file generated using 'miraconvert' from a MIRA *.caf file.
            This issue has been raised in [mira_talk]:



            Note that the above code: "cat yoursample.sam | awk '{if (length($10)!=length($11)) print p+1" "$0;p++}'"

            will not necessarily work for this issue, since in the example above (and in my case) both fields are of the same length.
            anybody know of a quick way of reading/interpreting the sum of a CIGAR field in order to remove the offending lines?
            Cheers,
            Miguel

            Comment

            • EricHaugen
              Member
              • Sep 2009
              • 13

              #7
              Originally posted by mxr1895 View Post
              anybody know of a quick way of reading/interpreting the sum of a CIGAR field in order to remove the offending lines?
              Here's a little Perl script I've used for that purpose:
              Code:
              #!/usr/bin/perl -w
              use strict;
              
              # Pass through valid-looking SAM records
              # Suppress records with invalid CIGAR strings, sending warning to STDERR
              
              while (<>) {
                  my ($readName,$flag,$refName,$refOffset,$mapQuality,$cigar,$ignore1,$ignore2,$ignore3,$seq,$qual,$tag,@other) = split /\t/;
                  unless (defined($seq)) {
                      warn "No sequence found in $_\n";
                      next;
                  }
                  my $orig = $cigar;
                  my $readbases = 0;
                  while (length($cigar) > 0) {
                      if ($cigar =~ /^([\d]+)([MISP\=X])(.*)/) {
                          $readbases += int($1);
                          $cigar = $3;
                      } elsif ($cigar =~ /^([\d]+)([DNH])(.*)/) {
                          # ignore
                          $cigar = $3;
                      } else {
                          warn "Failed to parse ($orig) in $_\n";
                          last;
                      }
                  }
                  my $seqlen = length($seq); 
                  if ($seqlen != $readbases) {
                      warn "$seqlen bases in $seq but $readbases bases parsed from CIGAR string $orig, $_";
                  } else {
                      print;
                  }
              }

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Bastien hasn't yet fixed this bug as of MIRA 4.0 RC5,

                Comment

                • cankut
                  Junior Member
                  • Feb 2014
                  • 1

                  #9
                  Hi,

                  Here you can find information about why I got the same problem and how I solved it.

                  Today I had the same problem... when I was trying convert sam file to bam file with following command:
                  # samtools view -bS myfile.sam -o myfile.bam

                  To solve my problem first I tried
                  # cat yoursample.sam | awk '{if (p!=18632881) print $0;p++}' > fixed.sam
                  as Richard Finney mentioned.
                  But it is not the best solution when the error continues within different lines.

                  So I checked my .sam file.
                  # head -n error_line myfile.sam | tail -n1
                  Here everything was ok.
                  Then I copied the mapped sequence and search it on my readfile.fa
                  I noticed that the following line had "--" instead of seq data.
                  example:
                  >seq1
                  GATATTGGCGCGGCTCAATCA
                  --
                  >seq2
                  GATATTGGCGCGGCT
                  >seq3
                  GATATTGGCGCGGC

                  And it was exist several times in different lines.
                  So I removed these undesirable symbols with command:
                  # sed -i '/--/d' readfile.fa

                  Then I started again from aligning the curated readfile with using bwa
                  ....
                  then I was succesful to convert sam to bam and all following steps done without any error.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Pathogen Surveillance with Advanced Genomic Tools
                    by seqadmin




                    The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                    Yesterday, 11:48 AM
                  • seqadmin
                    New Genomics Tools and Methods Shared at AGBT 2025
                    by seqadmin


                    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                    The Headliner
                    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                    03-03-2025, 01:39 PM
                  • seqadmin
                    Investigating the Gut Microbiome Through Diet and Spatial Biology
                    by seqadmin




                    The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                    02-24-2025, 06:31 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-20-2025, 05:03 AM
                  0 responses
                  30 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-19-2025, 07:27 AM
                  0 responses
                  36 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-18-2025, 12:50 PM
                  0 responses
                  30 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-03-2025, 01:15 PM
                  0 responses
                  190 views
                  0 reactions
                  Last Post seqadmin  
                  Working...