Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sam files convert to bam files error

    hi all,

    when I use samtools to get bam file from sam file? I met the following problems:
    samtools view -h -F 4 -q 1 -bS C.filsa.sam >C.filsa.bam
    [samopen] SAM header is present: 7 sequences.
    [sam_read1] reference 'SR' is recognized as '*'.
    [main_samview] truncated file.

    I also met "missing colon in auxiliary data " and "CIGAR and sequence length are inconsistent" in individual rows. My sam files came from the results of gsnap. I am not sure these problem caused by gsnap or samtools. how can i deal with them?

    Any suggestions and answers are appreciated. thank you.

  • #2
    The following is my sam sample. I don't understand where is the reference 'SR'?
    SRR019035.130 16 Chr5 9804788 40 36M * 0 0 CAGCCTCAAACGGCGCCGTCTTATACGGTGAGTTAC IIIII9IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1 NM:i:0
    SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.131 16 Chr1 753661 40 30M * 0 0 TGAAGATATTGAACCTCTCCGTTAGGGAAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:30 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40
    X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.132 16 Chr3 7844307 40 36M * 0 0 ATGCTGGTAATTCACGAGCTTGATGAAACATTTCAC I3IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1 NM:i:0
    SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.133 0 Chr1 28835502 40 36M * 0 0 GTTTTAGTTTCGTCTGCAACTGAGTCATCACCTACT IIIIIIIIIIIIIIIIIIIIIIDIIIIIIDIII-II MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.134 0 Chr1 28836313 40 36M * 0 0 GAAAATTTCAGGTCTGGTTCAGAATTGGTTCCGAAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII7II MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.135 0 Chr5 22542176 40 25M * 0 0 CGTGGTTCTAGGACATCATCTGATA IIIIIIIIIIIIIIIIIIIIIIIII MD:Z:25 NH:i:1 HI:i:1 NM:i:0 SM:i:40
    XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.136 0 ChrC 100327 3 36M * 0 0 GAATAAAGGATTAATCCGTATCATCTTGACTTGGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:2 HI:i:1 NM:i:0
    SM:i:3 XQ:i:40 X2:i:40 XO:Z:UM PG:Z:A
    SRR019035.136 272 ChrC 138287 3 36M * 0 0 AACCAAGTCAAGATGATACGGATTAATCCTTTATTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:2 HI:i:2 NM:i:0
    SM:i:3 XQ:i:40 X2:i:40 XO:Z:UM PG:Z:A
    SRR019035.137 16 Chr1 28835623 40 36M * 0 0 TATTTTCGTCGTCTCTAGAGTTTGAAGCATCAGTCC IIBI61IIIIIHIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.138 16 Chr5 19304066 40 36M * 0 0 ATCAATGATATGTTTAAGCAAGACGACTCTTTCAGC IIIII?IIIIIIIIIIIIIIIIIIIIIIIIIIIIII MD:Z:36 NH:i:1 HI:i:1
    NM:i:0 SM:i:40 XQ:i:40 X2:i:0 XO:Z:UU PG:Z:A
    SRR019035.139 0 Chr4 162871 40 26M * 0 0 TGATTTCGTTGTGCTATGTAAACTTT IIIIIIIIIIIIIIIIIIII1IIIII MD:Z:26 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40
    X2:i:0 XO:Z:UU PG:Z:A

    Comment


    • #3
      The SR... stuff is just the name of the read, which I see you downloaded from SRA (or ENA). Out of curiousity, what happens if you just:

      Code:
      samtools view -F 0x4 -q 1 -Sbo C.filsa.bam C.filsa.sam
      I wonder if giving the -h option is just screwing things up (it shouldn't do anything when you write a BAM file).

      Comment


      • #4
        Thanks dpryan.
        I try your code, but "reference 'SR' is recognized as '*'.” still occurred. my SRA data download from http://www.ncbi.nlm.nih.gov/sra/?term=SRR019035。

        Comment


        • #5
          If the first 1000 lines or so are sufficient to reproduce this, could you attach that (you have to edit in "advanced" mode and click on the paperclip)? That'd provide a reproducible example. To get the first 1000 (or whatever) lines, just:

          Code:
          head -n 1000 file.sam > excerpt.txt

          Comment


          • #6
            I try the first 1000 raws, It's no problem. So I attach the first 500 raws and the tail 500 raws for you. but I am not sure the problems will appear.

            Every time, when I deal with large sam files, only very few lines has some problems such as 'missing colon in auxiliary data' or 'CIGAR and sequence length are inconsistent', but these two problem always illustrate the specific lines and I could found the problems. Only 'reference *** is recognized as '*‘’,I couldn't found which lines have problems?

            because my sam files are got from gsnap alignment. So I am confused the problems are caused from the gsnap or samtools? if they are caused by gsnap, 99% data is OK. how can I avoid these problem and filter these low quality data in advance.
            Attached Files

            Comment


            • #7
              That doesn't seem to reproduce the problem either. It's very likely that the problem is with gsnap, which apparently is producing corrupt output on occasion. You might consider upgrading if that's an option or report the issue to the developer.

              Comment


              • #8
                Thank you for your good advise, It indeed help me.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Best Practices for Single-Cell Sequencing Analysis
                  by seqadmin



                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                  Yesterday, 07:15 AM
                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 06:58 AM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 08:18 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 08:04 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-03-2024, 06:55 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Working...
                X