Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SAM/BAM sort by read names produces truncated read names

    Hi,

    I tried to sort the alignment file by read name, but it appears that truncated read names were produced. This phenomenon was observed no matter which program I used: SAMtools sort (0.1.8), Picard SortSam (1.77) or Novosort (2.08) .

    Here is the first few records of the original SAM file:
    Code:
    HWI-ST621:415:D197AACXX:8:1101:1        113     chr2    236798427       70      100M1S  chr8    3088040 0       ACCTCTGTTTCTAAGCAGTGGAATAGAATTGCTTATGGAATAGCCAGGTCATAGGATGTNATAANTTCCCTGGAAATCAGAGGGGAAAAGAAGCAAAACAN   C@?>?AC@:C@>CECDEE@ACFEBFFDEEHECDACADHFHFEHIJGJIGIHJJIHDB80#HF?1#GDJIHCIGGHHAIIIJJHEHJJIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:18 UQ:i:18 NM:i:2  MD:Z:59G4T35
    HWI-ST621:415:D197AACXX:8:1101:1        177     chr8    3088040 70      101M    chr2    236798427       0       AAATACATACATACACACAGACTGATTTTCTCTTCAGCAATATTTTAATGAAACCCCATACTGCAAATTACATAAACTAGTTAAAGTACACCAACCTCAAG   DEEDDDFDCEECEEDDBFFFDHHHFGHECJJIHFJJJIJJJIJHHGIHGDDGGJJJIIHGHIJJJIIJIGJJIIIFIIJJJJJJIIHFFAHHHDFFDFCCB        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1223:2124        83      chr8    143208201       70      100M1S  =       143207998       -303    CGCTGAGAGCAAGGTGCCAGCAGGGTGGGCCCTTCTGGAGGCTCCGGCCGGGATCTGTTCCAGGCCACCCCCGCCTTCCGGCCATCCTCAGCTTGGCTCCN   >@CA>A:A>>>3(CA<AACDDDB<<?3?@9?CDCDCBCC?7<BBDBB@<93?DCCAA8<B?A<<DB7DCIGGBHGAHIIHFJJIEJIIHHHHHFFFDD=1#        PG:Z:novoalign  RG:Z:LS148      AS:i:47 UQ:i:47 NM:i:1  MD:Z:6C93       PQ:i:59 SM:i:70 AM:i:70
    HWI-ST621:415:D197AACXX:8:1101:1223:2124        163     chr8    143207998       70      92M     =       143208201       303     TTGTGGAGTCAGGTGTCCCTGGGGTCACGGTGACTGGCCAGGCGNGGGGAGCCAGGAGGCACACGGTCCTGGGCTCTNGCAGGGCTGGAGTG    @BBDFFADD?FHH@@EGGGGIIII@BCGHG8?DGHGB@FHHGAG#-<CC;@E?ACEE?B7?BCA?B;?BDDCB9??A#++28?B?B@B1<>A PG:Z:novoalign  RG:Z:LS148      AS:i:12 UQ:i:12 NM:i:2  MD:Z:44C32G14   PQ:i:59 SM:i:70 AM:i:70
    HWI-ST621:415:D197AACXX:8:1101:14       65      chr6    74783346        70      1S100M  chr1    1867309 0       NGATTAAGCAGCCAAGCTGTATCCTGAGGGAAACATGGGCAATGGAAAGCATCAGATTTCCTGGGTCAAAGCTATCCTGAGCTCAGGCACTGGGCTAACTG   #4=DFFFFGHHHHJJJJJJIJJJJJJJJJJGHIJIHIIJIGIIJJBFHIIIJJJJDIJJIHHIJJIGGHHHHHFFFFFFEDEEEEDDD@DDDDDDCDCDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
    HWI-ST621:415:D197AACXX:8:1101:14       129     chr1    1867309 70      101M    chr6    74783346        0       ACACACACACACACACACGAACTGCAGGGGGCTCTGGAGCCATGGAGTTAGAAAAGCTCTCTGAGAGGCCAGGTGTAGTGGCTCATGCCTGTAATCCCAGC   CCCFDFFFHHHHGJJJIJJIJJJJJFHIJIJFHIJJJDHEHHHHG@D?BDACCEDCBDDDDDDCDDDDBDBDB@CCCCCCBDDCCC@ACAC@>AB>CCACD        PG:Z:novoalign  RG:Z:LS148      AS:i:30 UQ:i:30 NM:i:1  MD:Z:68T32
    HWI-ST621:415:D197AACXX:8:1101:14       97      chr2    62756955        70      1S100M  chr6    74783591        0       NGTGCTGTTTGGTTTGTGTGTATTATATGGGTTTGGATTACAATAATTCCTCCCTTTTGTATAATGTTTTGCAGTTTTTAAAGCACTTCATGCTCTAAATC   #1=DDFFDHHGGFHIIHHEHGFGIDHHIIIIFGIIICGGEHHHIIIII>GGGIIIIIIIICFGHHGGHIIIIDAAEHHHEBDDFCEEECCDCCCCCC>ACC        PG:Z:novoalign  RG:Z:LS148      AS:i:6  UQ:i:6  NM:i:0  MD:Z:100
    HWI-ST621:415:D197AACXX:8:1101:14       145     chr6    74783591        70      101M    chr2    62756955        0       ATTTTTGTAAGTCACCAATGGTTGGATGTTGGCAGTTTCATAAGGTTCATTCTAATAGTTCCTGGGACACAAATGACTCGAAGTAGGTCAAGACAGGTTCA   <DDDDDDDDDDDDEEECCFDFFGHEHGJJIIJJJJIGHIHCIIIJIGCGIIGDIHEIIHGIJJJJIHIIJJIIHGBHHJIJJJJJJJJHHHFHFFFFD?C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1        81      chr1    155944063       70      101M    chr11   19838477        0       CAGCTGTACCTGGCAGCAGCCCCTTCCCCAAGATGGTGACACCTCTGTCCACACCCTCTGTAATAGTGACCGGAGAGCCTGTGGAGCATTCCACCAGGATT   DDDEDAA:BCAA:DD@BDDDDB?@=BDEDEEDFFFD@;??=HHIIIIGJIHF<JIHFGBIHIJIIIIIHJJJJJJJIJJJJIJJJJJIHHHHHFFFFFCC@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST621:415:D197AACXX:8:1101:1        161     chr11   19838477        70      101M    chr1    155944063       0       AGCCCCTTATGCAGAAAAAGGGACTCCACCTGGAGCCCTCTCTGGATCTACTTCTCCCAGATAAATCAGTCGGCTGTGTAATCTTTCAGGAAACCTGACCC   ??<DDFFFFHHDDDHIGDDAFE9FFGHGCHEGG9FGGHGGGGCFHBF*0BBCBGGE@GHGCHA@ECE@H;ADBFDCDDCCDD@CCC;33:32:595<9>3<        PG:Z:novoalign  RG:Z:LS148      AS:i:1  UQ:i:1  NM:i:0  MD:Z:101
    After sorting:
    Code:
    HWI-ST  81      chr7    83652142        70      82M     chr8    142160880       0       CTTTGTATTTACAGATACCACGGCCATTTTGCAATGTCCTCAGCACATAGTGGAAGCTGAACAAACAATCACATTTTCTAAT      @D<EA?7)==77@=7)('-'FF;FABB*0>EDB9DFDGDEBDEECC<FHHHBE@9HHEAB<;>FFDBBFA<DFA;A,B48;?   PG:Z:novoalign  RG:Z:LS148      AS:i:22 UQ:i:22 NM:i:1  MD:Z:76A5
    HWI-ST  65      chr9    120922414       70      101M    chr6    160312253       0       TCACTGAGTCTGATTGAAGCAACTGGCATTGGTGATCATACTTCAATATTTCTCTCATATTTGAAGTTAGAATTAGTTGATGTGAGATATTATATTAGCCT   @CCFFFFFHFHFAHHIDGHIJGIIJGHCGIJICFHIIIIIJIJJIIJGIIEIJHHGGIICGHIBGHFGHHGGHIDC@DHGIHGIGHHHHECBDFFFFFEDE        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  81      chr2    46872242        70      101M    chr17   79461315        0       CATGGATTAAAATATTAAGTAATTTGATCTAGATGATTGTTTACAGTTTAACGCAAATACACTTAGTCTGTTCTGATTATTTACTCAAGGATTATATTACT   >C>:EDDFCDDFFDFFHHHHHHJIHGG=GIGJJIIIIJGIIIHIJHDGGHHJFIIJIIGC:JHHAIIFJJJIHGH@IJJJHHCGB>HGGHGHHFFFFF@C@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  65      chr8    103315908       70      93M     chr17   40205036        0       AGATATCTGAGAAACTGACCTAAATAAGCAATCTGAAAAGATTAAGGTTCCTTCAATTATTATACTACTTGTTCTCCAAATAACACACTAACT   <@@ADD>DDBA<FG?A43?@FFF:3AEB>DFECE91:C<CFCFCFFC::4?D>FCDDD<FC8DFEFDG88@.==C=4@D;7@:7?CCBDD@>@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:93
    HWI-ST  89      chr16   61016706        70      101M    =       61016706        0       TGTTGAGTCAATGTAAGACCTTGGTAAGAATTCTTCAATTTAGACATGGCTAATTTTTAATGTCAACCACAGCTATTGAGGTACTTATATTAATTAACCTT   C?CECACCFFFFDDDE?=CCGGIIIGGEGIIIGGIIEGIIHHDBFGIGFIIIHGIIIIIGGIHG@CHHHGHHHGHDEIFIIGIHBIIIHHDDHEDEDF@?@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  97      chr12   16510044        70      101M    chr9    75346048        0       TAATAAAAATTCAGTTTTAACTATAGATGCCTTCTTCTCCTCTTGTGTTTGATTTATTGCTCCAAATGGGCCAACCTGGATGTCTATATTTCTTCCACTAA   CCCFFFFFHHHHHJIIGIIJJIJIIJJJJJJIEIIJJJJJHJIJGFGFHJJJIIJJJJJJJJIGJJJJIIJJJIJHJHHFHHBBEDFFCFEFEEEEDDDDD        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  73      chr5    22843028        70      97M     =       22843028        0       TAACTGTGTTTACTTTTCTCAGTTTCTACCAGAGAAAAGGCAGGTGCATTTTTTTGGTATGTTTGTGTAAAGTGAATTTGGCTTTACTTTTTCAAAT       =?<DD>=;FHDFFHGE@EFH?EA<B4AA@EBGCC1?91*:8CFG0?@?<D@@B;AFB=7=3?CHEEBE77B@6>;(6;.;;@;?>A>5(5:@CC5@>    PG:Z:novoalign  RG:Z:LS148      AS:i:3  UQ:i:3  NM:i:0  MD:Z:97
    HWI-ST  73      chr6    152150636       70      101M    =       152150636       0       CATTTGTCATCATTACACGGTCATGGGAGTGCTAAGAAGACTTAAATGCAGGGCTACCACCCCTTCCCAATTCATCTTTTATCCATTTTATTTCTCTAAGG   @CCDDDDEHHFHHFBHGGHHAFEFFHIGG:?CFGIGIGGHHEGIEHIGHGDE@;B=FA@F@FGGGEEHECCFFEFFCECDECCCDDDEDDCC@BCC>CCCC        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    HWI-ST  113     chr7    63064316        30      101M    chr17   26080536        0       CCTGCTCATCTCAGGCCTGCCGGCTCCTCCACCTGCCTTTTCGAGTACCCTGGGAACCCCCCGAGGACAGGTGTCATCGGTTGCTTCATCTCACCATCCCT   A94+(:ACCC??@BB@@7DDBDB<2????@8;BDB@A@BCDBCCCA<-DCC>3?8DB=7@@IHCIIJIGIJIIIJGHHGGGGHGEIDIFFFFAFFFDF@@@        PG:Z:novoalign  RG:Z:LS148      AS:i:31 UQ:i:31 NM:i:1  MD:Z:42C58
    HWI-ST  89      chr4    96140737        70      101M    =       96140737        0       AACAACGAGCCTCACTAGGTGACGATTAGCTATGGTTTCCCTGGTCTATACTGGATTTGGGTTCATTGGTAAATCATTCTATTCATAGCAATACAAGATAT   <<A?8DDDDDDCCAEEEFFFFHHHHHFIJJJJIIIIIGIGHIFIGHIIGDGGIJIJJIIHIHIEHIIJJJJJJIIJJJIJJIIJJIJJHGHHHFFFFFB@@        PG:Z:novoalign  RG:Z:LS148      AS:i:0  UQ:i:0  NM:i:0  MD:Z:101
    Does anyone have any idea of what's wrong with the programs or data?

    Thanks a lot!

    Allen

  • #2
    Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?

    Comment


    • #3
      The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?

      Comment


      • #4
        Originally posted by maubp View Post
        Very strange. Was that a typo in the version of samtools (I have 0.1.18 on my machine), or do you really have an out of date copy?
        You are right, that was a typo mistake. Thanks for spotting that.

        Comment


        • #5
          Originally posted by dpryan View Post
          The original SAM file also looks to have truncated names. Your read names should all end in ":8:[\d]+:[\d]+:[\d]+" (or something like that), where [\d]+ is regex for a number. The SAM file that you posted looks to have 3 reads (according to read name), but 5 reads if you look at the sequences. Is there something screwed up in your original fastq files?
          Yes you are right, it seems the read titles were screwed up by novoalign. The original read titles were fine.

          Code:
          @HWI-ST621:415:D197AACXX:7:1101:1179:2146 1:N:0:
          NCAGAATGAGCAATTAGAAATCCTCTGTNNTNNTAGNNNNCTGGAAATTAAACCAAGTGTATAATGCACCTAATGAAGTGTATGGTCTGANGTTTAANTAG
          +
          #1=DDFFFHHHHHJJJJJJJJJJJJJJI##2##1:C####00?DHGIJJJEHIHIEHCHFGIIJJJIGEEHHFEHFFFDDDFEEECDEDC#,5<@@C####
          @HWI-ST621:415:D197AACXX:7:1101:1185:2187 1:N:0:
          TTTGAACATCCCCACTAGGTTCTTTTCCATTGNCAANNNGGAGCATCAGCCAGTGAATCTGTTTCAGGTTTCCATTCTGCAGAACTCCTCCAAAGCATGTG
          +
          CCCFDFFFHHHHHEHIJJJCHHIIJJIIGGIG#1:C###00?DHIJHGIIJJJGHIEHIIIGDHGIJI@DHFH>AEHFFFFFFECCCCEDCDCCDDDCDCC

          Comment


          • #6
            Hi Allenyu

            Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
            Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.

            Comment


            • #7
              Hi Allen,

              Yes, you need to sort your Fastq input before running Novoalign. No luck man.


              Originally posted by zee View Post
              Hi Allenyu

              Try adding " --hdrhd 4" to your novoalign command in case there is more than 1 byte difference between the read names of a set of paired reads.
              Also note that read1 and read2 should be in order throughout your FASTQ input file. If this is not the case then most aligners will probably not do the right thing.
              Marco

              Comment


              • #8
                Thanks! Now trying to use sorted reads first.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM
                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 05-24-2024, 07:15 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-23-2024, 10:28 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-23-2024, 07:35 AM
                0 responses
                22 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-22-2024, 02:06 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Working...
                X